All right, the new Wikipedia sort is live on CrosswordNexus.com and this time I am allowing users to download the original list to play with offline. Now that I’ve played with it a bit, I have some ideas for next time that I’m going to gather here. If you have some ideas too, feel free to chip in.
First observation: I don’t like the distribution of scores. I actually made a histogram of the distribution which you can see here:
Looks okay, right? Except … the vast majority of the useful entries are clustered in the 97-100 range. That’s way too tight a range. Also, no one will care at all about pretty much anything scored 50 or below. That’s way too loose a range for the junk. So next time I need to tailor the histogram to make the bars on the right smaller and the bars on the left bigger. Of course, this will test my sorting algorithm — if it ranks things wrong there will be a bigger gap.
Second: Some things that we don’t want get too high a ranking. I mentioned in the last post that Wikipedia seems to have a bias toward geographical sites. But it also has a bias toward things Americans tend not to care about … notably, soccer. Romario, Ruud van Nistelrooy and Guus Hiddink all rate 100 on the site, and there’s no way they will ever appear in an American crossword.
Is there a fix to this? Not really … unless we cheat a little bit. One bit of information I didn’t use when making the current rankings is the list of categories assigned to each article. So I can simply add some logic in the code that says something like “If the article has a category that starts “Cities”, multiply the page length by 0.8, and if it has a category that ends “footballers”, multiply the page length by 0.5.” The intent is to bring these articles to a length that an American would assign to them if he were making the page. Will it hurt some legitimate articles? Of course. But it will help weed out some of the chaff as well.
Thoughts? Have you been playing with it any?
P.S. My son is doing very well. His cancer is in remission but he is at a high risk of infection, so we are being very careful. And right now he is in my arms and GO TO SLEEP WHY AREN’T YOU GOING TO SLEEP?!?
P.P.S. Have fun at the ACPT everyone!
P.P.P.S. “Weed out the chaff” is a terrible mixed metaphor.