The most interesting part of my Crossword Nexus website is the Wikipedia Regex search, and the most interesting part of that is the ordering of results. I didn’t want it to return results alphabetically — I wanted it to return results based on relevance. But how can you automatically determine if a Wikipedia article is relevant? Well, the method I implemented was ordering by inlinks. The more links there were to an article, the more interesting it should be, right?
For the most part, this works pretty well. But let’s ask the site for the best results of the form ??E?L??.
Before we go on, try to think of some good crossword entries that would fit this pattern, preferably ones with Wikipedia pages.
Let’s see … there’s Kremlin and gremlin, Sheila E., The Blob, shellac … quite a few options. What are the top 10 results returned from Crossword Nexus?
Clenleu
Breilly
Kremlin
Fresles
Creully
Treclun
Bresles
Rieulay
Clesles
Treslon
That’s … extremely ugly. Wait, what the heck are those things even? Communes in France, are you kidding me? Is something messed up?
No, nothing’s messed up. Take a look at Clenleu‘s page. See all those links at the bottom (click on “Show” on the “Communes of the Pas-de-Calais department” tab)? Yeah, all of those pages in turn link back to Clenleu. In fact, about 900 pages in all link to Clenleu, which kills the results.
Luckily, there’s an easy way around this. If we only count links in the article text itself, we can skip all of those useless links. If we do that, Clenleu only has (drum roll please) 3 inlinks. So that takes care of that. Here are the new top 10 results with this method:
Kremlin 396 Siedlce 239 Heerlen 226 Shellac 160 Sheila E. 139 Peebles 137 Ixelles 133 Breclav 88 Feedlot 82 Preslav 72
Ahhhh, better. Still no “gremlin” or “The Blob” but those finished 12th and 13th respectively. So that’s it, right? We’ve got our new ranking metric?
Well, hold on, why stop there? There might be other good possibilities, too. Let’s look at length of article, number of languages an article is translated into, and recency of last edit. Maybe these have some value too.
Length of article:
MHealth 55963 The Play 21931 Tien len 21263 Gremlin 17843 The Kliq 17477 Shellac 16519 Heerlen 16194 Sheila E. 14344 Eden Log 13981 Ixelles 13020
Uh, MHealth? I guess it’s a thing. People like to write about it, at least. “The Play” didn’t mean anything to me until I noticed it was referring to the “The band is on the field!” play. This is pretty good, too.
Next up, number of translations:
Kremlin 38 Apelles 34 Heerlen 31 Ixelles 29 Sterlet 27 Shellac 26 Siedlce 25 Buellas 25 Preslav 24 Usellus 23
This … is not so great. I fear that things translated into many languages might be mostly geographical sites.
I have high hopes for recency of last edit. Let’s take a look:
O'Neills 1325721157 Kvevlax 1325693146 Alex Lee 1325693087 The Kliq 1325642247 Sheila E. 1325640534 Buellas 1325616464 The Flow 1325616163 Poe's law 1325612070 Rieulay 1325611112 Uxelles 1325605593
Oh, no, this is the worst one yet. And what is so important about Buellas that it’s translated into so many languages and updated so frequently?
Based on these results, I’m thinking of going with a metric that’s 70% inlinks and 30% article size. But what do you think? Should I exclude the other two completely? I’d like to get a little feedback before going live. Thanks!
UPDATE (3/7/2012 9:31 PM) In case you were curious how the 70/30 split would look, here are the first few:
Sheila E. Shellac Heerlen Siedlce Ixelles Gremlin Apelles Kremlin Peebles Feedlot The Play The Bled EHealth Preslav The Blob Breclav Abe clan Wheelie
That’s pretty good. Cities and communes and the like are simply overrated by Wikipedia, so I’m not sure we could ever get rid of things like Heerlen and Siedlce and Ixelles. Thoughts?
Pingback: Wiki Ranking II – Lessons for next time | Box Relatives