The most interesting part of my Crossword Nexus website is the Wikipedia Regex search, and the most interesting part of that is the ordering of results. I didn’t want it to return results alphabetically — I wanted it to return results based on relevance. But how can you automatically determine if a Wikipedia article is relevant? Well, the method I implemented was ordering by inlinks. The more links there were to an article, the more interesting it should be, right?
For the most part, this works pretty well. But let’s ask the site for the best results of the form ??E?L??.
Before we go on, try to think of some good crossword entries that would fit this pattern, preferably ones with Wikipedia pages.
Let’s see … there’s Kremlin and gremlin, Sheila E., The Blob, shellac … quite a few options. What are the top 10 results returned from Crossword Nexus?
Clenleu
Breilly
Kremlin
Fresles
Creully
Treclun
Bresles
Rieulay
Clesles
Treslon
That’s … extremely ugly. Wait, what the heck are those things even? Communes in France, are you kidding me? Is something messed up?
No, nothing’s messed up. Take a look at Clenleu‘s page. See all those links at the bottom (click on “Show” on the “Communes of the Pas-de-Calais department” tab)? Yeah, all of those pages in turn link back to Clenleu. In fact, about 900 pages in all link to Clenleu, which kills the results.
Luckily, there’s an easy way around this. If we only count links in the article text itself, we can skip all of those useless links. If we do that, Clenleu only has (drum roll please) 3 inlinks. So that takes care of that. Here are the new top 10 results with this method:
Kremlin 396 Siedlce 239 Heerlen 226 Shellac 160 Sheila E. 139 Peebles 137 Ixelles 133 Breclav 88 Feedlot 82 Preslav 72
Ahhhh, better. Still no “gremlin” or “The Blob” but those finished 12th and 13th respectively. So that’s it, right? We’ve got our new ranking metric?
Well, hold on, why stop there? There might be other good possibilities, too. Let’s look at length of article, number of languages an article is translated into, and recency of last edit. Maybe these have some value too.
Length of article:
MHealth 55963 The Play 21931 Tien len 21263 Gremlin 17843 The Kliq 17477 Shellac 16519 Heerlen 16194 Sheila E. 14344 Eden Log 13981 Ixelles 13020
Uh, MHealth? I guess it’s a thing. People like to write about it, at least. “The Play” didn’t mean anything to me until I noticed it was referring to the “The band is on the field!” play. This is pretty good, too.
Next up, number of translations:
Kremlin 38 Apelles 34 Heerlen 31 Ixelles 29 Sterlet 27 Shellac 26 Siedlce 25 Buellas 25 Preslav 24 Usellus 23
This … is not so great. I fear that things translated into many languages might be mostly geographical sites.
I have high hopes for recency of last edit. Let’s take a look:
O'Neills 1325721157 Kvevlax 1325693146 Alex Lee 1325693087 The Kliq 1325642247 Sheila E. 1325640534 Buellas 1325616464 The Flow 1325616163 Poe's law 1325612070 Rieulay 1325611112 Uxelles 1325605593
Oh, no, this is the worst one yet. And what is so important about Buellas that it’s translated into so many languages and updated so frequently?
Based on these results, I’m thinking of going with a metric that’s 70% inlinks and 30% article size. But what do you think? Should I exclude the other two completely? I’d like to get a little feedback before going live. Thanks!
UPDATE (3/7/2012 9:31 PM) In case you were curious how the 70/30 split would look, here are the first few:
Sheila E. Shellac Heerlen Siedlce Ixelles Gremlin Apelles Kremlin Peebles Feedlot The Play The Bled EHealth Preslav The Blob Breclav Abe clan Wheelie
That’s pretty good. Cities and communes and the like are simply overrated by Wikipedia, so I’m not sure we could ever get rid of things like Heerlen and Siedlce and Ixelles. Thoughts?
March 8, 2012 at 5:00 am
These look pretty good! Something I struggle with is figuring out good ways to deal with Kremlin. When most folks say Kremlin, they mean the thing in wikipedia’s “Moscow Kremlin” article. Wikipedia’s “Kremlin” article is about some other thing. I try to use inlinks to figure out what these things are normally called, but those ain’t perfect.
March 8, 2012 at 6:20 am
instead of just counting links, you could attempt to implement something like this:
http://en.wikipedia.org/wiki/Pagerank
i hear it’s been known to produce meaningful results.
March 8, 2012 at 5:00 pm
I tried implementing PageRank and it just about killed my computer. Maybe I’ll try again for the next update. For this update I’ll have to be happy with some combination of these four factors, I think.
March 8, 2012 at 10:01 pm
My intuition is that PageRank wouldn’t help.
I have also noticed the over-representation of geographical references. Some of this has to do with the structure of Wikipedia that tends to link to places a lot. Some of this has to do with the fact that we have a local bias and Wikipedia does not. At one point I was angry with my ranking algorithm for turning up some silly city in Indonesia when I realized that it had something like 2 million people and was actually kind of a big deal even though I’d never heard of it.
So I started thinking about what I’m looking for as a puzzle constructor. I don’t want things which 50 million Southeast Asians know about but 0.5% of North Americans know about. I want things which a _substantial, uncorrelated fraction of my audience_ knows about. What I don’t know is how to get that from Wikipedia. I think most of the metrics you’re talking about are _global prominence metrics_ which are very prone to magnifying things of _narrow, deep_ interest.
So I started thinking about if I were to take all the pages in Wikipedia and sort of group them topically. Things about Southeast Asia over here. Things about beekeeping over there. Small and large clusters of related documents would be grouped. Then what I want (I hypothesize) is something that gets links from all over — it’s not just popular in its own little topic-corner of the world.
There’s probably some graph metric that would identify this more reliably. But I don’t know what it is. Maybe PageRank would approximate it. Maybe it wouldn’t.
March 8, 2012 at 10:07 pm
Perhaps a centrality metric would work well. In particular, perhaps we should be focusing on “Betweenness centrality” instead of “Degree centrality”.
http://en.wikipedia.org/wiki/Centrality#Betweenness_centrality
Seems like it might be kind of expensive to compute, though.
http://en.wikipedia.org/wiki/Betweenness_centrality#Algorithms
March 9, 2012 at 5:14 am
It’s an excellent point — how can we tell what a computer to do unless we can precisely define what it is we want it to do?
I think for now I’ll stick with what I’ve got here and refine it on the next iteration.
March 12, 2012 at 1:29 pm
Have you tried total number of edits or frequency of edits? Number of distinct contributors?
March 12, 2012 at 9:38 pm
Andrew —
I do think those would be useful … but I don’t know how to get that data from the Wikipedia data dumps.
I’d also love to run a regression on these different statistics to see how correlated they are.
Pingback: Wiki Ranking II – Lessons for next time | Box Relatives