Box Relatives

Thoughts about puzzles, math, coding, and miscellaneous

Ranking Wikipedia Pages

| 9 Comments

The most interesting part of my Crossword Nexus website is the Wikipedia Regex search, and the most interesting part of that is the ordering of results. I didn’t want it to return results alphabetically — I wanted it to return results based on relevance. But how can you automatically determine if a Wikipedia article is relevant? Well, the method I implemented was ordering by inlinks. The more links there were to an article, the more interesting it should be, right?

For the most part, this works pretty well. But let’s ask the site for the best results of the form ??E?L??.

Before we go on, try to think of some good crossword entries that would fit this pattern, preferably ones with Wikipedia pages.

Let’s see … there’s Kremlin and gremlin, Sheila E., The Blob, shellac … quite a few options. What are the top 10 results returned from Crossword Nexus?

Clenleu
Breilly
Kremlin
Fresles
Creully
Treclun
Bresles
Rieulay
Clesles
Treslon

That’s … extremely ugly. Wait, what the heck are those things even? Communes in France, are you kidding me? Is something messed up?

No, nothing’s messed up. Take a look at Clenleu‘s page. See all those links at the bottom (click on “Show” on the “Communes of the Pas-de-Calais department” tab)? Yeah, all of those pages in turn link back to Clenleu. In fact, about 900 pages in all link to Clenleu, which kills the results.

Luckily, there’s an easy way around this. If we only count links in the article text itself, we can skip all of those useless links. If we do that, Clenleu only has (drum roll please) 3 inlinks. So that takes care of that. Here are the new top 10 results with this method:

Kremlin 396
Siedlce 239
Heerlen 226
Shellac 160
Sheila E. 139
Peebles 137
Ixelles 133
Breclav 88
Feedlot 82
Preslav 72

Ahhhh, better. Still no “gremlin” or “The Blob” but those finished 12th and 13th respectively. So that’s it, right? We’ve got our new ranking metric?

Well, hold on, why stop there? There might be other good possibilities, too. Let’s look at length of article, number of languages an article is translated into, and recency of last edit. Maybe these have some value too.

Length of article:

MHealth 55963
The Play 21931
Tien len 21263
Gremlin 17843
The Kliq 17477
Shellac 16519
Heerlen 16194
Sheila E. 14344
Eden Log 13981
Ixelles 13020

Uh, MHealth? I guess it’s a thing. People like to write about it, at least. “The Play” didn’t mean anything to me until I noticed it was referring to the “The band is on the field!” play. This is pretty good, too.

Next up, number of translations:

Kremlin 38
Apelles 34
Heerlen 31
Ixelles 29
Sterlet 27
Shellac 26
Siedlce 25
Buellas 25
Preslav 24
Usellus 23

This … is not so great. I fear that things translated into many languages might be mostly geographical sites.

I have high hopes for recency of last edit. Let’s take a look:

O'Neills 1325721157
Kvevlax 1325693146
Alex Lee 1325693087
The Kliq 1325642247
Sheila E. 1325640534
Buellas 1325616464
The Flow 1325616163
Poe's law 1325612070
Rieulay 1325611112
Uxelles 1325605593

Oh, no, this is the worst one yet. And what is so important about Buellas that it’s translated into so many languages and updated so frequently?

Based on these results, I’m thinking of going with a metric that’s 70% inlinks and 30% article size. But what do you think? Should I exclude the other two completely? I’d like to get a little feedback before going live. Thanks!

UPDATE (3/7/2012 9:31 PM) In case you were curious how the 70/30 split would look, here are the first few:

Sheila E.
Shellac
Heerlen
Siedlce
Ixelles
Gremlin
Apelles
Kremlin
Peebles
Feedlot
The Play
The Bled
EHealth
Preslav
The Blob
Breclav
Abe clan
Wheelie

That’s pretty good. Cities and communes and the like are simply overrated by Wikipedia, so I’m not sure we could ever get rid of things like Heerlen and Siedlce and Ixelles. Thoughts?

9 Comments

  1. These look pretty good! Something I struggle with is figuring out good ways to deal with Kremlin. When most folks say Kremlin, they mean the thing in wikipedia’s “Moscow Kremlin” article. Wikipedia’s “Kremlin” article is about some other thing. I try to use inlinks to figure out what these things are normally called, but those ain’t perfect.

  2. instead of just counting links, you could attempt to implement something like this:

    http://en.wikipedia.org/wiki/Pagerank

    i hear it’s been known to produce meaningful results.

  3. I tried implementing PageRank and it just about killed my computer. Maybe I’ll try again for the next update. For this update I’ll have to be happy with some combination of these four factors, I think.

  4. My intuition is that PageRank wouldn’t help.

    I have also noticed the over-representation of geographical references. Some of this has to do with the structure of Wikipedia that tends to link to places a lot. Some of this has to do with the fact that we have a local bias and Wikipedia does not. At one point I was angry with my ranking algorithm for turning up some silly city in Indonesia when I realized that it had something like 2 million people and was actually kind of a big deal even though I’d never heard of it.

    So I started thinking about what I’m looking for as a puzzle constructor. I don’t want things which 50 million Southeast Asians know about but 0.5% of North Americans know about. I want things which a _substantial, uncorrelated fraction of my audience_ knows about. What I don’t know is how to get that from Wikipedia. I think most of the metrics you’re talking about are _global prominence metrics_ which are very prone to magnifying things of _narrow, deep_ interest.

    So I started thinking about if I were to take all the pages in Wikipedia and sort of group them topically. Things about Southeast Asia over here. Things about beekeeping over there. Small and large clusters of related documents would be grouped. Then what I want (I hypothesize) is something that gets links from all over — it’s not just popular in its own little topic-corner of the world.

    There’s probably some graph metric that would identify this more reliably. But I don’t know what it is. Maybe PageRank would approximate it. Maybe it wouldn’t.

  5. Perhaps a centrality metric would work well. In particular, perhaps we should be focusing on “Betweenness centrality” instead of “Degree centrality”.
    http://en.wikipedia.org/wiki/Centrality#Betweenness_centrality

    Seems like it might be kind of expensive to compute, though.
    http://en.wikipedia.org/wiki/Betweenness_centrality#Algorithms

  6. It’s an excellent point — how can we tell what a computer to do unless we can precisely define what it is we want it to do?

    I think for now I’ll stick with what I’ve got here and refine it on the next iteration.

  7. Have you tried total number of edits or frequency of edits? Number of distinct contributors?

  8. Andrew —

    I do think those would be useful … but I don’t know how to get that data from the Wikipedia data dumps.

    I’d also love to run a regression on these different statistics to see how correlated they are.

  9. Pingback: Wiki Ranking II – Lessons for next time | Box Relatives

Leave a Reply

Required fields are marked *.


This site uses Akismet to reduce spam. Learn how your comment data is processed.