Ranking Wikipedia Pages

March 8, 2012 by Alex | 9 Comments

The most interesting part of my Crossword Nexus website is the Wikipedia Regex search, and the most interesting part of that is the ordering of results. I didn’t want it to return results alphabetically — I wanted it to return results based on relevance. But how can you automatically determine if a Wikipedia article is relevant? Well, the method I implemented was ordering by inlinks. The more links there were to an article, the more interesting it should be, right?

For the most part, this works pretty well. But let’s ask the site for the best results of the form ??E?L??.

Before we go on, try to think of some good crossword entries that would fit this pattern, preferably ones with Wikipedia pages.

Let’s see … there’s Kremlin and gremlin, Sheila E., The Blob, shellac … quite a few options. What are the top 10 results returned from Crossword Nexus?

Clenleu
Breilly
Kremlin
Fresles
Creully
Treclun
Bresles
Rieulay
Clesles
Treslon

That’s … extremely ugly. Wait, what the heck are those things even? Communes in France, are you kidding me? Is something messed up?

No, nothing’s messed up. Take a look at Clenleu‘s page. See all those links at the bottom (click on “Show” on the “Communes of the Pas-de-Calais department” tab)? Yeah, all of those pages in turn link back to Clenleu. In fact, about 900 pages in all link to Clenleu, which kills the results.

Luckily, there’s an easy way around this. If we only count links in the article text itself, we can skip all of those useless links. If we do that, Clenleu only has (drum roll please) 3 inlinks. So that takes care of that. Here are the new top 10 results with this method:

Kremlin 396
Siedlce 239
Heerlen 226
Shellac 160
Sheila E. 139
Peebles 137
Ixelles 133
Breclav 88
Feedlot 82
Preslav 72

Ahhhh, better. Still no “gremlin” or “The Blob” but those finished 12th and 13th respectively. So that’s it, right? We’ve got our new ranking metric?

Well, hold on, why stop there? There might be other good possibilities, too. Let’s look at length of article, number of languages an article is translated into, and recency of last edit. Maybe these have some value too.

Length of article:

MHealth 55963
The Play 21931
Tien len 21263
Gremlin 17843
The Kliq 17477
Shellac 16519
Heerlen 16194
Sheila E. 14344
Eden Log 13981
Ixelles 13020

Uh, MHealth? I guess it’s a thing. People like to write about it, at least. “The Play” didn’t mean anything to me until I noticed it was referring to the “The band is on the field!” play. This is pretty good, too.

Next up, number of translations:

Kremlin 38
Apelles 34
Heerlen 31
Ixelles 29
Sterlet 27
Shellac 26
Siedlce 25
Buellas 25
Preslav 24
Usellus 23

This … is not so great. I fear that things translated into many languages might be mostly geographical sites.

I have high hopes for recency of last edit. Let’s take a look:

O'Neills 1325721157
Kvevlax 1325693146
Alex Lee 1325693087
The Kliq 1325642247
Sheila E. 1325640534
Buellas 1325616464
The Flow 1325616163
Poe's law 1325612070
Rieulay 1325611112
Uxelles 1325605593

Oh, no, this is the worst one yet. And what is so important about Buellas that it’s translated into so many languages and updated so frequently?

Based on these results, I’m thinking of going with a metric that’s 70% inlinks and 30% article size. But what do you think? Should I exclude the other two completely? I’d like to get a little feedback before going live. Thanks!

UPDATE (3/7/2012 9:31 PM) In case you were curious how the 70/30 split would look, here are the first few:

Sheila E.
Shellac
Heerlen
Siedlce
Ixelles
Gremlin
Apelles
Kremlin
Peebles
Feedlot
The Play
The Bled
EHealth
Preslav
The Blob
Breclav
Abe clan
Wheelie

That’s pretty good. Cities and communes and the like are simply overrated by Wikipedia, so I’m not sure we could ever get rid of things like Heerlen and Siedlce and Ixelles. Thoughts?

Box Relatives

Thoughts about puzzles, math, coding, and miscellaneous

Ranking Wikipedia Pages

9 Comments

Leave a Reply Cancel reply