Box Relatives

Thoughts about puzzles, math, coding, and miscellaneous

More on ranking Wikipedia pages

| 1 Comment

I’m updating the Wikipedia Regex Search, creating a new ranked list of Wikipedia pages, and I have some thoughts:

  1. Thanks to an excellent comment from Jim Kingdon, I tried including some redirects in the new results. But even excluding all the redirects marked as “bad” isn’t enough. There are so many redirects not marked as anything and it’s impossible to differentiate between them for scoring purposes. There’s lot of junk in there, and even for the stuff that’s not junk – how do you score it? “Mr. October” is a redirect (to Reggie Jackson, of course). Do I give it the same score as Reggie? It only has 1 inlink, do I give it a terrible score because of that? I suppose I could take an average, but the problem of the junk remains. I’ll have to table this indefinitely.
  2. One problem I’ve had with this ranking is that many things come up that my intended audience (Americans) can’t be expected to know. This is essentially because there’s only one English-language Wikipedia, and so someone super-famous in Australia will get a high ranking, even though your average American may never have heard of him/her. But there may be a solution. To create this list I essentially extract the entire link structure of Wikipedia, which gives me a directed graph. Now, if I can group nodes in that graph to certain geographical regions, I can lower the scores of anything English, Australian, etc. I have some pretty smart readers — does anyone here know of a method for doing that?
  3. Along with the entire list of ranked Wikipedia titles, I will also put out a ranked list of famous people. It’s essentially any Wikipedia title ranked 90 or above with “???? births” as a category. Right now I’m using this to try to beat the Gaffney-Gordon pangram challenge but there are many other uses as well. For instance, in Matt Gaffney’s book “Gridlock”, Peter Gordon (wait, those two again?!) talks about how he couldn’t find any famous people whose first name ends in “i” and whose last name begins with “I.” Well, if you had this list, you could just type
    grep 'i I' FamousNames.txt

    and 90 names will come up. Now, not all of them are “famous” and some of them (like Vasili IV of Russia) don’t match what the question actually asks, but some of them are decent. In case you’re curious, the top result with a score of 100 is Juli Inkster (that name was new to me). Midori Ito, the name given to Peter after he couldn’t find one, is not in the list right now (I’m having some trouble with names that can be written with diacritics) but she will probably score 97 or higher when it comes out.

One Comment

  1. “Now, if I can group nodes in that graph to certain geographical regions, I can lower the scores of anything English, Australian, etc.”

    Wikipedia’s Categories might help here. E.g., Julian Assange’s wikipedia entry is in categories like Australian activists, Australian computer programmers, Australian this, Australian that. Charles, Prince of Wales is in categories British Anglicans, British businesspeople, …

    Entries for geographical features often come with a lat/long. So if obscure British hamlets are ranking too high, it might make sense to check for lat/longs around there.

Leave a Reply

Required fields are marked *.


This site uses Akismet to reduce spam. Learn how your comment data is processed.