{"id":156,"date":"2012-03-08T04:34:48","date_gmt":"2012-03-08T04:34:48","guid":{"rendered":"http:\/\/alexboisvert.com\/musings\/?p=156"},"modified":"2012-03-08T17:02:45","modified_gmt":"2012-03-08T17:02:45","slug":"ranking-wikipedia-pages","status":"publish","type":"post","link":"https:\/\/alexboisvert.com\/musings\/2012\/03\/08\/ranking-wikipedia-pages\/","title":{"rendered":"Ranking Wikipedia Pages"},"content":{"rendered":"<p>The most interesting part of my <a href=\"http:\/\/crosswordnexus.com\">Crossword Nexus<\/a> website is the Wikipedia Regex search, and the most interesting part of that is the ordering of results.  I didn&#8217;t want it to return results alphabetically &#8212; I wanted it to return results based on relevance.  But how can you automatically determine if a Wikipedia article is relevant?  Well, the method I implemented was ordering by inlinks.  The more links there were to an article, the more interesting it should be, right?<\/p>\n<p>For the most part, this works pretty well.  But let&#8217;s ask the site for the best results of the form <strong>??E?L??<\/strong>.<\/p>\n<p>Before we go on, try to think of some good crossword entries that would fit this pattern, preferably ones with Wikipedia pages.<br \/>\n<!--more--><\/p>\n<p>Let&#8217;s see &#8230; there&#8217;s Kremlin and gremlin, Sheila E., The Blob, shellac &#8230; quite a few options.  What are the top 10 results returned from Crossword Nexus?<\/p>\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Clenleu\">Clenleu<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Breilly\">Breilly<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Kremlin\">Kremlin<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Fresles\">Fresles<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Creully\">Creully<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Treclun\">Treclun<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Bresles\">Bresles<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Rieulay\">Rieulay<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Clesles\">Clesles<\/a><br \/>\n<a href=\"http:\/\/en.wikipedia.org\/wiki\/Treslon\">Treslon<\/a><\/p>\n<p>That&#8217;s &#8230; extremely ugly.  Wait, what the heck are those things even?  Communes in France, are you kidding me?  Is something messed up?<\/p>\n<p>No, nothing&#8217;s messed up.  Take a look at <a href=\"http:\/\/en.wikipedia.org\/wiki\/Clenleu\">Clenleu<\/a>&#8216;s page.  See all those links at the bottom (click on &#8220;Show&#8221; on the &#8220;Communes of the Pas-de-Calais department&#8221; tab)?  Yeah, all of those pages in turn link back to Clenleu.  In fact, <a href=\"http:\/\/en.wikipedia.org\/w\/index.php?title=Special:WhatLinksHere\/Clenleu&#038;limit=500\">about 900 pages<\/a> in all link to Clenleu, which kills the results.<\/p>\n<p>Luckily, there&#8217;s an easy way around this.  If we only count links in the article text itself, we can skip all of those useless links.  If we do that, Clenleu only has (drum roll please) 3 inlinks.  So that takes care of that.  Here are the new top 10 results with this method:<\/p>\n<pre>\r\nKremlin 396\r\nSiedlce 239\r\nHeerlen 226\r\nShellac 160\r\nSheila E. 139\r\nPeebles 137\r\nIxelles 133\r\nBreclav 88\r\nFeedlot 82\r\nPreslav 72\r\n<\/pre>\n<p>Ahhhh, better.  Still no &#8220;gremlin&#8221; or &#8220;The Blob&#8221; but those finished 12th and 13th respectively.  So that&#8217;s it, right?  We&#8217;ve got our new ranking metric?<\/p>\n<p>Well, hold on, why stop there?  There might be other good possibilities, too.  Let&#8217;s look at length of article, number of languages an article is translated into, and recency of last edit.  Maybe these have some value too.<\/p>\n<p>Length of article:<\/p>\n<pre>\r\nMHealth 55963\r\nThe Play 21931\r\nTien len 21263\r\nGremlin 17843\r\nThe Kliq 17477\r\nShellac 16519\r\nHeerlen 16194\r\nSheila E. 14344\r\nEden Log 13981\r\nIxelles 13020\r\n<\/pre>\n<p>Uh, MHealth?  I guess <a href=\"http:\/\/en.wikipedia.org\/wiki\/MHealth\">it&#8217;s a thing.<\/a>  People like to write about it, at least.  &#8220;The Play&#8221; didn&#8217;t mean anything to me until I noticed it was referring to the &#8220;The band is on the field!&#8221; play.  This is pretty good, too.<\/p>\n<p>Next up, number of translations:<\/p>\n<pre>\r\nKremlin 38\r\nApelles 34\r\nHeerlen 31\r\nIxelles 29\r\nSterlet 27\r\nShellac 26\r\nSiedlce 25\r\nBuellas 25\r\nPreslav 24\r\nUsellus 23\r\n<\/pre>\n<p>This &#8230; is not so great.  I fear that things translated into many languages might be mostly geographical sites.<\/p>\n<p>I have high hopes for recency of last edit.  Let&#8217;s take a look:<\/p>\n<pre>\r\nO'Neills 1325721157\r\nKvevlax 1325693146\r\nAlex Lee 1325693087\r\nThe Kliq 1325642247\r\nSheila E. 1325640534\r\nBuellas 1325616464\r\nThe Flow 1325616163\r\nPoe's law 1325612070\r\nRieulay 1325611112\r\nUxelles 1325605593\r\n<\/pre>\n<p>Oh, no, this is the worst one yet.  And what is so important about Buellas that it&#8217;s translated into so many languages and updated so frequently?<\/p>\n<p>Based on these results, I&#8217;m thinking of going with a metric that&#8217;s 70% inlinks and 30% article size.  But what do you think?  Should I exclude the other two completely?  I&#8217;d like to get a little feedback before going live.  Thanks!<\/p>\n<p><strong>UPDATE (3\/7\/2012 9:31 PM)<\/strong> In case you were curious how the 70\/30 split would look, here are the first few:<\/p>\n<pre>\r\nSheila E.\r\nShellac\r\nHeerlen\r\nSiedlce\r\nIxelles\r\nGremlin\r\nApelles\r\nKremlin\r\nPeebles\r\nFeedlot\r\nThe Play\r\nThe Bled\r\nEHealth\r\nPreslav\r\nThe Blob\r\nBreclav\r\nAbe clan\r\nWheelie\r\n<\/pre>\n<p>That&#8217;s pretty good.  Cities and communes and the like are simply overrated by Wikipedia, so I&#8217;m not sure we could ever get rid of things like Heerlen and Siedlce and Ixelles.  Thoughts?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The most interesting part of my Crossword Nexus website is the Wikipedia Regex search, and the most interesting part of that is the ordering of results. I didn&#8217;t want it to return results alphabetically &#8212; I wanted it to return &hellip; <a href=\"https:\/\/alexboisvert.com\/musings\/2012\/03\/08\/ranking-wikipedia-pages\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-156","post","type-post","status-publish","format-standard","hentry","category-coding"],"_links":{"self":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts\/156","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/comments?post=156"}],"version-history":[{"count":8,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts\/156\/revisions"}],"predecessor-version":[{"id":165,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts\/156\/revisions\/165"}],"wp:attachment":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/media?parent=156"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/categories?post=156"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/tags?post=156"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}