{"id":288,"date":"2013-07-01T04:46:46","date_gmt":"2013-07-01T04:46:46","guid":{"rendered":"http:\/\/alexboisvert.com\/musings\/?p=288"},"modified":"2013-07-10T00:58:26","modified_gmt":"2013-07-10T00:58:26","slug":"more-on-ranking-wikipedia-pages","status":"publish","type":"post","link":"https:\/\/alexboisvert.com\/musings\/2013\/07\/01\/more-on-ranking-wikipedia-pages\/","title":{"rendered":"More on ranking Wikipedia pages"},"content":{"rendered":"<p>I&#8217;m updating the <a href=\"http:\/\/www.crosswordnexus.com\/wiki.php\">Wikipedia Regex Search<\/a>, creating a new ranked list of Wikipedia pages, and I have some thoughts:<\/p>\n<ol>\n<li>Thanks to an <a href=\"http:\/\/alexboisvert.com\/musings\/2012\/11\/20\/wikipedia-regex-search-updated\/#comment-569\">excellent comment from Jim Kingdon<\/a>, I tried including some redirects in the new results.  But even excluding all the redirects marked as &#8220;bad&#8221; isn&#8217;t enough.  There are so many redirects not marked as anything and it&#8217;s impossible to differentiate between them for scoring purposes.  There&#8217;s  lot of junk in there, and even for the stuff that&#8217;s not junk &#8211; how do you score it?  &#8220;Mr. October&#8221; is a redirect (to Reggie Jackson, of course).  Do I give it the same score as Reggie?  It only has 1 inlink, do I give it a terrible score because of that?  I suppose I could take an average, but the problem of the junk remains.  I&#8217;ll have to table this indefinitely.<\/li>\n<p><!--more--><\/p>\n<li>One problem I&#8217;ve had with this ranking is that many things come up that my intended audience (Americans) can&#8217;t be expected to know.  This is essentially because there&#8217;s only one English-language Wikipedia, and so someone super-famous in Australia will get a high ranking, even though your average American may never have heard of him\/her.  But there may be a solution.  To create this list I essentially extract the entire link structure of Wikipedia, which gives me a directed graph.  Now, if I can group nodes in that graph to certain geographical regions, I can lower the scores of anything English, Australian, etc.  I have some pretty smart readers &#8212; does anyone here know of a method for doing that?<\/li>\n<li>Along with the entire list of ranked Wikipedia titles, I will also put out a ranked list of famous people.  It&#8217;s essentially any Wikipedia title ranked 90 or above with &#8220;???? births&#8221; as a category.  Right now I&#8217;m using this to try to beat the <a href=\"http:\/\/xwordcontest.com\/2013\/06\/mgwcc-265-friday-june-28th-2013-found-in-translation.html\">Gaffney-Gordon pangram challenge<\/a> but there are many other uses as well.  For instance, in Matt Gaffney&#8217;s book &#8220;Gridlock&#8221;, Peter Gordon (wait, those two again?!) talks about how he couldn&#8217;t find any famous people whose first name ends in &#8220;i&#8221; and whose last name begins with &#8220;I.&#8221;  Well, if you had this list, you could just type\n<pre>grep 'i I' FamousNames.txt<\/pre>\n<p>and 90 names will come up.  Now, not all of them are &#8220;famous&#8221; and some of them (like Vasili IV of Russia) don&#8217;t match what the question actually asks, but some of them are decent.  In case you&#8217;re curious, the top result with a score of 100 is <a href=\"http:\/\/en.wikipedia.org\/wiki\/Juli_Inkster\">Juli Inkster<\/a> (that name was new to me).  Midori Ito, the name given to Peter after he couldn&#8217;t find one, is not in the list right now (I&#8217;m having some trouble with names that can be written with diacritics) but she will probably score 97 or higher when it comes out.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m updating the Wikipedia Regex Search, creating a new ranked list of Wikipedia pages, and I have some thoughts: Thanks to an excellent comment from Jim Kingdon, I tried including some redirects in the new results. But even excluding all &hellip; <a href=\"https:\/\/alexboisvert.com\/musings\/2013\/07\/01\/more-on-ranking-wikipedia-pages\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-288","post","type-post","status-publish","format-standard","hentry","category-coding"],"_links":{"self":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts\/288","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/comments?post=288"}],"version-history":[{"count":2,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts\/288\/revisions"}],"predecessor-version":[{"id":296,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/posts\/288\/revisions\/296"}],"wp:attachment":[{"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/media?parent=288"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/categories?post=288"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/alexboisvert.com\/musings\/wp-json\/wp\/v2\/tags?post=288"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}