The news first: I’ve updated the Wikipedia Regex Search to include Wiktionary in its results. The Wikipedia results have also been updated to be current as of November 1st.
Now the problem: to test it out, I attempted to solve the most recent Matt Gaffney Contest using the search, but it didn’t turn anything up. Why? Because “Oracle of Omaha” isn’t a full-fledged Wikipedia page, just a redirect, and I exclude redirects from my results.
So what’s the fix here? The obvious fix is to include redirects in my results, but I can’t just include all of them wholesale. Just look at all the pages that redirect to “Condoleezza Rice” to see why. No thanks.
So is there a way to be more judicious about choosing which redirects to use? There must be; after all, Onelook seems to handle it just fine. I’m thinking for now to compare each redirect to a list of known “good” results, maybe from my clue database or the collaborative word list. If a redirect page appears in one of those, then maybe I could include it and just give it the same score as the page it redirects to. (Incidentally, it is in my clue database, but not the collaborative word list — I’ll have to add it.)
Is there another way to determine which redirects to use? I’d love to hear suggestions. Anything I can do to improve my tool would be great.
April 11, 2013 at 2:22 pm
Did you try excluding redirects tagged with things like {{R from short name}} or in category [[Category:Unprintworthy redirects]]? Don’t know if this exactly matches what you need, but at first glance the unprintworthy thing would seem to be close.
April 11, 2013 at 4:09 pm
I didn’t even know about that! I will definitely look into that for the next iteration. Thanks!
Pingback: More on ranking Wikipedia pages | Box Relatives