Here’s this week’s NPR puzzle:
Name a country with at least three consonants. These are the same consonants, in the same order, as in the name of a language spoken by millions of people worldwide. The country and the place where the language is principally spoken are in different parts of the globe. What country and what language are these?
Let’s see what the NLTK can do for us here.
As usual, the goal is to do this in pure Python, without any external help. Getting a list of countries is super easy:
from nltk.corpus import gazetteers countries = set([country for filename in ('isocountries.txt','countries.txt') for country in gazetteers.words(filename)])
Getting a list of languages? Well, let’s use Wordnet for that. First, let’s look at where the word “Swahili” falls in Wordnet by printing its hypernyms:
synsets = wn.synsets('swahili') synset = synsets while synset: print synset.lemma_names() synsets = synset.hypernyms() if synsets: synset = synsets else: synset = ''
[u'Swahili'] [u'Bantu', u'Bantoid_language'] [u'Niger-Congo'] [u'Niger-Kordofanian', u'Niger-Kordofanian_language'] [u'natural_language', u'tongue'] [u'language', u'linguistic_communication'] [u'communication'] [u'abstraction', u'abstract_entity'] [u'entity']
All right, looks like taking all the hyponyms of “natural_language” will work nicely. We’ll get some things we don’t need — namely language families like “Niger-Kordofanian” — but it’s all right, we’ll just remove them with the eyeball test.
Now that we’re ready to go, we’ll apply the trick we used before to get members of a category and we’re off:
from nltk.corpus import wordnet as wn import re from nltk.corpus import gazetteers from collections import defaultdict def just_consonants(w): ''' Remove anything but consonants ''' w = w.lower() return re.sub(r'[aeiouy]+','',w) def get_category_members(name): ''' Use NLTK to get members of a category ''' members = set() synsets = wn.synsets(name) for synset in synsets: members = members.union(set([w for s in synset.closure(lambda s:s.hyponyms(),depth=10) for w in s.lemma_names()])) return members ################## # Get a list of languages languages = get_category_members('natural_language') # Make a dictionary of consonantcy -> language lang_dict = defaultdict(list) for w in languages: lang_dict[just_consonants(w)].append(w) # Get a list of countries countries = set([country for filename in ('isocountries.txt','countries.txt') for country in gazetteers.words(filename)]) country_dict = defaultdict(list) for w in countries: country_dict[just_consonants(w)].append(w) cons_country_set = frozenset(country_dict.keys()) counter = 1 for consonantcy in lang_dict.iterkeys(): if consonantcy in cons_country_set and len(consonantcy) >= 3: print counter, country_dict[consonantcy], lang_dict[consonantcy] counter += 1
1 [u'Somalia'] [u'Somali'] 2 [u'Turkey'] [u'Turki'] 3 [u'Uganda'] [u'Gondi'] 4 [u'Chad'] [u'Chad'] 5 [u'America'] [u'Maraco'] 6 [u'Tonga'] [u'Tonga'] 7 [u'Lebanon'] [u'Albanian'] 8 [u'Nepal'] [u'Nepali'] 9 [u'Slovenia'] [u'Slovene'] 10 [u'Slovakia'] [u'Slovak'] 11 [u'Armenia', u'Romania'] [u'Romany'] 12 [u'Azerbaijan'] [u'Azerbaijani'] 13 [u'Germany'] [u'German'] 14 [u'Malawi'] [u'Mulwi'] 15 [u'Malta'] [u'Yamaltu', u'Malto', u'Malti'] 16 [u'Greece'] [u'Ugric'] 17 [u'China'] [u'Chin'] 18 [u'Ukraine'] [u'Korean', u'Karen']
Well, what do you know. You could argue, I guess, for #3 or #16, but far and away the best answers are #7 and #18 (the first part, anyway). Nice puzzle! And it once again goes to show the power of using NLTK to get members of a category.