Posts Tagged ‘wikipedia’
(This was written after the first, but before viewing the second, match of the Jeopardy IBM Challenge.)
I’m a fan of Jeopardy, a professional software developer, and a strong optimist about the prospects for artificial intelligence – so I’m immensely enjoying the contest between the IBM Watson system and human Jeopardy Champions Ken Jennings and Brad Rutter.
Watson’s ability to answer quickly and confidently, even when clues are indirect or euphemistic, has been impressive. It’s rightly hailed as one of the most impressive demonstrations yet of information retrieval and even natural-language understanding.
When Alex Trebek walked by the 10 racks of 9 servers each, said to include 2880 computing cores and 15 terabytes (15,000 gigabytes) of high-speed RAM main-memory, I couldn’t shake the feeling: this seems like too much hardware… at least if any of the software includes new breakthroughs of actual understanding. As parts of the show took on the character of an IBM infomercial, my skepticism only grew.
Let me explain.
While Jeopardy questions are challenging and wide-ranging, in highly idiomatic (even whimsical) English, this trivia game remains a very constrained domain. The clues are short; the answers just a few words, at most, and usually discrete named entities – the kinds of things that have their own titled entries in encyclopedias, dictionaries, and gazetters/almanacs of various sorts.
While the clues often have wordplay, many also have signifiers that clearly indicate exactly what kind of word/phrase completion is expected. (The strongest is perhaps the word ‘this’, as in “this protein” or “‘Storm on the Sea of’ this”. But categories which promise a certain word or phrase will be in the answer, with that portion in quotes, also help brute-force search plenty.)
I strongly suspect that almost anyone with even a rudimentary understanding of language, and enough time to research clues in an offline static copy of Wikipedia, could get 90%+ of the clues right.
An offline copy of all of Wikipedia’s articles, as of the last full data-dump, is about 6.5GB compressed, 30GB uncompressed – that’s 1/500th Watson’s RAM. Furthermore, chopping this data up for rapid access – such as creating an inverted index, and replacing named/linked entities with ordinal numbers – tends to result in even smaller representations. So with fast lookup and a modicum of understanding, one server, with 64GB of RAM, could be more than enough to contain everything a language-savvy agent would need to dominate at Jeopardy.
But what if you’re not language savvy, and only have brute-force text-lookup? We can simulate the kinds of answers even a naive text-search approach against a Wikipedia snapshot might produce, by performing site-specific queries on Google.
For many of the questions Watson got right, a naive Google query of the ‘en.wikipedia.org’ domain, using the key words in the clue, will return as the first result the exact Wikipedia article whose title is the correct answer. For example:
DON’T WORRY ABOUT IT for $2000:
“It’s just acne! You don’t have this skin infection also known as Hansen’s Disease”
[site:en.wikipedia.org acne skin infection hansen’s disease]
1st result: LEPROSY (right answer)
CAMBRIDGE for $1600/Daily Double:
“The chapels at Pembroke & Emmanuel Colleges were designed by this architect ”
[site:en.wikipedia.org chapels pembroke emmanuel colleges designed architect]
1st result: CHRISTOPER WREN (right answer)
HEDGEHOG-PODGE for $2000 :
“A recent bestseller by Muriel Barbery is called this ‘of the hedgehog'”
[site:en.wikipedia.org bestseller Muriel Barbery “of the Hedgehog”]
1st result: THE ELEGANCE OF THE HEDGEHOG (right answer)
CAMBRIDGE for $2000:
“This ‘Narnia’ author went from teaching at Magdalen College, Oxford to teaching at Magdalene College, Cambridge”
[site:en.wikipedia.org “Narnia” author teaching Magdalen College Oxford Magdalene College Cambridge]
1st result: C.S. LEWIS (right answer)
Often, the correct answer isn’t first, but other trivial heuristics can reveal the answer further down. For example, discard any title that has already appeared in the question, and thus is unlikely to be the answer. Consider:
“CHURCH” and “STATE” for $400:
“A Dana Carvey character on ‘Saturday Night Live’; Isn’t that special…”
[site:en.wikipedia.org dana carvey character “saturday night live” special]
Dana Carvey (struck as appearing in clue)
2nd: THE CHURCH LADY (right answer)
ETUDE, BRUTE for $2000:
“From 1911 to 1917, this Romantic Russian composed ‘Etudes-Tableaux’ for piano ”
[site:en.wikipedia.org 1911 1917 romantic russian “etudes-tableaux” piano]
Etudes-Tableaux (struck as appearing in clue)
2nd: SERGEI RACHMANINOFF (right answer)
Even when this technique fails, it sometimes fails just like Watson or real contestants:
THE ART OF THE STEAL for $1600:
“In May 2010 5 paintings worth $125 million by Braque, Matisse & 3 others left Paris’ Museum of this art period”
[site:en.wikipedia.org may 2010 paintings 125 million braque matisse paris museum art period]
1st: Picasso (Watson’s wrong, nonsensical answer)
…iteratively stripping some words trying to find candidates matching ‘this art period’…
[site:en.wikipedia.org paintings braque matisse paris museum art period]
Braque (struck as appearing in clue)
2nd: Cubism (Ken Jennings’ wrong answer)
Matisse (struck as appearing in clue)
4th: MODERN ART (right answer; Watson’s 3rd option)
And even where this this technique doesn’t yield the answer in a top page title, the answer is usually close at hand.
HEDGEHOG-PODGE for $400:
“Some hedgehogs enter periods of torpor; the Western European species spends the winter in this dormant condition”
[site:en.wikipedia.org hedgehogs periods of torpor western european species winter dormant condition]
1st: Short-beaked Echidna (HIBERNATION, the correct answer, appears prominently in the snippet a few words from ‘torpor’)
2nd: Bat (HIBERNATION appears alongside ‘winter’ in snippet)
There are many, many other possible heuristics for knowing when to accept or reject the naive top-results, and when patterns of words in the source material could yield other answers not in the title, or create confidence in answers stitched together from elsewhere. For example, consider the clue:
HEDGEHOG-PODGE for $1600:
“Hedgehogs are covered with quills or spines, which are hollow hairs made stiff by this protein”
The phrase ‘this protein’ strongly indicates the answer is a protein; Wikipedia has a ‘list of proteins’. Only one protein on that list also appears on each of the ‘hedgehog’ and ‘Spines (zoology)’ pages: KERATIN, the correct answer.
With a full, inverse-indexed, cross-linked, de-duplicated version of Wikipedia all in RAM, even a single server, with a few cores, can run hundreds of iteratively-refined probe queries, and scan the full-text of articles for sentences that correlate with the clue, in the seconds it takes Trebek to read the clue.
That makes me think that if you gave a leaner, younger, hungrier team millions of dollars and years to mine the entire history of Jeopardy answers-and-questions for workable heuristics, they could match Watson’s performance with a tiny fraction of Watson’s hardware.
Unfortunately, Jeopardy didn’t open this as a general challenge to all, like the DARPA Grand Challenges, with a large prize to motivate creative entries. Jeopardy seems to have simply followed IBM’s lead – and perhaps even received promotional payments from IBM for doing so. (I can’t find a definitive statement either way.)
IBM is known, and rightly admired, for many things… but hardware thrift isn’t one of them. And the boost to IBM’s sales from this whole exercise wouldn’t be nearly as large if Watson were a single machine, able to be positioned on the podium next to its human challengers, barely larger than the monitor displaying Final Jeopardy answers. That wouldn’t move roomfuls of computers!
Nice job, Jeopardy and IBM, but next time: open it to stingier teams!