Little House on the Noosphere

IBM Watson: Overprovisioned “Big Iron”?

with 17 comments

IBM Watson Machine Room

IBM Watson: 90 servers, 2880 cores, 15TB RAM

(This was written after the first, but before viewing the second, match of the Jeopardy IBM Challenge.)

I’m a fan of Jeopardy, a professional software developer, and a strong optimist about the prospects for artificial intelligence – so I’m immensely enjoying the contest between the IBM Watson system and human Jeopardy Champions Ken Jennings and Brad Rutter.

Watson’s ability to answer quickly and confidently, even when clues are indirect or euphemistic, has been impressive. It’s rightly hailed as one of the most impressive demonstrations yet of information retrieval and even natural-language understanding.

And yet…

When Alex Trebek walked by the 10 racks of 9 servers each, said to include 2880 computing cores and 15 terabytes (15,000 gigabytes) of high-speed RAM main-memory, I couldn’t shake the feeling: this seems like too much hardware… at least if any of the software includes new breakthroughs of actual understanding. As parts of the show took on the character of an IBM infomercial, my skepticism only grew.

Let me explain.

While Jeopardy questions are challenging and wide-ranging, in highly idiomatic (even whimsical) English, this trivia game remains a very constrained domain. The clues are short; the answers just a few words, at most, and usually discrete named entities – the kinds of things that have their own titled entries in encyclopedias, dictionaries, and gazetters/almanacs of various sorts.

While the clues often have wordplay, many also have signifiers that clearly indicate exactly what kind of word/phrase completion is expected.  (The strongest is perhaps the word ‘this’, as in “this protein” or “‘Storm on the Sea of’ this”. But categories which promise a certain word or phrase will be in the answer, with that portion in quotes, also help brute-force search plenty.)

I strongly suspect that almost anyone with even a rudimentary understanding of language, and enough time to research clues in an offline static copy of Wikipedia, could get 90%+ of the clues right.

An offline copy of all of Wikipedia’s articles, as of the last full data-dump, is about 6.5GB compressed, 30GB uncompressed – that’s 1/500th Watson’s RAM. Furthermore, chopping this data up for rapid access – such as creating an inverted index, and replacing named/linked entities with ordinal numbers – tends to result in even smaller representations. So with fast lookup and a modicum of understanding, one server, with 64GB of RAM, could be more than enough to contain everything a language-savvy agent would need to dominate at Jeopardy.

But what if you’re not language savvy, and only have brute-force text-lookup? We can simulate the kinds of answers even a naive text-search approach against a Wikipedia snapshot might produce, by performing site-specific queries on Google.

For many of the questions Watson got right, a naive Google query of the ‘’ domain, using the key words in the clue, will return as the first result the exact Wikipedia article whose title is the correct answer. For example:

“It’s just acne! You don’t have this skin infection also known as Hansen’s Disease”
[ acne skin infection hansen’s disease]
1st result: LEPROSY (right answer)

CAMBRIDGE for $1600/Daily Double:
“The chapels at Pembroke & Emmanuel Colleges were designed by this architect ”
[ chapels pembroke emmanuel colleges designed architect]
1st result: CHRISTOPER WREN (right answer)

HEDGEHOG-PODGE for $2000 :
“A recent bestseller by Muriel Barbery is called this ‘of the hedgehog'”
[ bestseller Muriel Barbery “of the Hedgehog”]
1st result: THE ELEGANCE OF THE HEDGEHOG (right answer)

CAMBRIDGE for $2000:
“This ‘Narnia’ author went from teaching at Magdalen College, Oxford to teaching at Magdalene College, Cambridge”
[ “Narnia” author teaching Magdalen College Oxford Magdalene College Cambridge]
1st result: C.S. LEWIS (right answer)

Often, the correct answer isn’t first, but other trivial heuristics can reveal the answer further down. For example, discard any title that has already appeared in the question,  and thus is unlikely to be the answer. Consider:

“CHURCH” and “STATE” for $400:
“A Dana Carvey character on ‘Saturday Night Live’; Isn’t that special…”
[ dana carvey character “saturday night live” special]
1st: Dana Carvey (struck as appearing in clue)
2nd: THE CHURCH LADY (right answer)

ETUDE, BRUTE for $2000:
“From 1911 to 1917, this Romantic Russian composed ‘Etudes-Tableaux’ for piano ”
[ 1911 1917 romantic russian “etudes-tableaux” piano]
1st: Etudes-Tableaux (struck as appearing in clue)
2nd: SERGEI RACHMANINOFF (right answer)

Even when this technique fails, it sometimes fails just like Watson or real contestants:

“In May 2010 5 paintings worth $125 million by Braque, Matisse & 3 others left Paris’ Museum of this art period”
[ may 2010 paintings 125 million braque matisse paris museum art period]
1st: Picasso (Watson’s wrong, nonsensical answer)

…iteratively stripping some words trying to find candidates matching ‘this art period’…

[ paintings braque matisse paris museum art period]
1st: Braque (struck as appearing in clue)
2nd: Cubism (Ken Jennings’ wrong answer)
3rd: Matisse (struck as appearing in clue)
4th: MODERN ART (right answer; Watson’s 3rd option)

And even where this this technique doesn’t yield the answer in a top page title, the answer is usually close at hand.

“Some hedgehogs enter periods of torpor; the Western European species spends the winter in this dormant condition”
[ hedgehogs periods of torpor western european species  winter dormant condition]
1st: Short-beaked Echidna (HIBERNATION, the correct answer, appears prominently in the snippet a few words from ‘torpor’)
2nd: Bat (HIBERNATION appears alongside ‘winter’ in snippet)

There are many, many other possible heuristics for knowing when to accept or reject the naive top-results, and when patterns of words in the source material could yield other answers not in the title, or create confidence in answers stitched together from elsewhere. For example, consider the clue:

“Hedgehogs are covered with quills or spines, which are hollow hairs made stiff by this protein”

The phrase ‘this protein’ strongly indicates the answer is a protein; Wikipedia has a ‘list of proteins’. Only one protein on that list also appears on each of the ‘hedgehog’ and ‘Spines (zoology)’ pages: KERATIN, the correct answer.

With a full, inverse-indexed, cross-linked, de-duplicated version of Wikipedia all in RAM, even a single server, with a few cores, can run hundreds of iteratively-refined probe queries, and scan the full-text of articles for sentences that correlate with the clue, in the seconds it takes Trebek to read the clue.

That makes me think that if you gave a leaner, younger, hungrier team millions of dollars and years to mine the entire history of Jeopardy answers-and-questions for workable heuristics, they could match Watson’s performance with a tiny fraction of Watson’s hardware.

Unfortunately, Jeopardy didn’t open this as a general challenge to all, like the DARPA Grand Challenges, with a large prize to motivate creative entries. Jeopardy seems to have simply followed IBM’s lead – and perhaps even received promotional payments from IBM for doing so. (I can’t find a definitive statement either way.)

IBM is known, and rightly admired, for many things… but hardware thrift isn’t one of them. And the boost to IBM’s sales from this whole exercise wouldn’t be nearly as large if Watson were a single machine, able to be positioned on the podium next to its human challengers, barely larger than the monitor displaying Final Jeopardy answers. That wouldn’t move roomfuls of computers!

Nice job, Jeopardy and IBM, but next time: open it to stingier teams!

Written by gojomo

2011-02-16 at 20:16

17 Responses

Subscribe to comments with RSS.

  1. All very true, and I think that we can imagine that a hungrier, leaner team would have found some interesting techniques that may have eluded the IBM team for lack of searching. This being said, your initial Wikipedia domain-limited searches do rely upon Google… which has more than a few Watsons in all its data centers churning out the answers.

    While it is quite possible that you could devise a series of search protocols and heuristics that would give you the correct results on a smaller machine, it’s also credit to Google to be able to crawl through wikipedia and find those pages from said keywords; still not a trivial problem at that stage.

    The Google Search Appliance is a quite watered down version of Google’s algorithms, but I’m sure it would do a pretty good job; that can be as small as 1U, so it’s definitely true that this doesn’t counter what you say. Just wanted to be a bit cheeky : )

    Samuel Jackson

    2011-02-16 at 22:41

    • When limiting Google to a single domain, the results are very similar to what you’d get if you indexed the same material with a local tool, such as the open-source SOLR search server. Google’s ranking is likely to be a bit better, leveraged with global info about inlinks, and their snippets (the excerpts optimized for your query shown along with the results) are always excellent. But SOLR is very good too. From experience, indexing takes a while, but the generated index is smaller than the source material, and searching is very quick – and if the index fits in RAM, even a single machine can answer queries in the blink of an eye. (That is, the kinds of test searches I used don’t require the equivalent of any Watsons, either for initial indexing or querying.)

      With a local tool, there are also advanced search techniques Google doesn’t offer: fuzzy document similarity; arbitrary boolean combinations; boosting particular terms as essential (such as those highlighted by ‘this’ as being central to the answer).

      It would surely take a lot of work and deep analysis to match Watson – the confidence intervals especially. But I think my examples show that a lot can be achieved with just Wikipedia and simple tools/heuristics. I’m pretty sure the answers to almost every Jeopardy question are stated, in one way or another, in a 30GB Wikipedia dump (which preprocessing in a manner conscious of typical Jeopardy patterns/topics would reduce to an even smaller size). So what’s occupying the other 14,970GB of Watson RAM?


      2011-02-17 at 03:10

  2. You should watch the PBS Nova documentary about the 4-years spent developing the software.

    This wasn’t some freakin’ 10-line perl script they kluged together.

    Tom Limoncelli

    2011-02-16 at 23:30

    • His point is that it could have been and would have achieved the same results. At the very least it would have been drastically more efficient in terms of computing and human resources.

      That really doesn’t speak well for those 4 years spent on development.


      2011-02-17 at 10:23

      • No it would NOT have achieved the same results. If you think it could go ahead and write the perl script and try it out! A perl script linked to Google would have got maybe 10% of answers right producing embarassing garbage the rest of the time. Getting that % up to 90% is what takes the hardware. Differences in degree become differences in kind.

        Why did nature give us 100 billion neurons when mice have far less? After all, mice can do a lot of the things humans do? There’s no difference between the mouse and human brain except in scale. What a waste of resources for nature!


        2011-02-18 at 03:11

  3. To quote from your article:

    “The phrase ‘this protein’ strongly indicates the answer is a protein; Wikipedia has a ‘list of proteins’. Only one protein on that list also appears on each of the ‘hedgehog’ and ‘Spines (zoology)’ pages: KERATIN, the correct answer.”

    You have used human deductive reasoning to arrive at your conclusions here. If Watson is reasoning as you say, it is smarter than I thought!!!

    You have considered the Spines (zoology) page, why not the “Spines (anatomy)” or “Spines (book)”?

    But I accept some of what you say, Watson produces non-sensical answers which a human would never do indicating that it does not truly understand the human world.


    2011-02-17 at 08:13

  4. Don’t underestimate Watson. I have some experience developing AI systems, and it’s always the most effort and computational resources that goes into squeezing out the last % of correctness/recognition rate.
    Sometimes, getting 90% right can be pretty easy. But each % from there can be exponentially more difficult.


    2011-02-17 at 08:19

  5. Nothing but a lame publicity stunt from an aging failing corporation whose motto “Think” is increasingly ironic. Watson has no semantic awareness, and just uses a flowchart to sift Google answers and created estimated probabilities based on cross-correlations. It didn’t even know the difference between the U.S. and Canada. What did IBM think, they would use the Jeoparetarded promo to sell the river-slurping dino to major corporations for medical diagnostics? Oops, I gave the diagnosis for a giraffe. It would be really funny if Google just duplicated its functionality and provided it free to all.

    TL Winslow

    2011-02-17 at 12:04

  6. And you must remember that all this time Watson was consuming 65 kW of power while competing against a human brain that works at 20W (peak throughput) and which can still multitask other important things

    Let’s just stop for one second and marvel at the human brain and the 2 billion years of evolution which gave birth to it.

    AI guy

    2011-02-17 at 13:47

    • Agreed 100%. People are saying Watson is overkill but it is actually rather weak compared to the human brain.

      Watson 15 Terabytes,
      Brain 1000 Terabytes

      Processing power
      Watson 80 Teraflops
      Brain 100 Teraflops

      So Watson is at a considerable disadvantage.


      2011-02-18 at 03:01

    • Hey Al guy,

      Do you have a reference for that 65 kW of power?



      2011-02-24 at 16:20

  7. […] Think is an online knowledge forum called IBM’s Watson “not so impressive” while Memesteading posed the question as to whether the computer was just over provisioned “big iron”. After all […]

  8. I agree. In your examples you got rid of all the functional words that give meaning to the language and still got the correct answers. For me the show was the equivalent of a student stealing the questions a day before the exam. I had the feeling during the show that Watson was cheating and not really “thinking” just by looking for key words in encyclopedic databases. Based on the first program only, you could easily see which kind of questions were easy to Watson and which not. So the second and third program questions could have been rewritten to give Watson more of a challenge and a fair play to the human contestants. And by the way, on betting for answers, Watson was a stingy and coward player.


    2011-02-18 at 02:28

  9. If solving this problem is as easy as you claim, why did Google publicly congratulate IBM on this achievement?

    Could it be the brainiacs at Google, who know something about searches, appreciate how difficult the Jeopardy! challenge was?

    I also suggest you read this:

    IBM checked out the possibility that this could’ve been done by a “basement hacker”. They found a “homebrew” approach fell well short of the goal.


    2011-02-18 at 15:33

    • I didn’t say it was ‘easy’; I suggested IBM wound up using far more hardware than necessary, probably for promotional/traditional purposes. I could see the development of the system as requiring lengthy, large-cluster analysis. But when boiled down to a trained knowledge-base sufficient to answer all Jeopardy questions, 15TB of RAM seems like about 14.5TB too much.

      Of course Google would congratulate IBM. I congratulate IBM, too. It’s an impressive achievement, both as technology and marketing! That doesn’t turn off my ability to analyze the scope of the problem, and the institutional incentives involved.

      Did the IBM team even have an incentive to optimize for hardware/cost? Or was showing off a room full of expensive computers one of the goals from the beginning?

      The excerpt from the Baker book at Gizmodo actually feeds my confidence. It reports that at the outset of the project in 2007, a single IBM neophyte (!) with only 1 month, 1 computer, and a measly 500 training clues (!) was “nearly matching” Piquant, IBM’s “state-of-the-art in Q-A technology”! Somehow Baker accepts the project leader’s spin that this was an “ideal” result for the project.


      How many man-years went into Piquant, and a lone fresh graduate matched it in the Jeopardy domain in a month? Rather than viewing this as a vindication of their plan to devote a giant team and hardware plant to the problem, why didn’t they let the “Basement Baseline” team have a couple more people, a handful more machines, and a few more months of time to work (if not 3.5 more years!), and the 200,000+ clue J!Archive of past Jeopardy rounds?

      Why didn’t Jeopardy and IBM let other teams compete for real money, like the DARPA Grand Challenge or Netflix Prize?

      The end result is impressive, but also heavily orchestrated to sell Big Iron, and unvalidated by actual resource-constrained, independent, external competitive evaluation.


      2011-02-18 at 17:27

  10. […] IBM Watson: Overprovisioned “Big Iron”? « Memesteading […]

  11. […]  Because the gigabyte (-g) output rounds to the nearest gigabytes, unless you’re on Watson (IBM’s computer containing 15,000 gibabytes, or 15 terabytes, not of hard drive space, mind […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: