It is said that to explain is to explain away. This maxim is nowhere so well fulfilled as in the area of computer programming, especially in what is called heuristic programming and artificial intelligence. For in those realms machines are made to behave in wondrous ways, often sufficient to dazzle even the most experienced observer. But once a particular program is unmasked, once its inner workings are explained in language sufficiently plain to induce understanding, its magic crumbles away; it stands revealed as a mere collection of procedures, each quite comprehensible. The observer says to himself “I could have written that”. With that thought he moves the program in question from the shelf marked “intelligent”, to that reserved for curios, fit to be discussed only with people less enlightened than he.
These words were written in 1966 by Joseph Weizenbaum, in a paper that explained—and explained away—his famous program ELIZA, the mock psychotherapist. ELIZA’s conversations could occasionally suggest a glimmer of understanding:
I need some help, that much seems certain.
WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP?
Perhaps I could learn to get along with my mother.
TELL ME MORE ABOUT YOUR FAMILY.
But Weizenbaum revealed that ELIZA was merely picking out a few keywords from the text and applying simple syntactic transformations, along with a dose of randomness. The program was manipulating symbols, but the symbols had no meaning attached to them.
What about Watson, the new Jeopardy champion? Watson gave a dazzling performance this past week, decisively winning a two-game match against the best human players in the history of the quiz show. But will the magic crumble if we look closely at how it works? Does the program really understand those quirky Jeopardy clues, or is it just pushing symbols around, in the manner of ELIZA?
The most detailed account of Watson’s innards that I’ve been able to find is an article published in AI Magazine last fall by David Ferrucci of IBM, the project’s lead engineer, and a dozen colleagues from IBM and Carnegie Mellon. (“Building Watson: An overview of the DeepQA project,” AI Magazine 31(3):59–79. The article is behind a paywall at the AI Magazine web site, but resourceful internauts may find it elsewhere.)
Here’s the overview:
The system we have built and are continuing to develop, called DeepQA, is a massively parallel probabilistic evidence-based architecture. For the Jeopardy Challenge, we use more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses.
Thus we learn that behind Watson’s calm, metallic voice is a clamor of 100+ agents doing their massively parallel probabilistic evidence-based thing. This is not one big brain but a society of mind. (By the way, I think “probabilistic” means simply that potential answers are scored by assigning them probabilities; as far as I can tell there is no randomness or indeterminacy in the algorithms, but I could be wrong about that.)
The first step in the run-time question-answering process is question analysis. During question analysis the system attempts to understand what the question is asking and performs the initial analyses that determine how the question will be processed by the rest of the system.
This description implies that the system is indeed making an effort to dig down into the semantics of natural language. But how does it attempt to understand the clue? The rest of the paragraph is more of a shopping list than an explanation:
The DeepQA approach encourages a mixture of experts at this stage, and in the Watson system we produce shallow parses, deep parses (McCord 1990), logical forms, semantic role labels, coreference, relations, named entities, and so on, as well as specific kinds of analysis for question answering.
The reference to (McCord 1990) is perhaps the most illuminating item in this list. The author is Michael C. McCord, who is at IBM’s Yorktown Heights lab where Watson was built. The phrase “deep parses” apparently refers to McCord’s idea of a slot grammar, which provides a single framework for combining the syntactic analysis of sentences (subject, predicate, object, etc.) with semantic features (word senses, logical relations, predicates). Unfortunately, the AI Magazine article gives no further hints about how slot grammars are used in the analysis of Jeopardy clues. (For some useful recent accounts of slot grammars, see the links in McCord’s publications list.)
The part of the question-analysis phase that Ferrucci et al. discuss at greatest length is a process they call “LAT detection.” LAT is “lexical answer type”: a word or phrase in the clue that specifies what kind of response is wanted—a person, a city, a book, a substance, and so on. Consider this clue, in a category titled “Oooh…. chess”:
Invented in the 1500s to speed up the game, this maneuver involves two pieces of the same color.
The LAT in the clue is “maneuver”: Whatever the answer is, it must be something that can plausibly be described as a maneuver. If you were to fixate on the wrong LAT—say “the game” or “two pieces”—you’d have no hope of coming up with the correct answer. Naming the two pieces “king and rook” would not score any points, even though that particular choice of pieces suggests you have the right idea in mind; to get credit for the answer, you need to give the name of the maneuver: “castling.”
Identifying the correct LAT is clearly important. It’s also clearly difficult. What’s not so clear is how Watson does it. In the chess example, does “maneuver” stand out from the rest of the words in the clue for grammatical reasons (it’s the subject of the main clause), or because it’s pointed to by the demonstrative adjective “this,” or for some other reason? How would you write a program to identify the LAT of an arbitrary Jeopardy clue?
Moving on from the analysis of questions to the finding of answers, the algorithmic details remain a little fuzzy.
Watson has access to various sources of “structured” knowledge: relational databases, taxonomies, ontologies. With such resources, retrieval is straightforward. Yet it turns out that few clues can be reformulated as database queries. Ferrucci writes: “Watson’s current ability to effectively use curated databases to simply ‘look up’ the answers is limited to fewer than 2 percent of the clues.” I suppose this is not really surprising. If the game could be reduced to database lookup, it wouldn’t be much fun.
For the other 98 percent of the queries, I gather that the retrieval process is more like Googling for the answer. The machine has no live internet connection during the Jeopardy contest, so it can’t actually search the web. But lots of free-form textual data was loaded into the Watson servers ahead of time, including all of Wikipedia and many other reference works. Using these documents as seeds, the system then trawled the web for other sources that might be useful, and cached copies of them for use offline. About four terabytes of material was available for query answering.
As for the search methods applied to this archive, the article by Ferrucci et al. offers another shopping list:
A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches (for example, Indri and Lucene), document search as well as passage search, knowledge base search using SPARQL on triple stores, the generation of multiple search queries for a single question, and backfilling hit lists to satisfy key constraints identified in the question.
In a long series of training runs the system was tuned to balance the competing demands of coverage, accuracy and speed.
The operative goal for primary search eventually stabilized at about 85 percent binary recall for the top 250 candidates; that is, the system generates the correct answer as a candidate answer for 85 percent of the questions somewhere within the top 250 ranked candidates.
The trouble with free-form textual search is that you may very well identify relevant snippets of text but still have a hard time extracting the correct answer. Indeed, the same kind of analysis that goes into figuring out the question also has to be applied to candidate answers. For example, Ferrucci et al. discuss this clue: “He was presidentially pardoned on September 8, 1974.” Among the materials retrieved by the search algorithm was the text fragment: “Ford pardoned Nixon on Sept. 8, 1974.” For a human player with a little knowledge of U.S. history, this result would be more than enough to settle the matter, but a computer program still has some work to do. Suppose the program has correctly identified the LAT of the clue as “He,” and suppose further that it knows that both “Ford” and “Nixon” refer to male persons, perhaps even that they were presidents. Which of the two names is the right choice? Several of the tests that Watson applies are essentially string-matching algorithms, similar to those that search DNA sequences for genetic patterns. Those algorithms might count how often each name occurs in association with the given date, but that result will not resolve the ambiguity in this case. The correct answer comes from a program module that undertakes a deeper logical analysis and recognizes the difference between subject and object in the two statements.
• • •
Given this glimpse into how Watson works, do we deem its intelligence to be explained, or explained away? Personally, I have mixed feelings.
I admit to a sentimental fondness for what John Haugeland called Good Old-Fashioned AI, or GOFAI—the ambitious kind of artificial intelligence that aspired to build a true thinking machine, a system with some deep internal representation (a mental model) of the world in which it functions. The outstanding example of this style is Terry Winograd’s SHRDLU program, written in 1970, which conversed about objects in a world of toy blocks on a tabletop. At the time, Winograd firmly asserted that the program was able to “understand discourse,” and he meant by this that the program understood not only the words but also the objects and relations the words referred to.
The promise of SHRDLU was that we could extend the same methods to broader domains of discourse, steadily building toward a general-purpose, human-like intelligence, with the same kind of carefully planned knowledge representation. But that never happened. Later in the 1970s, AI entered a time of troubles. When it came back in the 80s, the emphasis had shifted, and the technology had diversified. The new AI focused on expert systems, on data mining, on statistical rather than deductive methods; another branch of AI turned away from the human cerebral cortex in favor of the motor neurons of the cockroach. Overall, the field took a more pragmatic turn, with less concern for understanding the ultimate nature of intelligence and more energy invested in getting useful results, whatever the methodology.
Watson is in this latter-day pragmatic tradition, with its 100+ agents and its massively parallel probabilistic evidence-based architecture. Compared with SHRDLU, it’s all so messy, so ad hoc, so opaque. But it works, doesn’t it.
And I suppose my own mind is not quite as tidy as I would like to believe.
• • •
Although Watson won its Jeopardy match by a wide margin and made very few mistakes along the way, the moment everyone will remember is the program’s spectacular flub of a Final Jeopardy question on the second night. The category was “U.S. Cities,” and the clue was:
Its largest airport is named for a WWII hero, its second largest for a WWII battle.
Watson replied “Toronto.” As it happens, I got that question right; just seconds after the clue was revealed, I called out “Chicago.” Later, though, when I thought about the mental process that led to my answer, I realized that this was not at all a product of well-focused deductive reasoning. I was doing the same kind of scattershot, parallel, probabilistic groping in the dark that I frown on in a machine.
My “reasoning” went something like this: If it has two airports, it must be a pretty big city…. New York has three airports…. There’s Dallas, with DFW and Love—but no heroes or battles there. Chicago has two. Oh! Midway—that must be the battle of Midway.
That’s when I pressed the buzzer.
Note how sketchy my thinking was. I had no idea O’Hare was named for a war hero. As a matter of fact, I had no idea that Midway was named for the naval battle. If I had been asked in a more straightforward way, “Why is Chicago’s second airport named ‘Midway’?”, I would have guessed that it lies halfway between Point A and Point B. The Pacific island would not have entered my consciousness. And I never bothered to dig any deeper into the catalogue of multi-airport cities—Washington, San Francisco, Houston (isn’t G. H. W. Bush a WWII hero?).
So messy and ad hoc.