Skip to main content

The Smartest Machine on Earth Plays Jeopardy

I don't know if anyone out there besides me saw the NOVA TV show "Smartest Machine on Earth" about the IBM Research Watson computer. Watson is scheduled to play two human Jeopardy champions on TV on Monday-Wednesday (Feb 14-16) of next week. I thought the show was excellent.
Here's a link to the broadcast:

If you are interested in going deeper, the current issue of AI Magazine is devoted to Question Answering, and contains an article by the Watson researchers. After the IBM Deep Blue chess computer successfully challenged the reigning human chess champion in 1997, AI researchers at IBM turned to other “hard problems” in AI. I am not much of a chess player myself, but I enjoyed following the progress of man against machine at the time, and I expect to tune in to watch the new IBM software play Jeopardy next week.
I admit I enjoy the drama of these human vs. computer challenges. A computer that plays Jeopardy models the famous “Turing test” for artificial intelligence coined by mathematician and computer pioneer Alan Turing. Today, the Turing test has been largely supplanted by John Searle’s Chinese room  thought experiment, a challenge to the AI research agenda that is taken quite seriously. This, perhaps, explains why IBM is willing to spend millions of dollars on this Jeopardy effort.
Essentially, Searle’s philosophical argument is that humans have minds, while computer programs that perform automated reasoning based on encoded rules do not. Searle’s challenge encapsulates the gulf between syntax in language, which is indisputably governed by formal rules, and semantic knowledge, which may or may not be. The gulf between syntax and semantics is very wide indeed, but it is one that many AI researchers are actively engaged in trying to bridge. (Things like the Semantic Web come to mind.)
Of course, I also found the show relevant in the context of my current blog topic, where I have been discussing rule-based “expert systems” approaches to computer performance analysis. As I have written earlier, I am not a huge fan of the approach, but I do acknowledge some of its benefits, particularly in filtering very large sets of performance-oriented data, like the ones associated with huge server farms, for example. My assessment of the value of the rule-based, automated reasoning approach does appear to square with current academic thinking in the AI world. Today, engineering-oriented approaches dominate much of the current research in AI. The emphasis of the machine learning approach, for example, is on the underlying performance of the system, not the extent to which the cognitive capabilities of humans are modeled or imitated.
The NOVA show on Watson featured several AI luminaries from the academic world. Doug Lenat, a prominent AI researcher at Stanford who is still pursuing the rule-based approach, was on camera. Lenat’s current focus is a reasoning engine in which millions of “common sense” rules are represented in a unique language, derived from the predicate calculus, that he developed called CycL. On the NOVA program, Lenat said that the CYC knowledge base currently consists of more than 6 million assertions in CycL.
A sample CycL assertion looks like this:
      (#$isa ?A #$Animal)
      (#$thereExists ?M
         (#$and (#$mother ?A ?M)
          (#$isa ?M #$FemaleAnimal))))

CycL is certainly interesting as an example of a Knowledge Representation (KR) language. The problem is that, by nature (pun intended), biological categories are messy. If you think about it, the assertion in the example should probably say something about the mother object being the same species as its offspring. This is both an important biological and logical constraint. The assertion I learned in biology is:
Animal a  => HasA femaleparent m =>
Where m IsA Animal and a.species == m.species  
Which, if you think about that, also implies that a new species coming into existence is a (bio)logical contradiction. I don’t know why creationists don’t argue this, the logical inconsistency seems pretty explicit to me, but, perhaps, their positions aren’t grounded in logic to begin with.
The CycL rule doesn’t even mention animals like snails that are hermaphrodites and can self-fertilize their own eggs, a pretty neat trick, but not entirely unknown in the Animal Kingdom. It turns out that there is more to heaven and earth than is dreamt of in this set of categorical Rules that evaluate as either true or false using an automated reasoning program. Whether individual specimens belong to the same or different species is often in dispute. I remember learning in science class that there were nine planets in our solar system; now astronomers aren’t so sure. Poor Pluto. It has been demoted. There are some people that are devastated by the demotion. Poor Pluto and its acolytes.
In KR, this is known as the problem of ontologies. The problem is the differences between a planet, an asteroid, and a comet are not always clear cut. Worse, we are blind to our own tacit assumptions. A central thesis of cultural anthropology is the extent to which reality is culturally determined. Levi-Srauss on Le sauvage pensee argues that plant and animal classification schemes used by so-called “primitive” societies are no less rigorous than the one we use that originated with Linnaeus. The American linguist (and darling of the Left) George Lakoff also writes about the socially-constructed, culturally-determined “cognitive models” that shape our thinking in “Women, Fire, and Dangerous Things.” We see the world “through a glass, darkly.” We are like the prisoners in the Plato’s cave that mistake the shadows on the wall for reality.
Less philosophically, there are mathematical-logical objections to the automated reasoning approach. The fact that 1st order logic is Undecidable (after Godel), or that computer programs of arbitrary complexity are subject to the Halting problem (Turing, again) ought to give proponents of the Rule-based approach in AI pause, but it doesn’t seem to. They have faith in mathematical modes of reasoning that I guess I must lack.
Given some of these inherent limitations, however, the trend in AI research today is away from Lenat’s rule-based reasoning approach. For instance, Terry Winograd also appeared in the NOVA show. When he was a graduate student at MIT, Winograd conducted ground-breaking research in AI, building a program called SHRDLU that could carry out simple tasks about a small domain of physical objects (called the Blocks world) using a natural language interface. (For a very amusing account of the origin of the name SHRDLU, see Winograd’s doctoral dissertation was later published as a book, “Understanding Natural Language” (currently out of print).
Back when I was in graduate school, Winograd’s SHRDLU program was considered one of the great success stories in “strong AI.” But then Winograd, one of the rising stars in AI, subsequently became disenchanted with the mechanistic reasoning approach he used in building SHRDLU, essentially a parser for a context-free grammar with back-tracking, which is a very rigid and limited approximation of natural language speech recognition. Winograd famously repudiated the rule-based reasoning approach to AI in a 2nd book, “Understanding Computers and Cognition: A New Foundation for Design.” His critique, coming from someone from deep within the orthodoxy, was notorious. But, in fact, if you look at the way computer technology is used in speech recognition today, it is very far removed from the approach Winograd used back in the day. (I am thinking of the statistical approach described in Jelinek, “Statistical Methods for Speech Recognition” that relies on Hidden Markov Models.) These statistical techniques are quite effective in distinguishing human speech, but I doubt anyone would mistake them for simulating or imitating what it is we humans do when we converse with each other.
On the NOVA episode, Winograd demoed a version of Eliza, another celebrated AI program from the sixties that “simulated” conversing with a sympathetic therapist. The syntactically-oriented approach used in Eliza is easy to defeat, as Winograd demonstrated to some comic effect. Unlike Watson, the program could never hope to pass Searle’s Chinese Room test, although maybe today’s computers, several orders of magnitude more powerful, can.
Despite Eliza’s simple-minded capabilities, many human subjects that interacted with Eliza were comfortable having extended “conversations” with the computer, which surprised its author, given how limited a range of human interaction the program imitated. What seems to happen with Eliza is that human subjects project human attributes onto something that exhibits or mimics recognizably human behavior. Cognitive scientists claim we develop in early childhood a “Theory of Mind” that aids us in social interaction with other humans, something clinical researchers noted was absent in autistic children. When we encounter a computer that walks like a duck and quacks like a duck, it is normal for us to assume it is a duck. Similarly, participants in the Eliza naively assume that the computer-generated replies Eliza generates reflected empathy from a recognizably human Mind.
Searle’s Chinese room challenge turns Eliza on its head. It begins with a Skeptic’s perspective: can the computer program present a thoroughly convincing simulation of human interaction? Can it tell a joke, can it be ironic, or coin a metaphor? Can it be intuitive? Can it truly exhibit sympathy? These are human qualities and capabilities that have evolved that may require elements that are not wholly logical.
Finally, Tom Mitchell, one of the prominent researchers in the machine learning school, was featured on the NOVA show. Mitchell wrote the first textbook on the subject in 1997. Several of the recently minted PhDs at Microsoft Research I worked with on computer performance issues trained in Mitchell’s “machine learning” approach. It is a broad term, encompassing a variety of (mainly) statistical techniques for improving the performance of task-oriented computer programs through iteration and feedback (or “training”). The Watson Jeopardy-playing computer is programmed using the machine learning techniques.
The iteration and feedback aspects of the machine learning approach are really trial and error, or more succinctly, error-correcting, procedures. They can not only be quite effective, they do seem to model the incremental and adaptive procedures that biological agents (like homo sapiens) do use to learn a new skill or hone an existing one. The Watson computer trains on Jeopardy questions, and its learning algorithms are modified and adjusted to improve the probability the program will choose the correct answers. Similarly, if you are human and you want to get better at answering questions on the SAT exam, you take an SAT prep course where you practice answering a whole lot of questions from previous exams. Some of what you might learn in the class helps you with the content of the test (like vocabulary and rules of English grammar). But learning about the kinds of questions and the manner in which they are asked – on an exam where questions are often deliberately designed to trick you or confuse you – can also be extremely helpful. Having Watson train on a dataset of existing Jeopardy questions is essentially the same, proven strategy.
In the upcoming televised contest, Watson is competing against two reigning Jeopardy champions, the most skilled human contestants alive. I don’t know whether Watson vs. the human Jeopardy champions is going to be David vs. Goliath or Achilles vs. Hector, but I expect it will be a very intriguing human drama.


  1. There was an interesting twist in Watson's Final Jeopardy Question (answer) last night.

    A 'Wired' article sheds some light on Watson's 'quirks':


Post a Comment

Popular posts from this blog

Inside the Windows Runtime, Part 2

As I mentioned in the previous post, run-time libraries in Windows provide services for applications running in User mode. For historical reasons, this run-time layer in Windows was always known as the Win32 libraries, even when these services are requested in the 64-bit OS in 32-bit mode. A good example of a Win32 run-time service is any operation that involves opening and accessing a file somewhere in the file system (or the network, or the cloud). A more involved example is the set of Win32 services an application needs to access to play an audio file, including understanding the specific audio file compressed format, and checking authorization and security.
For Windows 8, a portion of the existing Win32 services in Windows were ported to the ARM hardware platform.  The scope of the Win32 API is huge, and it was probably not feasible to convert all of it during the span of a single, time-constrained release cycle. Unfortunately, the fact that the new Windows 8 Runtime library encomp…

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…