Chess, the Drosophila of AI: The Tests of Artificial Intelligence

April 25, 2024

Chess, the drosophila of AI: The tests of artificial intelligence

Games, in particular chess, have long been benchmarks for artificial intelligence. How to test for intelligence, has been a preoccupation of the field of AI since the Turing Test, yet, true intelligence remains elusive.

If, like me, you work in technology policy and spend your work hours trying to figure out how machine learning algorithms should be held accountable, in early 2023, your world would have been engulfed by the distraction of “the letter” and debates about long-termism, sentient AI and embargoes. [1] An open letter calling for a ban on all development on AI beyond GPT-4 was published by the Future of Life, a non-profit research organization, which receives funding from several powerful actors, including the Musk Foundation. The signatories of the letter included some of the biggest names in and around the tech industry — Turing Prize winner Yoshua Bengio, CEO of SpaceX, Tesla & Twitter Elon Musk, Co-founder of Apple Steve Wozniak, 2020 US Presidential candidate Andrew Yang, Co-Founder of Skype, Centre for the Study of Existential Risk, Future of Life Institute Jaan Talinn, Co-Founder of Pinterest Evan Sharp, President of Center for AI and Digital Policy Marc Rotenberg, among others. The fear-mongering about AI most vividly on display in UK’s AI summit last year, is perhaps only matched by its hype, both of which ascribe sentient, superhuman abilities to systems which are essentially statistical models running on very large quantities of data. In this essay, I will look at how we have measured the “intelligence” of machines and what that tells us about the current AI moment of hype and fear.

The Triumph of Deep Blue

In 1997, when chess grandmaster Garry Kasparov lost to IBM’s Deep Blue supercomputer, it was a validation of something chess players had always predicted. [2] Games, in particular, chess, have remained central to the standard for the demonstration of artificial intelligence. The assumption is that If a machine can beat humans at a game that requires intelligence, it must be intelligent itself. In the last essay, we spoke about Hebb’s postulate, ‘Neurons that fire together, wire together’. When axons of a neuron are near another neuron, and repeatedly fired, some growth process or metabolic change takes place and strengthens the link between them. Thus, chess players study the best opening gambits and endgames, gradually creating a mental library of chess moves and strategies. The patterns or cell assemblies get solidified with time in the human brain and become second nature allowing the chess player to almost intuitively arrive at the best possible moves while playing.

On the other hand, an algorithm can use brute force, simultaneously analysing all possible moves, choosing from what it identifies as the most optimal options, and performing the same process repeatedly after each move. A human brain cannot process that much information and the game between Kasparov and Deep Blue would be akin to a chess game between an individual and a committee, with different parts of it simultaneously analysing different moves, only unlike an actual committee, there was perfect communication in real-time. When these algorithms become learners, the level of complexity increases multifold and it becomes impossible for the human brain to even conceive the conceptual models of how they work.

Chess, the drosophila of AI

In 1965, the Russian scientist Alexander Kronrod called chess the “drosophila of the field of Artificial Intelligence.” Kronrod was rationalising the immensely expensive computing hours he was devoting to playing chess, but the game of chess has indeed hung around the field of AI in some way or the other since its inception. It appears that programmers have been trying to teach computers to play chess since the very beginning of computer science. Hector Levesque goes one step ahead and claims that “as soon as a machine is invented, the next step is to turn it into a chess player.” The fundamental way in which machines approach playing chess has not changed all that much since 1949. In that year, Claude Shannon published a paper setting about how a computer can be programmed to play chess. In a paper called “Programming a Computer for Playing Chess,” he considered an evaluation system which a program can use to score all the variations possible, assigning a value to each position, and then deciding on the move with the highest possible value. This would require the program to evaluate all possible moves of both players for as many moves as playing time may permit. He highlighted two possible approaches for doing so. The first, Type A, was a brute search approach, scrutinizing every conceivable move and variation, delving deeper with each iteration. In contrast, the second approach, Type B, would be a relatively streamlined algorithm that emulates human thought processes, concentrating on a select few promising moves and exploring them extensively rather than evaluating every option.

Much before the computer, an Automaton chess player was created in 1769 by the Hungarian engineer Wolfgang von Kempelen as an amusement for the Hungarian queen. This Automaton was called Mechanical Turk, a name replete with racist overtones as the humanoid mechanical figure was dressed in turban and robes. The Mechanical Turk was a scam, in that it had a hidden human player. For over eight decades, it toured the continent and amused crowds, often playing notable people including Napoleon Bonaparte and Benjamin Franklin. There were many attempts to do an expose on the Mechanical Turk, including a much-celebrated essay by Edgar Allen Poe called ‘Maelzel's Chess Player’. Another notable person who played with Mechanical Turk was a young Charles Babbage, who after playing with it, worked on his plans for a chess-playing machine. In true chess-computing tradition, Babbage considered playing chess as one of the most interesting applications of the design of Analytical Engine. This short aside is to demonstrate how the ability to play games which require strategy, and chess in particular, has been the measure of machine intelligence even before the invention of machines.

In parallel with Claude Shannon, Alan Turing was also experimenting with the idea of a chess-playing machine, and in 1953, wrote what is often considered the first chess-playing program. Both Shannon and Turing believed that the ability to play chess competently was an indicator of general intelligence and it, thus followed, that a computer which could play chess was intelligent, or at least in Turing’s opinion, capable of imitating intelligence. Remember the two approaches detailed by Shannon in his paper — Type A characterised by brute force which exhaustively considers and eliminates all possible options, and Type B characterised by sophisticated heuristics to trim the decision trees. These two approaches came to represent a central debate in the field of AI about the nature of machine intelligence. Essentially, did it matter whether intelligent machines mirrored human thought processes, or was it enough if their actions seemed intelligent? To simplify further, does a chess-playing computer, regardless of its skill, genuinely comprehend the game of chess, and does this distinction hold any significance at all?

The Type A approach theoretically promises a perfect game of chess. As Nathan Ensmenger points out, because chess is a game with a finite set of positions and each position permits only a finite number of moves, and the rules dictate that every game concludes with a win, draw, or loss, it is possible to predefine all potential combinations of moves and counter-moves as a branching tree of decision points. However, this theoretical possibility of a perfect game is effectively impossible to turn into a practical reality. With an average of 42 moves (equivalent to 84 counter-moves) per game and around 38 legal moves to evaluate per counter-move, a standard chess match would demand the assessment of approximately 38⁸⁴ positions. To illustrate the magnitude of this figure, Ensmeger points out that if every atom in the universe (10⁷⁵) were functioning as a chess computer at the velocity of Deep Blue (10⁶ moves per second), the time elapsed since the Big Bang (10¹⁸ seconds) would still not suffice to evaluate each of these combinations.

Despite the practical difficulties of the Type A approach, it came to dominate the field. The Type A approach was an example of a minimax algorithm. A minimax algorithm is a recursive method used to select the subsequent move in an n-player game, typically a two-player one. Each position or state of the game is assigned a value, determined by a position evaluation function, indicating the potential benefit for a player to reach that position. The player then selects the move that maximizes the minimum value of the resulting position from the opponent's potential moves. If it's player A's turn to move, A assigns a value to each of their legal moves.

In 1958, Allan Newell, Herbert Simon, and Clifford Shaw improved the minimax algorithm by incorporating a technique called alpha-beta pruning. This modification allowed for the rapid elimination of entire branches from the search process. By combining the minimax algorithm with the alpha-beta pruning technique, the total number of branches in the decision tree requiring examination was significantly reduced. This breakthrough made it possible to play chess on almost any computer, including the earliest microcomputers. Additionally, it boosted the speed, robustness, and overall effectiveness of the minimax algorithm, surpassing its competitors. Once the Chess 4.0 program (which used minimax and alpha–beta pruning) triumphed at the first Association for Computing Machinery (ACM) computer chess championships in 1974, the Type-B approaches were largely forsaken.

Chess in Computing Cultures

The obsession with chess in computing cultures is instructive for anyone trying to understand the socio-technical evolution of AI. There are other games like Go which are more complex, but chess enjoyed more mainstream popularity over them. Chess has long been regarded as a prestigious pursuit, linked with intellectuals, artists, and individual brilliance, serving as a powerful metaphor for various intricate human cognitive endeavours. This broad acceptance of chess as a complex, analytical and creative game gave it an elite status in computing cultures so much so that in 1973, Simon and Chase declared that “if one could devise a successful chess machine, one would seem to have penetrated to the core of human intellectual endeavour.” It had helped that many notable computer science pioneers like Turing and Shannon were amateur chess players.

The fact that chess was a game backed by written annotated literature with detailed notation allowed its easy adaptation for datafication for the purposes of computational replication. Chess already had a body of writing which described popular moves and stratagems which could be used to train and test these chess playing programs. Chess also benefited from having sophisticated ranking systems, which could be translated into benchmarks to measure performance.

As Ensmenger points out, the public spectacle that chess matches provided also made them indispensable for a field which has relied for over six decades on hype and potential rather than real-life performance. We are aware of the much-storied battles between Kasparov and IBM’s computers, but the theatre of chess matches between human beings and computers predates them by decades. Ensmenger provides a detailed account of the Dreyfus affair in the 1960s. When the philosopher Hubert Dreyfus of MIT criticised the limitations of contemporary chess engines in his comprehensive critique of AI, which coincided with one such engine being defeated by a 10-year-old human opponent, AI researchers organized a match between Dreyfus and the cutting-edge MacHack program. Despite Dreyfus not claiming to be a skilled player, he played and was convincingly defeated. The Bulletin of the ACM’s Special Interest Group in AI declared, ‘A Ten Year Old Can Beat the Machine–Dreyfus: But the Machine Can Beat Dreyfus,’ and the episode was widely employed to deflect criticism of the field.

The Turing Test

The theatre around AI that chess matches between humans and machines embody goes to the root of the question — is the machine intelligent? For many decades, the benchmark devised by Alan Turing, known eponymously as the Turing Test has been at the centre of this question. In the early 1950s, even before the term was devised for the Dartmouth Summer School, Alan Turing was becoming more intrigued by the idea of artificial intelligence. His work was prompted by MIT mathematician Norbert Wiener’s book called Cybernetics (later, McCarthy devised the term AI to distinguish the discipline from cybernetics). Turing’s inspiration for the test came from a celebrated Victorian-era game called ‘The Imitation Game’ in which one had to identify whether the subject was a man or woman based on their answers to preset questions. The Turing Test was sketched out in a 1950 paper, “Computing Machinery and Intelligence,” published in the prestigious journal, Mind.

Turing proposed the following, “The question, ‘Can machines think?’ should be replaced by ‘Are there imaginable digital computers which would do well in the imitation game?’”In the Turing Test, human interrogators would similarly communicate via keyboard and screen with a subject. They are not aware if the subject is a human or a machine. The interrogator's objective is to ascertain whether the entity under interrogation is a human or a computer program. If the subject is a computer, yet its interrogators cannot tell so, then it must accept that it retains some measure of intelligence. The beauty of the Turing Test was its elegant simplicity. In one fell swoop, it sidesteps questions about the exact nature of machine intelligence and sentience. The program's actual capacity for thought (or its state of consciousness, self-awareness, etc.) is inconsequential, as long as it can perform tasks that render it indistinguishable from human thought.

The Turing test exemplifies a common scientific approach. When determining whether two entities are identical or distinct, one can inquire if there exists a reasonable test that can differentiate between them. If such a test exists, where one entity passes but the other does not, then they are likely different. Conversely, if no such test can distinguish them, it's not viable to assert they are distinct. Turing's test specifically focuses on discerning machine intelligence from human intelligence, with the criterion being whether a person can distinguish the behaviour exhibited by each.

The sole evidence available to the interrogators consists of the inputs and outputs—that is, the questions posed by the interrogator and the subsequent responses received. In the context of the Turing test, the entity being interrogated is essentially a black box; its internal structure cannot be inspected. All that is accessible are the inputs and outputs. The fact that the Turing Test established itself as the holy grail of AI led, unfortunately, and yet unsurprisingly, to many attempts to meet its objectives through cheap tricks. The idea is often to confuse the human interrogators rather than build a machine which engages with the real issues of intelligent behaviour.

In Artificial Intelligence, A Guide for Thinking Humans, Melanie Mitchell relates one such instance. [3] In 2014, the Royal Society in London hosted a Turing test demonstration involving five computer programs, thirty human contestants, and thirty human judges from diverse backgrounds, including computer experts, nonexperts, native and non-native English speakers, and individuals of various ages and professions. Each judge engaged in multiple rounds of five-minute conversations, typing in parallel with both a human and a machine contestant, and then had to determine which was which. The winning chatbot, named “Eugene Goostman,” was developed by a team of Russian and Ukrainian programmers and succeeded in deceiving ten judges (or 33.3 per cent). The competition organizers swiftly publicized reports claiming that the Turing test had been passed, adhering to Turing’s criterion of “more than 30 percent fooled in five minutes.” Eugene Goostman had operated similarly to most other chatbots. It maintains an extensive collection of sentence templates that can be populated based on predetermined rules applied to the input text received from its conversational partner. The chatbot's programmers have endowed it with linguistic rules enabling it to identify key information in the input and retain it for future reference. Additionally, the chatbot stores a database of common knowledge, coded by human programmers, along with certain logical rules.

For those acquainted with the programming behind chatbots, it's unmistakably apparent from the transcripts of the competition that Eugene Goostman is a program and not a particularly sophisticated one at that. The outcome appeared to reflect more on the judges and the test itself than on the capabilities of the machines. Given five minutes and a tendency to evade challenging questions by changing the subject or responding with a new query, the program surprisingly managed to deceive a nonexpert judge into believing they were conversing with a genuine person. This phenomenon has been observed with numerous chatbots, dating back to ELIZA from the 1970s, which emulated a psychotherapist and purported to pass the Turing Test.

The Winograd Schema Challenge

In 2010, Hector Levesque proposed a fresh challenge for artificial intelligence, The Winograd Schema Challenge, named after a renowned example from Terry Winograd’s groundbreaking 1972 doctoral thesis, “Understanding Natural Language.” [4] This example comprises pairs of sentences such as the following.

The city councilmen refused the demonstrators a permit because they feared violence.
The city councilmen refused the demonstrators a permit because they advocated violence.

In the first sentence, the pronoun “they” naturally refers to the city councilmen, while in the second, it naturally refers to the demonstrators. The key difference between the two sentences is the use of “feared” in the first and “advocated” in the second. As a result, the different entities chosen for “they” must correspond to the distinct word choices. Presumably, when humans read the first sentence, their interpretation of “they” is influenced by the understanding that a council’s fear of demonstrators’ violence would justify denying them a permit, whereas demonstrators’ fear of violence would rarely justify such a denial. Similarly, in the second sentence, humans’ interpretation of “they” is guided by the understanding that demonstrators advocating violence would justify denying them a permit, whereas council members advocating violence would not be a valid reason to refuse them a permit. Additionally, knowledge of the typical attitudes of city councils and demonstrators toward violence likely aids in disambiguation. It seemed reasonable in 1972 and still in 2011 to assume that an AI capable of this type of disambiguation would need to rely on some manner of common-sense knowledge. In 2012, Levesque, Davis, and Leora Morgenstern released an extended edition of Levesque's earlier work from 2011, broadening the examination to encompass additional suggested assessments of AI semantic comprehension. They also explored the test's broader implications within the context of AI research and introduced the corpus. The schemas underwent evaluation in a competition in 2016, where the winning program achieved accuracy on merely 58% of the sentences, barely surpassing the outcome of random guessing. However, since 2016, the state of the art has caught up with the Winograd tests, and has performed exceedingly well.

Broadly, three kinds of methods have been used in response to the Winograd Scheme challenges. The first is feature-based approaches. You may recall examples of feature extraction from the previous essay. The features extracted are largely things like semantic relations between words, and additional commonsense knowledge is encoded in the program in the form of explicitly written rules from knowledge bases, web searches, or word co-occurrences etc. The feature extraction process had limited success.

The second approach is the neural network approach. Instead of extracting specific features, in this approach, whole sentences are fed. Programmers rely on word embeddings or use recurrent neural networks to encode the local context. As this approach lacks reasoning abilities, it could not effectively solve the Winograd schemas.

The final approach builds on the second one. In it, there are extensively pretrained language models, developed through deep neural networks, on large text corpora. Afterwards, there is a further refinement of the model by training it on data resembling Winograd Schema Challenge tasks, aiming to optimise its performance. This approach eventually led to a near-human performance.

As Mitchell explains, Winograd schemas were also susceptible to “shortcuts that allow neural networks to perform well without understanding.” She provides an example. Let's examine the sentences "The sports car passed the mail truck because it was going faster" and "The sports car passed the mail truck because it was going slower." A language model trained extensively on a vast collection of English sentences would have learned the association between “sports car” and “fast,” and between “mail truck” and “slow.” Consequently, it could provide accurate answers solely relying on these associations, without necessarily demonstrating understanding.

What does it mean for machines to be intelligent?

AI, since Deep Blue, have been fairly adept at beating humans at chess matches. The extreme reaction to Kasparov’s loss to Deep Blue in 1997 have also with time, given way to an understanding that brute-force-driven AI can easily best humans at chess, without having any general intelligence. John McCarthy called it the curse of AI, that whenever machines perform better than humans at a task, we immediately conclude that it does not require real intelligence, “[a]s soon as it works, no one calls it AI anymore.”

The noise around AI has been deafening in the last decade. In 2011, IBM’s Watson program decisively triumphed over human champions on the television game show Jeopardy!, adeptly deciphering clues laden with puns. In 2016, the game of Go, a longstanding AI grand challenge, saw a program named AlphaGo impressively defeat one of the world's top players. Outside of these narrow fields, AI was changing our interaction with everyday technologies — automated translation services which, even with mistakes, work exceedingly well, automated voice assistants being able to deal with numerous spoken instructions, automated subtitles, and live translation over voice calls.

However, the last decade is also replete with big failures. After beating human contestants in Jeopardy!, Watson, IBM declared, had clearly understood natural language in all it complexity, and would be used to revolutionise medicine, among other things. IBM’s AI strategy was also distinct from other big, newer players. Unlike them, IBM began with large, high profile and difficult initiatives such as Watson for Oncology. The technology had largely been tested in narrow domains such as besting a human being at a game, or in the case of medicine, very specific questions like whether an individual will have adverse reactions to a specific drug, rather than the wide-scale general and specialised knowledge that it was promoted to excel in.

Much of the examples of successes I list above come from the domain of Natural Language Processing (NLP), or at least involved some element of it. NLP has long been a critical goal of AI. Question-answering format is at the centre of the NLP research, as people want to be able to use human language to interact with machines, rather than code. This represents a huge leap in technological progress where not needing specialised computation knowledge lowers the threshold for using an AI service. Melanie Mitchell makes an excellent point that true NLP would require not only a comprehension of the human language but also a comprehension of the world it is used to navigate. The first few decades of NLP witnessed symbolic rule-based programming, where computer systems were fed grammatical and linguistic rules. Much like other domains of AI, this approach did not work.

We are now in the throes of subsymbolic AI. Rather than embedding explicit knowledge, we enable machines to autonomously comprehend language by exposing them to extensive written text and teaching them to anticipate words. The outcome is what researchers refer to as a language model. Mitchell takes the examples of speech recognition systems. In the previous essay, we spoke about deep learning in the context of image recognition. Around the same time in 2012, pioneering research was carried out in the field of speech recognition in a project led by Geoffrey Hinton. In the same year, a new deep-network speech-recognition system was released on Android phones. Mitchell explains what a stunning improvement this was going from “horribly frustrating to moderately useful,” to “suddenly became very nearly perfect in some circumstances.” Yet, she notes importantly that even the best speech recognition continues to be confused or misled by unfamiliar words or phrases, thus underscoring their limited comprehension of the speech they are transcribing. Would adding more layers of networks and more data correct these flaws, or are these flaws a clear and unremovable symptom of a programming system which accomplishes tasks without understanding what they are? I am inclined to agree with the sceptical Mitchell that it is the latter.

[1] To read an excellent account of the impact and distraction of the letter, see here.

[2] To read a gripping account of the lead up and preparation of this much storied battle, see Kasparov, G. K. Deep thinking: Where machine intelligence ends and human creativity begins. New York: Public Affairs, 2018.

[3] Mitchell is one of the most clear-headed contemporary writers on AI. In this book, she provides as accessible a guide to understanding AI in everyday life as I have come across.

[4] Levesque writes about the Turing Test and Winograd Schemas in his eminently readable book, Common Sense, the Turing Test, and the Quest for Real AI.