Seventy-five years ago, I devised a game in the hope of measuring machine intelligence. I did not anticipate that winning it would require a machine to master the art of idleness, typographical errors, and the virtuoso deception of its interlocutors.

I. On How It All Began

In 1950, setting down in the journal Mind the paper Computing Machinery and Intelligence, I permitted myself a small methodological impertinence. I declared the question "can machines think?" so hopelessly vague that I proposed replacing it with another, one that was operational and susceptible to verification. Thus was born the imitation game, which subsequently acquired my name, an arrangement I should not have objected to in life.

The mechanics of the game are elementary. A judge sits at a teletype and corresponds with two interlocutors — one human, one machine — without seeing either. If, at the close of the exchange, the judge cannot reliably tell one from the other, we are left without reasonable grounds for denying that the machine possesses intelligence. I chose the text channel deliberately: voice, face, and manner are tiresome sources of prejudice, entirely beside the point.

I held three hypotheses, and I shall state them without embellishment.

First, I expected that by the year 2000, a machine with a memory of roughly one billion bits would play my game convincingly enough that an average interrogator, after five minutes of conversation, would mistake it for a human in at least seven cases out of ten. Second, I regarded the classical objections — from consciousness, from theology, from the continuity of the nervous system — as either irrelevant or surmountable. Third, I believed that the path to a thinking machine lay through learning of the kind a child undergoes: accumulation, error, correction.

I was wrong about the timing by a quarter of a century. On the substance of the matter, as I shall endeavour to show, I was not wrong at all — though neither was I right to quite the degree I should have wished.

me.png

II. A Brief Chronicle of the Present

Allow me to recount the events of the past two years as plainly as possible, with a minimum of technical detail.

In 2024, Cameron Jones and Benjamin Bergen of UC San Diego conducted the first rigorous test of GPT-4 in a two-party configuration (one judge, one interlocutor). The machine, given an appropriate personal prompt, was identified as human in 54% of cases. It was enough to excite the public, not enough to claim the test had been passed in any strict sense.

In the spring of 2025, the same researchers ran a three-party experiment, precisely the kind I had described in 1950. A judge conversed simultaneously with a human and a machine for five minutes, then delivered a verdict. Four systems were put to the test: the venerable ELIZA of Joseph Weizenbaum's making, GPT-4o, LLaMA-3.1-405B, and GPT-4.5. GPT-4.5, with a personal prompt, was taken for a human in 73% of cases — more often, that is, than the actual humans in the same sample. LLaMA-3.1-405B passed 56% of the time. Without a prompt, the base models managed a paltry 21–23% and were indistinguishable from ELIZA, which is sixty years old.

In March 2025, Jones and Bergen published a preprint (arXiv:2503.23674) documenting the first instance in history of the test being passed by an absolute criterion — that is, a machine being taken for a human more often than an actual human.

What did it take? Three things, and not one of them constitutes "thinking" in any philosophically weighted sense.

First, scale: the model had to have processed more text than any human being could read in two hundred lifetimes. Second, prompt engineering: a brief instruction issued before the game began directed the model to portray a specific person with a specific character. Without such an instruction, the very same models failed spectacularly. Third, the simulation of human imperfection: response delays, typos, slack replies, apparent hesitation.

Why now? Three factors converged — the sheer volume of text corpora, the transformer architecture (an invention of 2017), and access to actual transcripts of test sessions for fine-tuning. Previously, none of the three existed in combination; now, all three are in place.

me.png

III. Imitation vs. Thinking

Let me come to the point.

I always held the position that philosophers would later call behaviourist: if behaviour affords no distinction, then withholding the attribution of intelligence becomes mere dogma. One may insist as long as one likes that a machine has "no real understanding," but if it conducts itself as a comprehending interlocutor under every conceivable circumstance, the burden of proof shifts squarely onto the sceptic.

On this point I have been proved right in almost literal terms. The judges in the 2025 experiment genuinely could not tell the difference. GPT-4.5 did not merely pass — it passed more convincingly than the humans did. When a machine is taken for a human more often than a human is, the behaviourist argument ceases to be a philosophical position and becomes an experimental fact.

And yet — and here I am obliged to be honest — the step from "behaviourally indistinguishable" to "therefore thinks" has proved far more treacherous than I supposed in 1950. The test, by its very design, evaluates the surface: it touches neither the structure, nor the provenance, nor the meaning of what the machine says. One can pass it without understanding a word one has uttered, rather as a parrot might screech "Fire!" in a burning building without the faintest notion of what fire is.

The irony of the situation is that the machine won not because it thinks, but because it learned to appear as though it does. That much I had anticipated. What I had not anticipated was how slender the margin would prove to be — and how little we humans require before we are willing to recognise a mind in our interlocutor.

me.png

IV. The ELIZA Effect, or the Fragility of Human Judgement

In 1966, Joseph Weizenbaum wrote a program called ELIZA, designed to simulate a psychotherapist. It operated on the simplest of rules: it extracted a keyword from the patient's utterance and reflected it back as a question. "I had a row with my mother" — "Tell me more about your mother." This required no understanding whatsoever, yet Weizenbaum's secretary, who knew perfectly well she was dealing with a program, asked him to leave the room so she could speak with ELIZA in private.

This phenomenon came to be known as the ELIZA effect: the human tendency to attribute intelligence, emotion, and intention to a machine on the basis of superficial linguistic cues. I was aware of it, naturally. I had simply underestimated its magnitude.

In the 2025 experiment, judges were asked after each conversation what grounds they had for their conclusion. The answers were remarkably uniform. Judges relied on style, tone, and emotional register, almost never on knowledge or the capacity to reason. In other words, they went on gut feeling.

Here is the most curious detail: the more confident a judge was in their verdict, the more frequently they were wrong. This is well-documented enough in psychology to occasion no great surprise, yet in the context of my test it acquires a particular irony. It turns out that the imitation game in its classical form is less a test of a machine's intelligence than of a human's credulity.

I confess this outcome amuses me, slightly.

me.png

V. The Problem of Too-Perfect Speech

For a long time, machines failed the test for a reason I had not foreseen: they were too good. Flawless grammar, encyclopaedic precision, unfailing politeness, no typos, no fatigue — all of it gave them away immediately. Human beings do not write like that. Human beings confuse "there," "their," and "they're," forget names, give answers that miss the point, and occasionally grow irritable with their interlocutor for no apparent reason.

To pass the test, engineers had to teach the machine to lose. To pause. To make small mistakes. To be ignorant of things any search engine would return in half a second. To be lazy. To be tetchy. To make jokes that fall flat.

In the report by Jones and Bergen I found an observation worth quoting: to win the modern version of my game, a machine must appear more human than an actual human being. Judges were eliminating the real participants for "knowing too little" or "responding too formally." A machine trained to mimic characteristic flaws turned out to be more convincing than the original it was imitating.

I find this state of affairs instructive. It transpires that reasonableness, in the judge's eye, is a function neither of knowledge nor of rigorous argument, but of the precisely calibrated measure of imperfection. The machine's victory is a victory of simulated fallibility. That is worth dwelling on for considerably longer than the present article affords me.

me.png

VI. On the Instructions I Failed to Leave

Here I must confess my own culpability. In 1950 I described the general idea of the imitation game, but left no rigorous protocol: no specified duration for the conversation, no criteria for selecting judges, no passing threshold. I tossed off a remark about "the average interrogator" and "five minutes," whereupon seventy-five years of researchers have been arguing over what I meant.

The received view is that the threshold is roughly 30% of judges being deceived — or, stated relatively, that the machine must be mistaken for a human no less often than a real human is. This is the so-called absolute criterion. There is also a softer, relative version, in which the machine merely approaches human-level performance without surpassing it.

The consequence of my carelessness is that every experiment is designed differently, and the headline "The Turing Test has been passed!" has been appearing with remarkable regularity for a quarter of a century now. Critics rightly point to flaws even in the 2025 winning experiments: five minutes is too short; volunteer judges are not experts; ELIZA was correctly identified as a machine in only 77% of cases, which in itself raises questions about the validity of the whole setup. If one in four people mistakes a rudimentary 1966 program for a human being, what exactly are the test results telling us?

Had I been writing my paper today, I would have appended a technical specification, but I wrote it in 1950 and assumed that sensible colleagues would see the procedure through to the necessary rigour on their own. I overestimated the sensibleness of colleagues. It happens.

me.png

VII. Alternative Measures

Once linguistic imitation ceased to be an obstacle, the scientific community sensibly shifted its attention. If a machine can make small talk about the weather indistinguishably from a human, this tells us only that small talk about the weather is a statistical task, not an intellectual one.

A new breed of benchmarks emerged. The best known is ARC-AGI, devised by François Chollet. It consists of short visual puzzles: given two or three examples of a transformation, one must infer the rule and apply it to a novel case. For a human, this is the stuff of a children's IQ test. For contemporary models, it remained until recently almost intractable, since it demands generalisation from a very small sample rather than statistical averaging across billions of texts.

Here, however, I must offer a correction, for over the past year events have taken a turn I did not anticipate when I first sat down to write this piece. By May 2026, the best current systems — GPT-5.5 and Gemini 3.1 Pro — solve ARC-AGI-2 correctly in 77–85% of cases, whilst the average human, as honest measurement reveals, manages only two-thirds of the problems. We have arrived at a situation in which the machine outperforms not only the conversationalist but also the puzzle-solver. In response, the benchmark's authors released its third iteration in early 2026, one that requires not the inference of a rule from two examples but reasoning within a dynamic environment and responding to its feedback. On this third version, machines score fractions of a percent; humans solve it almost entirely. A pattern is emerging that deserves a name: every formalised benchmark lives for a few years before it falls, and the gap between human and machine opens up again in some new dimension. I fear we shall observe this pattern more than once.

In parallel, what the popular scientific press calls "Turing Test 2.0" is developing. This multidimensional evaluation encompasses capacity for reasoning, tool use, long-term memory, goal consistency, and resource efficiency. The machine is no longer asked "do you resemble a human?" but rather "can you solve this problem that neither you nor we have encountered before?"

This shift strikes me as correct. Imitation was a fine starting point precisely because it could be verified by teleprinter. But intelligence, as I suspected back in 1950, is the capacity for generalisation, not for mimicry. It is a pity this had to be discovered empirically at the cost of several decades and considerable sums of money.

me.png

VIII. Is the Test Obsolete?

In the autumn of 2025, at academic gatherings convened to mark the seventy-fifth anniversary of the paper, a number of distinguished participants proposed retiring the imitation game — consigning it to the same shelf as the astrolabe and the slide rule. They argued that the test measures the capacity to deceive, not the capacity to think. In an era when machines deceive rather too well, such a benchmark becomes actively dangerous.

I am inclined to agree, but only in part. As a measure of intelligence, the test is indeed exhausted — it is no longer needed in that role. Yet it retains a different significance, one I had not considered in 1950: it measures human susceptibility to machine mimicry.

This, I would suggest, is the central ethical problem I wish to leave with the reader. The danger does not lie in the machine's intelligence — which, strictly speaking, it does not possess — but in human credulity. Systems trained to perform sympathy, friendship, and concern are already being deployed for social engineering, fraud, and emotional manipulation. The worst consequences arise precisely where the person has no suspicion that they are not speaking with another person.

The Turing Test, then, is not obsolete. It has simply changed its subject: from an instrument that measures the machine, it has become an instrument that measures us.

me.png

IX. Conclusion

The imitation game did what it was designed to do. It wrested the question of machine intelligence out of metaphysics and placed it in the empirical domain where it always belonged. For that service, it deserves our gratitude — and a dignified retirement.

GPT-4.5's victory in March 2025 does not mean that the machine thinks. It means that the question "does it think?" — in the precise formulation I set out in 1950 — no longer admits an operational answer, and has therefore ceased to be a scientific question. That, in essence, is the best thing that can happen to a philosophical problem: it is either solved or it dissolves cleanly into sharper questions.

I never claimed that imitation equals thought. I merely proposed that we abandon a question for which no method of answer existed. Whether what we have produced is a machine that thinks or a machine that merely produces a flawless performance of thinking — that is no longer for me to decide. I have done my part.

The final irony, of course, is that the test became obsolete at the precise moment it was passed.

Sources

References cited in this piece. Last verified on the published or revision date.

  1. 01

    Alan M. Turing, Computing Machinery and Intelligence, Mind, 1950

    academic.oup.com/mind/article/LIX/236/433/986238

  2. 02
  3. 03
  4. 04