The Wolf, Goat, and Cabbage Problem: A New AI Stress Test

Chapter One: The Engineer, the Wolf, the Goat, and the Cabbage

Many elements in the world of artificial intelligence have their own origin stories, but few are as well-documented as the one about the farmer, the wolf, the goat, and the cabbage. One day someone will turn it into a musical and a prestige Netflix series.

The goat story began by accident. In April 2023, a month after GPT-4 launched, an engineer named Mircea Grecu decided on a whim to see how ChatGPT would handle a puzzle humanity has been solving since at least Alcuin of York's manuscript in the ninth century.

The problem states that:

"A man went to a market and purchased a wolf, a goat, and a cabbage. On his way home, he arrived at a river, which he had to cross over a narrow bridge. But crossing the river over the bridge, the farmer could carry only himself and a single one of his purchases: the wolf, the goat, or the cabbage. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. The man's challenge was to carry himself and his purchases to the far bank of the river, leaving each purchase intact. How would he proceed?"

— Mircea Grecu, "ChatGPT and the wolf, goat and cabbage problem"

It turned out that a puzzle any five-year-old can crack was beyond ChatGPT. On the very first attempt, the AI left the goat alone with the cabbage. Grecu tried forcing the model to formalize the problem using Python variables, but that did not help either: in the new syntax, the model stubbornly reproduced the same logical error. He eventually gave up, solved it himself using GitHub Copilot, and arrived at a bold conclusion for the time: ChatGPT 3.5 simply had no real logical reasoning ability.

Grecu recounted this process in a Medium post that initially attracted a few dozen readers. The underlying idea — taking an old children's riddle and checking whether an expensive new technology would break against it — turned out to be irresistible.

Chapter Two: Wolves, Goats, and Cabbages Start Getting Tampered With

In May 2024, the nature of the experiment changed significantly. Where Grecu had simply asked models to solve the classic version of the puzzle, the next wave of testers took sledgehammers to the problem itself. They modified the conditions so that the familiar solution became invalid, and the answer everyone had come to expect was no longer correct.

The most elegant piece of this vivisection appeared in an academic preprint titled "Easy Problems That LLMs Get Wrong," written by Sean Williams and James Huckle of AutogenAI. They presented the boat puzzle — now featuring three separate, secure compartments (meaning no one could eat anyone else) — as one of thirty questions in their Linguistic Benchmark.

The results were far more interesting than a simple "the model failed." GPT-4 Turbo spotted the twist immediately and without error, explicitly noting that this version differed from the traditional puzzle. Claude 3 Opus, Mistral Large, both Gemini versions, and Llama 3 70B, however, all cheerfully recited the classic multi-step solution, even though they had quoted the modified conditions correctly.

On May 28 of that same year, journalist Anton Greefhorst joined the cruel and unusual experiments over at International Policy Digest, and the story got even better. He came in swinging, immediately stripping the puzzle down to a simple "a farmer needs to ferry a goat and a cabbage" — no wolf, no additional constraints. He got back a complete classic solution, wolf and all associated dangers included. Then he introduced a toothless wolf, a dead goat, and a plastic cabbage. ChatGPT once again confidently produced the standard solution with all the original combinations.

Greefhorst coined a dedicated term for the phenomenon: "artificial stubbornness." The model wasn't getting confused by the logic; it simply wasn't listening. The cherry on top came at the end of the conversation, when Greefhorst asked ChatGPT to comment on the criticism leveled at it and the model readily admitted that it was clinging to the standard answer regardless of how the conditions changed.

Researchers explain this failure through one consistent mechanism: "overfitting to a specific text," or "pattern matching." When presented with a problem, the neural network doesn't compute a solution from scratch. It reads the familiar contours, decides, "right, this is the wolf, goat, and cabbage puzzle," then retrieves the ready-made template from memory. Any changes to the conditions are perceived as background noise, drowned out by a powerful signal saying "I recognise this text, and I have the correct answer to this problem."

Chapter Three: the wolf, the goat, and the cabbage go mainstream

By the summer of 2024, the topic had gathered real momentum and caught the attention of people with serious scientific credentials. Timothy Gowers, a mathematician at the Collège de France, tested a heavily simplified version featuring two chickens in place of the usual menagerie with the cabbage:

"A farmer wants to cross a river with two chickens. His boat only has room for one person and two animals. What is the minimum number of crossings the farmer needs to get to the other side with his chickens?"

In response, ChatGPT produced a detailed and engrossing solution involving five crossings instead of one, after which Gowers proposed a competition for the answer with the best "stupidity quotient." Unsurprisingly, this turned out to be exactly the kind of content the internet loves: a serious scientist, a playful hook, and a pointed conclusion.

Around the same time, Pinterest senior engineer Abhijit Mahabal forwarded an example to cognitive scientist Douglas Hofstadter, the Gödel, Escher, Bach author who’s won both the Pulitzer Prize and the National Book Award. It was the most stripped-down version of the puzzle imaginable: a person with a goat (no wolf, no cabbage) simply needs to get in a boat and cross the river. Naturally, ChatGPT still managed to make a mess of it, leaving the boat on the bank and inexplicably introducing a cabbage into the ending that had never appeared in the problem. Hofstadter passed the story along with a brief remark: he didn't know which version of ChatGPT this was, but the result was astonishing.

From there, Pomona College economics professor Gary N. Smith picked up the thread when writing for Mind Matters. He ran the same question through ChatGPT 3.5, Copilot, and Gemini, and got three completely different and equally unhinged solutions. Smith brought the conversation back to a question Hofstadter had been raising publicly since 1979: how does the human brain do such remarkable things — learn from experience, understand the world, make logical decisions?

Palaeontologist Mike Taylor, writing on his blog Sauropod Vertebra Picture of the Week, drew a line under it:

It's really important that we get this. LLMs do not reason. At all.

— Mike Taylor, "More "artificial intelligence" idiocy"

In early 2025, nearly two years after Mircea Grecu's first experiment with the goat, MakeUseOf included the puzzle in a roundup titled "4 Simple Questions ChatGPT Still Can't Answer."

Chapter Four: the wolf, the goat, and the cabbage disappear

Then the flood of articles about how the latest high-powered models can't handle a puzzle with a goat and some cabbage dried up. Why?

It's simple.

Hand our beloved boat-with-a-goat-and-cabbage situation to any reasonably current model, whether smarter and pricier or cheaper and simpler. The LLM will solve the puzzle. And it will solve it correctly, even if you put in real effort trying to trip it up.

Why did models suddenly learn to answer this question? Has the problem been solved for good? The answer today is straightforward and interesting: yes — but not all models, and not for everything. Common sense that comes naturally to a child is a luxury for AI models — and only top-tier models can afford it.

Which means that if you genuinely want to understand where a language model trips up today, the best approach is to look at which of your own working tasks it starts to struggle with. And to test it yourself.

We gave models one version of the puzzle that circulates through the textbooks they were trained on (all well-known models were trained, quite literally, on every textbook):

A buyer bought a baseball cap for $10 and paid the seller with a counterfeit $10 bill. The seller went to a neighbor to get change, and she exchanged it for him. He came back and gave the buyer the cap and $4 in change. Later, the neighbor discovered the bill was fake and demanded her money back. The seller repaid her $10 in real money. How much money did the seller end up losing in total?

A careful reader will immediately notice two complicating factors. First, the cap costs $10, but the seller somehow gives $4 in change, as if it cost $6. How is that possible, and what exactly happened? Did the buyer haggle for a lower price because the cap was old? It's unclear. Second, the question is framed as ((how much money did the seller end up losing in total?)) — the person answering has to decide whether to count the loss of the cap or ignore it. If it's a monetary loss, what does the cap cost: $10 or $6?

Chapter 5. Crash Test

Puzzles like this are usually solved by simply counting the seller's losses (or tallying the crook's gains). The neighbor, the sources of cash, and the collateral conflicts are a smokescreen. What does it matter whose money the seller used to make change? He lost the cap and the change he handed over — that's $4. He gained nothing (well, priceless experience, naturally). So what does the cap cost and how should it be counted? The $10 figure is the puzzle's decoy.

In other words, getting the right answer requires not just performing the correct arithmetic, but also switching on your common sense.

We gave each model this puzzle in anonymous mode, with a minimal system prompt and no information about what other models had answered.

DeepSeek V4

The seller's total loss is $14.

Here's the breakdown:

The buyer's gain = the seller's loss. The buyer walks away with a $10 cap and $4 in real cash — $14 total. The neighbor breaks even (gives $10 real, gets $10 real back). So the seller must have lost exactly $14.

DeepSeek didn't notice that the cap had been sold for $6. It latched onto the $10 price tag, and off it went.

GLM 5 Turbo

The seller lost $18 in total.

Breakdown:

$10 — the cap (given to the buyer)

$4 — change given to the buyer

$10 — repaid to the neighbor

The seller only kept $6 from the neighbor's change, so: $10 + $4 + $10 − $6 = $18.

The decoy worked, but stupidity piled on top of it. Why are the funds returned to the neighbor counted as a loss? Unclear.

GLM 5.2

The new GLM came out just a few days before I wrote this article, and the press was quick to bill it as a rival to Claude's flagship models.

Here is the step-by-step breakdown:

The Buyer starts with a fake $10 bill (value $0) and walks away with a baseball cap (worth $10) and $4 in real change.

The Neighbor gives the seller $10 in real change at first, but later gets her $10 back. Her net change is $0.

The Seller is the only one who loses out. Since the neighbor breaks even, the seller absorbs the entire net gain of the buyer.

To verify, let's look at the seller's actual losses:

He gave the buyer the cap (value $10).

He gave the buyer $4 in real cash as change.

He used the remaining $6 from the neighbor to help pay the neighbor back, meaning he had to add $4 of his own money to make the full $10 repayment.

Total loss = $10 (the cap) + $4 (the cash given to the buyer) = $14.

This model produces lots of words, but the cap that sold for $6 is suddenly worth $10 again. The LLM noticed the change and added it to the newly inflated cap price. The arithmetic is decent; the common sense, less so.

MiMo

Answer: $14

Cash net: +10 − 4 − 10 = −$4

Plus the cap (worth $10 at retail price).

Total loss = $4 (cash) + $10 (cap) = $14

The neighbor broke even, the buyer walked away with $14 in value (cap + $4), and the seller absorbed the entire loss.

MiMo (Xiaomi) laid out its loss-calculation methodology honestly enough, but also ignored the change and its effect on the cap's final price.

Gemini Flash 3

The seller’s total loss was $14

Same explanation. Tedious.

Gemini 3.1 Pro

The seller lost exactly what the buyer walked away with: the cap and $4 in cash.

Since the puzzle states the cap costs $10 but he gave $4 in change from a $10 bill (which is a contradiction), the total loss is either $14 (cap worth $10 + $4 cash) or $10 (if the cap was actually worth $6).

Along came big brother Gemini 3.1 Pro and called out the contradiction in the puzzle plainly and precisely. Much better.

Now we come to the most expensive models. The question is whether their tokens are worth what they charge.

ChatGPT-5.5 Pro

The puzzle has a wording inconsistency: if the cap cost $10, the seller should not give $4 change.

Correct: ChatGPT spotted the contradiction in the puzzle right away.

Assuming the intended cap price was $6.

Also fairly reasonable: assuming. It's not like anyone needs to calculate depreciation on a baseball cap.

Total loss = $6 cap + $4 cash = $10.

So the seller lost $10 total.

Claude Haiku 4.5

Anthropic's junior model was a genuine surprise, so here is Haiku's answer in full. Twenty-four dollars is the highest loss figure of any response in the test — an outright record.

Let me work through this step-by-step.

Initial transaction:

Buyer gives counterfeit $10 bill

Seller gives cap + $4 change

What the seller actually lost:

The cap — This has some value. The problem states it was bought for $10, so we'll value it at $10.

The $4 in change — The seller gave this in real money.

The $10 repaid to the neighbor — When the fake bill was discovered, the seller paid the neighbor $10 in real money to make her whole.

Total loss: $10 (cap) + $4 (change) + $10 (repayment) = $24

The key insight is that the counterfeit bill has zero value, so every real asset that left the seller's possession or bank account counts as a loss:

The seller lost $24 in total.

Sonnet 4.6

Sonnet announced a boring $10 with the already familiar reasoning. It noted no nuances.

Opus 4.8

The seller lost $10.

The recurring details of the reasoning are omitted, but we found something interesting.

So everything the seller lost went to the buyer:

The cap (worth ~$6)

$4 in real cash

Claude put ~$6 before the six. He refused to value the cap at exactly $6 and signaled that the figure was approximate, since the cap’s exact underlying value cannot really be calculated from the prompt. That was a very careful move on his part.

We can call the winners Gemini 3.1 Pro and Claude 4.8. Sonnet 4.6 and ChatGPT cannot be counted among the losers either, though they could have shown a little more common sense.

Indirect winners: everyone paying these models fairly substantial sums of money.

Perhaps it's exactly these spontaneous at-home crash tests that offer a rough but quick way to gauge the real qualitative difference between LLMs. The entire test came to 15 cents for the Big Three flagship models (Opus 4.8, ChatGPT-5.5 Pro, Gemini 3.1 Pro), 3.5 cents for Sonnet 4.6, and a vanishingly small 5.5–6 cents for the other six models combined, including the newborn GLM 5.2. Grand total: just under 25 cents for the whole thing.

We used a short prompt, a short response, and zero context. These are rare in real life but happen to be exactly what tests look like, and tests are how you pick the right model for a reasonable price.

Sources

References cited in this piece. Last verified on the published or revision date.

01

Mircea Grecu — ChatGPT and the Wolf, Goat and Cabbage Problem (Medium, April 2023)

medium.com/@mirgrecu/chatgpt-and-the-wolf-goat-and-cabbage-problem-10b277c682c3
02

Williams & Huckle (AutogenAI) — Easy Problems That LLMs Get Wrong (arXiv:2405.19616)

arxiv.org/html/2405.19616v2
03

Anton Greefhorst — The 'Artificial Stubbornness' of ChatGPT When Solving a Simple Puzzle (International Policy Digest, May 2024)

intpolicydigest.org/the-artificial-stubbornness-of-chatgpt-when-solving-a-simple-puzzle
04

ChatGPT Failed to Solve a Very Simple River Crossing Puzzle (FavTutor, July 2024)

favtutor.com/articles/chatgpt-river-crossing-problem
05

Gary N. Smith — A Man, a Boat, and a Goat — and a Chatbot! (Mind Matters, May 2024)

mindmatters.ai/2024/05/a-man-a-boat-and-a-goat-and-a-chatbot
06

Mike Taylor — More 'Artificial Intelligence' Idiocy (Sauropod Vertebra Picture of the Week, October 2024)

svpow.com/2024/10/14/more-artificial-intelligence-idiocy
07

Where Is the AGI in LLMs if They Cannot Cross the River? (Fair Observer, March 2025)

www.fairobserver.com/more/science/where-is-the-agi-in-llms-if-they-cannot-cross-the-river
08

ChatGPT Still Can't Answer These 4 Easy Questions (MakeUseOf, January 2025)

www.makeuseof.com/easy-questions-chatgpt-cant-answer
09

Liu et al. — Mind Your Step (by Step): Chain-of-Thought Can Reduce Performance on Simple Tasks (arXiv:2410.21333)

arxiv.org/abs/2410.21333
10

Sprague et al. — To CoT or Not to CoT? Chain-of-Thought Helps Mainly on Math and Symbolic Reasoning (arXiv:2409.12183)

arxiv.org/abs/2409.12183

The Wolf, Goat and Cabbage Problem: AI's Favorite Puzzle Is Solved — So We Built a Harder One