When LLMs summarize existing documents In the lab, hallucination has dropped to roughly 7% (Vectara HHEM). In the newsroom, when a journalist asks about an event from that morning, the same families of models return a significant problem in 45% to 76% of answers (EBU/BBC). Both figures are accurate. This piece discusses the distance between them.
Late last year, Google published a factuality benchmark that its own strongest model failed. It was a good move, an honest one. We keep trusting Google even though we have every right — no, every duty — to doubt its products. Gemini 3 Pro topped the FACTS Suite at 68.8%, and nothing else came closer, GPT-5 and Claude 4.5 Opus included. Model capability is certainly climbing, but truthfulness in actual production does not keep pace with it. Let's look at the table.
Claude vs ChatGPT vs Gemini: the honest scoreboard
| Measure | Claude | ChatGPT | Gemini |
|---|---|---|---|
| Vectara HHEM hallucination (lower better) | Opus 4.5: 10.9% | GPT-5.4: 7.0% | Gemini 2.5 Pro: 7.0%; Gemini 3 Pro: 13.6% |
| SimpleQA Verified accuracy | Opus 4: ~54% (only 35.5% attempted) | GPT-5 main: 46% | Gemini 3 Pro: 72.1% |
| FACTS Suite | not leading | not leading | Gemini 3 Pro: 68.8% (leader) |
| Live news problem rate (EBU/BBC) | not tested directly | ~24% | 72–76% |
| Public trust as news source (Reuters 2025) | not measured | 29% | 18% |
| Calibration | best (hedges, refuses) | confidently wrong | confidently wrong, harder-to-spot errors |
Taken together, the rows make the obvious question — which model wins — hard to answer cleanly. Gemini leads two of the three lab benchmarks and trails badly on the one test built from live journalism. Claude trails on accuracy but wins on calibration. ChatGPT ranks first on some benchmarks and last on others, yet it holds the lead in public trust: 29% against Gemini's 18%. People invest their trust by brand, not data.
What the lab benchmarks say (Vectara HHEM, SimpleQA Verified)
The Vectara HHEM leaderboard, updated 11 May 2026 across more than 7,700 articles, measures one narrow thing: given a source text and asked for a summary, does the model stay faithful to it? On that task the frontier holds near 7% hallucination, with GPT-5.4 and Gemini 2.5 Pro both at 7.0%. Claude's best entry, Opus 4.5, comes in at 10.9%.
Marketing tends to skip a catch here: the newest reasoning models often score worse rather than better. Gemini 3 Pro lands at 13.6%, close to double the error rate of its 2.5 Pro predecessor, and Claude Opus 4.6 (12.2%) trails Opus 4.5 (10.9%). Vectara reads this as reasoning models overworking the text and drifting away from the source. The explanation is sound; almost every one of us has already experienced a moment when the smart (and expensive) model turned out weaker than its dumber relative.
On SimpleQA Verified (Epoch AI), Gemini posts the best result: Gemini 3 Pro at 72.1% accuracy against 54.5% for Gemini 2.5 Pro. GPT-5's main model scores 46%. Claude Opus 4 lands around 54%, but only among the questions it chose to answer, and it attempted just 35.5% of them. Anthropic, for its part, does not publish SimpleQA in its system cards, which is worth keeping in mind when you compare how openly each lab reports its weak spots.
What the real news tests say (EBU/BBC, 45%–76% failure)
Take the same models off curated benchmarks and put them on live news, and the results shift sharply. In October 2025, the EBU and BBC used 22 broadcasters across 18 countries and 14 languages in a study that had working journalists grade more than 3,000 answers. Out of the responses, 45% included at least one significant problem, 31% had serious sourcing flaws, and 81% contained some error, even if just a minor one. Gemini was the weakest performer, with significant problems in 76% of answers and sourcing issues in 72%, roughly three times ChatGPT's rate.
A BBC study from February 2025 had ranked ChatGPT the strongest of that round at a 15% error rate, with Gemini at 34%. Both investigations found that models altered or made up 13% of quotes attributed to BBC articles. The Tow Center reached a similar verdict using eight AI search tools and 1,600 queries: more than 60% of citations were wrong and the tools tended to be wrong with confidence. Of 134 incorrect citations, ChatGPT hedged on only 15.
AI may be able to summarize documents without much trouble, but attributing open-web news stories is a much bigger problem — and one that matters much more to journalists.

What is the most accurate AI model?
The most honest answer: none. No LLM clears roughly 70% on Google's full FACTS Suite, leading to the bleak conclusion that there’s no “most accurate” AI model, only a set of trade-offs that change with the task.
Ask for closed-fact recall and Gemini 3 Pro leads at 72.1% on SimpleQA Verified and 68.8% on FACTS. Ask for document summarization and GPT-5.4 shares the top spot with Gemini 2.5 Pro, at 7.0% hallucination. Ask about that morning's news and all of them fail close to half the time or worse.
Accuracy vs calibration: why Claude "wins" by refusing
Claude's reputation for honesty rests on measurable data. On SimpleQA Verified, it declined to attempt nearly two-thirds of the questions. In the LumiChats run, it logged only 3 confidently wrong answers against ChatGPT's 8 and Grok's 14, and it did best on niche or ambiguous facts by signaling uncertainty instead of bluffing. Tom's Guide's stress test on the Iran strike pointed the same way: Claude stayed with verified sources while Gemini produced the most detailed answer and also the most invented one, down to fabricated times, names, and figures.
There is a strong case for treating this as the real win. Journalists can recover from not knowing a fact, but a confident fabrication that slips into print often causes more lasting damage.
In other conditions, though, Claude’s honesty fails. The most rigorous head-to-head on bibliography generation (Cabezas-Clavijo & Sidorenko-Bautista, 2025) tested free, search-less models and found Claude fabricating 64% of references, putting it below only Copilot and Perplexity for accuracy. Strip out live search and Claude's caution turns into confabulation from memory, showing that calibration depends on configuration rather than being a fixed property of the model.
Why smarter isn't truer (the o3/o4-mini paradox)
OpenAI's own system card records a paradox: a model's capability and its truthfulness have diverged. The o3 model hallucinated on 33% of PersonQA prompts, compared to 16% for the older o1, and the smaller o4-mini reached 48%. Reasoning models make more claims overall, which produces more correct answers and more fabrications at the same time. The Vectara reversals run in the same direction: Gemini 3 Pro is the newer and stronger model, and it summarizes less faithfully than the version it replaced.
The McGill study from March 2026 found all four major assistants performing badly at attribution, with ChatGPT the worst at naming the originating outlet. Gemini covers more reporting but buries the source inside the prose. Claude, per the Reuters/Axios citation study, references news outlets least often of the group — twenty times less than Gemini and fifty times less than ChatGPT.

When to use Claude vs ChatGPT vs Gemini
| Task | Use | Why |
|---|---|---|
| Your own documents/PDFs | Claude (Citations API) | grounds claims to exact sentences in a closed corpus |
| Multi-step research report | ChatGPT Deep Research | strongest autonomous research feature, with a source list per claim |
| Live news and current events | Gemini (Search grounding) | best raw accuracy on fresh facts via Google Search |
| Minimizing the risk of believing a fabrication | Claude | best calibrated, admits uncertainty |
| Academic bibliography | none of the three (use Elicit/SciSpace/Consensus) | all three fabricate from memory; DOIs often resolve to the wrong paper |
| News attribution (who published it) | none reliable; verify by hand | ChatGPT names the outlet worst (McGill); Gemini hides sources in body text; Claude rarely cites news at all |

The 70% ceiling and what it means for journalists
Look at the two facts side by side: the best lab hallucination rate is about 7%, while the best live-news problem rate, in the largest independent study, still leaves close to half of answers flawed. The benchmarks that improved measure a task — faithful summary of an existing document — that journalists rarely face in practice. On open-web attribution of fresh news, a frequent duty in the industry, LLMs still lag far behind.
Chatbots came last, at 9%, among the tools respondents in Reuters Institute's 2025 report trust to verify information. This skepticism is valid when no leading model crosses roughly 70% on the full FACTS Suite.
Taking this data into account gives newsrooms a good operating rule: use models to find leads and rough out structure, then open every cited source by hand before anything runs. A confident citation from Claude, Gemini, or ChatGPT is a place to begin checking, not an end to the job at hand.
Questions.
What is the most accurate AI model?
When to use Claude vs ChatGPT vs Gemini?
Which AI model is most accurate?
Why do smarter AI models hallucinate more?
Sources
References cited in this piece. Last verified on the published or revision date.
- 01
- 02
- 03
- 04
- 05
- 06
- 07
- 08
- 09
- 10
- 11
- 12
- 13
- 14
- 15