When LLMs summarize existing documents In the lab, hallucination has dropped to roughly 7% (Vectara HHEM). In the newsroom, when a journalist asks about an event from that morning, the same families of models return a significant problem in 45% to 76% of answers (EBU/BBC). Both figures are accurate. This piece discusses the distance between them.

Late last year, Google published a factuality benchmark that its own strongest model failed. It was a good move, an honest one. We keep trusting Google even though we have every right — no, every duty — to doubt its products. Gemini 3 Pro topped the FACTS Suite at 68.8%, and nothing else came closer, GPT-5 and Claude 4.5 Opus included. Model capability is certainly climbing, but truthfulness in actual production does not keep pace with it. Let's look at the table.

Claude vs ChatGPT vs Gemini: the honest scoreboard

Measure Claude ChatGPT Gemini
Vectara HHEM hallucination (lower better) Opus 4.5: 10.9% GPT-5.4: 7.0% Gemini 2.5 Pro: 7.0%; Gemini 3 Pro: 13.6%
SimpleQA Verified accuracy Opus 4: ~54% (only 35.5% attempted) GPT-5 main: 46% Gemini 3 Pro: 72.1%
FACTS Suite not leading not leading Gemini 3 Pro: 68.8% (leader)
Live news problem rate (EBU/BBC) not tested directly ~24% 72–76%
Public trust as news source (Reuters 2025) not measured 29% 18%
Calibration best (hedges, refuses) confidently wrong confidently wrong, harder-to-spot errors

Taken together, the rows make the obvious question — which model wins — hard to answer cleanly. Gemini leads two of the three lab benchmarks and trails badly on the one test built from live journalism. Claude trails on accuracy but wins on calibration. ChatGPT ranks first on some benchmarks and last on others, yet it holds the lead in public trust: 29% against Gemini's 18%. People invest their trust by brand, not data.

What the lab benchmarks say (Vectara HHEM, SimpleQA Verified)

The Vectara HHEM leaderboard, updated 11 May 2026 across more than 7,700 articles, measures one narrow thing: given a source text and asked for a summary, does the model stay faithful to it? On that task the frontier holds near 7% hallucination, with GPT-5.4 and Gemini 2.5 Pro both at 7.0%. Claude's best entry, Opus 4.5, comes in at 10.9%.

Marketing tends to skip a catch here: the newest reasoning models often score worse rather than better. Gemini 3 Pro lands at 13.6%, close to double the error rate of its 2.5 Pro predecessor, and Claude Opus 4.6 (12.2%) trails Opus 4.5 (10.9%). Vectara reads this as reasoning models overworking the text and drifting away from the source. The explanation is sound; almost every one of us has already experienced a moment when the smart (and expensive) model turned out weaker than its dumber relative.

On SimpleQA Verified (Epoch AI), Gemini posts the best result: Gemini 3 Pro at 72.1% accuracy against 54.5% for Gemini 2.5 Pro. GPT-5's main model scores 46%. Claude Opus 4 lands around 54%, but only among the questions it chose to answer, and it attempted just 35.5% of them. Anthropic, for its part, does not publish SimpleQA in its system cards, which is worth keeping in mind when you compare how openly each lab reports its weak spots.

What the real news tests say (EBU/BBC, 45%–76% failure)

Take the same models off curated benchmarks and put them on live news, and the results shift sharply. In October 2025, the EBU and BBC used 22 broadcasters across 18 countries and 14 languages in a study that had working journalists grade more than 3,000 answers. Out of the responses, 45% included at least one significant problem, 31% had serious sourcing flaws, and 81% contained some error, even if just a minor one. Gemini was the weakest performer, with significant problems in 76% of answers and sourcing issues in 72%, roughly three times ChatGPT's rate.

A BBC study from February 2025 had ranked ChatGPT the strongest of that round at a 15% error rate, with Gemini at 34%. Both investigations found that models altered or made up 13% of quotes attributed to BBC articles. The Tow Center reached a similar verdict using eight AI search tools and 1,600 queries: more than 60% of citations were wrong and the tools tended to be wrong with confidence. Of 134 incorrect citations, ChatGPT hedged on only 15.

AI may be able to summarize documents without much trouble, but attributing open-web news stories is a much bigger problem — and one that matters much more to journalists.

ma.png

What is the most accurate AI model?

The most honest answer: none. No LLM clears roughly 70% on Google's full FACTS Suite, leading to the bleak conclusion that there’s no “most accurate” AI model, only a set of trade-offs that change with the task.

Ask for closed-fact recall and Gemini 3 Pro leads at 72.1% on SimpleQA Verified and 68.8% on FACTS. Ask for document summarization and GPT-5.4 shares the top spot with Gemini 2.5 Pro, at 7.0% hallucination. Ask about that morning's news and all of them fail close to half the time or worse.

Accuracy vs calibration: why Claude "wins" by refusing

Claude's reputation for honesty rests on measurable data. On SimpleQA Verified, it declined to attempt nearly two-thirds of the questions. In the LumiChats run, it logged only 3 confidently wrong answers against ChatGPT's 8 and Grok's 14, and it did best on niche or ambiguous facts by signaling uncertainty instead of bluffing. Tom's Guide's stress test on the Iran strike pointed the same way: Claude stayed with verified sources while Gemini produced the most detailed answer and also the most invented one, down to fabricated times, names, and figures.

There is a strong case for treating this as the real win. Journalists can recover from not knowing a fact, but a confident fabrication that slips into print often causes more lasting damage.

In other conditions, though, Claude’s honesty fails. The most rigorous head-to-head on bibliography generation (Cabezas-Clavijo & Sidorenko-Bautista, 2025) tested free, search-less models and found Claude fabricating 64% of references, putting it below only Copilot and Perplexity for accuracy. Strip out live search and Claude's caution turns into confabulation from memory, showing that calibration depends on configuration rather than being a fixed property of the model.

Why smarter isn't truer (the o3/o4-mini paradox)

OpenAI's own system card records a paradox: a model's capability and its truthfulness have diverged. The o3 model hallucinated on 33% of PersonQA prompts, compared to 16% for the older o1, and the smaller o4-mini reached 48%. Reasoning models make more claims overall, which produces more correct answers and more fabrications at the same time. The Vectara reversals run in the same direction: Gemini 3 Pro is the newer and stronger model, and it summarizes less faithfully than the version it replaced.

The McGill study from March 2026 found all four major assistants performing badly at attribution, with ChatGPT the worst at naming the originating outlet. Gemini covers more reporting but buries the source inside the prose. Claude, per the Reuters/Axios citation study, references news outlets least often of the group — twenty times less than Gemini and fifty times less than ChatGPT.

ma.png

When to use Claude vs ChatGPT vs Gemini

Task Use Why
Your own documents/PDFs Claude (Citations API) grounds claims to exact sentences in a closed corpus
Multi-step research report ChatGPT Deep Research strongest autonomous research feature, with a source list per claim
Live news and current events Gemini (Search grounding) best raw accuracy on fresh facts via Google Search
Minimizing the risk of believing a fabrication Claude best calibrated, admits uncertainty
Academic bibliography none of the three (use Elicit/SciSpace/Consensus) all three fabricate from memory; DOIs often resolve to the wrong paper
News attribution (who published it) none reliable; verify by hand ChatGPT names the outlet worst (McGill); Gemini hides sources in body text; Claude rarely cites news at all
ma.png

The 70% ceiling and what it means for journalists

Look at the two facts side by side: the best lab hallucination rate is about 7%, while the best live-news problem rate, in the largest independent study, still leaves close to half of answers flawed. The benchmarks that improved measure a task — faithful summary of an existing document — that journalists rarely face in practice. On open-web attribution of fresh news, a frequent duty in the industry, LLMs still lag far behind.

Chatbots came last, at 9%, among the tools respondents in Reuters Institute's 2025 report trust to verify information. This skepticism is valid when no leading model crosses roughly 70% on the full FACTS Suite.

Taking this data into account gives newsrooms a good operating rule: use models to find leads and rough out structure, then open every cited source by hand before anything runs. A confident citation from Claude, Gemini, or ChatGPT is a place to begin checking, not an end to the job at hand.

Questions.

What is the most accurate AI model?
It depends what you measure. On lab benchmarks, GPT-5.4 and Gemini 2.5 Pro lead document-faithfulness (~7% hallucination on Vectara HHEM), and Gemini 3 Pro tops factual recall (72.1% on SimpleQA Verified). But no model clears roughly 70% on Google's full FACTS suite, and on live news queries all of them fail 45-76% of the time. There is no single 'most accurate' model — only trade-offs.
When to use Claude vs ChatGPT vs Gemini?
Use Claude when you need calibration and honesty — it hedges or refuses rather than inventing, and its Citations API grounds answers to your own documents. Use Gemini for current events, because Google Search grounding gives it the best live-fact accuracy. Use ChatGPT for multi-step Deep Research reports. None is reliable for unverified citations.
Which AI model is most accurate?
Gemini 3 Pro currently leads independent factual benchmarks (72.1% on SimpleQA Verified, 68.8% on Google's FACTS suite), but Claude is best calibrated — it admits uncertainty instead of bluffing. Accuracy rankings flip depending on whether the task is closed-fact recall, document summarization, or live news, so a single winner does not exist.
Why do smarter AI models hallucinate more?
Reasoning models make more claims overall, so they produce both more correct answers and more fabrications. OpenAI's own o3 hallucinated on 33% of PersonQA prompts versus 16% for the older o1, and o4-mini reached 48%. More capability does not automatically mean more truthfulness.

Sources

References cited in this piece. Last verified on the published or revision date.

  1. 01

    Vectara Hallucination Leaderboard

    github.com/vectara/hallucination-leaderboard

  2. 02

    SimpleQA Verified — Epoch AI Benchmarks

    epoch.ai/benchmarks/simple-qa-verified

  3. 03

    SimpleQA Verified — Research Paper

    arxiv.org/abs/2509.07968

  4. 04

    Google DeepMind FACTS Benchmark Suite

    deepmind.google/blog/facts-benchmark-suite-systematically-evaluating-the-factuality-of-large-language-models

  5. 05

    Tow Center / CJR: We Compared Eight AI Search Engines — They're All Bad at Citing News

    www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

  6. 06

    EBU/BBC News Integrity in AI Assistants Report 2025

    www.ebu.ch/files/live/sites/ebu/files/Publications/MIS/open/EBU-MIS-BBC_News_Integrity_in_AI_Assistants_Report_2025.pdf

  7. 07

    BBC News Finds That AI Tools Distort Its Journalism Into a Confused Cocktail With Many Errors

    www.niemanlab.org/2025/02/bbc-news-finds-that-ai-tools-distort-its-journalism-into-a-confused-cocktail-with-many-errors

  8. 08

    ChatGPT, Claude, Gemini, and Grok Are All Bad at Crediting News Outlets — But ChatGPT Is the Worst

    www.niemanlab.org/2026/03/chatgpt-claude-gemini-and-grok-are-all-bad-at-crediting-news-outlets-but-chatgpt-is-the-worst-at-least-in-this-study

  9. 09

    Generative AI Models Love to Cite Reuters and Axios, Study Finds

    www.niemanlab.org/2025/07/generative-ai-models-love-to-cite-reuters-and-axios-study-finds

  10. 10

    OpenAI o3 and o4-mini System Card

    cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf

  11. 11

    Anthropic Citations API

    claude.com/blog/introducing-citations-api

  12. 12

    Reuters Institute Digital News Report 2025 — Executive Summary

    reutersinstitute.politics.ox.ac.uk/digital-news-report/2025/dnr-executive-summary

  13. 13
  14. 14

    LumiChats: Most Accurate AI — Claude, ChatGPT, Gemini, Grok 100-Facts Test

    lumichats.com/blog/most-accurate-ai-2026-claude-chatgpt-gemini-grok-100-facts

  15. 15

    Tom's Guide: I Tested ChatGPT, Gemini, and Claude on the Iran Strike — and One AI Fed Me Fake News

    www.tomsguide.com/ai/i-tested-chatgpt-gemini-and-claude-on-the-iran-strike-and-one-ai-fed-me-fake-news