In April 2026, Meta unveiled Llama 4 Scout with an announcement that made the AI world flinch and look up in disbelief: a context window of 10 million tokens (Meta AI Blog). Today, all flagship models are jostling around a claimed window of 1 million tokens, with Gemini leading the pack at 2 million.
"Claimed" is the operative word. The working — that is, real and effective — context window for all of them is several times smaller than advertised. Then, along comes 10 million tokens, in an official press release from an ostensibly serious company. One million tokens (in English) is roughly 10 average books or 100 long-form articles. Meta was claiming it could hold 100 books (or 1,000 articles) in its so-called "memory." Impressive? To anyone who bought it, perhaps, but prompt engineers don't traffic in illusions. The calculation was aimed at the semi-expert crowd: they'd reflexively knock 50% off the 10 million figure, land on five million, and say, "Look, we know this is PR, but we already cut it in half. There has to be something behind the number, right? A working window of 5 million; any way you slice it, that's a record."
The irony is that the providers actually deploying this model on their servers don't believe in fairy tales; Groq, for instance, hard-caps Scout's working context at 131,000 tokens. Beyond that: a wall.
What the industry got, in the end, was not the record the marketing team had in mind. It turned out to be a record for the gap between a claimed context window and a working one. Independent testers, rolling their eyes at the breathless "revolution" and "infinite memory" coverage, ran Llama 4 Scout through long-context benchmarks. The result? On Fiction.LiveBench, the model scores 15.6% at 128,000 tokens. Note: not at 10 million, at 128,000. The unassuming Gemini 2.5 Pro holds 90.6% at the same length. In other words, those vaunted 10 million tokens turn back into a pumpkin the moment they meet a real task (Fiction.LiveBench).
Aggregate data from independent long-context leaderboards paints an almost absurdist picture. The flagships — recent Claude generations, Gemini Pro, the senior GPT variants — pushed to roughly the one-to-two-million-token mark and stopped, apparently because there was nowhere worth going beyond that. Meanwhile, Llama 4 Scout, the only model claiming 10 million tokens, sits comfortably in the bottom half of the overall long-context comprehension rankings.
What Is a Context Window — and Why It's Not Memory
People just getting started with AI engineering usually assume the model has memory. They've already gotten comfortable with tokens, they know a long-form article is around 10,000 tokens and a page of fine-but-still-readable PDF text is about 1,000. The models pull from a context window,into which everything gets loaded fresh before each response: your question, instructions, documents, conversation history, user preferences and habits. Close the tab and the buffer is empty, the model has no memory and the AI doesn’t remember you, it never did. Sure, an AI provider can bundle a saved profile about you along with every prompt you send, creating the illusion of memory, but it's just an illusion. If the system doesn't send the saved profile — for example, if you open an incognito window — you'll immediately get: "Hello, and who exactly are you?"
For anyone who writes for a living, this leads to a simple and deeply uncomfortable conclusion: a novel manuscript — say, 150,000 words, which works out to more than 200,000 tokens — already doesn't fit within the honest working length of most models. The plan of "I'll just paste in the whole novel and ask" falls apart sooner than you'd expect. As context load increases, the model's comprehension likely collapses in a specific order. First goes its grasp of the novel's overall structure, then the character arcs (if they weren't fed in as a separate file), and last of all the needle-in-a-haystack test: finding one specific fact is actually the easiest thing for the model to do. Which is exactly why marketers love that particular benchmark so much, even though it's rarely useful to real users — and to authors in particular.
Everyday intuition trips over another wrinkle. When a person and a model interact directly, without trained intermediaries, the person quickly runs into an uncomfortable reality: there is no hierarchy inside the context window. The model doesn't read text sequentially, filing the important parts away on a shelf; it recalculates the relationships between all tokens, each against every other, from scratch, every single time. This is the self-attention mechanism that all transformers have been built on since 2017 (Vaswani et al., "Attention Is All You Need").
This is where the math gets genuinely depressing. Attention complexity is quadratic: double the context length and you get four times the computation. Triple it and you get nine times. Going from 1 million to 10 million tokens head-on means a hundredfold increase in computational load, and a hundredfold (or greater) increase in cost. To get around those impossible numbers, the engineers who built Llama were forced to reach for architectural tricks.

How Llama 4 Scout Got a 10-Million-Token Context Window
You might ask, how did they get a context window that big? Here's how: behind the headline figure of 10 million tokens there isn't one big engine; instead, there are three architectural sleights of hand. Let's walk through each one.
MoE — the Model That Doesn't Call Everyone Into the Meeting
The first trick is called Mixture-of-Experts (MoE). Instead of one massive network, there are several specialized sub-networks and a router that decides which one to call in for any given token. Think of a well-run meeting: you don't summon every employee in the company, you bring in only the experts who know the specific issue at hand. Everyone else keeps working at their desks. Or gets a coffee. Doesn't matter.
Llama 4 Scout has 109 billion parameters spread across 16 experts, but only 17 billion — roughly 15% — are actually involved in any given token's computation (Meta AI Blog). That's precisely why a model this size can realistically run on a couple of industrial AI accelerators rather than an entire server rack. Without MoE it couldn't run at all. Full stop.
iRoPE — Layers With a Sense of Place, and Layers Without
For a model to understand the order of tokens, it needs positional encoding, otherwise the text collapses into an incoherent bag of words. The standard approach of recent years is RoPE (Rotary Position Embeddings): position is encoded by rotating the token vector by an angle that depends on its index (Su et al., 2021). It works beautifully, right up until the sequence length pushes beyond what the model was trained on. At millions of tokens, RoPE starts getting things wrong, and it does so with confidence.
Here, Meta makes a paradoxical move: you read the spec and do a double-take — is this a typo? In iRoPE (interleaved RoPE), layers alternate: some use RoPE, others use NoPE (No Positional Encoding), meaning no positional information whatsoever (ApX Machine Learning). Wait, no position at all? The model has no idea where any given word sits, and this is supposed to be a solution? Apparently so. The RoPE layers maintain local structure, while the NoPE layers rely on the causal mask and semantics. What else were the engineers supposed to do? Encoding a position for token number 4,832,119 is meaningless; the model was never trained on sequences anywhere near that long.
Does this causal mask actually work? Does the semantics save it? How exactly was it wired in? Nobody knows. No independent experts have measured the NoPE layers' contribution in isolation; benchmarks hit the architecture as a whole.
Attention Temperature
The third trick is the most unassuming. Attention diffuses on long contexts, attention diffuses: instead of looking in the right place, the model starts gawking at everything at once like a tourist in Times Square on a Friday night, where everything is bright and vying for attention. The fix is temperature scaling in the attention formula and sharpening the distribution so focus doesn't blur out (Meta AI Blog). Does it help? It does, on tens of thousands of tokens. On millions, probably not. But you won't find the answer to that question in a press release. That's not what press releases are for.
The Thousand-Story Skyscraper
Say your work involves architectural firms that design skyscrapers. One of them announces: their engineers have developed a way to build not a 100-story building like the competition, but a 1,000-story one. Two to two-and-a-half miles tall. You'd do a slow double-take and ask: did they discover some special composite material? — No, they didn't, they tell you. The solution is purely architectural, built on existing technology.
You'd heave a heavy sigh — and rightly so. The difference between real skyscraper architects and the architects of Llama comes down to a few fundamental things: a sense of accountability to the market and to clients, a deep-seated aversion to cheap hype, and the absence of a crowd of wide-eyed investors at the door, ready to believe anything as long as it fits the expectations of an overheated market. That's why reputable architectural firms don't put out press releases about 1,000-story skyscrapers very often.
And what are the Llama architects actually risking? Will their building buckle under the load and collapse? No. All they need for a jaw-dropping press release is to make sure that when you stuff 10 million tokens into the model, it doesn't immediately crash with an OOM (out of memory) error and at least acts like it's still working. And that they can explain to the press why it can, in principle, handle those 10 million tokens. That, in essence, is the brilliant architectural achievement: keeping up appearances and having a good explanation ready.
And how will this 10-million-token model actually perform in the real world? It won't. First comes slowness, stuttering, and heavy wheezing. But slowness is only half the problem — there are tasks that can afford to wait. The heavy wheezing, however, gives way to hallucinations and hangs by the 60,000-token mark. At just 0.6% of the claimed 10 million tokens, the model loses the thread — and never finds its way back.
And there you have it — the complete blueprint for a 1,000-floor skyscraper: a smart elevator that doesn't take everyone up at once, a floor-numbering system that quietly stops pretending to be accurate above the hundredth floor, and a tweaked altimeter. Each solution, taken on its own, is solid engineering — no irony intended. But do they add up to a building where you can actually reach the 966th floor? No. Definitely not.

Lost in the Middle: Why LLMs Forget What's in the Center
In 2023, Nelson F. Liu and colleagues published a paper with a title that said it all: "Lost in the Middle: How Language Models Use Long Contexts." The experiment was straightforward: take several large language models, give them a long context with one critical fact buried inside, then move that fact around and watch the accuracy. What emerged was a textbook U-shaped curve. Models read the beginning and end carefully, but the middle falls apart. If the key fact lands in that blind spot, the model simply doesn't see it. It looks right at it and sees nothing.
It's the same way a student before an exam reads the first thirty pages of a textbook with real focus — taking notes, thinking it through — and tears through the last ten in a panic outside the exam room door. The two hundred pages in between? Somehow they got skimmed. The effect was named lost in the middle, and three years in, it hasn't gone anywhere. It reproduces across models from OpenAI, Anthropic, Google, and Meta; every system that's been put to the test (confirmed on the RULER benchmark, 2024).
The accuracy cliff is a drop, not a slope
You might ask: is the degradation gradual? No, it’s not.. A graph of quality versus context length doesn't look like a gentle slide down a hill, it looks like a table that just had two legs shot out from under it in a Western. The 2025-generation models that advertised 200,000-token context windows held up reliably to around 130,000 tokens, then accuracy fell off a cliff, a phenomenon the industry has simply taken to calling the accuracy cliff (NVIDIA RULER repository).
The ~130,000 token figure is an empirical pattern specific to this model generation. Experienced users, unlike the marketing teams, trust only their own tests and try not to feed more than 130,000 tokens to a model at once. When they do, they don't hold their breath for a coherent result.
By 2026, flagship models had pushed factual recall at one million tokens to 96% and above. Does that mean the cliff disappeared? No, it just changed its nature. Models got better at locating individual facts, but according to comprehension benchmarks, they didn't get better at connecting what they found across those lengths. Find a fact — yes. Build a chain out of 20 to 50 facts — no. This has been mathematically demonstrated by benchmarks like RULER: as soon as the task scales up from finding one needle to extracting and aggregating several, the effective context window, even for flagship models, shrinks by several times over.
Proactive interference — it turns out models have psychology
The most striking explanation came from an unexpected direction. In 2025, researchers applied the concept of “proactive interference” to language models, a term from cognitive psychology describing a situation where old information gets in the way of absorbing new information. For example, you learned a work password, then it changed, and yet you keep typing the old one, cursing yourself every time. That's exactly it.
It turned out that models suffer from exactly the same problem. The more distracting context appears before a target fact, the worse the model is at retrieving that fact, and the relationship is log-linear (Wang et al., "Unable to Forget", 2025). A neural network trained on the sum of human writing has inherited humanity's memory problems. It sounds poetic, but it comes at a steep price.

What Works Instead: RAG, Memory Agents, and Context Engineering
Let's take stock: throwing 10 million tokens at a model head-on doesn't work. A bare language model on a long context loses to a human. Marketing promises one thing; benchmarks demonstrate another. So what's a practitioner supposed to do when there's an important task and a deadline tomorrow evening?
The good news: after a couple of years of bumping their heads against this problem, the industry has come up with several approaches that actually work. None of them sound as impressive as "10 million tokens," which is why they never make it into press releases, but they deliver.
RAG — Give the Model a Search Engine, Not a Library
The oldest and most honest technique is RAG (Retrieval-Augmented Generation). The idea is almost embarrassingly simple: instead of dumping everything into the context at once, you search a database for relevant chunks and feed only those to the model.
Let's walk through a concrete scenario. You have 10,000 pages of corporate logs and the question, "What error occurred on Tuesday at 2:30 PM?" The brute-force approach loads everything at once, runs up a bill for millions of tokens, and prays the model doesn't lose the relevant entries somewhere in the middle. With RAG, you work smarter: you run a search against an index, pull out five relevant records (~2,000 tokens), and hand those to the model. The cost difference? A thousandfold. The accuracy difference? Decisively in RAG's favor, because the model will actually read 2,000 relevant tokens carefully, whereas it won’t do the same for 10 million tokens.
A writer's version of the same scenario: a trilogy and the question "which chapter did the hero break his arm?" The smart move isn't to feed all three volumes to the model — it's to find the three scenes that mention the arm and show the model only those.
Is it free? No. RAG requires a pipeline — indexing, embeddings, a vector database — meaning real engineering work. But it works; effective structure beats raw volume.
LOCOMO — the Benchmark Marketing Won't Quote
So how do you test memory honestly? Researchers from UNC Chapel Hill, USC, and Snap Inc. asked that question and assembled the LOCOMO benchmark (Long-term Conversational Memory). It consisted of long, multi-session dialogues simulating months of interaction, with an average of 19 sessions, 9,200 tokens per dialogue, and strict temporal anchoring (Maharana et al., 2024). What it tests isn't the needle-in-a-haystack retrieval that marketers love to cite, but genuine reasoning: what came first, what came later, how facts connect, and whether they contradict each other. In other words, what memory actually does.
The results are a bucket of ice water to the face. Humans score around 88 on the F1 metric. A bare language model fed the entire conversation directly into its context window scores around 38. Maybe expanding the window will fix things? No — adding length to the context window does absolutely nothing; what actually works is adding structure. Systems that index the conversation, build a relationship graph, and feed the model only the relevant nodes consistently outperform a bare context window while working with a fraction of the context.
The takeaway here is brutal for anyone writing a press release: a bare context window loses to a human by a landslide, while smart structure closes that gap with no magical intelligence behind it, just indexing, a graph, and retrieval logic. Boring? Yes, but it works.
Agentic Memory — When the Conversation Lasts for Months
The next level up is systems that can search and remember. If RAG is a library catalog, an Agentic Memory system is a librarian with a notebook.
Here's how it works: conversation history is broken down into discrete facts, stored in a knowledge graph, and anchored to a timeline. When a new question comes in, the system doesn't re-read all 35 previous sessions, it pulls the relevant graph nodes and feeds them to the model. The approach is called Agentic Memory, and the graph itself typically lives in a database like Neo4j. The real advantage is that it addresses a fundamental weakness of long contexts: temporal reasoning (figuring out what came before what, and how facts evolved over time). A bare language model handles this poorly; an agent with a graph handles it well.
What if you need to keep that memory local, like on a corporate laptop or a phone, without sending data to the cloud? The industry has an answer: dynamic adapters, as in the MemLoRA approach. Instead of caching a massive context in working memory, the system distills key facts into tiny micro-weights (adapters). These are stored directly on the device and loaded into the model on the fly, turning a static neural network into a flexible system that learns as it goes.
Context Engineering — A Discipline Worth Capitalizing
Built on top of all this is a distinct engineering practice: Context Engineering. Two years ago, prompt engineering was about how to phrase a query. Context Engineering is about how to architect the system around the model so that the right information lands in the context window.
IBM, Anthropic, and Meta converge on several principles in their guidance (IBM Think, 2026; Anthropic prompting docs). First: relevance first — every token in the context window must earn its place, because noise actively hurts. Second: compression over completeness — distilled facts instead of raw data dumps. Third: provenance — every piece of information is traceable back to its source. Fourth: structured note-taking — for long tasks, the model maintains a running log rather than re-reading history from scratch.
In practice — as Anthropic's guidelines prescribe, for example — this translates into strict formatting rules: long target documents must be wrapped in clear XML tags, and control instructions placed at the very end of the prompt, to counteract attention degradation.
You can hardly call it a prompt in the conventional sense anymore. This is a full-blown data architecture wrapped around a language model, with indexing, routing, versioning, and logging. It might be boring, but it actually works, unlike 10 million tokens.
What to Choose
If you have a document under 130,000 tokens and a one-off query, use direct context: cheap and cheerful, works on almost any modern model. If you need the same thing but without the hassle of manual copy-pasting and with honest source citations, use Google NotebookLM: upload your files, ask your questions, and the model answers strictly from your data.
If you're dealing with long conversations, support threads, corporate history, or narrative continuity, you need RAG plus graph memory. There's no real out-of-the-box solution here; you'll either have to write code yourself or build a pipeline in a visual environment like Flowise or Langflow, though someone still has to configure that, too.
If you need local deployment, privacy, or have hardware constraints, the MemLoRA approach fits the bill: a small model plus adapters. One important caveat is that this requires an engineering build tailored to your specific hardware.
And for datasets north of a million tokens, use iterative chunking (the SnowBall pattern): slice into overlapping chunks, run them through the model sequentially, aggregate the facts. Flowise or Langflow work here too, but again, someone has to put it all together and get it running.
All of these solutions share one thing: they design effective structure instead of piling on raw volume.

Elephant Memory, a Hole in the Middle
Let's draw a line under this whole story. Llama 4 Scout, 10 million tokens, 15,000 pages. It sounded like a magic trick, and it turned out to be one, complete with terms and conditions printed in fine print on the back of the box. To be fair, the architecture is honest: iRoPE and Mixture-of-Experts (MoE) allow the model to stay on its feet while swallowing that kind of volume. But 'didn't crash' and 'actually read' are two very different results. Engineers know this perfectly well. Which brings us to the main practical takeaway: don't take press release numbers at face value. "10 million tokens" in a marketing brochure and "10 million tokens" on a benchmark like Fiction.LiveBench are two very different numbers.
Afterword Without a Moral
Is there anyone to name and shame when all is said and done? Not really. The 10 million token story isn't an exposé, and it's certainly not a scandal. What we have here is the industry's normal cycle, playing out the same way it always does: engineers do the impossible, marketers sell it as magic, practitioners learn the hard way where the limits are, researchers explain where those limits come from, workarounds emerge — and everything settles down until the next press release.
Give it a year or two, and someone will inevitably ship a model that genuinely handles a million tokens without losing the middle. And then what? Then we'll chuckle at the days when accuracy fell off a cliff at 130,000 just like we chuckle now at GPT-3's context window of a measly 2,000 tokens. Laughing at yesterday's ceilings is one of the industry's oldest traditions.
For now, we work with what we have: an AI with an elephant's memory that forgets the entire vast elephant middle. It sounds absurd, but this is the stage of technological maturity where boring, honest engineering practices start beating polished slide decks.
And that, perhaps, is the best news in this whole story about 10 million tokens. Here at the editorial desk, we live by this rule: effective structure beats raw volume. We'll have plenty more to say about that.
Questions.
Is a bigger context window always better?
How many tokens can an LLM actually handle reliably?
Why do LLMs forget the middle of long documents?
What's the difference between RAG and a long context window?
Sources
References cited in this piece. Last verified on the published or revision date.
- 01
- 02
- 03
- 04
- 05
- 06
- 07
- 08
- 09
- 10
- 11
- 12
- 13
- 14
- 15