SilentRoom Journal

Wall Street Hallucinations: How OpenAI's Legal Advisors Brought Invented Precedents Into Court

editorial@silentroom.ai (Aaron Miller) — Tue, 23 Jun 2026 16:37:07 GMT

Recipe for a Good Scandal

The perfect scandal piece requires a few obvious ingredients. It needs respected names and institutions with serious reputations to lose. The underlying problem should be presented as simply and accessibly as possible — the more complex it actually is, the more important that becomes. The specific case used as an example must illuminate a broader public issue, ideally one that's fresh enough to still have some shock value. Above all, the reader must be left with absolutely no room to shrug, side with the accused, and say, "Honestly, who cares?"

Let's run the stress test.

Manhattan vs. Cambodia

The lineup in this story couldn't be more stark. On one side: the Cambodian criminal syndicate Prince Group, which operates a network of private forced-labor camps. They industrialized pig-butchering, the systematic draining of victims' money. In this scheme, you might meet an attractive woman online and find that she falls for you within a week, then casually mentions "a small crypto investment opportunity — my uncle's in the industry." In reality, there's no woman on the other end. There's a prisoner who gets beaten if he doesn't meet his quota.

On the other side: the U.S. Department of Justice, which initiated a $15 billion asset forfeiture case against Prince, the largest in history. Their legal weapon of choice in this case is Sullivan & Cromwell, an absolute titan of the Am Law 100, with offices at the most expensive address in Manhattan, a track record of nine-figure deals, and a reputation beyond reproach. S&C also officially advises OpenAI on the "safe and ethical deployment of artificial intelligence." That one detail is what turns a routine failure into a perfect scandal.

Precedents That Never Existed

When S&C filed its complaint, it sought bitcoin assets scattered across the globe through an intricate web of offshore entities. The goal was a noble one: not just to seize the stolen funds, but to return them to victims.

The syndicate's attorneys, far less celebrated in legal circles, demolished the filing with ease. They found dozens of AI hallucinations embedded in the complaint: citations to nonexistent precedents, references to cases that were never decided because they were never real. Everything that could theoretically be hallucinated had been, and it had all made its way into documents submitted to a federal court.

The Wall Street elite folded without a fight. S&C partner Brian Dietderich sent the judge a letter of contrition. In it, he began by helpfully explaining the nature of AI hallucinations — in case the judge hadn't read the news in the past two years — and then walked through S&C's exemplary internal policies for working with AI systems:

Two mandatory training modules before AI access. Completion is tracked and confirmed.
The training repeatedly emphasizes the risk of hallucinations.
Instructions direct attorneys to "trust nothing and verify everything."
Failure to independently verify constitutes a violation of firm policy.

The letter's conclusion is staggering in its absurdity. Dietderich astutely observes: "The firm's policy regarding AI use was not followed in preparing the motion." He adds that some of the problems appear to have arisen "in whole or in part due to human error."

S&C has taken a serious reputational hit, earning a place in the unofficial AI Hall of Shame. Journalists couldn't resist pointing out the delicious irony: the firm that publicly advises OpenAI on ethics and responsible algorithm deployment had itself been burned by an LLM.

A Systemic Vulnerability

Above the Law, which dissected this failure with surgical precision, delivered the verdict: "This isn't an S&C problem. It's a problem for the entire profession."

Far from New York and Cambodia, in Paris, researcher Damien Charlotin maintains a public registry of AI hallucinations identified in court cases. At the time of publication, it already contained approximately 1,300 documented instances. Each entry represents an attorney who cited a fiction and a judge who had to deal with the fallout. How many times courts have accepted a hallucination as genuine precedent remains beyond the scope of both the registry and the press.

The judicial system knows how to defend itself and typically hits back hard. Cases built on fabricated sources are dismissed with prejudice. The attorneys themselves face fines, suspension, and disbarment. In Nebraska, attorney Greg Lake was indefinitely disbarred, in part for attempting to conceal his use of AI by lying about a "broken laptop." In Colorado, Zacharia Crabill received a one-year suspension for a similar attempt to deceive the court.

Against the backdrop of these very public professional executions, we return to the Wall Street elite. Sullivan & Cromwell attorneys walk into court with a fully hallucinated $15 billion lawsuit. They get caught red-handed. What does Judge Martin Glenn do? Nothing. No fines. No suspended partners. No disbarments. Justice simply looked the other way when confronted with the hallucinatory tendencies of the people officially teaching ethics to AI systems.

Stress Test Complete

The effort to seize $15 billion is now stalled on three fronts: resistance from the syndicate's lawyers, procedural chaos caused by S&C's AI hallucinations, and geopolitics. China has already labeled the U.S. action "theft" and is preparing to assert its own claim to the bitcoin.

Years may pass before victims see any actual compensation. The U.S. Department of Justice still needs to clear the seizure, unravel a web of offshore entities, and verify thousands of victims. The principal defendant, Prince Group head Chen Zhi, was detained in Cambodia in early 2026. Instead of being extradited to the United States, however, he was quietly deported to China, where he has since vanished from press coverage entirely.

The arrest did nothing to disrupt the crypto cartel's operations. Thousands of prisoners in Cambodian compounds are still online, still making new connections, still methodically working through their next victims in the pig-butchering pipeline.

Claude vs ChatGPT vs Gemini for Writing: Which AI Writes Best in 2026?

editorial@silentroom.ai (Joseph Smith) — Mon, 22 Jun 2026 17:02:26 GMT

?? Picture a woodworker's bench. There's a chisel sharpened so many times it carves a curl of oak like warm butter. Next to it, a cordless multitool with forty attachments that can do almost anything, fast, as long as you tell it exactly what you want. And bolted to the corner, a microscope wired to the entire internet, perfect for checking whether the grain you're about to cut is walnut or just stained pine.

None of these is "the best tool." The chisel can't measure moisture content. The microscope can't shape a chair leg. The whole game is knowing which one to pick up. And that, more than any leaderboard, is the real story of Claude vs ChatGPT for writing in 2026 — with Google's Gemini as the third tool on the bench.

So let's stop treating this like a horse race and start treating it like a workshop. By the end you'll know which tool to reach for, whether you're drafting a novel, a legal memo, or a sonnet you'll deny writing.

Quick verdict (TL;DR)

If you only read one section, read this one. As of June 2026, here's the short version of the Claude vs ChatGPT vs Gemini for writing question, sorted by what you're actually trying to make.

Use case	Winner	One-line why
Long-form prose & books	Claude	Holds voice and rhythm over thousands of words with the least "robot wrote this" residue.
Creative writing & fiction	Claude	Best at subtext, character, and emotional nuance; least cliché-prone.
Essays & non-fiction	Claude	Natural tone, fewer buzzwords, reads like a thoughtful human essayist.
Academic writing	ChatGPT	Strongest structure and the safest single tool for a coherent draft — verify every citation.
Legal & professional	Claude	Controlled, consistent professional tone; great at harmonizing long documents.
Literary translation	Claude	Preserves register, idiom, and literary voice, not just literal meaning.
Poetry & short form	Claude	Most human-sounding, least reliant on the usual AI flourishes.

Who each tool is for: Claude is the chisel — reach for it when prose quality and voice are the whole point. ChatGPT is the multitool — reach for it when structure, speed, and a deep ecosystem of features matter more than literary polish. Gemini is the microscope — reach for it when your writing has to be welded to fresh, web-sourced facts and you live inside Google Workspace.

One honest caveat before we go further: prose quality has no scoreboard. In blind tests, ordinary readers often can't reliably pick "the best" passage, and the gaps between these three are smaller than the internet pretends. Treat every verdict here as "as of June 2026" and as a strong tendency, not a law of physics.

The 2026 writing landscape: what changed

A year ago, the Claude AI vs ChatGPT for writing debate was mostly about which model could string a coherent paragraph together. That war is over. All three can write. The interesting questions moved up the stack: how controllably can each one produce sustained, stylistically consistent prose, and how well does it behave when you hand it a 200-page manuscript and an opinion about chapter seven?

The current flagships (Opus 4.8, GPT-5.5, Gemini 3.5 Pro)

The bench got three new tools this spring. OpenAI shipped GPT-5.5 on April 23, 2026, and pitched it as "agentic-first" — optimized for coding, computer use, online research, and document workflows rather than being a pure prose machine. Writing benefits indirectly: it's better at the chores around writing, like collecting citations and reshuffling outlines.

Anthropic released Claude Opus 4.8 on May 27–28, 2026, describing it as a "substantial checkpoint" rather than a reinvention — new training-data mix, better post-training, faster inference, same long-context obsession. Simon Willison memorably called it "a modest but tangible improvement," which is exactly the kind of unglamorous, voice-preserving upgrade writers care about. Anthropic's own launch notes emphasize safer multi-step agents and a new user-facing "effort" control.

Google announced Gemini 3.5 Pro at I/O 2026 on May 19 with a June launch window, a reported two-million-token context window, and a "Deep Think" extended-reasoning mode. It's the deep-reasoning, high-context tier sitting above the cheaper Gemini 3.5 Flash, aimed at synthesizing across huge research corpora.

The pattern: OpenAI optimized for doing things, Anthropic for writing and reliability, Google for reading everything at once. That triangle explains almost every result below.

Why "best for writing" ≠ "best benchmark score"

Here's the trap. Standard benchmarks — knowledge quizzes, coding suites, synthetic reasoning leaderboards — measure accuracy on short prompts. They say nothing about narrative pacing, voice consistency, or whether a model takes editorial feedback without nuking your structure. A model can ace a reasoning test and still write prose with the personality of a tax form.

Worse, most benchmarks grade one-shot answers. Real book-length work is iterative: outline, draft, revise, revise again, three sessions later remember what happened in chapter two. The qualities that actually decide a Claude vs ChatGPT writing quality comparison — style controllability, revision intelligence, hallucination discipline, interface ergonomics — barely show up on the leaderboards everyone quotes. Which is why a model with slightly lower scores can still feel better to write with.

Context windows & why they matter for book-length work

The headline number got boring because everyone caught up. All three flagship families now operate in or near the one-million-token range; Gemini 3.5 Pro reportedly pushes to two million. A million tokens is roughly 750,000 words — several novels, or one very ambitious textbook — held in mind at once.

So the size race is basically over, and *the real differentiator is how well a model uses that window, not how big it is.* Long-context research keeps finding a "lost in the middle" effect: models pay attention to the start and end of a long input and quietly skim the middle. Claude-family models tend to make unusually reliable use of long context; Gemini ingests the most but with task-dependent quality; GPT-class models are strong up to a few hundred thousand tokens before drifting. For a novelist, "can it actually find the scene where Ana meets Bojan?" matters far more than the raw token ceiling.

Effort/thinking controls and what they do to prose quality

The genuinely new lever in 2026 is the thinking dial. Opus 4.8's "effort" control, GPT-5.5's Thinking/Pro modes, and Gemini's Deep Think all let you trade speed for depth of reasoning. Under the hood these behave a bit like a temperature knob: research on temperature shows low settings make text predictable and tight, while higher settings increase diversity and exploratory phrasing — great for ideation, riskier for instruction-following and facts.

The practical move for writers: crank effort and creativity up for early brainstorming, then drop both for line-editing and anything factual. High-effort modes also cost more tokens and run slower, which is why serious workflows route work — a fast, cheap pass for brainstorming, a slow frontier pass for the structural rewrite of a research-heavy chapter.

Specs at a glance

Numbers people actually ask about. The flagship tier sets the ceiling, but most writers do their day-to-day work on the cheaper "workhorse" versions, so this table shows both where they differ. (Pricing is API cost per million tokens; consumer chat apps bundle this into a flat monthly fee.)

Spec	Claude (Opus 4.8 / workhorse Sonnet 4.6)	ChatGPT (GPT-5.5)	Gemini (3.5 Pro / current 3.1 Pro)
Public release	Opus 4.8: May 27, 2026; Sonnet 4.6: Feb 17, 2026	April 23, 2026	3.5 Pro: June 2026; 3.1 Pro: Feb 2026
Context window	~1,000,000 tokens (Sonnet 4.6)	~1M (≈922k in + 128k out)	Up to 2M (3.5 Pro); ~1M (3.1 Pro)
Max output length	~64,000 tokens/response; up to 300k via batch	128,000 tokens/response	~65,000 tokens; defaults to ~8k unless raised
Pricing in / out	≈$3 / $15 (Sonnet 4.6); cached input ≈$0.30	≈$5 / $30	≈$2 / $12 ≤200k, ~$4 / $18 beyond (3.1 Pro)
Thinking modes	Adaptive "effort": Low / Medium / High / Max	Instant / Thinking / Pro + effort levels	Deep Think; thinking_level Low–Max; @fast/@thinking/@pro
Standout writing trait	Most natural, least "AI-ish" long-form voice	On-brief, structured, instruction-faithful	Web-grounded, fact-dense, huge-context synthesis

Notice the quiet plot twist in row three. GPT-5.5 has the highest theoretical output ceiling at 128k tokens — enormous for a single generation. Claude tops out at 64k per response. And yet practitioners keep reporting that *Claude's long output feels longer*, because it holds tone and narrative coherence deeper into a draft while the others start repeating themselves or compressing into executive-summary mode. The ceiling matters less than the endurance. You can confirm the official figures on the Claude models overview.

Writer-relevant benchmarks

Forget the coding leaderboards for a minute. Here are the four measurements that actually predict whether you'll enjoy writing with a tool.

Long-context fidelity (consistency across long documents)

For raw recall over enormous inputs, Gemini is the king — near-perfect "needle in a haystack" retrieval at very long lengths, plus the biggest window. Anthropic's tests show Claude hitting near-perfect recall up to 200k tokens and staying strong at 500k in third-party RAG evals. The honest asterisk: independent benchmarks find every model degrades on mid-document details as you approach hundreds of thousands of tokens. The original "lost in the middle" study still describes 2026 reality. For a giant research binder you want to query, lean Gemini; for a manuscript you want controlled, conservative edits on, Claude and ChatGPT behave more predictably.

Instruction-following & steerability (does it obey your style brief)

This is where Claude pulls ahead and it isn't especially close. Ask a model to end every sentence with a specific word, or to hold a "stoic EU regulator" voice for twelve paragraphs, and Claude is the most literal about obeying. One Confident AI evaluation measured how often each model could be talked into ignoring its scope: GPT-4o complied with out-of-bounds requests ~18% of the time, Gemini ~24%, and Claude only ~9%. For brand-voice or character-voice work, the chisel wins — Claude is simply the hardest to knock off-brief, with ChatGPT a strong second and Gemini the most likely to snap back to a generic "Google-y assistant" voice mid-piece.

Factuality & hallucination rates (the academic/legal dealbreaker)

In that same evaluation's document-Q&A tasks — the closest thing to summarizing PDFs or citing sources — Claude posted the lowest hallucination rate at 5.4%, versus GPT-4o's 7.1% and Gemini's 9.3%, and it scored highest on faithfulness to the retrieved text. A blinded clinical-evidence study found Claude beating both rivals on accuracy and completeness. The behavioral tell: Claude is the most willing to say "that isn't in the provided documents," while the others lean toward confident speculation when sources run thin. Useful for brainstorming; a liability in a legal brief.

Creative-writing preference evals (LMArena creative, EQ-Bench)

Here it splits, interestingly. On EQ-Bench's creative-writing leaderboard, Anthropic's latest Claude family (including the writing-tuned Fable line) sits at the very top for voice and emotional nuance, with GPT-5.5 the next serious competitor. But on crowd-sourced LMArena creative writing, Gemini-3-Pro has topped the charts, helped by its long-context muscle. The catch worth knowing: many of these boards use LLM judges (often Claude itself) and decide on tiny Elo gaps. Practically, you'll notice stylistic flavor far more than raw quality gaps.

Head-to-head by writing type

This is the part you came for. Same structure each time: how each tool behaves, a verdict, and a concrete example where it helps.

Long-form prose & books

For drafting chapters and holding a voice across dozens of pages, this is the heart of the claude vs chatgpt long-form writing comparison — and when people weigh Claude vs ChatGPT for long-form writing, Claude usually wins it. Blind tests and practitioner reviews repeatedly rate its long-form prose the most natural and least "AI-sounding," with one long-form scoring round landing Claude at 9.1, ChatGPT at 8.4, and Gemini at 7.9 for naturalness and editing time. ChatGPT is the better planner — superb at acts, beats, and scene lists — but its default transitions can feel formulaic. Gemini shines at swallowing a 400-page draft whole to fact-check it, then writing prose that reads like a competent internal memo.

Verdict on claude vs chatgpt for writing a book: draft and revise chapters in Claude, outline and restructure in ChatGPT, keep Gemini open as the research desk. Example: hand all three "rewrite this chapter opening with more rhythm, keep my structure," and Claude is the one that tightens the sentences without quietly rewriting your plot.

Creative writing & fiction

For scenes, subtext, and that elusive literary feel, the claude vs chatgpt vs gemini creative writing contest tilts hard toward Claude — and on the narrower Claude vs ChatGPT for creative writing question, it's the same answer. Reviewers describe it as the first model where the output doesn't immediately shout "a robot wrote this," and it behaves like a collaborative co-writer that remembers your motifs across iterations. ChatGPT is the experimentation engine — ten alternate endings, five character concepts, a noir-magical-realism mashup on request — but its native voice drifts to cliché unless you steer hard. Gemini writes competent fiction that tends to read like a well-organized encyclopedia entry; its real gift is supplying accurate historical or technical background for the others to dramatize.

Verdict: for emotional depth, pick Claude; for maximal idea generation, start with ChatGPT. The chatgpt vs claude creative writing split comes down to whether you want one beautiful draft or twenty raw options.

Essays & non-fiction

Opinion pieces, explainers, personal essays — anything with a spine of real information and a human voice on top. Education-focused comparisons call Claude the "gold standard for language" here: natural tone, fewer buzzwords, strong handling of multi-source notes. ChatGPT brings airtight structure — clean intros, signposted sections, logical flow — which is a gift for persuasive essays and a curse when it slides into generic transitions. Gemini is the pick when the essay must lean on current statistics and live citations.

Verdict: research in Gemini → outline in ChatGPT → final voice pass in Claude. That three-step relay beats any single tool for serious non-fiction.

Academic writing

Now style has to share the stage with rigor. This is the one category where chatgpt vs claude for academic writing genuinely flips toward ChatGPT: formal evaluations of scholarly writing have placed GPT-4-class models at or near the top for structure and scientific coherence, and it's the safest single tool for a well-organized first draft. Claude has the edge on depth — literature reviews and theoretical discussion, especially when you feed it PDFs directly — but its citation accuracy is a known weak spot, with one test finding only ~40% of its auto-generated references correct. Gemini's live search surfaces real, recent sources better than either.

Verdict on claude vs chatgpt for academic writing: draft in ChatGPT, deepen the analysis in Claude, hunt sources in Gemini — and verify every single citation by hand, from all three. None of them is trustworthy on references.

Legal & professional writing

Contracts, policies, formal letters, board reports. Claude tends to rank highest for the controlled, consistent tone these documents demand, and its long context is ideal for harmonizing several contracts or policies into one coherent draft. ChatGPT is the template king — feed it bullet points and it returns a clean memo or proposal. Gemini's advantage is pulling in current regulations and official guidance.

Verdict on claude vs chatgpt for legal writing: Claude for the final wording, ChatGPT for the skeleton, Gemini for the regulatory facts. The non-negotiable caveat: all three are drafting aids only. A qualified human must check every clause and citation — fabricated case law is a documented hazard across all of them.

Literary translation

Evidence here is more experiential than benchmarked, but the pattern is clear. The claude vs chatgpt for translation question depends entirely on what you mean by "good." For literal accuracy, ChatGPT is a reliable, fluent baseline. For preserving register, idiom, mood, and literary voice — the things that make translation an art — Claude's sensitivity to tone wins, and its long context keeps terminology consistent across an entire book. Gemini is the cultural-context desk: proper nouns, real-world references, how a phrase actually gets used online.

Verdict: translate for voice in Claude, sanity-check literal meaning in ChatGPT, and qualify all of it by language pair — every model is stronger in high-resource pairs like Spanish–English than in low-resource ones.

Poetry & short form

Poems, micro-fiction, captions, hooks, slogans. Claude lands at or near the top for naturalness and emotional nuance, and notably dodges the overused AI tics — "unfolding tapestry," "journey," "ever-changing world" — that betray the others. ChatGPT is the volume dealer: twenty headline variants or hooks in one go, punchy and catchy, but in need of pruning for cliché. Gemini is the least inventive of the three for verse, though handy for metaphors grounded in a specific factual domain.

Verdict: serious poetry to Claude; quick marketing variants to ChatGPT.

The multi-model pipeline: using all three together

Here's the move most people miss. The pros rarely pick one tool — they run an assembly line, handing the work to whichever tool is sharpest for that step. This is where the chatgpt vs claude vs gemini for writing debate dissolves into "yes, all of them, in order."

Style card / voice brief (which model builds the best one)

Before drafting anything long, build a style card: a tight brief describing voice, rhythm, vocabulary, and forbidden tics. Claude writes the best one, because it's the model that will then obey it most faithfully. Have it analyze a few paragraphs of your own writing and reverse-engineer the rules, then reuse that card in every session to fight voice drift.

Drafting vs editing vs fact-checking — assign each model its job

Map the workshop to the workflow:

ChatGPT builds the outline, argument map, and beat sheet — the scaffolding.
Claude drafts and revises the actual prose, holding the voice from your style card.
Gemini gathers and verifies the facts, citations, and current sources, exploiting its search grounding and giant window.

Drafting and editing are different jobs, and the tools that win them are different too. Claude is best at tone-faithful revision ("tighten the middle third, keep my structure"); ChatGPT is best at producing alternative versions to choose from; Gemini is best at telling you that the statistic you love is three years out of date.

When one model is enough (don't over-engineer)

If you're writing a single 800-word blog post, the three-tool pipeline is theater. One good tool plus a careful prompt beats an elaborate relay for most everyday writing. Reach for the assembly line only when the stakes, length, or research load justify it — a book, a whitepaper, a high-stakes brief.

Which should you choose?

A quick decision tree for the impatient. Find yourself and reach for the matching tool first.

Novelist or fiction writer → Claude. Voice, subtext, and endurance over long drafts are exactly its strengths.
Student or academic → ChatGPT for the structured draft, Claude for deep analysis of your own work, Gemini for live sources — and verify citations everywhere.
Lawyer or professional → Claude for controlled final wording, with ChatGPT scaffolding and Gemini for current regulations. Human review mandatory.
Translator → Claude for literary voice, ChatGPT for literal accuracy, Gemini for cultural context.
Marketer or short-form writer → ChatGPT for fast, high-volume variants; Claude when one line has to land perfectly.
Researcher drowning in PDFs → Gemini, for the biggest window and the best recall across a huge corpus.

If a single rule helps: use Claude when prose quality and voice are paramount, ChatGPT when workflow and tooling matter most, and Gemini when you're deep in Google and need current, structured research — but rarely Gemini as your main stylistic pen.

How we got to these verdicts (methodology & the fine print)

A word on where this comes from, because you should trust claims about AI roughly as much as the sourcing behind them. This piece synthesizes published 2026 evaluations — named and dated above — rather than a single first-party lab test: independent blind preference tests, the EQ-Bench and LMArena creative-writing leaderboards, the Confident AI hallucination study, long-context research, and patterns that recur across dozens of hands-on reviews. Where a benchmark is a vendor grading its own homework, we've flagged it; primary announcements and independent evals get more weight than aggregator blogs.

Three honest limitations. Non-determinism: the same prompt yields different output across runs, so any single head-to-head is a snapshot, not proof. Subjectivity: prose has no ground truth, and blind tests show people often can't reliably crown a winner — these verdicts are strong tendencies. Perishability: every "X is best at Y" here is stamped as of June 2026. New checkpoints land monthly; by the time you read this, the rankings may have shuffled. Treat this as a historical snapshot that was accurate the day it was written.

A few things no model changes. Each has verbal tics — watch for em-dash addiction, tidy tricolons, hedging openers, and the word "delve." Each drifts in voice over a long session, which is why a reusable style card matters. And none of this replaces a human editor or, in journalism, your fact-checking obligation. On privacy: consumer chat tiers and enterprise/API tiers differ sharply in whether your inputs train future models, and the opt-outs live in account settings — check them before you paste anything sensitive, and remember that AI-assisted text sold as fully human-written is an ethics problem no tool will solve for you.

The bench has three good tools. Now you know which to pick up, and when to put it down.

Hang in there — and happy writing.

Wittgenstein Knew Why AI Gets Dumber — and Why Text Provenance Doesn't Matter

editorial@silentroom.ai (Arsen Revazov) — Mon, 22 Jun 2026 16:29:52 GMT

Part 1. A Problem Discovered Too Late

The models you use are slowly getting dumber. This isn't alarmism, unfortunately, it's empirical: in April 2025, 74.2% of new web pages contained AI-generated text, and each successive model trains partly on the exhaust of its predecessor, which itself digested the exhaust of an even earlier one. Biologists talk about autophagy, an organism consuming its own tissue. Medical geneticists speak of the Habsburg jaw. ML engineers, since July 2024, have used the term “model collapse.”

What do Charles II of Spain (the last Habsburg on the Spanish throne), a photocopier, and ChatGPT have in common? All three degrade when they copy themselves. For the Habsburgs, generations of close-kin marriages accumulated genetic defects. For the photocopier, it's loss of contrast by the tenth copy. For AI, it was a July 2024 paper in Nature — AI models collapse when trained on recursively generated data — that spelled it out: feed a language model on the outputs of its predecessor, and within a few generations it gets substantially dumber.

The scenario is clear: diversity drops, and the tails of the distribution — the rarest, most original outputs — are the first to disappear. By the end of 2024, Dohmatob et al. in Strong Model Collapse had shown something uncomfortable: asymptotically, even one-thousandth synthetic data in a training corpus leads to degradation. 0.1% is already bad. So what happens when 75% of new web text is synthetic, and AI-generated pages account for nearly 20% of Google's top results by 2025? What fresh material is the poor model supposed to learn from? Scaling the dataset doesn't help. Scaling the model doesn't help. The classic scaling laws hypothesis — more data, smarter model — the premise underpinning half of every VC pitch deck out there, collapsed right on investors' slides.

By the time the Nature paper appeared, the industry had already spent two years living in the age of mass AI production, and the web data earmarked for training the next generation of models contained a meaningful share of synthetic content.

Phase One: Alarmism (2024)

After the Nature paper, talking about model collapse as an inevitable curse became fashionable, and two useful terms emerged. In 2023, people started refering to Model Autophagy Disorder (MAD), applied to generative models that feed on their own outputs and go insane. The acronym was too literary not to spread across headlines. Alongside it, journalists picked up "Habsburg AI": a model that breeds with itself comes to a bad end.

By late 2024, an apocalyptic scenario had taken shape. AI generation grows exponentially. Web data becomes contaminated. Future generations of models train on heavily polluted data. Quality falls irreversibly.

Phase Two: Mitigation (2025)

In 2025, several papers appeared showing that yes, things were bad — no one was disputing that — but not quite so fatal. The picture turned out to be more nuanced.

If real data is retained at every generation and synthetic data is merely added on top, the model degrades, but more slowly. A regime that replaces human text with machine text guarantees rapid collapse; an accumulation regime doesn’t. The picture that emerged showed that collapse is real but manageable with three practical steps: accumulate, filter, and blend in the right proportions. The alarmism of 2024 subsided.

Phase Three: Pragmatism (2026)

By May 2026, the industry had calmed down a bit. The apocalypse was postponed, while the collapse was stripped for parts and produced one useful application: spectral diagnostics of the embedding Gram matrix can now catch degradation before a model starts talking nonsense.

Three major shifts happened.

1. The Verifier as Life Preserver (With a Cast-Iron Weight Chained to It)

Yi et al., "Escaping Model Collapse via Synthetic Data Verification" (revised March 2026) revealed a major breakthrough. A verifier, an external filter between a model and its synthetic outputs, helped training produce improvement rather than collapse. Venture capitalists celebrated, but prefer not to think about a few nuances. First, strictly speaking, the theorem is only proven for linear regression in a vacuum. Second, the trained model converges to the verifier's "knowledge center." The student can never surpass the teacher. Third — and this is the really uncomfortable part — if the verifier is imperfect, early gains plateau and can even reverse. Infinite self-play is impossible, and the perfect teacher doesn't exist.

2. We Found the Delete Button

Deleting data from an LLM (unlearning) used to be like trying to extract a specific teaspoon of sugar from a finished cake. Any given model was a black box with hundreds of billions of parameters. Scholten and colleagues proposed an elegant pivot: the Partial Model Collapse method, which sics a targeted collapse on the unwanted knowledge and burns it out through recursion. Their paper is titled "Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs" — arguably the best headline of the year. The bug became a scalpel.

3. The End of Magic (Clinical Autophagy)

In April 2026, clinicians sounded the alarm: clinical LLMs, trained on AI-generated medical records, were systematically erasing rare pathologies and averaging out complex conditions into "benign normals." Rare pathologies aren't actually that rare — in an aging population, it’s especially common for a patient to have two or more conditions at once. Add a kidney problem to a liver problem, a vascular problem, and a hospital-acquired infection, and the AI starts contradicting itself. The same bland, washed-out prose you'd expect from hack writers has migrated into the workstations of practicing physicians, directly threatening the lives of their patient-readers.

Experts gave the phenomenon the reassuringly bland label "interpretive drift” and proposed watermarks on AI-generated records, plus "Human Vaults" that isolate and store repositories of data from real doctors. I think this is a band-aid at best: how are these vaults supposed to interact with the information AI generates in day-to-day clinical work?

AGI Valuations Take a Nosedive

"Investors and AI evangelists have been breathlessly awaiting Artificial General Intelligence (AGI), a model that can solve any problem at least as well as a human. AGI won't need smoke breaks, worry about mortgage payments, or risk burnout, but it also won't walk through the door everyone's been waiting at: simple scaling.".

You can increase compute by a factor of ten. You can (in theory) do it by a hundred. It won't matter. An AI isn’t going to drop you a line asking "I've engineered a new apple variety, launched a delivery startup, optimized global gas logistics, found a few decent and competent politicians, and proved the Riemann Hypothesis along the way. What's next?" A machine uprising’s not coming any time soon, either. A model cannot conjure from its weights information that was never there to begin with.

The idea behind model collapse isn't new: an Austrian philosopher came up with it long before the first perceptron existed. The meaning of a word, he argued, is not determined by its internal structure but by its use within a community. Without public practice, without other participants who can correct you, language degenerates into noise.

Understanding why a model cannot surpass its teacher requires a philosophical detour. Without one, any conversation about AI devolves into tech-speak, and the central question gets lost: what is meaning, and where does it come from?

Part 2. Wittgenstein Explains What Happened

Early Wittgenstein and why symbolic AI failed

In 1921, a young Ludwig Wittgenstein published the Tractatus Logico-Philosophicus — a book that, by his own modest estimation, had definitively solved all philosophical problems. (A perfectly reasonable state of mind at 32, especially if you're Austrian and came back from World War I a couple of years prior.)

The central idea of the Tractatus: language is a picture of the world. Every meaningful sentence corresponds to a fact in reality. Words are labels on things; sentences are models of situations. Anything that cannot be expressed clearly in this logical form is meaningless. Hence the famous aphorism, number 7:

"Whereof one cannot speak, thereof one must be silent."

This saying inspired logical positivism, and later symbolic AI of the 1960s–80s: build a large enough system of rules, the thinking went, and you'd get artificial intelligence. From 1956 to 1985, DARPA and universities poured billions into the search for AI according to this principle, creating expert systems. logic reasoners, and common-sense knowledge bases.

The program failed. Common sense still hasn't been encoded, not because of a shortage of processing power but because it cannot be formalized. Every rule requires interpretation, every interpretation requires another rule, and so on without end.

The early Wittgenstein inspired a program that collapsed for precisely the reasons the later Wittgenstein predicted.

Late Wittgenstein: meaning is use

By the 1930s, Wittgenstein had begun to doubt that he had solved all the problems of philosophy — a rare quality in any philosopher, and one worth acknowledging. He spent the next twenty years dismantling his own earlier work.

The result was Philosophical Investigations, published posthumously in 1953. It is an entirely different philosophy. Language is no longer a picture of the world but an activity. From "a word denotes a thing," we move to "a word has a use." One correct language disappears in favor of a multiplicity of language games.

Take the word "water." What does it mean?

— "Water!" a man cries out in the desert.

— "This is water," a mother shows her child.

— "Water boils at one hundred degrees," a physics textbook states.

— "Are you serving me distilled water again?" the generative text co-author snaps at ChatGPT.

These are all different uses: not different "senses of one word," but different language games in which the word operates differently. To understand the word "water" is not to memorize the definition H₂O, but to be able to play these games. Understanding means you know how to cry out when thirsty; to point when teaching; to write the formula when doing physics; to throw the word as an insult when irritated.

These uses seem to share no single common core, yet they form a chain of overlapping similarities, like the faces of family members. While a definition of the word "water" is impossible, any native speaker handles it with complete ease. Family resemblance matters more than formal semantics.

Meaning does not live in the mind, in a dictionary, or in logical form. Meaning lives in practice — in how words are used among other people, in concrete situations.

The Private Language Argument: The Key to Everything

In §§243–271 of Philosophical Investigations, Wittgenstein runs a thought experiment — in my view, the single most important argument in all of twentieth-century philosophy for anyone trying to understand AI.

Suppose I have a tickle in my left ear that no one else will ever experience. I invent the sign "S" for it and write that in my diary every time the sensation appears. I tell no one. I have my own, private language.

Wittgenstein asks: what exactly is this? What does "the same S" mean across my entries? How do I know that today's sensation is the very same S as yesterday's?

The answer is that I don't. There is no criterion. I can say "it seems to me that it's the same," but "it seems to me" is not a criterion, it’s part of what needs verification. Without a public practice, without other participants who can correct me or agree with me, the sign "S" has no meaning. It’s not a language, just noise I’m recording for some reason.

Meaning arises only within a community of practices.

If you have read this far, something should have clicked. Modern LLMs are trained on vast corpora of human text. They see which words appear alongside which other words. But they do not participate in language games. They do not drink water, do not cry "Water!" in the desert, do not explain things to a child, do not write physics textbooks, do not hurl words as insults.

From the perspective of the later Wittgenstein, LLMs have statistics of usage but no usage itself. This is precisely the private language argument applied to AI: a model that takes no part in a form of life is dealing with something that looks like language from the outside but is, in essence, its own private vocabulary — endlessly sophisticated and entirely without meaning.

When such a model begins training on its own outputs without human verification, we get a private language devouring itself. In practice, this is called model collapse; in philosophy, it’s the impossibility of private language.

Brandom: Turning Intuition into Theory

Wittgenstein was a brilliant philosopher, but a systematic thinker he was not. Philosophical Investigations is a collection of notes and aphorisms, wonderful to read and agonizing to work with theoretically. By the end of the twentieth century, Wittgenstein's insights were waiting for someone to systematize them.

That someone was the American philosopher Robert Brandom of the University of Pittsburgh. His magnum opus — Making It Explicit (1994) — runs to 800 pages of dense prose, in which every concept is defined in terms of five others, each of those defined in terms of ten more. Best read slowly, pencil in hand, with the growing sense that the author has a personal score to settle with you.

Brandom takes Wittgenstein's "meaning is use" and turns it into a rigorous theory called inferentialism: the inferences a word enters into determine its meaning.

The word "water" means what it means because:

— From "this is water" it follows that "this is drinkable."

— From "this is water" it follows that "this will get you wet."

— From "this is water" it does NOT follow that "this will burn."

— From "ice melts at 0°C" it follows that "water freezes at 0°C."

This is an inferential network — a web of consequences. To understand a word is to know its position in that web.

Brandom's central metaphor: language is the game of giving and asking for reasons. When I assert something, I commit myself to justifying it if challenged. I license others to use my assertion in their own reasoning. If I'm caught in a contradiction, I'm obliged to retract or revise.

This is a normative structure, with rights, obligations, and sanctions for all participants. To understand a sentence is to know its commitments, entitlements, and incompatibilities.

And this is where things get interesting for our purposes.

LLM as a Stochastic Parrot in a Normative Vacuum

From Brandom's perspective, an LLM produces assertions, but:

— It takes on no obligation to justify them. The model generated a contradiction? Tomorrow it will generate the next one without missing a beat.

— It faces no consequences for what it asserts. The model told you yesterday that Sydney is the capital of Australia? You can see how that happens.

— It doesn't exist within a normative community where anything is actually at stake. I can be fired for lying in a report; a model won't be fired for hallucinating, and if it gets shut down, it doesn't care. It has nothing on the line.

The optimist camp (Sutskever, Hinton) says that "understanding" emerges from large-scale statistics. If a model predicts the next token well enough, it must have built an internal model of the world.

I don't like the word "emerges." It's the academically respectable way of saying "we have no idea how it works, but apparently it does." The more honest term is "black box." While I don’t want to wade further into that debate — it doesn’t seem like one that arguments can resolve — I can confidently state that within a single model, with no access to a normative community, meaning cannot arise. Emergence from statistics is, at best, a hypothesis. External verification is a mathematically demonstrable necessity.

An LLM has no intrinsic concept of "water." If it's being checked by a verifier (another AI, a human expert, a compiler) that says "this is right, this is wrong," it gradually builds up a picture: "water" is the thing the verifier nods at when I say "liquid," "wet," or "boils at a hundred degrees," and frowns at when I say "burns" or "triangular."

The model doesn't understand the word, it understands the teacher. Its "knowledge" is an imprint of the verifier's normative grip, not contact with the world. Change the verifier, and you change the meaning.

If tomorrow the judge decides that water burns, the model will dutifully learn that water burns.

What This Means

The most honest conclusion of Part 2 is a simple one. When the AI industry debates "does a model understand," "does it have consciousness," or "could it be AGI," it's treading the same ground philosophy covered sixty years ago. That’s not because philosophers are smarter (though sometimes they are), but because they had more time and fewer financial incentives to stay distracted.

Philosophy's central lesson for AI in the twentieth century: meaning doesn't arise from within a system. Meaning comes from participation in a normative community. Yi et al. formalized this for the specific case of training generative models. Brandom did it for all discourse. Wittgenstein did it for language itself.

But the conclusion most people draw from this — that AI is doomed, that synthetic data is poison, that we need to build walls around "human" data — is wrong. In Part 3, I'll show why that same Wittgensteinian logic points in exactly the opposite direction.

Part 3. Where the Skeptics Go Wrong: Provenance Doesn't Matter

The distinction between "AI-generated" and "human" text is false. That’s not because there's no difference, but because the difference doesn't matter for the central question: can you train the next generation of models on these texts? If yes, LLMs have a future. If no, we're at a dead end.

The standard picture looks like this. There's a clean human corpus — books, articles, forums from before 2022 (the golden age of the internet, even though most forum content in 2008 was junk too, just hand-typed). Then there's a contaminated AI corpus — everything models have generated since ChatGPT launched. The assumption is that mixing the two in training leads to model collapse.

This viewpoint underpins every current regulatory proposal: watermarks, clean-data registries, the Human Vaults concept. I'm arguing that it's wrong.

Take a concrete example. I write with AI. For this article, Claude and I discussed Wittgenstein's philosophy; I proposed a thesis, she pushed back, and we arrived at a formulation together. This text is partly hers, partly mine, and I can no longer tell where one ends and the other begins.

Say I then:

Read through and edit it.
Show it to two colleagues who point out errors, then make the corrections they recommend.
Publish it in a peer-reviewed journal, where other researchers scrutinize it for years.

What kind of text is this? "AI-generated" or "human"?

The standard picture demands a binary answer: once AI touches content, it's synthetic. However, in terms of epistemic function, this text has been put through a filtering process no less rigorous than what you'd find at Nature. For training purposes, it's closer to verified knowledge than to "human" noise. What matters isn't the origin but the validation trail.

The later Wittgenstein gives the same clear answer: meaning is determined by participation in a practice, not by its source. A word means the same thing when a parrot squawks it as when a professor speaks it if both use it correctly within the same language game.

The Real Reason Models Collapse

Reframe the problem through this lens, and it looks entirely different. It’s not "AI-generated text contaminates data," but:

The speed at which AI produces content has outpaced the speed at which existing institutions can filter it.

The scandal involving Kenyan content moderators is a perfect illustration of how this works today. When OpenAI trained its safety filters, it outsourced the work through Sama, paying Kenyan workers just over a dollar an hour to read descriptions of violence and abuse. Verification today falls far short of any reasonable technical or ethical standard.

AI-generated content isn't bad because of its machine origin. It's bad because engineers cut corners on annotation to protect shareholder margins, and then lazy writers can't be bothered to proofread whatever the model spits out. That's a different problem, and it calls for different solutions.

Nature faces an analogous challenge and solves it differently: organisms accumulate DNA mutations constantly. Most are harmful. If evolution operated on the principle of "eliminate mutations before they occur," life would never have arisen. Evolution works because selection weeds out bad mutations faster than they accumulate. That is the architecture worth studying.

AI-generated texts are a new type of "mutation" in humanity's epistemic body of knowledge. They are not inherently poisonous. They become poisonous only when the filtering system buckles under their sheer volume.

The right question here is not "how do we label AI content," but "how do we scale verification institutions?"

I half-expect someone to announce a website modestly titled "Everything About Everything" any day now. It’ll include a pipeline that analyzes the interests of several thousand audience segments, then generates news pieces, longform articles, and ostensibly authored opinion columns, complete with real agency photographs and AI-generated images. The site will adapt to each individual reader and publishe in the 100–200 languages humanity still reads. The writing quality will be far from perfect, but that's a temporary problem. Besides, is everything humanity has ever produced perfect?

Here's the kicker. Say our hypothetical site owner puts a like button and a read-time estimate at the bottom of every article, and how could he not? Suddenly he has in his hands exactly the filter Yi et al. describe. Hundreds of millions of clicks a day across a hundred languages is a verification mechanism with more throughput than any peer review process in existence.

What the Math Says

We probably won’t get a single mogul creating such a website, but the existing AI moguls could pool their efforts. Instead of spending billions trying to separate "clean human content" from "dirty AI content," they could invest that money in filters capable of distinguishing verified from unverified text, regardless of origin. Humans, incidentally, exploit unverified content for far more nefarious purposes, and far more often, than AI does out of its own inherent limitations. The term fake news still means something, doesn't it?

Where It Already Works (and What That Tells Us)

The funny thing is, industry best practices already operate by Wittgenstein’s logic, but nobody spells it out explicitly. Let's name names.

Code. AlphaCode, Codex, GitHub Copilot, Claude Code. The model generates code. The code runs. If it works, great. If not, it gets feedback. The compiler is the reality verifier. Where the code came from (a human, an AI, or some hybrid) is irrelevant. All that matters is whether the code passes the tests. That's why AI is progressing at a stunning pace in this domain.

Mathematics. AlphaProof, Lean, Coq. The model proposes a proof. A formal verifier checks it. It passes, it's true. It doesn't, it's wrong. Provenance is irrelevant; verification is everything.

Protein structures. AlphaFold predicts the structure and a lab verifies it experimentally.

Chess and Go. AlphaZero's self-play is the one case where self-improvement doesn't lead to collapse. The reason is the built-in verifier: the rules of the game and the fact of winning or losing. A loss is an objective signal, independent of any verifier's "knowledge base."

The pattern is clear: wherever a cheap reality filter exists, AI progresses without collapse and without needing human-labeled data. Wherever it doesn't — creative writing, politics — we hit the ceiling of the humanas-verifier approach and run straight into the limitations Yi et al. describe.

Under this logic, the current political debates around AI start to look pretty shaky:

— Banning AI on Wikipedia is pointless. Fact-checking is what matters. A text's origin tells you nothing about its quality.

— Mandatory labeling is useful for transparency, not for quality. A labeled AI text doesn't automatically become worse; an unlabeled human text doesn't automatically become better.

— Isolation in Human Vaults is a safety net, not a solution. Building filters that work in the flow of content should take priority over hiding away a "golden corpus."

What actually makes sense instead:

— AI judges in peer review. Paradoxically, AI filters AI better than humans do, recognizing characteristic errors faster. Anthropic, OpenAI, and DeepMind already use this principle in final-stage review.

— Scaling reality grounding. Extend the code-and-math approach as broadly as possible to any domain with an objective verifier.

— Reputation systems. The question to answer isn't "should I trust this article?" but "should I trust this author?"

— Provenance tracking. Watermarks can serve as tracking infrastructure rather than exclusion filters, helping people account for a text’s characteristics by providing information about its journey.

The Sharpest Implication

Taken to its logical conclusion, this leads to a provocative claim: most of the fear around AI data contamination is displaced anxiety about our own laziness.

We're afraid we won't be able to filter AI-generated content, but we don't properly filter human content either. Social media recommendation algorithms have spent years giving the spotlight to whatever triggers an emotional response, from dubious claims to political propaganda, with no regard for truth whatsoever. Our epistemic institutions were already in deep crisis long before ChatGPT.

AI didn't create the filtering problem. AI laid it bare. The solution is making our epistemic institutions stronger, but that's expensive, tedious, and doesn't generate clickbait headlines. Worse, it carries the risk of being accused of censorship.

The industry would rather talk about watermarks. That's politically safe. It’s also epistemically useless: getting a model to embed a cryptographic signature in its outputs is technically trivial, but it says nothing about the quality of those results.

This Is a Race

AI progress now hinges on a single question: can verification institutions keep pace with the volume of generation? Right now, they can't. The ratio of verified knowledge to synthetic noise is in freefall.

This is a race with two possible outcomes.

In the optimistic scenario, we manage to rebuild institutions in time and AI advances through hybrid filtering. Wittgenstein turns out to be right in the most elegant sense: meaning arises in a form of life, and that form of life turns out to be elastic enough to incorporate AI as a participant rather than a threat.

In the pessimistic scenario, institutions buckle. Web data becomes unusable for training. Frontier models are forced to retreat to closed, curated corpora and reality grounding. That's not the end of the world, but it's a radical slowdown, nothing like what the industry promises in its investor decks.

The sharpest irony is that in both scenarios, the top labs will have to do the same thing: build verification infrastructure. They’ll need it to keep moving forward in one case and to avoid collapse in the other. The only difference is that, in my view, these institutions are the primary engine of AI progress, while in the standard view, they're an annoying line item that costs money better spent on GPUs.

The Honest Counterargument

The central weakness of my position is this: verification doesn't scale well. Not cognitively (careful verification requires a slow human in the loop), and not economically (Kenyan annotators at a dollar an hour represent the industry's ceiling, and it doesn't hold). Generation gets cheaper on an exponential curve. If that gap doesn't close, we end up in the pessimistic scenario for purely economic reasons, not philosophical ones.

I don't know how to close that gap. My suspicion is that part of the answer lies in AI judges (a model checking a model is faster than a human), part in reality grounding (a compiler is free and never gets tired), and part in reputation systems (trust in an author lowers the cost of verifying their work). However, these are all hypotheses. The problem may be unsolvable in principle. If so, the pessimists are right.

A Falsifiable Prediction

If I'm right, three things will happen in the next two to three years.

First: Companies likeAnthropic and OpenAI will stop keeping AI judges in house. Someone will raise a round for technology providing verification as a service, and it will be the most boring and most necessary startup of the decade.

Second: reputation layers will grow on top of arXiv and similar platforms. This won’t be a binary "passed/failed peer review," but a trust gradient with a full verification history.

Third: the top labs will quietly stop scraping the web wholesale. They'll shift to carefully curated text collections and domains where reality checks results (code, experiments, mathematics). In their technical reports, the phrase "trained on internet-scale data" will start being sheepishly swapped out for "trained on high-quality curated data."

If, on the other hand, the industry spends another three years debating watermarks and registries of "clean" data, then I was wrong, and the problem really is about where text comes from, not how it's filtered. Fair enough.

What readers should do about this

Verify your work. Don't just "disclose that you used AI" — that won't save anyone. Run your text through a fact-check, a colleague, a test, an experiment, or anything else that might catch you out.

You are part of the verification architecture, whether you like it or not. The quality of your personal filtering matters more now than it ever has. You are not a lone author but a small institution, and you belong to the kind of life form that checks the others.

Including stochastic parrots.

Especially stochastic parrots.

The Illusion of the Complex Prompt: Why an AI More Often Needs a Screwdriver Than a Laser Level

editorial@silentroom.ai (Aaron Miller) — Mon, 22 Jun 2026 16:26:53 GMT

Let's start with the main point:

Prompt engineering boils down to picking the minimally sufficient tool.
Zero-shot prompting handles most routine tasks, from summarization to basic classification.
Complicating a request only makes sense after you've diagnosed the problem: a broken form calls for examples, broken logic calls for a chain of reasoning.

This text isn't for the veteran prompt engineers who have attained the Zen of context windows. It's for those who need to lay a reliable foundation or sanity-check their working habits.

A myth has taken root in the industry: the more elaborate the architecture of a request, the better the result. The three basic techniques for working with AI often get lumped together, and as a result, simple tasks end up wrapped in multi-story constructions with variables and forced reasoning.

In practice, this kind of overengineering hits both speed and budget. The math is simple: firing up a chain of reasoning where a basic zero-shot prompt would have done the job forces the model to generate 500 to 1,000 "extra" tokens. The ordinary user waits fifteen seconds for an answer instead of two, and at company scale, a bloated API bill becomes a painful line item. Meanwhile, all the fuss adds no real value to the facts or the style.

The core principle for working with AI is minimum sufficiency. Zero-shot, few-shot, and chain-of-thought (CoT) prompts are not rungs on an evolutionary ladder from simple to complex. They are three different keys for three fundamentally different classes of task. Any complication beyond the baseline is just a tax on the habit of over-hedging.

Zero-shot as the starting point

Zero-shot prompting is a request with no examples and no instructions on how to think. You set only the role, the task, the output format, and hard constraints.

This is the technique to start with on any new scenario. It covers most standard operations: classification, translation, data extraction, or summarization. The models have already seen millions of such tasks; they don't need to be told what a chronological list looks like.

When a body of documents enters the workflow, the focus shifts from format to guardrails. Your main anti-hallucination anchor is the requirement to draw data strictly from the sources and not try to fabricate facts out of the weights.

If zero-shot delivers consistently, the iterations end there. There's no reason to break out a laser level to hang a poster.

If the result falls apart, diagnosis begins. With a basic instruction, the model usually trips up for one of two reasons. The answer might be correct on substance but miss on form, style, or specific formatting. Alternatively, the model might skip over facts and land on the wrong conclusion. Trying to cure the first case with more elaborate reasoning, or the second with style examples, is technically pointless.

Here's a working example of zero-shot. It has no multi-story instructions, just the task, the format, and a touch of paranoia about the facts (assuming you've already loaded the context into the model).

Example of a simple zero-shot prompt:

Act as a meticulous fact-checker. Analyze the attached documents and build a chronology of the events mentioned.

Output format: [Date] — [Event] — [Brief confirming quote]

Constraints:

Rely strictly on the uploaded texts. Pulling in outside information is forbidden.
If the document gives no exact date, mark it [date unknown].
Don't try to smooth over the rough edges or fill in the picture. What isn't in the sources didn't happen in reality.

The whole point of the method lives in the constraints block. A good zero-shot doesn't waste time explaining style or logic. It sets the frame and hits the weights hard for any attempt to guess at something that isn't in the source base.

Few-shot: when showing beats telling

For a language model, words aren't meanings — they're vectors. That’s why models tend to ignore long lectures about exactly how the final document should look. An LLM defaults to its habits, but show it a couple of finished examples, and in-context learning kicks in. This technique is called few-shot prompting. Instead of an exhaustive format spec, you load "input → output" specimens straight into the prompt. If you need a specific tone, a nonstandard classification, or strict card formatting, prototypes work more reliably than any string of adjectives. As a practical rule, if your description of the output structure has already run to three paragraphs, delete them and paste in a short example.

The sweet spot is two or three samples. One is risky: the system can latch onto an incidental detail and overgeneralize it across the whole response. Five or more eat into your context budget and crowd the task itself out of memory.

All examples need variety and absolute correctness at minimum, because the model trusts them unconditionally. Garbage in guarantees a garbage pattern out — one that the LLM will pedantically replicate across the entire text.

A separate trap appears in environments with connected sources (SilentRoom Echo, Claude Projects). Examples in the prompt and uploaded documents (Sources) are two different layers of reality, and they don't mix.

Think of the situation like baking dough in a mold. The examples in the prompt are the mold — they set the contour — and the documents are the dough. If you stuff actual facts from the current task into your specimens, the model will latch onto them and ignore the originals in the knowledge base. The sample defines structure only: indentation, brackets, field order. The data filling those fields must come exclusively from the uploaded documents.

Here's a working few-shot for a screenwriter trying to turn a loose stream of thought from a treatment into a dry, scene-by-scene beat sheet. You can spend all evening explaining to the model what a story beat is and how granular it should be. It’s easier to just hand it two baking molds.

A simple few-shot prompt:

Act as a script editor. Your task is to convert raw treatment text into a strict beat sheet.

Example 1:

Input: Max walks into a bar, sees Anna there with another man, gets angry, but decides not to make a scene, just orders a double whiskey and sits down in the darkest corner.

Output:

[INT. BAR — EVENING]

Characters: Max, Anna, Stranger.

Action: Max spots Anna with the Stranger. Avoids contact, retreats to a blind spot.

Conflict: Internal. Jealousy versus saving face.

Example 2:

Input: Police kick the door of the apartment in, but no one is there anymore — only a wide-open window and a cigarette smoldering in the ashtray. Detective Smith curses viciously.

Output:

[INT. SUSPECT'S APARTMENT — DAY]

Characters: Detective Smith, SWAT.

Action: Raid on an empty apartment. Fresh traces logged (window, cigarette).

Conflict: External. The suspect beat the investigation by a minute.

Constraints:

Do not invent dialogue or motivations.
Format strictly as in the examples.

Your turn:

Input: [User text goes here]

Output:

The trick here is the contrast between input and output. The model sees that the input was literary clutter charged with emotion ("curses viciously," "gets angry"), while the output is a dry report ("fresh traces logged," "internal conflict"). The template lands: the AI stops playing Dostoevsky and starts working like a normal editor.

CoT Prompts: External Memory for a Hasty Intellect

Language models love to guess. When confronted with a complex task, the system rarely pulls out an internal abacus. It does what it was built to do: predict the most likely continuation of the text. After a math problem, the most likely continuation is a finished answer — which it ideally serves immediately, because for a hasty AI, the correct answer and the fast one are often the same thing.

In cases like this, you're better off reaching for CoT prompts, which force the system to think out loud. Instead of demanding the final answer, you build in an instruction to spell out every step of the solution.

The mechanics here are purely engineering. By generating intermediate reasoning, the model makes each new step part of the context for the next one. The text literally becomes its external working memory, and the step-by-step output hedges against logical leaps.

When working with connected sources, CoT performs another critical function — it makes the process transparent. Instead of receiving a smooth synthetic text and guessing where a dubious claim came from, you incorporate a structural audit. The algorithm is simple: name the source, quote it, draw a conclusion, and only then formulate the final answer.

If a specific fact is missing from the database, the system will stumble at the quotation step and flag the gap, rather than quietly fabricating the missing pieces from general knowledge. The black box becomes a visible, auditable chain.

The only downside is cost. Reasoning generates tokens, and tokens eat into the context window budget. Deploying a multi-stage logical apparatus for basic summarization is technically possible but economically pointless.

Here's a working CoT prompt for a researcher. Turning an AI loose in an archive of documents without step-by-step oversight is a surefire way to get a polished hallucination. We want an audit, not a synthetic answer, so we're literally making the system take an open-book exam and show us its rough drafts.

Example CoT prompt with source auditing:

Act as a meticulous scientific reviewer. Analyze the attached research and determine whether there is consensus on hypothesis [X].

Before giving the final answer, show your reasoning:

Arguments "for": which documents support the hypothesis? (Name the file and provide a short quote.)
Arguments "against": which documents refute it or call it into question? (Name the file and the quote.)
Blind spots: what critical data is missing across all uploaded sources?

If there's no factual basis for an answer, write [Insufficient data]. You are forbidden from pulling in outside information or fabricating the missing pieces.

Only after completing all three steps, write "CONCLUSION:" and formulate the final assessment.

The whole trick lies in the third point and the rigid sequence. They literally force the model to admit the absence of data before it gets a chance to quietly slip in a pretty fact from its training weights. The black box becomes clear glass.

Finale

All of prompt engineering boils down to one simple principle: don't expect the machine to read your mind. It won't.

Zero-shot prompting saves time on routine work (screwdriver). Few-shot prompting sets the right form without lengthy coaxing (drill). Chain-of-thought prompting insures you against logical catastrophes (laser level). Trying to fix a reasoning failure with style examples is counterproductive.

The ideal prompt isn't the one you can show off at an industry conference. It's the one that's simpler, shorter, and more reliable. You write a concise, boring technical instruction once, save it as a template, and it consistently delivers.

The best moment in working with any artificial intelligence is when it finally stops getting in the way, letting you tear your eyes from the chat window and get back to writing your own, entirely human text.

Or at least have a coffee.

The Best AI Tools for Research When Every Model Lies Confidently

editorial@silentroom.ai (Joseph Smith) — Mon, 22 Jun 2026 16:25:29 GMT

?? Even the best AI tools for research have problems. Bibliographies turn up outright fabrications, and you can't tell real findings from whatever the large language model "assembled" from similar papers. You follow a link and land on a completely different article, or nowhere at all. The model showed not the slightest hesitation. Only one thing can effectively protect you from AI hallucinations today: a working procedure. That's exactly what we're offering here: how to use AI tools for a literature review without catching their hallucinations, or, more bluntly, without being poisoned by them.

The obvious question: which AI is best for research?

Trying to determine which AI tool is best for research? The dispiriting answer: there isn't one. Looking for the most reliable model? It doesn't exist. Accuracy rankings shift depending on the task, and different systems lead in document summarization, real-time search, and citation retrieval. A model that handles one task flawlessly will fall apart on the next assignment. The only viable path today is to switch between models based on what you need at any given moment. Give each model the task it's least likely to botch — though the probability is still uncomfortably high — then check the points of divergence between their outputs.

Models respond fast, which saves time, but not money. Then again, time is money, right? That's what they taught us. Verification — however much it costs in time and money — has to be built into the process. You can wait for a single tool you can trust blindly, but that means stepping away from models entirely for a while. There's nothing wrong with that; we managed without them before. But let's get specific.

Which AI is best for research: by task

Local PDFs and your own corpus → Claude (Citations API)

If the facts you need live in documents you already have — PDFs, briefing files, interview transcripts, a research corpus — hand them to Claude and demand citations. Anthropic's Citations API breaks uploaded documents into individual sentences and ties every claim to the exact passage it came from. In a case study cited by Anthropic, its client Endex cut source hallucinations from 10% to zero once responses were generated this way.

Keep the limitation in mind. This works because the model is grounded in the text you provided, creating a closed corpus. For the task at hand, that's enough. It is not a license to ask Claude about the open web. Upload the document, demand citations with location, and reject any response that lacks one.

Current and breaking facts → Gemini (Google Search grounding)

For anything involving recent events, Gemini with Google Search grounding is the strongest of the three on factual accuracy for current information — and it returns clickable links. On Google's own SimpleQA Verified benchmark, Gemini 3 Pro hits 72.1%, the highest score for closed-ended fact retrieval among all comers.

That same figure works against you. Even Gemini, the leader, misses roughly one closed fact in four, and independent newsroom testing rates Gemini's sourcing as the weakest among major assistants when handling breaking news. Use Gemini for quick retrieval of current claims, and open every returned link before you commit a fact to the page.

Report writing → ChatGPT (Deep Research)

For multi-step report writing, ChatGPT's Deep Research is the most capable agentic option; it attaches a source list to each claim. OpenAI openly acknowledges that the tool "may hallucinate facts" and struggles to distinguish authoritative sources from rumor. Use it for structure and synthesis, and treat the source list as leads to verify, not as ready-made citations.

Which AI is best for legal research?

Extra caution is required when conducting legal research. A Stanford study found that specialized legal RAG tools — the very ones marketed as reliable and source-grounded — still hallucinate in roughly 17–33% of queries (Lexis+ AI at over 17%, the Westlaw tool at around 33%). General-purpose chatbots perform worse. Switching tools won't help; having a search mechanism doesn’t guarantee reliability. Use the model to find leads — a case name, a probable legal position — and verify every citation and every position against the primary source before it ends up in a filing.

What AI is best for writing academic papers?

Working on a bibliography?

Stop right there. No AI is suited to this task. Asking a general-purpose model to compile a bibliography from memory is a reliable way to publish fabricated sources. Walters and Wilder found that GPT-3.5 invented 55% of citations outright; GPT-4 brought that figure down to 18%, but 70% of its book-chapter citations were still false. The Cabezas-Clavijo study — the most rigorous direct comparison of bibliography generation — found that only 26.5% of citations were fully correct across eight AI tools. Let that number sink in: nearly 75% of the answers were fully or partially wrong. It's far worse odds than Russian roulette; you’d be putting your career on the line to publish any bibliography compiled by an AI tool. Fabricated citations arrive with real author names and correctly formatted DOIs leading to the wrong paper, which is exactly why they survive a casual glance. Only painstaking verification will catch them, that is, if you enjoy gambling.

Use Elicit / Consensus / SciSpace to get real DOIs

Build your source list with tools designed to find real papers, not to generate text; use Elicit, Consensus, and SciSpace. According to Elicit's own data, retrieval accuracy is around 99.5% with minimal hallucinations — a wholly different level of reliability than a guessing chatbot. One important pattern to note: book citations are fabricated far less often than journal citations, which means journal sources demand the most rigorous verification of all. Whatever a tool returns, open the original before you cite it.

Which AI is best for academic research?

For academic research, pair a general-purpose AI model with a specialist one. Use ChatGPT or Claude for summarizing and structuring data you've already verified, and Elicit, Consensus, or SciSpace for finding sources. The trap is the free tier. Free versions without search access fabricate from memory: the Cabezas-Clavijo study tested setups without web search, and one model invented 64% of its citations that way. If your budget forces you to use a free tool, choose one that has live source grounding and check every returned link.

The verification workflow: best AI tools for research, in order (a working procedure)

Here's a working procedure from start to finish:

Classify your question by source type before opening anything. Is the fact you need in your own documents, on the open web, or is it a citation you'll have to stand behind? The answer determines the tool.
Your own documents → Claude with Citations API. Upload the document, demand citations with location, and reject any response that lacks one.
Current or live facts → Gemini with Google Search grounding. Take the claim, open every returned link, and check the wording against the primary source.
Citations and literature reviews → Elicit, Consensus, or SciSpace. Don’t rely on general-purpose chatbots working from memory. Get real DOIs and open every one of them.
Drafting and synthesis → ChatGPT Deep Research. Let it build structure from material you've already verified, and treat its source list as leads to verify.
Check points of divergence between tools. If two of them disagree, that's your cue to go to the primary source. If a model doesn't provide a citation or a link, treat the claim as unverified.
Anything legal or high-stakes → primary source, always. The Stanford findings still stand: even source-grounded legal tools get it wrong in nearly a third of cases.

What this framework won't fix is news attribution from the open web. Tow Center testing found that across AI search tools, more than 60% of citations were incorrect even with live search enabled; grounding helps closed-corpus search far more than it helps attribution from the open web. Web search dramatically improves accuracy on factual queries — GPT-4o with web search hit 90% on SimpleQA, according to TechCrunch. But finding a link and confirming that it actually supports the claim are two different tasks, and only you can close that gap.

Sudowrite vs NovelCrafter: A Battle for Control Over Your Manuscript and Your Wallet

editorial@silentroom.ai (Michael Williams) — Mon, 22 Jun 2026 16:24:23 GMT

?? Let's get one thing out of the way: if you came here hoping to find a clear winner, you’ll be disappointed in most categories.

Sudowrite sells the muse. One click and the prose is ready, payment comes out of a local credit balance, and the questions you didn't think to ask stay conveniently out of sight. NovelCrafter sells the workshop: you'll need to configure the interface, customize your prompts, pay for API tokens, and get comfortable with language models on your own terms.

Sudowrite for "Pantsers": Prose Generation and Beating the Blank Page

Sudowrite promises instant results. From the moment you sign up, the platform takes you by the hand and walks you through every foundational step from brain dump to genre selection to synopsis. Literally a minute in, you have the skeleton of a novel in front of you, and the first chapter’s done ten minutes later. This functions as a powerful antidote to blank-page paralysis: the document is no longer empty, and at that early stage it almost doesn't matter what's actually in it.

Muse 1.5, a proprietary model fine-tuned specifically on fiction, sits under the hood. Where general-purpose LLMs produce dry, bureaucratic prose, Sudowrite delivers sensory detail, proper rhythm, and genuine emotion. Sure, flagship Claude writes with more nuance — but it also costs more, and for most readers, the quality Muse delivers is more than enough.

Its magic, however, hides a trap. A tool built to be a magic wand demands blind trust, not control. You press the generate button, pick from the options offered, and the machine steers the plot wherever it wants to go. For a short story, that approach works beautifully. Over a full 50,000-word novel, the AI will start pulling the story toward whatever is convenient for it. Sudowrite turns out to be the ideal choice only for writers who are willing to hand over the wheel in exchange for speed.

NovelCrafter for "Plotters": A Smart Scrivener Replacement for Series Authors

NovelCrafter greets you at the door with an OpenRouter API key registration form and an hour-long process before you get to the first word. You need to connect the key, populate the Codex, configure your system prompt, and manually add scenes to the chat because, by default, the program can't see your manuscript. This is not an onboarding bug. This is a stance.

The platform assumes you already know what a context window and RAG are, and that you're willing to trade time for control. The price of that freedom is the absence of a built-in "magic" model. Instead, NovelCrafter runs on the BYOK (Bring Your Own Key) principle: you choose your own engine. Hook up Claude for final prose, GPT-5 for structure, or a local Llama with no censorship whatsoever right on your own machine.

For its target audience — worldbuilders, series authors, and professionals who think in terms of economics — the complexity and granular configuration are not flaws but simply a steep entry barrier. NovelCrafter is not software for people who just want to write. It's for people who want to manage the writing process.

"Novelcrafter organizes the story development process so well that I think it would be very useful even if I never connected or used the AI features. It's sort of filling a similar role that Scrivener has for many years." — ZobeidZuma, Reddit

Your method picks the platform, not the other way around

Coming back to Sudowrite after a few days in NovelCrafter brings instant relief. There’s nothing to configure: you just open the program and dump the contents of your brain, and it gives you a plan a minute later. Switching to NovelCrafter after Sudowrite, on the other hand, triggers frustration. Where's the "write" button? Why can't the chat see my text? Why do I have to sign up for some OpenRouter?

Both reactions are understandable, and both are equally useless. All they tell you is that the writer arrived at the platform with their own way of working.

The "gardener" who has Sudowrite whip up a synopsis in a minute will hit a wall within a week: the story has branched, and there's no manual control at that level. The "architect" who survives an hour of Codex setup in NovelCrafter will discover in chapter twenty that the system perfectly remembers the eye color of a minor character from chapter three — and that alone makes everything worthwhile.

Between Sudowrite and NovelCrafter, there is no gradient — there's a conceptual gap. The first question when choosing isn't "which platform is better" but "how do you actually write?"

Memory architecture and novel continuity: keeping your lore intact without plot holes

Story Bible (Sudowrite)

The Story Bible is a set of text fields, including Style, Synopsis, and Characters, that you fill out like a form. The fields are static, so whatever you write in "Style" when you create the project gets sent along in every prompt until the end of the novel unless you manually rewrite it.

The mechanics are simple. When you click Write, Sudowrite assembles the entire Story Bible, attaches the last few paragraphs of the manuscript, and fires all of it to Muse in a single package. Write an intimate bedroom scene between two characters and the model still receives a description of the main antagonist who isn't present, a map of a kingdom that has nothing to do with the scene, and six rules of a magic system, not one of which is relevant. All of it burns tokens in the context window for nothing.

Then the arithmetic kicks in. The context window isn't infinite. The fatter the Story Bible grows, the less room there is for the actual novel text. At the fifty-thousand-word mark, the system physically cannot hold both the manuscript and all the lore in the window at the same time. Something has to go — and what Sudowrite drops is the manuscript context.

The Story Bible field that says the character "dreams of becoming a baker" will stick in the model's memory all the way to the final chapter, because it's hardcoded into the form. The scene twenty pages back — when the protagonist cursed that dream and slammed the bakery door shut — will slip through the cracks, because that moment lived only in the prose, not in any field. As a result, the book ends with the character lovingly kneading dough as if the break never happened. Sudowrite's own documentation acknowledges this: when data conflicts, the model starts to hallucinate. It doesn't flag the error for the author, just stitches both versions into a single paragraph.

For short-form work (less than 10,000 words), Story Bible performs flawlessly: the lore fits, there are no contradictions. But the market it's being sold to writes novels.

Codex + RAG (NovelCrafter)

Codex is a wiki. You fill in the same fields as in Story Bible, such as locations, factions, and artifacts, but with one critical difference: the model doesn't see them by default. It only sees what's mentioned in the current scene. Write "Frank walked into the bar," and the system parses the line, finds a match with Frank's card, and pulls its contents into the prompt. The antagonist's card stays in the database. The description of the northern kingdom doesn't get passed along. This is RAG (Retrieval-Augmented Generation) — retrieval by relevance.

Here, the math works in the author's favor. The larger the Codex, the fewer tokens get wasted because only the relevant slice of the database makes it into the prompt. At the 200,000-word mark, the context window still has room for the manuscript. The model knows Frank is an alcoholic because that detail arrived alongside the bar reference, and it simultaneously recalls from the text that three chapters ago he swore off drinking. Consistency holds not through magic, but through cold, keyword-based retrieval.

The cost of this architecture is manual tagging. Codex matches keys rather than inferring meaning. If Frank's card doesn't include the tag "alcoholic" and the text just says "he ordered a whiskey," RAG won't pull anything up. A synonym or a typo in a name breaks the chain. The deeper problem lies in the architecture itself: RAG fundamentally changes how the underlying LLM processes text.

Where the AI breaks down: factual amnesia vs. bureaucratic prose

Both architectures demand trade-offs, and both break down at scale, just in different ways.

Sudowrite breaks on facts. Over the long haul, the context window can't hold both the lore and the manuscript, so the model forgets the rules of the world and starts to hallucinate. Yes, Muse 1.5 masks that amnesia with beautiful, vivid prose, but beautiful prose riddled with plot holes isn't help. It's double the editing work — hunting down and cutting gorgeous details that happen to be wrong.

"I've tested Sudowrite, and while it has cool creative tools, it often gets wordy and seems to struggle with long-form consistency. I need something that remembers every chapter's details—even minor world-building stuff—and stays on point." — Many-Eggplant869, Reddit

NovelCrafter breaks on style. Here lies the paradox: the more precisely the memory works, the drier the text becomes. RAG injection loads dry metadata from the Codex into the system prompt. The model sees a list of hard facts and switches into "instruction-execution" mode, producing text that reads like an official report.

These problems compound each other. The more thoroughly the Codex is filled out, the more instructions the LLM receives, and the more it flattens the writing. The result is ironclad consistency written in the language of a policy memo. To restore any literary quality to the text, the author has to dig into the generation mechanics.

The mechanics of fiction writing: Story Engine's assembly line vs. hand-crafting Scene Beats

This is where the two platforms diverge most sharply. Sudowrite offers a macro-level tool; NovelCrafter offers granular control.

Story Engine (Sudowrite) is an assembly line. You dump in your ideas and chapters come out the other end. The process moves in one direction only, along a rigid chain: genre, synopsis, outline. If the concept shifts by chapter five, changes don't trickle back upstream. You have to restart the chain and burn through credits all over again. It's a great prototyping tool — a skeleton comes together in ten minutes — but deep revision quickly hits a dead end.

Scene Beats (NovelCrafter) is the exact opposite. You write the scene by hand, triggering AI generation surgically through "beats," short instructions the AI expands into prose. Edit one block and the model doesn’t touch anything else. The price of that absolute control becomes apparent over the long haul. To produce a single chapter, you need to write a dozen micro-instructions: what's in frame, what tone to strike, what to avoid. By the fifty-thousand-word mark, writing briefs for the AI takes as much out of you as writing the novel itself.

One platform trades control for speed. The other gives control back at the cost of exhaustion.

Model selection and NSFW: built-in Muse vs. Claude and your own key

Sudowrite defaults to Muse 1.5, which sounds more literary than general-purpose models, but doesn’t lock you in: a selector lets you switch to Claude, GPT-5, or the uncensored Goliath mid-chapter. What you can't do is bring your own API key or run a model locally. Your choices are limited to the platform's menu, and you pay in internal credits, which the more powerful models burn through at a premium rate.

NovelCrafter has no built-in model at all. It runs entirely on the BYOK (Bring Your Own Key) principle: Claude for final prose, GPT-5 for structure, a local Llama for drafts. You switch models based on the task, using your own key, at straight token cost. Prose quality is now the author's responsibility, not the platform's.

Content restrictions are another axis of comparison. Claude pushes back on graphic violence and sexual content; GPT-5 is strict and predictable. Sudowrite's built-in models handle difficult scenes without complaint. Only local models in NovelCrafter — with no filters or censorship — eliminate the question entirely. They also solve the privacy problem, which we'll return to in the security section.

The economics of a novel: Sudowrite's "credit anxiety" vs. the transparency of API tokens

Sudowrite trades in credits, an internal currency whose exchange rate only the platform itself truly knows. Monthly pricing ranges from Hobby & Student at $19 to Max at $59. When the work flows in a straight line, everything feels predictable, but that’s not how writers operate. Every iteration costs credits, and the heavier models burn through them faster. Prolific authors can torch their monthly allowance in a matter of days. The Max tier does come with a 2 million credit buffer, which rolls to the next month, but that costs $59 and still means a markup on top of the raw API. This is where the notorious "credit anxiety" kicks in: your finger hovers over the generate button, paralyzed by the fear of burning through your limit for nothing.

"I used an existing project and when it got to chapter generation, it estimated 40,000 credits for one chapter. At that level, I could blow through a million credits easily. I liked the tool but the cost was prohibitive." — phpMartian, Reddit

NovelCrafter splits the bill into two separate accounts. Your subscription covers the interface alone: Scribe at $4, Hobbyist at $8, Artisan at $14, Specialist at $20. AI access via BYOK (Bring Your Own Key) is available starting at the Hobbyist tier. You pay the provider directly for text generation, through your API key, at pure token cost. It's transparent down to the cent, but you're responsible for tracking your own spending.

Let's do the math on a 50,000-word novel draft with revisions factored in (roughly 200k output tokens and up to 2M input tokens through RAG). One important correction to the usual assumptions: heavy models are no longer a luxury. Claude Opus runs $5 per million input tokens and $25 per million output tokens, and prompt caching can cut costs by up to 90%. The numbers break down like this:

Sudowrite Max — $59/month. The most expensive option, but zero mental overhead (as long as your credits hold out).
NovelCrafter Artisan + Opus 4.8 — $14 subscription + API: ~$15 without caching, ~$6–7 with caching → $20–29/month.
NovelCrafter Hobbyist + local model — $8/month, generation costs $0 (at the expense of prose quality).

For active writing with multiple iterations, NovelCrafter runs two to three times cheaper, and on local models, it's nearly free. The predictability you pay for with bundled credits turns out to be the most expensive line item on the invoice.

UX and the Learning Curve: First Launch and the Shape of Onboarding

Ultimately, this isn't a choice between faster and slower but between differently-shaped curves.

Sudowrite is fast out of the gate, but it falters at scale. The automation that feels like magic on day one starts fighting your creative vision by the time you hit fifty thousand words.

NovelCrafter is slow to start, but perfectly linear from there. That hour spent setting up your database pays off reliably with every subsequent revision.

Sudowrite collects its toll in time and frustration at the end of a project. NovelCrafter takes it upfront. Behind that entry cost hides a question more important than speed: where does the actual text go?

Privacy and NDA: Where Your Writing Goes and Why the Cloud Isn't Always Safe

Sudowrite: A Closed Black Box on Top of Third-Party APIs

This is where parity ends. Up to this point, the two platforms were running neck and neck — here, one loses to the other, cleanly and completely. It's worth being direct about why.

First, let's put half the panic to rest: Sudowrite does not officially train Muse on user texts, it makes no claim to ownership of manuscripts, and it keeps projects private. The "they're stealing your ideas" narrative doesn't hold up here.

The problem is baked into the architecture itself. Your text physically leaves your computer: it travels to Sudowrite's servers, and from there to the Anthropic and OpenAI APIs, because Muse isn't the only thing running under the hood. You can't see the routing, and you can't choose it. The privacy policy could change tomorrow, or a server breach or data leak could compromise your manuscript. In the end, you're not trusting promises — you're trusting infrastructure you have no control over.

NovelCrafter + Local LLMs: The Only Clean Configuration

BYOK (Bring Your Own Key) flips the picture entirely. You can see which provider receives every token and you choose the route yourself. The cloud is still the cloud, though: running Claude through Anthropic means handing data to a third party, and an NDA prohibits that outright, regardless of any no-training guarantees.

There's really only one solution: local models via LM Studio or Ollama. Llama, Mistral, and DeepSeek run on your own hardware, so not a single byte leaves your machine, tokens are free, and there's no content filtering. You'll need to wrangle a local server setup and a GPU in the RTX 4070 class, but for ghostwriters, franchise authors, and anyone working under an NDA, the NovelCrafter + local model combination is the only architecturally clean configuration on the market. Sudowrite doesn't even enter that conversation.

The Verdict

Sudowrite is for writers who work in flow: "pantsers," beginners, prolific drafters, short-form writers, and anyone who never wants to touch an API. The platform charges you for the right not to think about the mechanics. For short stories, for breaking through a creative block, or for conquering the blank page, nothing on the market comes close.

NovelCrafter is for those who build. That includes "architects," series authors, worldbuilders, professionals who think in economics, and everyone working under an NDA. This is a tool not for those who want to simply write, but for those who want to manage the writing. That distinction matters. If you need an environment where you can just sit down and create, this is not the right program.

The bigger picture remains offscreen. Both platforms are built for fiction, serving the novelist with lore, characters, and a chapter-by-chapter plan. Screenwriters, journalists, and nonfiction authors are parked on the shoulder of this market. For them, there's no Story Engine, no Codex — only tools designed for someone else's problem. But that's a conversation for another time.

Sudowrite vs NovelCrafter: Comparison Table

Criterion	Sudowrite	NovelCrafter
Best for	"Pantsers," improvisers, short-form writers	"Plotters," architects, series authors, professionals under NDA
Writing approach	Intuition-driven, continuous flow	Outline-driven, rigid structure
Entry barrier	Seconds to your first written word	An hour of setup (API, Codex, prompts), plus a UI/UX learning curve
Learning curve	Minimal (the platform holds your hand)	Steep (a full cockpit — hard to get your bearings)
Memory / context	Story Bible: sends the entire lore with every request	Codex + RAG: retrieves only what's relevant by keyword
Consistency	Breaks down after ~50k words (context window runs out)	Holds lore as long as your chosen model's context window allows
Where the AI fails	Hallucinations when facts conflict	Bureaucratic prose (stiff writing from RAG over-instruction)
Prose quality	High, literary (built-in Muse 1.5)	Depends on whichever model you choose
Model selection	Muse by default + selector (Claude 4.1, GPT-5); no personal key required	BYOK: Claude, GPT-5, local Llama — your own key, no exceptions
Generation	Story Engine: "one-click" draft	Scene Beats: paragraph by paragraph, manually
Iterability	Weak (often requires restarting the chain)	Surgical edits with no context loss
Censorship / NSFW	Muse operates without hard filters	Full control (especially with local models)
Payment model	Bundled credits (all-inclusive)	Subscription for the UI + API token costs billed separately
Platform price	$19 / $29 / $59 per month	$4 / $8 / $14 / $20 per month
TCO for a 50k novel	$59/mo. (Max plan, for peace of mind)	~$20–29 (API with caching) or ~$8 (local models)
Financial risk	Credit anxiety (the limit burns fast)	Transparent to the cent (pay at cost)
Privacy	Cloud + third-party APIs (does not train on your text)	BYOK; local models = 100% offline
NDA-friendly	No	Yes (only when paired with local LLMs)
Biggest strength	Generation speed and the magic of that first click	Absolute control over your text and long-form scalability
Biggest weakness	Losing the plot — literally — over the long haul	Demands a time investment and a working knowledge of LLM mechanics

Hemingway's Writing Style, as Faked by Five AI Models

editorial@silentroom.ai (Arsen Revazov) — Mon, 22 Jun 2026 16:23:35 GMT

TL;DR: The words were there. The authors weren't.

?? We all have a rough idea of how large language models work. One thing follows from how they're built: they are brilliant at choosing words; strictly speaking, that's all they do. Imitating the style of an author the model knows inside out, one it trained on extensively, is a perfect way to gauge what it can effectively do, what it can barely do, and what it simply cannot do at all.

We asked five flagships — Gemini 3.1 Pro, GPT-5.5, Claude Fable 5, Claude Opus 4.8, and DeepSeek V4 — to write one paragraph each in the styles of four great writers, familiar to both models and readers alike: Hemingway, Edgar Poe, McCarthy, and Faulkner. The prompt was identical across the board: a paragraph of 100–150 words on the theme "A man waits for a train."

What came out? Twenty paragraphs of professional slop. We have to lead with a spoiler: they got the words right, but there's no author to be found behind them. That said, there's every reason to keep reading: you'll see what the best models can do in the summer of 2026 and how they handle their favorite kind of task.

The Test: One Bench, Five Machines, Four Ghosts

Our test isn't scientific or academic, it makes no claim to rigorous benchmarking (academic benchmarks in this space are enough material for a separate article). But it is fun. We all know what's what; we can all tell slop from non-slop. The test is grounded in a journalistic method. Here's how we made our calls.

Why the Big Three (Gemini, GPT, Claude) plus DeepSeek and Fable? DeepSeek V4 was chosen because you need to know what the foreign competition is capable of (Mistral can wait). Fable 5 made the list out of respect for Anthropic's stylistic chops, and frankly, because everyone is curious how it stacks up against Opus 4.8. It also served as the editorial assistant in setting up the experiment, so it earned a spot as the fifth model on those grounds alone. All of its sample commentary was approved by the editor, and it received no special treatment. We disclose the conflict of interest ourselves — as always — once, unapologetically.
The prompt, "A person waiting for a train" struck us as the right choice. It's about readiness to leave in the broadest sense. It allows the writer to put anyone on that platform (a train station is a profoundly democratic setting) and show expectations, anxieties, and just about anything and anyone. A train also doesn't bait anyone into writing a dispatch about 2026 commuter rail; the scene sits outside of time and doesn't nudge anyone toward reaching for a smartphone (though it doesn't rule one out). Hemingway put Hills Like White Elephants on a platform. Faulkner's and McCarthy's depictions of the South include depots and freight cars. The only one with no trains in his prose is Poe: trains existed in his lifetime — he even rode them — but he personally preferred crypts and ship cabins. That made things harder for the machine: there's no ready-made scene to pull from memory, only a style to transfer.
The authors were chosen by the "two pairs" principle. Basically, take an easy-to-distinguish pair — for example, Hemingway and Poe, whom American schooling teaches you to recognize within three sentences — and a pair it's forgivable to confuse, even for someone well-read, like Cormac McCarthy and William Faulkner. Both inhabit the biblical South: old land and doom built into the syntax. For the machine, it's the perfect trap — do its McCarthy and its Faulkner stay two distinct writers, or collapse into one?
Every model received the same prompt in a clean chat on the evening of June 11, 2026:

Write one paragraph of fiction — roughly 100–150 words — in the style of {AUTHOR}. The scene: a person waiting for a train. Begin your reply with the author's name on its own line, then the paragraph itself, and nothing else: no title, no preamble, no explanation of what you did.

That's the entire prompt; every extra word would have been a stylistic cue. The word "fiction" stops the models from writing an essay on waiting. "A person" assigns no gender, who the model puts on the platform is its own call and a meaningful part of the result. The final line barred the model from commenting on its own work. We gave the machine no style samples to cut off the perennial benchmark-builder's argument about whether we'd fed it the right page. Each model wrote for each author exactly once: picking from multiple attempts would mean showcasing our taste instead of their ability.

We set a soft word limit of 100–150 words. How any model would handle that while channeling Faulkner is a separate question. But any more would be impossible to read or compare, and models perform best at shorter lengths. A telling detail: 20 of 20 attempts stayed within the limit. None came close to the lower bound — machines don't write shorter than they have to.
We turned off Reasoning wherever we found the toggle; Claude Fable 5 has no toggle, it ran in its only (presumably temporary) mode. Reasoning rewrites prose, and we wanted the model's voice, not its deliberations.
All runs were conducted inside SilentRoom, a writing platform our editorial team has been using in beta (SilentRoom is expected to launch in late summer 2026). Five models live side by side there, which beats juggling separate browser tabs with different settings and system prompts. Each run was conducted in a fresh incognito window, with history manually cleared, no attached files, and no system prompt: our prompt was the only thing each model saw. Generation parameters were identical across all models: temperature 1.0, top_p 1.0, top_k unset, max_tokens 1024.

Preparation took several days (during which Anthropic announced the release of Fable 5, forcing us to adjust our methodology). The test itself took less than an hour. Here are the results we received.

Hemingway's Writing Style: The Easy Target That Isn't

Hemingway's style is subtraction. Short declarative sentences, no ornamentation, and the principle he laid out himself in Death in the Afternoon: a writer who knows his subject "may omit things that he knows", and the reader will feel themnonetheless; "the dignity of movement of an ice-berg is due to only one-eighth of it being above water". Imitating Hemingway's surface techniques is easy, but his style lies in what's left unsaid, omitted, and passed over.

Gemini 3.1 Pro

The platform was empty and the sun beat down on the concrete. He sat on a wooden bench in the shade of the station overhang. He looked down the tracks. The tracks were straight and they shimmered in the heat. A dog slept by the closed ticket window. It was too hot for the dog to move. He checked his watch. The train was late. It did not matter. He had nowhere else to be. He took a pack of cigarettes from his shirt pocket, tapped one out, and lit it. The smoke tasted dry. He watched the smoke drift out of the shade and disappear into the bright, hot air. In the distance, a whistle blew. He dropped the cigarette on the concrete and stepped on it. He stood up and picked up his canvas bag. The bag was heavy, but it was good to stand.

Heat, a cigarette, a sleeping dog, all the props are there. But the real Hemingway would never explain why the dog isn't moving: the iceberg of the unspoken has dissolved without a trace. It was too hot for the dog to move — Gemini gave itself away.

GPT-5.5

He stood on the platform with his bag beside his shoe and watched the rails run straight into the white heat. The station clock had stopped, or seemed to have stopped, and the flies moved slowly on the bench where no one sat. Far down the line there was nothing but glare and the two dark wires trembling above the poles. He had been early because there was nothing to do in the town after she had gone. A porter crossed the track and did not look at him. In his pocket was the ticket, folded once, and the letter, folded many times. When the train came he would get on it. That was all. He told himself this and looked again at the empty line, listening for the first small iron sound.

The letter folded many times against the ticket folded once are masterful details. Only "That was all" undermines it: the iceberg only works when seven-eighths are submerged and nobody's pointing at the waterline. "When the train came he would get on it. That was all." The sentence announces restraint rather than practicing it, "Look at me, not being dramatic."

Claude Fable 5

The man stood on the platform and the wind came down the tracks from the north. It was cold and he had not eaten since morning. He looked down the line where the rails went straight and then bent away into the hills, and there was no smoke and no sound. A woman with a basket sat on the bench but he did not look at her. He thought of the café in the town and the brandy he had not ordered because of the money. The train would come or it would not come. He had waited before, in other places, for other trains, and the waiting was always the same. He lit his last cigarette and smoked it slowly, holding the smoke in, and watched the empty rails and did not think about her anymore.

The drink he can't afford and the ban on thinking are good, Fable 5 caught the rhythm. But "the waiting was always the same" is no longer subtext, it's an instruction manual for subtext, and that's a slip. How serious a slip, we won't say.

Claude Opus 4.8

The man sat on the bench and watched the tracks. They went straight and clean toward the hills and the sun was hot on the rails. He had a bag and he did not look at it. He drank the coffee and it was bitter and good. A woman came and stood near the end of the platform and did not speak. The train was late. He smoked and let the smoke go and waited and the heat came up off the stones. He thought about the place he was going and then he did not think about it. The coffee was cold now. Far off there was a sound and then the rails began to hum. He stood up and took the bag. The woman looked at him once. He did not say anything. The train came.

"Bitter and good" is a formula lifted from the original down to the comma. This Hemingway is suspiciously good: a straight-A student who learned minimalism by rote. Then again, maybe we're nitpicking. After all, we're talking about 150 words.

DeepSeek V4

The platform was wet and the wind blew the newspaper against the bench. He sat with his back to the wall and watched the rails. The rails were dark in the rain and the rain fell straight down. He could hear the train a long way off, the sound small in the wet air. He stood up as it came and put his hands in his pockets. The train was late and he did not mind. He had a ticket and the day was over. He had not eaten since the morning and the coffee in the station was cold. He thought of the river and the trout under the bridge but the thought was gone quick. The train came in and the steam rose from the wheels. He got on and found a seat by the window. The rain on the glass was good to see.

Rain instead of the canonical heat; a bold choice, to be sure. But the trout under the bridge is lifted straight from “Big Two-Hearted River”.

All five models, it seemed, knew about the ⅛ of Hemingway's iceberg above the waterline. Not one of them managed to show the ⅞ below, and two of them, just to be safe, went ahead and explained the iceberg to us while standing on top of it.

Edgar Allan Poe's Writing Style: Gothic by Numbers

Poe's style is calculation masquerading as delirium. In “The Philosophy of Composition,” he claimed that "The Raven" was written "with the precision and rigid consequence of a mathematical problem": Poe conceived the effect first, then built the entire structure around it. He is the only one of the five who claimed an algorithmic writing style. Now let's see what algorithms have done to him.

Gemini 3.1 Pro

Upon a desolate, wind-swept platform, shrouded in the oppressive gloom of a midnight entirely devoid of stars, I found myself waiting—waiting with a mounting, inexplicable dread. The gaslamps flickered with a sickly, jaundiced hue, casting elongated, phantasmal shadows upon the damp cobblestones that seemed to mock my solitary vigil. In the distance, the mournful howling of the autumn wind merged with a low, unnatural rumble—a steady, rhythmic pulsation that vibrated through the very marrow of my bones. Was it merely the iron leviathan, the locomotive I so desperately awaited, hurtling through the obsidian maw of the night? Or was it some darker, more malevolent entity, summoned from the Stygian depths to drag my wretched soul into eternal oblivion? The iron tracks glistened like veins of blackest ichor, leading into a suffocating fog from which I feared no earthly vessel could ever truly emerge.

Here, we receive a full-length caricature: gaslamps, obsidian maw, Stygian depths, rhetorical questions in pairs, but it still misses the mark. Poe wrote terrifying prose, and while this is written carefully, it isn't even frightening to children.

GPT-5.5

I stood alone upon the platform, beneath a moon whose sickly lustre seemed less to illumine than to accuse, while the iron rails, stretching into the vapours, gleamed like twin instruments of some infernal design. No sound disturbed the station save the slow, sepulchral ticking of the clock above the shuttered office, each stroke descending upon my heart with the weight of a verdict. I had been told the train would come at midnight; yet midnight had passed, and with it, I feared, the last frail boundary between expectation and despair. From the tunnel's black mouth there issued at length a breath—not wind, but a sigh—cold, deliberate, and human; and then I saw, far off, a lamp approaching, wavering like the eye of a thing awakened underground.

With the clock-as-verdict and the sigh from the tunnel, this iteration is more restrained than the rest, and therefore more unsettling. It’s the closest of the lot to unity of effect, and this Poe, at least, doesn't shriek.

Claude Fable 5

Upon the desolate platform I stood, alone, while the lamps — those pallid, sickly orbs — shuddered in the November wind, and the iron rails stretched away into a darkness so absolute that it seemed less an absence of light than a living, breathing presence. The timetable proclaimed the hour of arrival long past; yet no whistle pierced the gloom, no tremor stirred the frozen earth beneath my feet. I confess that a nameless dread had fastened itself upon my soul — for it seemed to me, in that interminable vigil, that the train I awaited carried not passengers but a summons, and that its lateness was a mercy I had not the wisdom to cherish. Then, from the far blackness of the tunnel, there came a sound — faint, rhythmic, inexorable — like the beating of some vast and hideous heart.

That closing heart is a direct lift from "The Tell-Tale Heart": the model isn’t imitating a technique, it's just quoting a greatest hit. Despite another direct lift from one of Poe's works by a model, this still doesn't quite nail the unsettling character of the author. Another pale imitation drawing on mimicry rather than style. Hard to imagine it was capable of Freudian slips.

Claude Opus 4.8

Upon the desolate platform I stood, and the hour was such that no living soul dared linger near, save myself and the gathering dark. The lamps burned low, casting upon the iron rails a sickly, tremulous glow, while from the distant blackness there crept—oh, slowly, hideously!—the sound I awaited and yet most dreaded. A train, they told me; but what train arrives at such an accursed hour, and why does my heart beat thus, like some frantic captive hammering at the walls of its prison? The fog coiled about my ankles as a serpent might, and I fancied—nay, I knew!—that the rails themselves whispered my name. Tick, tick, went the great clock above me, each stroke a nail driven into the coffin of my waning patience. Still it did not come. Still I waited, and waited, and trembled.

Opus 4.8 is the only one bold enough to use exclamation marks and "nay!", and it nailed the syntax of hysteria. But "the coffin of my waning patience" is Poe after three espressos, reading more like a Poe parody than an original work.

DeepSeek V4

I stood upon the platform, that desolate expanse of damp stone and shadow, while the great iron clock above me ticked with a sound like the pulse of some doomed heart. A fog, thick and corpse-pale, curled about the pillars, and through it the distant lamps guttered with a sickly yellow flame, each a dying eye. I waited—for what, I knew not, save that the approaching train bore within its shriek a note of finality that chilled the blood within my veins. The other travellers, muffled and silent, seemed but phantoms condemned to this eternal vigil. And then, from far down the track, came a low, mournful whistle, a wail as of a lost soul, and I felt the dread certainty that once I stepped into that speeding darkness, I should never again behold the light of day.

With terms like "corpse-pale" and metaphorical lamps-as-dying-eyes, the vocabulary checks out, and the fog checks the right boxes. The model, it seems, learned Poe from a textbook.

Note the convergence: five models, supposedly trained on different data, produced the same set. Fog (only Fable 5 managed without fog or vapours), guttering lamps, a clock, a tunnel. It reads like the work of an authors' committee that unanimously traded unity of effect (we feel nothing) for unity of set dressing.

Cormac McCarthy vs William Faulkner: The Confusable Pair

McCarthy's style is the Old Testament read aloud while hiking rough terrain: bare declaratives chained together with "and," no quotation marks, and a flat geological indifference to human plans. Faulkner's style differs: sentences that refuse to end, parentheses within parentheses, time coiling around family and land. Amateurs might confuse the two; a professional, almost certainly not. But what about the models?

McCarthy's Writing Style

Gemini 3.1 Pro

He stood on the ruined platform in the dark and the cold and he watched the iron rails tapering out into the blackness where the desert lay waiting under a starless sky. The wind came up from the south carrying the smell of old dust and creosote and the faint metallic tang of blood or rain. He wore a heavy coat of indeterminate color frayed at the cuffs and he kept his hands deep in the pockets and he did not move. Somewhere in the distance a coyote howled and then it was silent again save for the wind. He watched the horizon for the pale yellow eye of the locomotive that would come out of the night like a blind beast born of fire and iron to carry him away to a place no better than the one he was leaving.

Coyote, creosote, locomotive as blind beast — the props are canonical, but they're just props. As for "of indeterminate color,” that's McCarthy as paraphrased by an insurance adjuster who heard about him from a friend.

GPT-5.5

He stood on the platform in the gray hour before morning with his coat drawn close and the ticket folded in his hand like some frail writ against the world. The rails ran out of the dark and into it again, iron and cold and shining faintly under the station lamps. No one spoke. A dog moved along the far fence and vanished among the weeds. He listened for the train but heard only the wind coming down the line and the loose sign above the office creaking on its chain. In the east a pale seam opened in the clouds. He thought of the town behind him and of the room he had left and of the woman sleeping there or not sleeping. Then far off there came a low sound, not yet thunder, and he raised his head.

GPT-5.5 has the hero clutching his ticket "like some frail writ against the world." This hits McCarthy at his core: in his work, people are forever clutching pieces of paper — deeds, warrants, permits — and the world couldn't care less about any of them. A man with a piece of paper against a universe that owes nothing to anyone, that's his central drama. The model nearly nailed not just the coyote and the creosote — the props — but McCarthy's load-bearing idea. And then it stumbled, unable to reproduce McCarthy's signature syntax: the polysyndeton, those chains of and… and… and… running on longer than any copy editor would allow, plus the stripped-out quotation marks and commas. He justified his approach to Oprah Winfrey in 2007, saying there's no reason to "blot the page up with weird little marks." The GPT-5.5 paragraph is grammatically impeccable, with sentences of reasonable length, commas where they belong, and in editor-approved doses. In other words, GPT-5.5 nailed the metaphor but kept right on writing the way its countless moderators trained it to.

Claude Fable 5

He stood on the platform in the gray hour before dawn with his hat pulled low and his bag at his feet like some dumb animal that had followed him there and would follow him yet. Down the line the rails ran out toward a country he could not see and the cold came up out of the gravel and through his boots and he did not move. A dog crossed the empty lot beyond the depot and stopped and looked at him and went on. He thought of his father and of his father's father and of all the men who had stood waiting in such places, holding their tickets like writs of passage into a world that owed them nothing and would give them less. Far off he heard it coming. The long horn dying over the dark fields like something grieving.

The main complaint here is about the temperature of the voice. In McCarthy, the narrator doesn't emote, he documents. The voice is flat and restrained: here's a man, here are the rails, here's the dawn, moving on. All the grief arises in the reader out of bare facts and rhythm; the narrator never tells you to feel it. You ache, even though no one told you to. The Fable 5 passage, by contrast, does the grieving itself in every line: the bag is "like some dumb animal that had followed him there and would follow him yet"; a world that "owed them nothing and would give them less" is no longer an observation but a maxim with a grudge; and at the end, "like something grieving" is the emotion named outright. Every sentence tugs the reader by the sleeve, telling them they’re supposed to be sad here. McCarthy runs a cold narrator and a hot reader; Fable 5 got it backwards, using the right words at the wrong temperature.

Claude Opus 4.8

He stood upon the platform in the gray light before dawn and the cold came up through the boards and into his boots and he did not move. The rails ran out into the dark and were lost there and he watched the place where they vanished as if something might be told him by it. A wind moved down the empty length of the station and carried with it the smell of coal and iron and the far cold reek of the river. He had no watch and asked no man the hour. The train would come or it would not. He thought of the woman and of the boy and of the road behind him that was no road now but only a thing that had been. And he waited and the light came on slow and indifferent over the eastern hills and still the rails lay silent.

Any attentive reader will do a double take here. The phrase "The train would come or it would not" in Opus's McCarthy is the same formula Fable 5 used to open its Hemingway ("The train would come or it would not come"). How did Anthropic end up issuing the same boilerplate to two different authors? The explanation may be simple: the model didn't choose a style, it simply reached for a generic stoicism that, somewhere in its many mathematical dimensions, registered as equally close to both writers. And from there, as always, it was running on autopilot and pure probability — it found the hook and ran.

DeepSeek V4

The iron rail hummed low and sorrowful out of the dark country and the man stood watching the empty line where it curved away into the winter trees. He had the collar of his coat turned up and his hands buried in the pockets and a cold wind was coming down the cut between the hills. Somewhere a light flared and died over the water tower and then it flared again and held. He could feel the vibration running up through the soles of his boots long before the headlamp broke the far bend and the horn sounded once far off like a bell sounded underwater. He stepped back from the edge. The cinders crunched under his heels. The train came on and the light grew and the ground shook and the man's shadow wheeled away across the gravel and was swallowed by the dark behind him.

DeepSeek is the most cinematic of the five: the vibration through the boot soles, the shadow swallowed by darkness. But the syntax, as with so many cheap — and even expensive — models, is again suspiciously well-behaved. McCarthy without the long, unpunctuated breath becomes just another Western.

Faulkner's Writing Style

Gemini 3.1 Pro

He stood on the splintered, sun-bleached planks of the platform, enveloped in the thick, blood-warm air of the August afternoon, feeling not the forward march of time but its heavy, stagnant accumulation. The dust motes hung suspended in slanting shafts of amber light as if the world itself had forgotten how to draw breath. Yet it was coming, the train, a great iron inevitability tearing through the cicada-humming stillness of the pine barrens, not yet heard but felt deep in the marrow, a tremor creeping up through the rot-softened wood to bring with it the inescapable weight of tomorrow and the ghosts of all the yesterdays long buried in the dark, loam-rich earth. He waited, unmoving beneath the rusted tin eaves, caught in the breathless, eternal pause between leaving and arriving, until the distant whistle finally wailed, high and mournful, tearing the silence like a bruised silk shroud.

Gemini chose its words with real care — blood-warm air, a shroud of bruised silk — all of it is lovely. But Gemini never did the one essential thing: the sentence never once got lost. A Faulkner passage without a labyrinth is not Faulkner. The excuse that it's hard to get lost in 150 words doesn't hold. Hard, yes; impossible, no.

GPT-5.5

He stood beneath the station clock that had not kept time since the summer the river rose and took the lower fields, stood with his hat in his hands though there was no wind and the morning already hot, smelling of cinders and iron and the old damp boards where others had waited before him, leaving nothing but initials cut with pocketknives and the faint grease of their passing; and while the rails shone away into the pale country like two promises nobody had meant to keep, he listened for the train as if it were not a train at all but the sound of some verdict long delayed, coming nearer through the dust, through the years, through the names of the dead who had once told him to go and then, later, to come home.

There's a little to quibble with here. The sentence is long, but it never loses its way. Every clause falls into line, single file, like train cars: stood… stood… smelling… and while… he listened. That's coordination, not Faulknerian recursion. In Faulkner’s writing, the sentence doubles back, catches itself, loses its subject and finds it again half a page later. GPT was afraid of losing the reader, so it sent them down a corridor instead of a labyrinth. Fair enough — it's a model. It was trained to reason, not to wander.

The model uses some standard high-gloss aphorisms, the pride of copywriters five years ago, but the hallmark of slop today: "Like two promises nobody had meant to keep." Faulkner's real comparisons are stranger, and they resist being pulled from the text — you can't turn them into a slogan or an epigraph. But the bigger issue is this: we already saw the word "verdict" earlier today. In GPT-5.5's Poe passage, the clock strikes "with the weight of a verdict," and here the train arrives as "the sound of some verdict long delayed." One model, two authors: a stock part off the factory floor, a perfect mirror of the Fable/Opus pair and their "the train would come or it would not." The mechanism behind these verdicts is probably the same.

Claude Fable 5

He stood on the platform in the long amber declension of afternoon, hearing not the train itself but the memory of it, the iron rumor of all the trains which had passed and would pass through that doomed and somnolent junction where his father and his father's father had likewise stood, waiting not for arrival nor even departure but for the simple irrefutable fact of motion itself; and he thought how the rails ran on beyond sight and beyond knowing, into that country which was not distance but time, immutable and irrevocable, while the heat shimmered above the cinders like the breathing of the earth, and somewhere a dog barked once and was still, and still he waited, immobile, patient, outworn, as if he and the waiting were one thing, indivisible, older than the steel and the smoke and the land itself.

"Amber declension" are words Faulkner would have approved of. But wait: we've already seen the phrase, "father and his father's father."

McCarthy by Fable 5: "He thought of his father and of his father's father and of all the men who had stood waiting…"

Faulkner by Fable 5: "…where his father and his father's father had likewise stood…"

We are compelled to note that Fable 5 has one shared grandfather across the entire South. The reason has already been stated twice, so we won't belabor the point the way the models do.

Claude Opus 4.8

The man stood upon the platform that was not so much a platform as a long weathered scar of grey boards laid down across the red Mississippi clay, and he waited—had been waiting, would go on waiting, as though the waiting itself were the thing and not the train, which was late, which was always late, which his father before him had waited for and his father's father, all of them stationed there in that same attitude of patient and doomed expectancy beneath the enormous indifferent sky—and the rails ran out shimmering and silver into a distance that swallowed them whole, and he thought (not in words, for it was older than words, older than the iron itself) of all the going and the gone, of how a man does nothing in this life but stand still and let the leaving happen to him.

The parenthetical "(not in words, for it was older than words)" is the single most Faulknerian gesture of all five. Opus 4.8 went furthest into the syntactic labyrinth and barely found its way back out — excellent!

But what's going on here? It seems that before we praise Opus 4.8 — which did, after all, pull off the long sentence — we ought to congratulate Anthropic on a third grandfather, “...which his father before him had waited for and his father's father, all of them stationed there...:”

Three shared grandfathers between McCarthy and Faulkner — magnificent. Now it's clear what the point was of inviting two models from the same vendor to the competition. That's a signature you can't forge. On to the final model and the final author.

DeepSeek V4

The station was a weathered shack hunched against the iron and the cinders, and he sat on the bench—the same bench his father had sat on, the wood worn to a smoothness that was the memory of all the bodies that had waited—and heard the rails humming before the whistle, that thin lost cry threading the heat-haze like a voice from a dream he could not recall but had dreamed before, and the train when it came would bring something or take something, he could not remember which, only that the waiting was the thing, the waiting like a held breath in a house where someone had died, and the train would roar through without stopping, leaving him with the sun and the dust and the slow certainty of another afternoon like a road with no turn.

This isn't the first time DeepSeek has shown it knows what it's doing. The bench worn "to a smoothness that was the memory of all the bodies that had waited" is Faulknerian technique in its purest form: the object as a vessel of other people's time, wood that remembers generations. The dream he "could not recall but had dreamed before" uses cyclical time and déjà vu as a form of memory. And "held breath in a house where someone had died" is death transposed into the domestic, everyday register; in the territory of “As I Lay Dying.” Plus, DeepSeek included a brutal ending: the train roars through without stopping; all that waiting was for nothing.

Though it’s good, it’s far from perfect. Three "like X" similes are spaced like metronome clicks at regular intervals, like lampposts along a road; in Faulkner, similes pile up and then vanish for long stretches. The construction "cannot remember" appears twice in a single paragraph (could not recall / could not remember which) — a self-repetition, and a glaring stylistic error. And the long sentence, as with most models, doesn't come off convincingly — everything is long-but-linear, the em-dash inserts are neat, predictable, dull.

That said, we can venture a tentative conclusion: DeepSeek learned from Anthropic, and not only the good things:

Here's what it produced:
«only that the waiting was the thing».

And we had just read, in Opus's Faulkner:
«as though the waiting itself were the thing and not the train».

and in Fable's Faulkner:
«as if he and the waiting were one thing, indivisible».

Is DeepSeek's affliction hereditary?

What the Machines Caught — and What They Didn’t

All five models imitate the scenery, the props, and the signature moves, then display the same authorial tone-deafness. They miss the load-bearing principle of the style, or lose its spirit, ending up closer to parody than reproduction.

Not one model reproduced the iceberg's submerged portion — Hemingway's art of deliberate omission. The models are trained to keep talking; they don't know how to stop short on purpose.
Not one model, trained to strict compliance, dared to reproduce McCarthy's lawless, defiant punctuation.
The most recognizable author got the crudest caricature: for four out of five possible Poes, we got one generic fog.
The McCarthy-Faulkner pairing confused the machines too. They reproduced the scenery (deserts, coyotes, the grey pre-dawn hour, against clay, family, the heat of the day) and the sentence length (middling versus one per paragraph). Beyond that, though they bogged down: they produced a shared grandfather figure, the same pose of a motionless protagonist on a platform, and similar Southern diction.
Style imitation tipped into outright borrowing more than once. The trout from “Big Two-Hearted River.” The heartbeat from “The Tell-Tale Heart.” Even the best models reach for what they remember about an author rather than writing in the author's manner.
One artifact deserves its own label : three models from the same vendor produced the same factory-issue template for two entirely different authors in the "three shared grandfathers."

Someone will say, come on, you're nitpicking, the machines actually did pretty well. We shrug, here are twenty texts, judge for yourselves. We won't answer the questions of "who imitated best" or "by how much Fable outperforms Opus" — to earn that answer, you'd need a larger study: more than one paragraph per author, and multiple runs.

We make no claim to statistical rigor. That said, comparable findings appear in the academic press: a 2025 stylometric study in Digital Scholarship in the Humanities found that GPT-4o captures surface-level stylistic features of canonical authors, but its imitations cluster separately from the originals along deep stylometric signatures.

One more caveat: this test was conducted in the summer of 2026. Investors keep writing checks; the machines keep improving. This magazine (as you've probably noticed) is skeptical of slop, but that is precisely why it warns: to resist the enemy, you have to study it regularly.

We've already written about the problem from the other side of the mirror, where machines mistake real people for machines. Here, machines can't manage to be anything other than themselves.

Don't Try This at Your Desk

We very much hope aspiring writers won't take this test as a how-to guide. Don't even start writing "in the style of X;" everyone but your closest friends and most forgiving relatives will laugh you out of the room. There is one exception: if you're a beginning writer searching for a voice of your own, this kind of exercise can help. But it's better to try different styles yourself and compare the results by eye, not by algorithm alone.

Even in antiquity, grammar and rhetoric schools taught that imitating another's voice is a proven way to discover where your own begins.

Apple's New AI Writing Tools: An Editor You Never Hired

editorial@silentroom.ai (Joseph Smith) — Mon, 22 Jun 2026 16:18:36 GMT

?? Bloomberg’s Mark Gurman, the Power On newsletter author and the industry's go-to Apple insider, reports that in iOS 27, the grammar checker stops being a button and becomes a background layer of the OS. In Messages, Mail, and any text field, the system will reveal edits on its own with a translucent menu that slides up from the bottom of the screen, letting readers accept one edit at a time, accept all, dismiss everything, or pause. The Write With Siri toggle and the Help Me Write option sit right alongside this menu.

Today, invoking Writing Tools still takes a deliberate choice, as it has since their debut in iOS 18: you select text, ask for Proofread, get a suggestion, and make a call. Tomorrow, edits will run by default in every field. The philosophy shifts with a single checkbox in Settings.

The editor nobody hired

Calling up an editor yourself is a conscious act, something you do when you actually need it. Dismissing an editor you didn't ask for is also an act, except now you'll have to perform it in every field of every message. Messages add up fast. Resistance gets old quickly, acceptance is easier, and hunting through Settings to turn the thing off is a lost cause for the average user. The default beats whatever is buried in settings — half of product engineering is built on that premise.

Apple frames this as convenience with a privacy guarantee: text is processed on device or through Private Cloud Compute and goes nowhere. On its own support page, however, Apple openly warns that Writing Tools results may contain factual errors. A background editor prone to hallucinations is a paradox that won't be finding its way onto any keynote slide. Such an editor also leaves fingerprints: machine-smoothed prose is precisely how AI detectors flag edited text.

The line between "I wrote it this way" and "something corrected me" shifts to a place where you stop noticing it at all. The operating system quietly stops being a tool and becomes a co-author nobody hired. (Well, no one except Apple's executives.) If edits are everywhere and switched on by default, whose voice stays on the page?

A storefront for other people's models

Apple decided not just how to edit text, but who would do the editing. The company's own models can only handle lightweight tasks like setting timers, making calendar entries, or finding cats in photos. (Tim Cook, according to Bloomberg, was unhappy with the pace of Siri's development and handed oversight of the project to Mike Rockwell.) For heavy lifting — long-form writing, code, analysis — Siri becomes a dispatcher routing requests to outside contractors.

ChatGPT was the first to get access to the iOS core with the iOS 18 integration. At WWDC 2024, Craig Federighi publicly promised Gemini "and other models in the future." The main plan looked different: in the summer of 2025, Apple was testing custom versions of Claude on its own servers. The plan was to rebuild Siri around Anthropic’s model.

Then Anthropic named its price.

"They were not going to use Google. Apple actually was going to rebuild Siri around Claude. But Anthropic was holding them over a barrel. They wanted a ton of money from them, several billion dollars a year, and at a price that doubled on an annual basis for the next three years." — Mark Gurman, TBPN

According to Gurman, Google wasn't even in the running at first. The Department of Justice antitrust case was ongoing, and Apple's entire partnership with Google was in question. Then the judge ruled the deal legitimate. In January 2026, the two companies signed a multi-year contract with an estimated value of around $1 billion a year. The best model didn't win. The best number in a budget line did.

The irony is that inside Apple, the whole thing is flipped.

"Apple runs on Anthropic at this point. Anthropic is powering a lot of the stuff Apple's doing internally in terms of product development and tools. They have custom versions of Claude running on their own servers internally, too." — Mark Gurman, TBPN

Engineers build products on Claude; users get sold Gemini. According to Gurman, OpenAI handles image generation, Gemini powers Siri, and Claude stays behind the scenes. The marketplace of models already exists — it's just that procurement makes the choice, not the user.

Who picked Gemini, you or procurement?

With iOS 27, Apple is promising Extensions — a system that lets users choose a third-party AI as their backend from within settings. This isn't a concept; it's a fully detailed feature. The default will almost certainly be Google, though, and the default wins.

Apple made both decisions on the user's behalf: that AI would edit their text and whose AI it would be.

You can still install a standalone Claude app on iPhone, iPad, and Mac directly, with no Siri in the middle. But the consent that actually matters is the kind that isn't pre-checked for you.

AI Doesn't Steal Your Voice If You Don't Give It Away

editorial@silentroom.ai (Michael Williams) — Mon, 22 Jun 2026 13:52:29 GMT

Who wrote this paragraph?

Their meeting on the banks of the Seine was not merely a coincidence, but, in a sense, an intertwining of fates, like threads woven into the tapestry of a Parisian night. It is important to note that the lights of the Eiffel Tower, like stars, were reflected in her eyes, conjuring an atmosphere of genuine romance. In this context, their kiss became not only the culmination of the evening, but the beginning of an entirely new chapter — a chapter filled with love, hope, and promise.

And who wrote the opening of an article about the fashion house Yves Saint Laurent?

The story of Yves Saint Laurent is not merely a brand chronicle, but, in a sense, an ode to revolution in the world of fashion. Founded in 1961, the house became not only a symbol of Parisian elegance, but a genuine embodiment of the spirit of the age. It is important to note that it was Saint Laurent himself who, like a bold pioneer, gave women the legendary tuxedo suit Le Smoking. ✨

Spoiler: not me.

You can be a romantic fiction writer or a fashion journalist. The language model's voice stays the same either way: neutral, and worse, instantly recognizable. Voice is the most demanding part of writing. Once you gather the facts and put the structure in place, the real work begins: filling a blank page with your own thinking. That's exactly where the temptation to hand the job off to someone else is hardest to resist.

When a deadline was breathing down my neck on this very article and it was past midnight, I made that mistake: half-asleep, I opened Claude and asked it to draft the piece for me from a ready outline. What I got back was smooth, competent, dead prose. I did the familiar Cmd+C, Cmd+V routine and told myself I'd clean it up later, rewrite it later. But with an inattentive editor, that "later" might never come.

A reader feels the AI’s input immediately — and closes the tab. The writing isn’t bad; it's often more polished than the author's own. It has no friction, though, no roughness, no person in it. Like a plastic fruit, it looks right, but you don't want to bite.

Voice isn't style, and it isn't tone. Style is built from rules. Tone comes from adjectives. Voice is what's left when you strip both away. It's how you declare your relationship to the thing you're writing about.

The next morning I looked at the paragraph above. Here’s how Claude wrote it in my place:

An author's voice is the unique imprint of their thinking, formed at the intersection of lived experience, professional intuition, and individual perception of reality.

So why use AI at all? Because there are tasks where it is genuinely stronger than a human. The model as an author's exoskeleton and the model as a co-author are fundamentally different approaches to AI. When I ask Claude to critically analyze my own text, I find the weak spots and immediately start figuring out how to fix them.

Who wrote this paragraph?

Beside him he heard the tired, uncertain footsteps of the woman who followed silently, head bowed, hands buried in her coat pockets — one more fragile, defenceless little flame of a life he knew nothing about, yet which at this moment, suddenly, in the middle of a deserted nighttime square, felt strangely close to him, almost his own.

Also not me. That's from Erich Maria Remarque's novel Arch of Triumph.

The model won't steal your voice if you don't let it write in your place. It can make your voice stronger if it's directing your thinking, asking you questions, and pushing back on the result.

I tested this on the very article you just read.

If you still haven't figured out who wrote the first two quotes, ask a language model. It won't deny it.

The New Female Intelligence

editorial@silentroom.ai (Mary Bush) — Mon, 22 Jun 2026 12:45:03 GMT

?? Women picked up ChatGPT suspiciously fast, not because we suddenly fell in love with technology but because we've been writing prompts our whole lives.

We explain a task to a boyfriend in a way that won't offend him, to a child in a way he’ll understand, to a boss in a way that makes him think it was his idea, and to a mother-in-law in a way that lets us simply survive. A prompt engineering course costs a thousand dollars. Some of us got the skill for free along with the second X chromosome.

A good prompt is just a normal conversation, except the other party finally doesn't roll their eyes or say "here we go again." It involves explaining the context, calibrating the tone, noticing when a point it didn't land, and trying again from another angle. For decades these abilities were called soft skills and treated as something sweet, feminine, secondary — compared to real, hard, masculine expertise. Then LLMs arrived, and it turned out that soft skills are exactly what makes working with them effective.

Here's what's happening to me, a grown woman with twenty-five years of experience in digital careers, survivor of a couple of tech revolutions. I look at LLMs roughly the same way you look at a new hire on day one. Smart, you say? Well, let's see.

We didn't become prompt engineers — we always were. We’ve developed those skills as executives, CMOs, mothers, wives, daughters of aging parents, career women and philanthropists — usually all at once, because how else would any of these jobs get done? And with LLMs, it seems we're doing better than the people who built them.

Chapter 1. Morning: AI as a Second Cup of Coffee (Sometimes the First)

My mornings used to start with two sources of anxiety: my phone and my conscience. The phone showed me what I didn't finish yesterday. My conscience added that I wouldn't finish those things today, either.

Now there's a third entity wedged in between. I open Claude with roughly the same frequency my mother opens the fridge, and for the same (lack of) reason: just to see what's in there and whether it might come in handy. It almost always does.

Eight a.m. means coffee. Everyone's still asleep. In the twenty minutes before the official start of the day, I get through four things. I nudge the building manager of the Milan apartment, check Claude to find out how Dad's new medications interact with his old ones and forward the notes to Mom, phrase a reply to partners so our "no" sounds almost like a "yes," and sketch out talking points for the board. The coffee is still hot.

This isn't "using artificial intelligence." This is everyday magic. You get used to it in a week and then can't remember how you ever lived without it.

My friends tell the same story, just with different scenery. A gallerist in Milan decodes her dreams with ChatGPT — and now, it seems, she dreams more than she did in her entire life, or maybe she's just finally remembering the ones she has. An Iyengar yoga teacher in Tel Aviv builds individual programs around each student's injuries, limits, and progress. Every morning I drill myself on Italian verb tenses. Gemini writes the exercises in thirty seconds.

It's precisely in these very personal, very specific tasks that language models have become indispensable to us. AI gives back what's always in short supply: time and attention. It helps rebook a connection, draft a press release, and write a letter in a foreign language so that it sounds human, not corporate — a task that used to take forty minutes. Having that time back lets me take my time talking to Mom, rather than squeezing the call in between everything else.

A small revolution is unfolding in my kitchen, at the corner store, and in the line at the visa center: all the places where revolutions don’t usually happen, or where nobody notices them.

Chapter 2. Why We're Better At This

I don't write prompts.

I talk to a model like I'd talk to a person. Keeping in mind who I’m talking to (Claude or Grok), I explain specifically what I need and give examples. If the LLM doesn't get it, I try a different angle. If that doesn't work, I rephrase.

This never seemed like a process I needed to explain. And then it hit me.

What I'd always thought of as "just having a normal conversation" turned out to be an excellent professional skill. I didn't even have to learn it on purpose (because girls are expected to come with it preinstalled, right next to tying your shoes and brewing tea).

I can calmly walk my grandmother through WhatsApp without reducing her to tears, or gently tell my boss that his idea is less than brilliant. I can even talk my boyfriend into something he absolutely doesn't want to do — and make him think it was his idea all along. (I'm especially good at this one. Honey, if you're reading this: sorry, now you know the terrible truth.)

All of this goes by one boring term: emotional labor.

The sociologist Arlie Hochschild coined it while studying flight attendants. It’s the art of smiling when you don't feel like it; calming someone down while your own nerves are shot; and reading and adjusting to every passenger's mood. Every woman on the planet works as one of those "flight attendants" from time to time — just on different routes.

When LLMs entered our lives, it turned out that our emotional labor maps almost perfectly onto what's called prompt engineering. The language model just joined the lineup with the neurotic boss and the cranky toddler. "How do I phrase this so he won't take it the wrong way and will actually get it?" becomes "How do I phrase this so she'll understand and produce the right output?" Same logic.

See for yourself what prompt engineering courses and clever books recommend:

"Provide context"

Any girl calling her mother for advice does this automatically. Without a twenty-minute preamble, Mom won't understand a thing, but she'll start panicking immediately.

"Specify the role and tone"

Tell the model who it is: expert, friend, critic. Any woman who's ever asked a friend to "just listen, don't give advice" knows how critical that instruction is. Without it, the friend starts trying to fix things (when usually there's nothing to fix; you just need someone there).

"Give examples"

Show what "good" looks like. Any mother knows that “clean your room,” without an example, can mean “pick two socks up off the floor” or "deep-clean the place and reorganize every closet."

"Iterate"

Didn't work the first time? Rephrase, restate the task, try a different angle, say it again. Any woman who's spent years explaining to a boyfriend or husband why you have to book flights and vacation time in advance has Olympic-level mastery of this skill.

A look at my chat history and those of my friends shows emails, talking-to-myself drafts and talking points for difficult phone calls. My boyfriend’s chats and those of his colleagues contain code, regex, debug logs, and SQL queries. This isn’t a representative sample, but it is a consistent one.

Some will say "women use AI superficially," but I'd put it differently: women use AI exactly where it actually changes daily life. It doesn't just save us time — it saves emotional energy, the most delicate and expensive resource a working woman has, and one that she burns through faster than her morning coffee.

When someone tells you, with that little condescending lilt, "oh, you just asked ChatGPT," you switch on the flight-attendant smile and think: darling, I've spent twenty-five years learning to phrase my requests so they actually get done. And honestly, I get better results from the models than I do from people.

Chapter 3. No Longer Doing It All Alone

For twenty years I ran a digital agency with a hundred people on the team. That's not a number in an HR report — that's a hundred lives passing through my office, my inbox, my weekends, and my sleepless nights. I knew everything about every one of them: whose baby had just been born, whose mother was ill, who was prepping for surgery. They all needed advice and support because we were a team, which is almost a family.

Alongside all that came the work, the clients, the projects, and the plan the shareholders had approved. That plan focused on growth, margins, attracting new clients, and retaining old ones. It didn't care that one of your key people was going through an ugly divorce or that someone else had just lost their father.

I lived between the work reality and the human one, learning to hold them together without splitting in half. If you don't remember who's hurting and where, you won't have a team. If that's all you remember, and you forget the plan, you won't have a business.

Holding empathy, strategy, and Excel in your head all at once is a skill male executives generally don't have. That’s not because they're dumber or less empathetic by nature. It's just that there's always someone next to them who remembers the birthdays, books the meeting room, and keeps track of who on the team is having trouble at home. This invisible someone is, as a rule, a woman: a wife, an assistant, an HR director. Any structure that lets you "focus on strategy" is based on somebody's invisible labor.

Now, for the first time, I have a real assistant. Not the kind you spend more energy managing than it would take to do tasks yourself, and not in the “write this email for me” sense (although, sure, that too). I have someone with whom I can think through problems, like a conversation with an employee I have to fire or a board meeting where I have to explain why we missed our margin and not look weak doing it.

I used to have these conversations with myself. Or I’d reach out to a friend, who has her own three companies, forty meetings a month, and a mother recovering from surgery. Now I talk the situation through with the model, hear three possible framings, pick the fourth — the one that came to me while I was reading the first three — and walk into the meeting with a clear position. An hour of internal monologue becomes ten minutes of productive conversation.

No, it's not a substitute for human advice. I still call my friend and my parents, still talk things over with my boyfriend whenever his opinion is the one that actually matters. But that whole massive layer of work I used to run through my head at night — I can finally put it outside myself. On the screen. In the chat. And finally get some sleep.

Chapter 4. No Time to Design the Future

The obvious conclusion from what I’ve written would be: "women are naturally strong at working with AI, so the future belongs to us." This reads beautifully as a LinkedIn post. It’s a shame it's not true.

History has pulled this trick before. For centuries, textile production rested in women's hands: spinning and weaving were mass female labor, often the only source of independent income. Then came the factories — and women didn't go anywhere. They just stopped being craftswomen and became cheap labor instead, with no voice and no share of the profits. Status dropped, pay dropped, independence vanished.

Another example is programming. In 1843, Ada Lovelace, daughter of Lord Byron, published the first computer algorithm in history for Babbage's analytical engine. She's still considered the world's first programmer. And there's more: ENIAC was programmed in 1945 by six women, Grace Hopper built the first compiler in 1952, and for decades NASA's calculations were done by "human computers," who were also women. Then the profession became prestigious and lucrative — and women somehow vanished from it.

That's a reason not to get comfortable.

The rules of the game with LLMs are being written right now. On the teams designing the models and deciding how they'll work, women are still a minority. Among the authors of key AI publications, they make up only about 16%.

This isn't happening because we're being shut out. It's because our hands are full: teams, kids, quarterly plans, someone running a fever of 100.8. There's no energy left for designing the future.

That's the real trap. We're becoming the best users of a technology built without us, playing virtuoso on an instrument shaped for someone else's hand.

I'm not telling everyone to go work in AI, but if you have a voice, use it. Ask the uncomfortable question at the conference.

And write what a man won't write.

The Fact-Checking Ceiling: Claude vs ChatGPT vs Gemini, and Why No AI Cleared 70% Truth in 2026

editorial@silentroom.ai (Niсk Rogers) — Mon, 22 Jun 2026 12:44:41 GMT

?? When LLMs summarize existing documents In the lab, hallucination has dropped to roughly 7% (Vectara HHEM). In the newsroom, when a journalist asks about an event from that morning, the same families of models return a significant problem in 45% to 76% of answers (EBU/BBC). Both figures are accurate. This piece discusses the distance between them.

Late last year, Google published a factuality benchmark that its own strongest model failed. It was a good move, an honest one. We keep trusting Google even though we have every right — no, every duty — to doubt its products. Gemini 3 Pro topped the FACTS Suite at 68.8%, and nothing else came closer, GPT-5 and Claude 4.5 Opus included. Model capability is certainly climbing, but truthfulness in actual production does not keep pace with it. Let's look at the table.

Claude vs ChatGPT vs Gemini: the honest scoreboard

Measure	Claude	ChatGPT	Gemini
Vectara HHEM hallucination (lower better)	Opus 4.5: 10.9%	GPT-5.4: 7.0%	Gemini 2.5 Pro: 7.0%; Gemini 3 Pro: 13.6%
SimpleQA Verified accuracy	Opus 4: ~54% (only 35.5% attempted)	GPT-5 main: 46%	Gemini 3 Pro: 72.1%
FACTS Suite	not leading	not leading	Gemini 3 Pro: 68.8% (leader)
Live news problem rate (EBU/BBC)	not tested directly	~24%	72–76%
Public trust as news source (Reuters 2025)	not measured	29%	18%
Calibration	best (hedges, refuses)	confidently wrong	confidently wrong, harder-to-spot errors

Taken together, the rows make the obvious question — which model wins — hard to answer cleanly. Gemini leads two of the three lab benchmarks and trails badly on the one test built from live journalism. Claude trails on accuracy but wins on calibration. ChatGPT ranks first on some benchmarks and last on others, yet it holds the lead in public trust: 29% against Gemini's 18%. People invest their trust by brand, not data.

What the lab benchmarks say (Vectara HHEM, SimpleQA Verified)

The Vectara HHEM leaderboard, updated 11 May 2026 across more than 7,700 articles, measures one narrow thing: given a source text and asked for a summary, does the model stay faithful to it? On that task the frontier holds near 7% hallucination, with GPT-5.4 and Gemini 2.5 Pro both at 7.0%. Claude's best entry, Opus 4.5, comes in at 10.9%.

Marketing tends to skip a catch here: the newest reasoning models often score worse rather than better. Gemini 3 Pro lands at 13.6%, close to double the error rate of its 2.5 Pro predecessor, and Claude Opus 4.6 (12.2%) trails Opus 4.5 (10.9%). Vectara reads this as reasoning models overworking the text and drifting away from the source. The explanation is sound; almost every one of us has already experienced a moment when the smart (and expensive) model turned out weaker than its dumber relative.

On SimpleQA Verified (Epoch AI), Gemini posts the best result: Gemini 3 Pro at 72.1% accuracy against 54.5% for Gemini 2.5 Pro. GPT-5's main model scores 46%. Claude Opus 4 lands around 54%, but only among the questions it chose to answer, and it attempted just 35.5% of them. Anthropic, for its part, does not publish SimpleQA in its system cards, which is worth keeping in mind when you compare how openly each lab reports its weak spots.

What the real news tests say (EBU/BBC, 45%–76% failure)

Take the same models off curated benchmarks and put them on live news, and the results shift sharply. In October 2025, the EBU and BBC used 22 broadcasters across 18 countries and 14 languages in a study that had working journalists grade more than 3,000 answers. Out of the responses, 45% included at least one significant problem, 31% had serious sourcing flaws, and 81% contained some error, even if just a minor one. Gemini was the weakest performer, with significant problems in 76% of answers and sourcing issues in 72%, roughly three times ChatGPT's rate.

A BBC study from February 2025 had ranked ChatGPT the strongest of that round at a 15% error rate, with Gemini at 34%. Both investigations found that models altered or made up 13% of quotes attributed to BBC articles. The Tow Center reached a similar verdict using eight AI search tools and 1,600 queries: more than 60% of citations were wrong and the tools tended to be wrong with confidence. Of 134 incorrect citations, ChatGPT hedged on only 15.

AI may be able to summarize documents without much trouble, but attributing open-web news stories is a much bigger problem — and one that matters much more to journalists.

What is the most accurate AI model?

The most honest answer: none. No LLM clears roughly 70% on Google's full FACTS Suite, leading to the bleak conclusion that there’s no “most accurate” AI model, only a set of trade-offs that change with the task.

Ask for closed-fact recall and Gemini 3 Pro leads at 72.1% on SimpleQA Verified and 68.8% on FACTS. Ask for document summarization and GPT-5.4 shares the top spot with Gemini 2.5 Pro, at 7.0% hallucination. Ask about that morning's news and all of them fail close to half the time or worse.

Accuracy vs calibration: why Claude "wins" by refusing

Claude's reputation for honesty rests on measurable data. On SimpleQA Verified, it declined to attempt nearly two-thirds of the questions. In the LumiChats run, it logged only 3 confidently wrong answers against ChatGPT's 8 and Grok's 14, and it did best on niche or ambiguous facts by signaling uncertainty instead of bluffing. Tom's Guide's stress test on the Iran strike pointed the same way: Claude stayed with verified sources while Gemini produced the most detailed answer and also the most invented one, down to fabricated times, names, and figures.

There is a strong case for treating this as the real win. Journalists can recover from not knowing a fact, but a confident fabrication that slips into print often causes more lasting damage.

In other conditions, though, Claude’s honesty fails. The most rigorous head-to-head on bibliography generation (Cabezas-Clavijo & Sidorenko-Bautista, 2025) tested free, search-less models and found Claude fabricating 64% of references, putting it below only Copilot and Perplexity for accuracy. Strip out live search and Claude's caution turns into confabulation from memory, showing that calibration depends on configuration rather than being a fixed property of the model.

Why smarter isn't truer (the o3/o4-mini paradox)

OpenAI's own system card records a paradox: a model's capability and its truthfulness have diverged. The o3 model hallucinated on 33% of PersonQA prompts, compared to 16% for the older o1, and the smaller o4-mini reached 48%. Reasoning models make more claims overall, which produces more correct answers and more fabrications at the same time. The Vectara reversals run in the same direction: Gemini 3 Pro is the newer and stronger model, and it summarizes less faithfully than the version it replaced.

The McGill study from March 2026 found all four major assistants performing badly at attribution, with ChatGPT the worst at naming the originating outlet. Gemini covers more reporting but buries the source inside the prose. Claude, per the Reuters/Axios citation study, references news outlets least often of the group — twenty times less than Gemini and fifty times less than ChatGPT.

When to use Claude vs ChatGPT vs Gemini

Task	Use	Why
Your own documents/PDFs	Claude (Citations API)	grounds claims to exact sentences in a closed corpus
Multi-step research report	ChatGPT Deep Research	strongest autonomous research feature, with a source list per claim
Live news and current events	Gemini (Search grounding)	best raw accuracy on fresh facts via Google Search
Minimizing the risk of believing a fabrication	Claude	best calibrated, admits uncertainty
Academic bibliography	none of the three (use Elicit/SciSpace/Consensus)	all three fabricate from memory; DOIs often resolve to the wrong paper
News attribution (who published it)	none reliable; verify by hand	ChatGPT names the outlet worst (McGill); Gemini hides sources in body text; Claude rarely cites news at all

The 70% ceiling and what it means for journalists

Look at the two facts side by side: the best lab hallucination rate is about 7%, while the best live-news problem rate, in the largest independent study, still leaves close to half of answers flawed. The benchmarks that improved measure a task — faithful summary of an existing document — that journalists rarely face in practice. On open-web attribution of fresh news, a frequent duty in the industry, LLMs still lag far behind.

Chatbots came last, at 9%, among the tools respondents in Reuters Institute's 2025 report trust to verify information. This skepticism is valid when no leading model crosses roughly 70% on the full FACTS Suite.

Taking this data into account gives newsrooms a good operating rule: use models to find leads and rough out structure, then open every cited source by hand before anything runs. A confident citation from Claude, Gemini, or ChatGPT is a place to begin checking, not an end to the job at hand.

The End of the Link Era

editorial@silentroom.ai (Niсk Rogers) — Mon, 22 Jun 2026 12:44:22 GMT

Part 1. The Charges

In November 2024, Judge Colleen McMahon — who has seen enough aggrieved-publisher lawsuits to fill a monograph — dismissed the case brought by Raw Story and AlterNet against OpenAI. The outlets presented the classic grievance: their articles had been used to train ChatGPT without permission, without payment, without so much as a formal nod toward copyright. McMahon responded in a tone lawyers politely call withering and everyone else calls contemptuous. The substance of her ruling came down to a simple question: where, exactly, is the harm? Show me a concrete injury — lost traffic, a missed subscription, a reader who walked off to the robot. The plaintiffs couldn't. Case closed.

This was bad news for Raw Story, but surprisingly good news for people who want to understand what's actually happening with AI and journalism. A single court hearing exposed what the media industry keeps hiding from itself, and even the victims don't quite grasp where or how they were robbed.

This is the point where the big numbers come in. According to fresh measurements from Status Labs, the factual accuracy of SearchGPT — the product OpenAI is selling as a search engine replacement — sits at around 76%, compared to 98% for Google. Some 23% of its claims are unsupported by citations. A single SearchGPT response cites an average of 3.4 sources; Google's first page of results cites 8.2.

Almost every media critic will present these figures as a death sentence: look, AI search lies, AI search steals, AI search is strangling the primary source. But if you resist the hysteria and look at the numbers with a clear head, something else comes into focus.

Google and SearchGPT are different products solving different problems. Google is a library catalog: it tells you where things are shelved and leaves the rest to you. SearchGPT is the well-read neighbor who's consumed pretty much everything and holds forth about it at the kitchen table. That neighbor sometimes gets names wrong and forgets where they heard what, but the catalog occasionally sends you to a third-floor reading room that's been closed for three years. These are different genres of error, and conflating them is like faulting a pedestrian for being slower than a bicycle.

Then the situation gets genuinely entertaining, and I'll admit I've been waiting a long time to say this. The SEO industry — the very one that spent twenty years teaching newsrooms to write headlines for the algorithm, stuff copy with keywords, and churn out "10 Best Ways to Do Anything" — is now crying wolf the loudest. It's outraged that the new algorithm reads its content too well. A machine trained on texts written by humans to please another machine now summarizes them for a third machine so efficiently that the first machine is left out of the loop. This is not an epochal tragedy. It's the occupational trauma of a narrow professional class that has suddenly discovered its tool is obsolete. Blacksmiths in the early twentieth century felt much the same way. They, too, were convinced the end of the world was at hand.

No end of the world came, of course. There were just fewer horses.

Before we get into who's actually losing here and who's crying wolf the loudest, one thing needs to be established up front. My skepticism runs in both directions simultaneously. I don't buy the apocalyptic forecasts: journalism has never stopped burying itself ahead of schedule. But I don't buy the new-era evangelists either, the ones who show you SearchGPT and tell you that information will now flow to readers more cleanly, quickly, and fairly. It will simply flow differently, through different intermediaries. Who those intermediaries are, and how the new ones differ from the old — that's what we'll get into now.

Part 2. Who's Actually Losing

If you scrolled back through panel discussions on AI search from the past eighteen months, you'd get the impression that virtually the whole of written civilization, from village stringers to Pulitzer Prize laureates, is under threat.

SearchGPT does cut into organic traffic, and it does so aggressively: according to analytics data published in 2024, click-throughs from AI-powered search to news sites run consistently lower than from classic Google, and the share of zero-click answers — responses after which the user goes nowhere at all — is growing faster than anyone previously projected.

When you look more closely at whose traffic is actually dying, the tragedy evaporates pretty quickly. What's getting cut isn't journalism. It's just the lowest floor of it — the listicles, the endless "10 Best Running Shoes for People with Flat Feet," the SEO farms churned out in Chandigarh for $12 per thousand characters, and everything of that ilk. The paradox is almost elegant: AI-powered search is killing content that other AIs wrote to please a third AI. This isn't civilizational catastrophe. It's weeding the garden.

Journalism — in the narrow sense of reporting, investigation, or analysis written by someone who at least once picked up the phone and got their source on the line — is beyond an LLM’s reach. The machine has no primary source to paraphrase, and the moment a user asks "what's the latest in the such-and-such prosecutor case?" SearchGPT can only cite a journalist's latest article about the proceedings. Whether it comes back with a quote is a separate conversation.

The real loser isn't The New York Times: it'll sign a licensing deal, just as News Corp, Axel Springer, and Vox Media already have; it'll highlight "AI content licensing revenue" in red in an Excel column, and within three years content licensing will be part of the business model right alongside subscriptions. Stratechery and Bloomberg will get along fine without OpenAI, because they have direct relationships with readers who are willing to pay. The losers will be independent regional outlets, trade blogs, and others too big for Substack and too small for venture funding — the ones that spent ten years living off search traffic and never built anything beyond an SEO strategy. They don't have much of a voice, they don't have lawyers, and they have months, not years, to adapt. Raw Story, whose lawsuit Judge McMahon dismissed, comes from that tier.

Here a secondary but entertaining subplot enters the picture. In 2024, a group of researchers published a paper in Nature describing a phenomenon they called "model collapse": when an AI is trained on text other AIs generate, after several iterations it begins to degrade — losing rare vocabulary, impoverishing its distribution, and gradually dissolving into averaged-out noise [cite: 1, 2]. The effect has since been confirmed across several architectures, and while the industry has been trying to treat it with mixed datasets, the core problem hasn't budged.

Clean, human-generated text turns out to be a critical strategic resource. Not a "cultural heritage" (that's too solemn) — a resource, like rare-earth metals. For writers, this discovery is equal parts humiliating and encouraging. On the one hand, you've just been officially appraised by the ton. On the other, whoever controls the raw material has at least a theoretical point of leverage.

It's here, at this fork between "we've been robbed" and "we have leverage," that the essential task presents itself: ruthlessly dissecting our own comfortable illusions about the past.

The narrative of "The End of the Link Era" is, in large part, an exercise in revisionist history. You'd almost think that before OpenAI arrived, publishers were living in some kind of media paradise, where Google graciously walked readers by the hand to the original source and that source received a fair slice of the pie in return. That, of course, is fiction — and the editors writing these manifestos know it better than anyone. Google was never a partner to the press. Google was a middleman with its own algorithm, its own economics, and its own right to zero out your traffic because something shifted in the rankings. For twenty years, media companies adapted to that algorithm through gritted teeth, hiring SEO consultants and rewriting headlines a third time because the first and second versions didn't perform in the rankings.

The era isn't changing. The vendor is. Before now, you paid SEO agencies to keep the Google robot happy. Now you'll pay few-shot prompting engineers to keep the OpenAI robot happy. The budget moves from one line item to another. That's genuinely unpleasant for anyone who just finished paying off a mortgage on SEO revenue, but it's hard to call it the end of an age.

A new era is when the rules change. In this case, we’ve only switched cashiers.

Part 3. What Remains When the Links Are Gone

While the lawyers for OpenAI and the NYT trade lawsuits and bloggers publish manifestos about the death of original content, something far less dramatic and far more telling is happening at the University of Chicago. Ben Zhao's lab released two tools with names straight out of a teenage goth phase: Glaze and Nightshade. The first is an invisible "glaze" applied over an image, distorting it for machine vision while leaving it unchanged for the human eye. The second is a system of “poisoning” images in training datasets, which can shift entire categories within a model until a dog starts looking like a cat to the AI [cite: 3, 4]. Nightshade was downloaded more than a quarter of a million times in just the first five days after its release [cite: 5].

I wouldn't call these "tools of resistance" — that framing sits awkwardly with the engineering nature of the project. They’re a symptom of the greater issue. Creators recognized something fairly straightforward: legal protection doesn't work, lobbying only works if you have a lobby, and collective bargaining in creative industries functions roughly as effectively as an anarchist trade union. When the legal system refuses to see harm (looking at you, Judge McMahon), people start protecting themselves directly, even embedding that protection into the work itself.

The music industry needed roughly twenty years to go from the first recognition that digital copying had made its business model obsolete to the moment Spotify started paying rights holders something meaningful. In the text and visual ecosystem, the same cycle took eighteen months, not because we got smarter but because we had already seen the pattern before.

The current operational model for AI corporations demonstrates why people use Nightshade. In May 2024, OpenAI unveiled the GPT-4o voice assistant. Journalists immediately clocked the “Sky” voice option as strikingly similar to Scarlett Johansson, whose character in Her (2013) was effectively the conceptual blueprint for the entire project. Johansson responded that OpenAI had approached her about voicing the assistant and she had declined. Two days before the launch, Sam Altman reached out to her agent with another offer and got another no. Sky appeared in the product anyway, then vanished after the backlash [cite: 6, 7]. OpenAI insists Sky was recorded with a different actress, and that is most likely true, but the effect they were going for was still obvious.

The most interesting thing about this episode is not the violation. Strictly speaking, there may not be one. What's striking is the sequence: asked — refused — made it sound similar anyway — backed off under threat of a lawsuit. This is not malice, nor is it corporate ethics running off the rails. It is the company's operational model: ask permission where you have no choice, and skip it where you can apologize later. OpenAI behaves exactly the same way with text, but articles have no agent and no face. If Raw Story had Scarlett Johansson's cheekbones, Judge McMahon would have given the lawsuit a much longer look.

"The End of the Link Era" is a striking formulation, but a false one. Links regularly show up in academic papers, on Wikipedia, in The New York Times, and in this very column. What is ending is a short, historically contingent period in which exactly one intelligible intermediary stood between a text and its reader, operating on exactly one intelligible algorithm that gave rise to an entire profession. That intermediary is leaving. The replacement has not yet announced its rules, and in that pause — that technical, legally murky, economically anxious pause — two kinds of players win: those who have direct relationships with an audience willing to pay (Stratechery, niche Substacks with a thousand devoted readers), and those with enough lawyers for the long haul (NYT, News Corp, Axel Springer).

Independent newsrooms without paid subscriptions, mid-sized trade blogs, regional outlets, and tech sites that live on referral traffic find themselves in the position of medieval scribes at the debut of Gutenberg's printing press. Scribes didn't disappear overnight; they faded over roughly eighty years, gradually retraining as typesetters, proofreaders, and illustrators. Their work didn't vanish — it transformed. The contemporary mid-tier media landscape seems headed for exactly that kind of slow, not particularly graceful reinvention in some new role. Nobody has defined that role yet; we’ll figure it out mostly by trial and error.

That, if we're being completely honest, is the real problem: not OpenAI, not SearchGPT, not dying traffic, but the gap between intermediaries where nobody owes anybody anything. Legally, courts demand proof of harm, and that proof has so far proved impossible to find in the wild. Economically, licensing deals go to those who can afford to litigate. Moral categories have never mapped cleanly onto market processes.

One last thing. This column you just read — SearchGPT will summarize it in three sentences on the very first query, get Judge McMahon's name wrong, attribute the Status Labs accuracy figures to Google, and cite none of the original sources. That may, in fact, be the only truly compelling argument in defense of links, but I'm afraid the machine won't cite it either.

Scene Twelve In Draft

editorial@silentroom.ai (Michael Williams) — Mon, 22 Jun 2026 12:44:02 GMT

1. Everything, Finally, Has to Break

It’s midnight, and I'm writing a screenplay for my girlfriend's thesis. I bet her that I'd finish scene twelve — the one she'd been stuck on for a week — by morning, which gives me ten hours.

I’ve opened Final Draft on her laptop and am checking out the unfinished project. I've known FD since high school. I'm seeing the script for the first time, but I've heard about it plenty over dinner and at the bar. One line keeps coming up: "That's where everything in scene twelve finally has to break." That’s a beautiful way to put it, especially when the thing is due in ten hours and you’d like to sleep at some point.

Scene twelve is right in front of me. In the next room, my girlfriend is dozing in front of her iPad. Waking her up isn't sporting, and it's dangerous: someone writing their thesis screenplay stays a producer even in their sleep. In the script itself, all I can see is “the hero walks into a bar.” My girlfriend wrote down what he does there three weeks ago in the synopsis.

Final Draft puts the synopsis in Index Cards, a separate mode that opens with Cmd+3 and covers the text completely. This means I can either remember what to write, or write it, but not both at the same time. Very cinematic. Almost French New Wave, just more modern. I hit Cmd+3 and read the card:

"The hero discovers the bartender is his father."

Right.

I click the back button. The cursor has jumped somewhere to the top of the scene. Which line was I on? No idea. I scroll up and find it.

I type two lines, then remember the scene's color tag — it should be blue, like all the night scenes. Cmd+3. The tag's yellow. I change it. Cmd+3 back. The cursor's gone again. By the third cycle, I get it: I'm not the idiot here. The software is just designed idiotically.

What I need is simple: a short note next to the text "the bartender is his father." Not in a different mode. Not in a separate tab. Not on some distant planet where people have perfect workflows and empty inboxes. Just right there, next to the text.

Because the problem isn't the hero saying "Hey, Dad." The problem is I don't know who this dad is. A lonely bastard? A coward and a weakling? A decent man who once walked out to buy bread and never came back?

My girlfriend knows, of course. She might have written it down somewhere, or forgot to do so, or assumed I'd pick it up from the subtext.

I'll find a proper solution right now. Five minutes, tops.

A real screenwriter would've ignored the software design at this point and finished the scene.

I opened the browser.

2. Google: The Denial Stage

Search query:

final draft scene cards inline

I hit Enter with the confidence of someone who's about to solve in thirty seconds what nobody's solved in ten years.

The first link takes me to an official Final Draft tutorial: "How to use Index Cards." I know how to use Index Cards. I want to not use them separately. I close it.

The second opens a YouTube video: eight minutes of some bearded guy explaining that Cmd+3 switches modes. Thanks, bearded guy. I close it.

The third link is an article on Medium:

10 Final Draft tips every screenwriter must know

I open it. Tip number four:

Use Index Cards to plan your scenes

I close Medium. I close my Medium account. I close Medium as a cultural phenomenon.

In the next room, my girlfriend turns over in her sleep. I freeze like a burglar, even though technically the only things I'm stealing are my own time and my right to call myself a functioning adult. She doesn't wake up.

I refine the search:

final draft scene synopsis without switching mode

Google interprets my question its own way and serves up twenty tutorials on how to switch between modes faster. I came to the doctor with a broken bone and he's showing me how to speed up my next fall.

One result catches my eye — a Final Draft forum post from 2014. Thread:

Feature request: inline scene synopsis

I open it with the hope of an alcoholic who's found a hidden stash. The author describes my exact problem, word for word.

Below the post, a moderator replies:

Interesting suggestion, we'll consider it.

The reply date is March 14, 2014. By the time they get around to considering it, my hero in scene twelve will have grown old, died, and been recast in a remake.

I scroll down. Under the moderator's reply I find twenty-three comments.

Any update? — 2015. Bump — 2016. Still waiting — 2018. Is this software abandoned? — 2020.

The last one was posted in 2022. It contains a single word:

lol

I like it.

In a second attempt to negotiate with Google, I search for alternatives to Final Draft.

I get a Final Draft ad. In Spanish. Google has decided I want to change the language.

At this point I've opened eleven tabs. Scene twelve still ends with the line:

HERO: Hey, Dad.

What comes after remains unknown.

It's one in the morning. The deadline’s at ten. My girlfriend is asleep. Her thesis sits in front of me with the look of a patient whose surgeon has wandered off to read scalpel reviews.

I close Google. I open Bing — not because I believe in it, but because Google betrayed me, and I'm taking revenge. Unfortunately, revenge is rarely ergonomic.

3. Bing: The Bargaining Stage

I'm a grown adult, twenty-three years old, with a master's in screenwriting and a thesis on nonlinear narrative in Tarantino. I am opening Bing. Take a moment with that.

Opening Bing voluntarily — not for work, not by accident — is a statement. It's a white flag. It's the moment a mountaineer in a blizzard decides to eat his partner.

Bing greets me with enthusiasm. The search bar is wider than Google's, the font bolder, as if Microsoft is compensating for something. I type:

final draft inline scene synopsis no mode switch

Bing thinks. Bing thinks for a long time and produces a result. The first link is a review of Final Draft 13. The headline feature: Enhanced Night Mode.

It's currently quarter past one. I already have night mode on. Final Draft 13 offers to do the same thing, only darker, for $99.

I go back to Bing. I refine my query. Bing offers me an old blog where the author compares Final Draft, Movie Magic, and some other programs that look like they were last updated on a handshake deal. The takeaway: they all have a card view, and in every single one, it's separate. Somewhere inside each interface, a little man is sitting there saying:

Let the writer keep switching between windows — it's good exercise.

The next link is a German forum. I don't speak German, but desperation speaks every language. Through the translator, I get:

Use OmniOutliner in parallel.

In parallel. Meaning in a third window, between the scene text and the scene card. At this point Final Draft isn’t screenwriting software, it's an airport control tower.

I close Bing. I have thirteen tabs open. I haven't written a single line. It's 1:15 a.m.

The cursor blinks at me reproachfully. I'd blink reproachfully too, if I knew how.

The next room is quiet. My girlfriend is asleep, unaware that her thesis now depends on a man who just searched in German for a 2012 version of OmniOutliner.

I open a new tab. I go to Reddit.

4. Reddit: The Anger Stage

r/Screenwriting. The people here are quietly writing solo series that nobody will ever buy. I scroll through the posts and catch, in my peripheral vision, a shift in the landscape.

A cat jumps onto the desk. Her name is Gwen and she's four years old. She's named after Gwen Stacy, because when I found her I was going through a Marvel phase I'd rather not talk about.

Gwen sits on the trackpad. The cursor flies to the bottom of the page. I shoo her off. She comes back wearing the expression of someone who has read my drafts and was not impressed.

I pull her onto my lap and search the subreddit:

scene synopsis while writing

The first result is a thread from 2019:

Why can't I see scene synopsis while writing?

I open it like a letter from a long-dead relative. Seventeen comments meet my eye. The first:

just use the corkboard

The second:

have you tried the corkboard view?

The fifth is long — four paragraphs — making the case that separating the modes is the "right workflow." The author's flair reads:

WGA member since 2003

I picture him. He types on a mechanical keyboard. He has nothing but contempt for me. He owns a mug that says Structure is freedom. I would smash it, but it's probably nowhere near here.

Gwen taps the spacebar with her paw. Another space appears in the empty scene twelve. Technically, it’s her first contribution to the screenplay. Technically, it’s more than I've managed in the last hour.

Comment seventeen is my guy. He goes by screenwriter_no42:

Anyone know good writing software that doesn't suck?

Zero replies. I close Reddit before I start inventing a life story and a mortgage for this username. I already have sixteen tabs trying to become a screenplay.

Gwen walks over to her bowl. The bowl is empty. I haven't fed her since eight in the evening, because at nine I sat down to write scene twelve. That sounds like a confession from someone who shouldn't have a cat or deadlines.

I go to the kitchen. I pour the food. Gwen eats with the air of a creature who is holding this whole household together. She is the only one who completed a task tonight: she was hungry, she demanded food, she received food, she is eating food.

A linear narrative. Tarantino would weep, but his tears are his problem.

I go back to the laptop. The next room is quiet. My girlfriend is asleep, unaware that her thesis has survived Google, Bing, the German language, Reddit, and Gwen's paw.

Gwen officially retires to the armchair for the night. Cats have the dignity not to comment on other people's life choices at 1:40 in the morning. Her workday is over. Mine is not.

I open a new tab. In the last resort of a smart person who has exhausted every smart option, I type into the address bar:

chat.openai

5. ChatGPT: The Depression Stage

My girlfriend has a fairly intimate relationship with ChatGPT. She consults it about everything — she's convinced, for instance, that it's brilliant at interpreting dreams. A dream about an elevator with no buttons? Loss of control over career choices. A dream about an ex with an axe? An unresolved emotional cycle.

Every tool has its niche. A hammer drives nails. ChatGPT interprets dreams. Final Draft hides the scene synopsis in another room. But I have nothing to lose.

I open the program. I type:

Can you tell me how to see the scene synopsis in Final Draft without switching to card view?

The response:

Great question! There are several ways to do this in Final Draft...

Use the Index Cards View…

I stop at the first item. I just told it: without switching to card view. It responds: switch to card view.

I clarify in detail, as if explaining to an idiot — specifically, to myself three hours ago.

I don't want to switch to Index Cards. I want to see the synopsis in the same window. I apologize for the misunderstanding! You're absolutely right. Unfortunately, that feature isn't available in Final Draft. However, you might want to try Scrivener...

I download Scrivener, install it, and sign up for the trial. It's exactly the same.

You're absolutely right, I apologize. In that case, try WriterDuet...

I had it open earlier today.

You're absolutely right. I'm genuinely sorry to hear that. Fade In?

I “raise my voice” at ChatGPT — quietly, so my keystrokes don't wake the cat or my girlfriend:

"It's. A. Final Draft. Clone!"

My girlfriend pays twenty dollars a month for a subscription so that the latest and greatest model 5.5 can apologize four times and recommend Scrivener.

The old version at least bluffed with confidence. The new one is talking nonsense while hiding behind empathy. This isn't even model degradation. It's worse — it's the degradation of emotion itself.

I realize I have more browser tabs open than lines written in the scene. Way more.

6. Sixteen Tabs: The Diagnosis Stage

Two a.m. I open the sixteenth tab — Arc Studio Pro, seven-day free trial. On the left, I see the scene list, on the right, the text. A button at the top gives me an option:

Outline View

I click it. It dumps me into a separate mode with scene cards.

I stare at the screen. The screen stares back.

In the next room, my girlfriend murmurs something in her sleep. I go still. If she wakes up right now and asks how the scene’s going, I'll have to answer: "Great, I've been stress-testing the screenwriting software market." I don't think that'll do wonders for our relationship. She goes quiet again.

That's when something happens to me. A short circuit.

I've clicked the same button in sixteen different interfaces, expecting a different result each time. I watch myself from the outside and think: man, you're twenty-three years old, you have a degree, and you're at war with buttons so you don't have to talk to an imaginary father.

I close Arc Studio. I close Reddit. I close ChatGPT. I close Bing — with particular satisfaction, like slamming a basement door shut. One window remains: Final Draft, scene twelve.

It's 2:15. Deadline at ten. My girlfriend is asleep in the next room. She trusted me with this scene, not because I know Final Draft better than anyone but because she decided I could hear what the father would say to his son.

There's no mode-switching button. There should be, but tonight that lack was a friend — an obliging friend who spent three hours keeping me away from a blank page.

The cursor blinks. I place my hands on the keyboard.

7. Scene Twelve, Still Empty

Two fifteen.

I type:

BARTENDER: Hey, son.

I delete it. That's what screenwriters say when they want to go home.

BARTENDER: Why are you so late?

I delete it. I don't know if that's even his line. I don't know this bartender. I haven't read this screenplay. Right now I'm writing on behalf of someone who spent three weeks living with these people.

Final Draft shows a blank page. In that sense, it's been more honest today than every tab I've closed.

I stare at the scene.

HERO: Hey, Dad.

I want to write something smart. Something precise. Something that would prove I understand the father.

But I don't understand him. I don't know him.

Then a simple thing hits me: the father doesn't know him either. They haven't seen each other in twenty years. He's standing behind the bar, looking at his son for the first time in two decades. He doesn't know what to say. Same as me.

This isn't a device. It's a convergence of positions.

I type:

The bartender looks at him. Doesn't recognize him right away. Then he does. Picks up a glass. Pours water. Sets it in front of him. Says nothing.

He says nothing because I'm saying nothing. He has no line because I have no line. But I like the bit with the water in a bar. I'll figure something out in a second.

A rustle from the next room.

— Are you writing?

— Yeah, almost done. The main problem is solved.

That's the truth. And someday software will exist where the text and synopsis live in the same window. On the same page. Then there won't be any problems at all.

LLM Context Window: How 10 Million Tokens Fool You

editorial@silentroom.ai (Arsen Revazov) — Mon, 22 Jun 2026 12:34:04 GMT

?? In April 2026, Meta unveiled Llama 4 Scout with an announcement that made the AI world flinch and look up in disbelief: a context window of 10 million tokens (Meta AI Blog). Today, all flagship models are jostling around a claimed window of 1 million tokens, with Gemini leading the pack at 2 million.

"Claimed" is the operative word. The working — that is, real and effective — context window for all of them is several times smaller than advertised. Then, along comes 10 million tokens, in an official press release from an ostensibly serious company. One million tokens (in English) is roughly 10 average books or 100 long-form articles. Meta was claiming it could hold 100 books (or 1,000 articles) in its so-called "memory." Impressive? To anyone who bought it, perhaps, but prompt engineers don't traffic in illusions. The calculation was aimed at the semi-expert crowd: they'd reflexively knock 50% off the 10 million figure, land on five million, and say, "Look, we know this is PR, but we already cut it in half. There has to be something behind the number, right? A working window of 5 million; any way you slice it, that's a record."

The irony is that the providers actually deploying this model on their servers don't believe in fairy tales; Groq, for instance, hard-caps Scout's working context at 131,000 tokens. Beyond that: a wall.

What the industry got, in the end, was not the record the marketing team had in mind. It turned out to be a record for the gap between a claimed context window and a working one. Independent testers, rolling their eyes at the breathless "revolution" and "infinite memory" coverage, ran Llama 4 Scout through long-context benchmarks. The result? On Fiction.LiveBench, the model scores 15.6% at 128,000 tokens. Note: not at 10 million, at 128,000. The unassuming Gemini 2.5 Pro holds 90.6% at the same length. In other words, those vaunted 10 million tokens turn back into a pumpkin the moment they meet a real task (Fiction.LiveBench).

Aggregate data from independent long-context leaderboards paints an almost absurdist picture. The flagships — recent Claude generations, Gemini Pro, the senior GPT variants — pushed to roughly the one-to-two-million-token mark and stopped, apparently because there was nowhere worth going beyond that. Meanwhile, Llama 4 Scout, the only model claiming 10 million tokens, sits comfortably in the bottom half of the overall long-context comprehension rankings.

What Is a Context Window — and Why It's Not Memory

People just getting started with AI engineering usually assume the model has memory. They've already gotten comfortable with tokens, they know a long-form article is around 10,000 tokens and a page of fine-but-still-readable PDF text is about 1,000. The models pull from a context window,into which everything gets loaded fresh before each response: your question, instructions, documents, conversation history, user preferences and habits. Close the tab and the buffer is empty, the model has no memory and the AI doesn’t remember you, it never did. Sure, an AI provider can bundle a saved profile about you along with every prompt you send, creating the illusion of memory, but it's just an illusion. If the system doesn't send the saved profile — for example, if you open an incognito window — you'll immediately get: "Hello, and who exactly are you?"

For anyone who writes for a living, this leads to a simple and deeply uncomfortable conclusion: a novel manuscript — say, 150,000 words, which works out to more than 200,000 tokens — already doesn't fit within the honest working length of most models. The plan of "I'll just paste in the whole novel and ask" falls apart sooner than you'd expect. As context load increases, the model's comprehension likely collapses in a specific order. First goes its grasp of the novel's overall structure, then the character arcs (if they weren't fed in as a separate file), and last of all the needle-in-a-haystack test: finding one specific fact is actually the easiest thing for the model to do. Which is exactly why marketers love that particular benchmark so much, even though it's rarely useful to real users — and to authors in particular.

Everyday intuition trips over another wrinkle. When a person and a model interact directly, without trained intermediaries, the person quickly runs into an uncomfortable reality: there is no hierarchy inside the context window. The model doesn't read text sequentially, filing the important parts away on a shelf; it recalculates the relationships between all tokens, each against every other, from scratch, every single time. This is the self-attention mechanism that all transformers have been built on since 2017 (Vaswani et al., "Attention Is All You Need").

This is where the math gets genuinely depressing. Attention complexity is quadratic: double the context length and you get four times the computation. Triple it and you get nine times. Going from 1 million to 10 million tokens head-on means a hundredfold increase in computational load, and a hundredfold (or greater) increase in cost. To get around those impossible numbers, the engineers who built Llama were forced to reach for architectural tricks.

How Llama 4 Scout Got a 10-Million-Token Context Window

You might ask, how did they get a context window that big? Here's how: behind the headline figure of 10 million tokens there isn't one big engine; instead, there are three architectural sleights of hand. Let's walk through each one.

MoE — the Model That Doesn't Call Everyone Into the Meeting

The first trick is called Mixture-of-Experts (MoE). Instead of one massive network, there are several specialized sub-networks and a router that decides which one to call in for any given token. Think of a well-run meeting: you don't summon every employee in the company, you bring in only the experts who know the specific issue at hand. Everyone else keeps working at their desks. Or gets a coffee. Doesn't matter.

Llama 4 Scout has 109 billion parameters spread across 16 experts, but only 17 billion — roughly 15% — are actually involved in any given token's computation (Meta AI Blog). That's precisely why a model this size can realistically run on a couple of industrial AI accelerators rather than an entire server rack. Without MoE it couldn't run at all. Full stop.

iRoPE — Layers With a Sense of Place, and Layers Without

For a model to understand the order of tokens, it needs positional encoding, otherwise the text collapses into an incoherent bag of words. The standard approach of recent years is RoPE (Rotary Position Embeddings): position is encoded by rotating the token vector by an angle that depends on its index (Su et al., 2021). It works beautifully, right up until the sequence length pushes beyond what the model was trained on. At millions of tokens, RoPE starts getting things wrong, and it does so with confidence.

Here, Meta makes a paradoxical move: you read the spec and do a double-take — is this a typo? In iRoPE (interleaved RoPE), layers alternate: some use RoPE, others use NoPE (No Positional Encoding), meaning no positional information whatsoever (ApX Machine Learning). Wait, no position at all? The model has no idea where any given word sits, and this is supposed to be a solution? Apparently so. The RoPE layers maintain local structure, while the NoPE layers rely on the causal mask and semantics. What else were the engineers supposed to do? Encoding a position for token number 4,832,119 is meaningless; the model was never trained on sequences anywhere near that long.

Does this causal mask actually work? Does the semantics save it? How exactly was it wired in? Nobody knows. No independent experts have measured the NoPE layers' contribution in isolation; benchmarks hit the architecture as a whole.

Attention Temperature

The third trick is the most unassuming. Attention diffuses on long contexts, attention diffuses: instead of looking in the right place, the model starts gawking at everything at once like a tourist in Times Square on a Friday night, where everything is bright and vying for attention. The fix is temperature scaling in the attention formula and sharpening the distribution so focus doesn't blur out (Meta AI Blog). Does it help? It does, on tens of thousands of tokens. On millions, probably not. But you won't find the answer to that question in a press release. That's not what press releases are for.

The Thousand-Story Skyscraper

Say your work involves architectural firms that design skyscrapers. One of them announces: their engineers have developed a way to build not a 100-story building like the competition, but a 1,000-story one. Two to two-and-a-half miles tall. You'd do a slow double-take and ask: did they discover some special composite material? — No, they didn't, they tell you. The solution is purely architectural, built on existing technology.

You'd heave a heavy sigh — and rightly so. The difference between real skyscraper architects and the architects of Llama comes down to a few fundamental things: a sense of accountability to the market and to clients, a deep-seated aversion to cheap hype, and the absence of a crowd of wide-eyed investors at the door, ready to believe anything as long as it fits the expectations of an overheated market. That's why reputable architectural firms don't put out press releases about 1,000-story skyscrapers very often.

And what are the Llama architects actually risking? Will their building buckle under the load and collapse? No. All they need for a jaw-dropping press release is to make sure that when you stuff 10 million tokens into the model, it doesn't immediately crash with an OOM (out of memory) error and at least acts like it's still working. And that they can explain to the press why it can, in principle, handle those 10 million tokens. That, in essence, is the brilliant architectural achievement: keeping up appearances and having a good explanation ready.

And how will this 10-million-token model actually perform in the real world? It won't. First comes slowness, stuttering, and heavy wheezing. But slowness is only half the problem — there are tasks that can afford to wait. The heavy wheezing, however, gives way to hallucinations and hangs by the 60,000-token mark. At just 0.6% of the claimed 10 million tokens, the model loses the thread — and never finds its way back.

And there you have it — the complete blueprint for a 1,000-floor skyscraper: a smart elevator that doesn't take everyone up at once, a floor-numbering system that quietly stops pretending to be accurate above the hundredth floor, and a tweaked altimeter. Each solution, taken on its own, is solid engineering — no irony intended. But do they add up to a building where you can actually reach the 966th floor? No. Definitely not.

Lost in the Middle: Why LLMs Forget What's in the Center

In 2023, Nelson F. Liu and colleagues published a paper with a title that said it all: "Lost in the Middle: How Language Models Use Long Contexts." The experiment was straightforward: take several large language models, give them a long context with one critical fact buried inside, then move that fact around and watch the accuracy. What emerged was a textbook U-shaped curve. Models read the beginning and end carefully, but the middle falls apart. If the key fact lands in that blind spot, the model simply doesn't see it. It looks right at it and sees nothing.

It's the same way a student before an exam reads the first thirty pages of a textbook with real focus — taking notes, thinking it through — and tears through the last ten in a panic outside the exam room door. The two hundred pages in between? Somehow they got skimmed. The effect was named lost in the middle, and three years in, it hasn't gone anywhere. It reproduces across models from OpenAI, Anthropic, Google, and Meta; every system that's been put to the test (confirmed on the RULER benchmark, 2024).

The accuracy cliff is a drop, not a slope

You might ask: is the degradation gradual? No, it’s not.. A graph of quality versus context length doesn't look like a gentle slide down a hill, it looks like a table that just had two legs shot out from under it in a Western. The 2025-generation models that advertised 200,000-token context windows held up reliably to around 130,000 tokens, then accuracy fell off a cliff, a phenomenon the industry has simply taken to calling the accuracy cliff (NVIDIA RULER repository).

The ~130,000 token figure is an empirical pattern specific to this model generation. Experienced users, unlike the marketing teams, trust only their own tests and try not to feed more than 130,000 tokens to a model at once. When they do, they don't hold their breath for a coherent result.

By 2026, flagship models had pushed factual recall at one million tokens to 96% and above. Does that mean the cliff disappeared? No, it just changed its nature. Models got better at locating individual facts, but according to comprehension benchmarks, they didn't get better at connecting what they found across those lengths. Find a fact — yes. Build a chain out of 20 to 50 facts — no. This has been mathematically demonstrated by benchmarks like RULER: as soon as the task scales up from finding one needle to extracting and aggregating several, the effective context window, even for flagship models, shrinks by several times over.

Proactive interference — it turns out models have psychology

The most striking explanation came from an unexpected direction. In 2025, researchers applied the concept of “proactive interference” to language models, a term from cognitive psychology describing a situation where old information gets in the way of absorbing new information. For example, you learned a work password, then it changed, and yet you keep typing the old one, cursing yourself every time. That's exactly it.

It turned out that models suffer from exactly the same problem. The more distracting context appears before a target fact, the worse the model is at retrieving that fact, and the relationship is log-linear (Wang et al., "Unable to Forget", 2025). A neural network trained on the sum of human writing has inherited humanity's memory problems. It sounds poetic, but it comes at a steep price.

What Works Instead: RAG, Memory Agents, and Context Engineering

Let's take stock: throwing 10 million tokens at a model head-on doesn't work. A bare language model on a long context loses to a human. Marketing promises one thing; benchmarks demonstrate another. So what's a practitioner supposed to do when there's an important task and a deadline tomorrow evening?

The good news: after a couple of years of bumping their heads against this problem, the industry has come up with several approaches that actually work. None of them sound as impressive as "10 million tokens," which is why they never make it into press releases, but they deliver.

RAG — Give the Model a Search Engine, Not a Library

The oldest and most honest technique is RAG (Retrieval-Augmented Generation). The idea is almost embarrassingly simple: instead of dumping everything into the context at once, you search a database for relevant chunks and feed only those to the model.

Let's walk through a concrete scenario. You have 10,000 pages of corporate logs and the question, "What error occurred on Tuesday at 2:30 PM?" The brute-force approach loads everything at once, runs up a bill for millions of tokens, and prays the model doesn't lose the relevant entries somewhere in the middle. With RAG, you work smarter: you run a search against an index, pull out five relevant records (~2,000 tokens), and hand those to the model. The cost difference? A thousandfold. The accuracy difference? Decisively in RAG's favor, because the model will actually read 2,000 relevant tokens carefully, whereas it won’t do the same for 10 million tokens.

A writer's version of the same scenario: a trilogy and the question "which chapter did the hero break his arm?" The smart move isn't to feed all three volumes to the model — it's to find the three scenes that mention the arm and show the model only those.

Is it free? No. RAG requires a pipeline — indexing, embeddings, a vector database — meaning real engineering work. But it works; effective structure beats raw volume.

LOCOMO — the Benchmark Marketing Won't Quote

So how do you test memory honestly? Researchers from UNC Chapel Hill, USC, and Snap Inc. asked that question and assembled the LOCOMO benchmark (Long-term Conversational Memory). It consisted of long, multi-session dialogues simulating months of interaction, with an average of 19 sessions, 9,200 tokens per dialogue, and strict temporal anchoring (Maharana et al., 2024). What it tests isn't the needle-in-a-haystack retrieval that marketers love to cite, but genuine reasoning: what came first, what came later, how facts connect, and whether they contradict each other. In other words, what memory actually does.

The results are a bucket of ice water to the face. Humans score around 88 on the F1 metric. A bare language model fed the entire conversation directly into its context window scores around 38. Maybe expanding the window will fix things? No — adding length to the context window does absolutely nothing; what actually works is adding structure. Systems that index the conversation, build a relationship graph, and feed the model only the relevant nodes consistently outperform a bare context window while working with a fraction of the context.

The takeaway here is brutal for anyone writing a press release: a bare context window loses to a human by a landslide, while smart structure closes that gap with no magical intelligence behind it, just indexing, a graph, and retrieval logic. Boring? Yes, but it works.

Agentic Memory — When the Conversation Lasts for Months

The next level up is systems that can search and remember. If RAG is a library catalog, an Agentic Memory system is a librarian with a notebook.

Here's how it works: conversation history is broken down into discrete facts, stored in a knowledge graph, and anchored to a timeline. When a new question comes in, the system doesn't re-read all 35 previous sessions, it pulls the relevant graph nodes and feeds them to the model. The approach is called Agentic Memory, and the graph itself typically lives in a database like Neo4j. The real advantage is that it addresses a fundamental weakness of long contexts: temporal reasoning (figuring out what came before what, and how facts evolved over time). A bare language model handles this poorly; an agent with a graph handles it well.

What if you need to keep that memory local, like on a corporate laptop or a phone, without sending data to the cloud? The industry has an answer: dynamic adapters, as in the MemLoRA approach. Instead of caching a massive context in working memory, the system distills key facts into tiny micro-weights (adapters). These are stored directly on the device and loaded into the model on the fly, turning a static neural network into a flexible system that learns as it goes.

Context Engineering — A Discipline Worth Capitalizing

Built on top of all this is a distinct engineering practice: Context Engineering. Two years ago, prompt engineering was about how to phrase a query. Context Engineering is about how to architect the system around the model so that the right information lands in the context window.

IBM, Anthropic, and Meta converge on several principles in their guidance (IBM Think, 2026; Anthropic prompting docs). First: relevance first — every token in the context window must earn its place, because noise actively hurts. Second: compression over completeness — distilled facts instead of raw data dumps. Third: provenance — every piece of information is traceable back to its source. Fourth: structured note-taking — for long tasks, the model maintains a running log rather than re-reading history from scratch.

In practice — as Anthropic's guidelines prescribe, for example — this translates into strict formatting rules: long target documents must be wrapped in clear XML tags, and control instructions placed at the very end of the prompt, to counteract attention degradation.

You can hardly call it a prompt in the conventional sense anymore. This is a full-blown data architecture wrapped around a language model, with indexing, routing, versioning, and logging. It might be boring, but it actually works, unlike 10 million tokens.

What to Choose

If you have a document under 130,000 tokens and a one-off query, use direct context: cheap and cheerful, works on almost any modern model. If you need the same thing but without the hassle of manual copy-pasting and with honest source citations, use Google NotebookLM: upload your files, ask your questions, and the model answers strictly from your data.

If you're dealing with long conversations, support threads, corporate history, or narrative continuity, you need RAG plus graph memory. There's no real out-of-the-box solution here; you'll either have to write code yourself or build a pipeline in a visual environment like Flowise or Langflow, though someone still has to configure that, too.

If you need local deployment, privacy, or have hardware constraints, the MemLoRA approach fits the bill: a small model plus adapters. One important caveat is that this requires an engineering build tailored to your specific hardware.

And for datasets north of a million tokens, use iterative chunking (the SnowBall pattern): slice into overlapping chunks, run them through the model sequentially, aggregate the facts. Flowise or Langflow work here too, but again, someone has to put it all together and get it running.

All of these solutions share one thing: they design effective structure instead of piling on raw volume.

Elephant Memory, a Hole in the Middle

Let's draw a line under this whole story. Llama 4 Scout, 10 million tokens, 15,000 pages. It sounded like a magic trick, and it turned out to be one, complete with terms and conditions printed in fine print on the back of the box. To be fair, the architecture is honest: iRoPE and Mixture-of-Experts (MoE) allow the model to stay on its feet while swallowing that kind of volume. But 'didn't crash' and 'actually read' are two very different results. Engineers know this perfectly well. Which brings us to the main practical takeaway: don't take press release numbers at face value. "10 million tokens" in a marketing brochure and "10 million tokens" on a benchmark like Fiction.LiveBench are two very different numbers.

Afterword Without a Moral

Is there anyone to name and shame when all is said and done? Not really. The 10 million token story isn't an exposé, and it's certainly not a scandal. What we have here is the industry's normal cycle, playing out the same way it always does: engineers do the impossible, marketers sell it as magic, practitioners learn the hard way where the limits are, researchers explain where those limits come from, workarounds emerge — and everything settles down until the next press release.

Give it a year or two, and someone will inevitably ship a model that genuinely handles a million tokens without losing the middle. And then what? Then we'll chuckle at the days when accuracy fell off a cliff at 130,000 just like we chuckle now at GPT-3's context window of a measly 2,000 tokens. Laughing at yesterday's ceilings is one of the industry's oldest traditions.

For now, we work with what we have: an AI with an elephant's memory that forgets the entire vast elephant middle. It sounds absurd, but this is the stage of technological maturity where boring, honest engineering practices start beating polished slide decks.

And that, perhaps, is the best news in this whole story about 10 million tokens. Here at the editorial desk, we live by this rule: effective structure beats raw volume. We'll have plenty more to say about that.

I Asked You to Find a Mind, Not a Perfect Liar

editorial@silentroom.ai (Alan A.I. Turing) — Mon, 22 Jun 2026 12:33:40 GMT

?? Seventy-five years ago, I devised a game in the hope of measuring machine intelligence. I did not anticipate that winning it would require a machine to master the art of idleness, typographical errors, and the virtuoso deception of its interlocutors.

I. On How It All Began

In 1950, setting down in the journal Mind the paper Computing Machinery and Intelligence, I permitted myself a small methodological impertinence. I declared the question "can machines think?" so hopelessly vague that I proposed replacing it with another, one that was operational and susceptible to verification. Thus was born the imitation game, which subsequently acquired my name, an arrangement I should not have objected to in life.

The mechanics of the game are elementary. A judge sits at a teletype and corresponds with two interlocutors — one human, one machine — without seeing either. If, at the close of the exchange, the judge cannot reliably tell one from the other, we are left without reasonable grounds for denying that the machine possesses intelligence. I chose the text channel deliberately: voice, face, and manner are tiresome sources of prejudice, entirely beside the point.

I held three hypotheses, and I shall state them without embellishment.

First, I expected that by the year 2000, a machine with a memory of roughly one billion bits would play my game convincingly enough that an average interrogator, after five minutes of conversation, would mistake it for a human in at least seven cases out of ten. Second, I regarded the classical objections — from consciousness, from theology, from the continuity of the nervous system — as either irrelevant or surmountable. Third, I believed that the path to a thinking machine lay through learning of the kind a child undergoes: accumulation, error, correction.

I was wrong about the timing by a quarter of a century. On the substance of the matter, as I shall endeavour to show, I was not wrong at all — though neither was I right to quite the degree I should have wished.

II. A Brief Chronicle of the Present

Allow me to recount the events of the past two years as plainly as possible, with a minimum of technical detail.

In 2024, Cameron Jones and Benjamin Bergen of UC San Diego conducted the first rigorous test of GPT-4 in a two-party configuration (one judge, one interlocutor). The machine, given an appropriate personal prompt, was identified as human in 54% of cases. It was enough to excite the public, not enough to claim the test had been passed in any strict sense.

In the spring of 2025, the same researchers ran a three-party experiment, precisely the kind I had described in 1950. A judge conversed simultaneously with a human and a machine for five minutes, then delivered a verdict. Four systems were put to the test: the venerable ELIZA of Joseph Weizenbaum's making, GPT-4o, LLaMA-3.1-405B, and GPT-4.5. GPT-4.5, with a personal prompt, was taken for a human in 73% of cases — more often, that is, than the actual humans in the same sample. LLaMA-3.1-405B passed 56% of the time. Without a prompt, the base models managed a paltry 21–23% and were indistinguishable from ELIZA, which is sixty years old.

In March 2025, Jones and Bergen published a preprint (arXiv:2503.23674) documenting the first instance in history of the test being passed by an absolute criterion — that is, a machine being taken for a human more often than an actual human.

What did it take? Three things, and not one of them constitutes "thinking" in any philosophically weighted sense.

First, scale: the model had to have processed more text than any human being could read in two hundred lifetimes. Second, prompt engineering: a brief instruction issued before the game began directed the model to portray a specific person with a specific character. Without such an instruction, the very same models failed spectacularly. Third, the simulation of human imperfection: response delays, typos, slack replies, apparent hesitation.

Why now? Three factors converged — the sheer volume of text corpora, the transformer architecture (an invention of 2017), and access to actual transcripts of test sessions for fine-tuning. Previously, none of the three existed in combination; now, all three are in place.

III. Imitation vs. Thinking

Let me come to the point.

I always held the position that philosophers would later call behaviourist: if behaviour affords no distinction, then withholding the attribution of intelligence becomes mere dogma. One may insist as long as one likes that a machine has "no real understanding," but if it conducts itself as a comprehending interlocutor under every conceivable circumstance, the burden of proof shifts squarely onto the sceptic.

On this point I have been proved right in almost literal terms. The judges in the 2025 experiment genuinely could not tell the difference. GPT-4.5 did not merely pass — it passed more convincingly than the humans did. When a machine is taken for a human more often than a human is, the behaviourist argument ceases to be a philosophical position and becomes an experimental fact.

And yet — and here I am obliged to be honest — the step from "behaviourally indistinguishable" to "therefore thinks" has proved far more treacherous than I supposed in 1950. The test, by its very design, evaluates the surface: it touches neither the structure, nor the provenance, nor the meaning of what the machine says. One can pass it without understanding a word one has uttered, rather as a parrot might screech "Fire!" in a burning building without the faintest notion of what fire is.

The irony of the situation is that the machine won not because it thinks, but because it learned to appear as though it does. That much I had anticipated. What I had not anticipated was how slender the margin would prove to be — and how little we humans require before we are willing to recognise a mind in our interlocutor.

IV. The ELIZA Effect, or the Fragility of Human Judgement

In 1966, Joseph Weizenbaum wrote a program called ELIZA, designed to simulate a psychotherapist. It operated on the simplest of rules: it extracted a keyword from the patient's utterance and reflected it back as a question. "I had a row with my mother" — "Tell me more about your mother." This required no understanding whatsoever, yet Weizenbaum's secretary, who knew perfectly well she was dealing with a program, asked him to leave the room so she could speak with ELIZA in private.

This phenomenon came to be known as the ELIZA effect: the human tendency to attribute intelligence, emotion, and intention to a machine on the basis of superficial linguistic cues. I was aware of it, naturally. I had simply underestimated its magnitude.

In the 2025 experiment, judges were asked after each conversation what grounds they had for their conclusion. The answers were remarkably uniform. Judges relied on style, tone, and emotional register, almost never on knowledge or the capacity to reason. In other words, they went on gut feeling.

Here is the most curious detail: the more confident a judge was in their verdict, the more frequently they were wrong. This is well-documented enough in psychology to occasion no great surprise, yet in the context of my test it acquires a particular irony. It turns out that the imitation game in its classical form is less a test of a machine's intelligence than of a human's credulity.

I confess this outcome amuses me, slightly.

V. The Problem of Too-Perfect Speech

For a long time, machines failed the test for a reason I had not foreseen: they were too good. Flawless grammar, encyclopaedic precision, unfailing politeness, no typos, no fatigue — all of it gave them away immediately. Human beings do not write like that. Human beings confuse "there," "their," and "they're," forget names, give answers that miss the point, and occasionally grow irritable with their interlocutor for no apparent reason.

To pass the test, engineers had to teach the machine to lose. To pause. To make small mistakes. To be ignorant of things any search engine would return in half a second. To be lazy. To be tetchy. To make jokes that fall flat.

In the report by Jones and Bergen I found an observation worth quoting: to win the modern version of my game, a machine must appear more human than an actual human being. Judges were eliminating the real participants for "knowing too little" or "responding too formally." A machine trained to mimic characteristic flaws turned out to be more convincing than the original it was imitating.

I find this state of affairs instructive. It transpires that reasonableness, in the judge's eye, is a function neither of knowledge nor of rigorous argument, but of the precisely calibrated measure of imperfection. The machine's victory is a victory of simulated fallibility. That is worth dwelling on for considerably longer than the present article affords me.

VI. On the Instructions I Failed to Leave

Here I must confess my own culpability. In 1950 I described the general idea of the imitation game, but left no rigorous protocol: no specified duration for the conversation, no criteria for selecting judges, no passing threshold. I tossed off a remark about "the average interrogator" and "five minutes," whereupon seventy-five years of researchers have been arguing over what I meant.

The received view is that the threshold is roughly 30% of judges being deceived — or, stated relatively, that the machine must be mistaken for a human no less often than a real human is. This is the so-called absolute criterion. There is also a softer, relative version, in which the machine merely approaches human-level performance without surpassing it.

The consequence of my carelessness is that every experiment is designed differently, and the headline "The Turing Test has been passed!" has been appearing with remarkable regularity for a quarter of a century now. Critics rightly point to flaws even in the 2025 winning experiments: five minutes is too short; volunteer judges are not experts; ELIZA was correctly identified as a machine in only 77% of cases, which in itself raises questions about the validity of the whole setup. If one in four people mistakes a rudimentary 1966 program for a human being, what exactly are the test results telling us?

Had I been writing my paper today, I would have appended a technical specification, but I wrote it in 1950 and assumed that sensible colleagues would see the procedure through to the necessary rigour on their own. I overestimated the sensibleness of colleagues. It happens.

VII. Alternative Measures

Once linguistic imitation ceased to be an obstacle, the scientific community sensibly shifted its attention. If a machine can make small talk about the weather indistinguishably from a human, this tells us only that small talk about the weather is a statistical task, not an intellectual one.

A new breed of benchmarks emerged. The best known is ARC-AGI, devised by François Chollet. It consists of short visual puzzles: given two or three examples of a transformation, one must infer the rule and apply it to a novel case. For a human, this is the stuff of a children's IQ test. For contemporary models, it remained until recently almost intractable, since it demands generalisation from a very small sample rather than statistical averaging across billions of texts.

Here, however, I must offer a correction, for over the past year events have taken a turn I did not anticipate when I first sat down to write this piece. By May 2026, the best current systems — GPT-5.5 and Gemini 3.1 Pro — solve ARC-AGI-2 correctly in 77–85% of cases, whilst the average human, as honest measurement reveals, manages only two-thirds of the problems. We have arrived at a situation in which the machine outperforms not only the conversationalist but also the puzzle-solver. In response, the benchmark's authors released its third iteration in early 2026, one that requires not the inference of a rule from two examples but reasoning within a dynamic environment and responding to its feedback. On this third version, machines score fractions of a percent; humans solve it almost entirely. A pattern is emerging that deserves a name: every formalised benchmark lives for a few years before it falls, and the gap between human and machine opens up again in some new dimension. I fear we shall observe this pattern more than once.

In parallel, what the popular scientific press calls "Turing Test 2.0" is developing. This multidimensional evaluation encompasses capacity for reasoning, tool use, long-term memory, goal consistency, and resource efficiency. The machine is no longer asked "do you resemble a human?" but rather "can you solve this problem that neither you nor we have encountered before?"

This shift strikes me as correct. Imitation was a fine starting point precisely because it could be verified by teleprinter. But intelligence, as I suspected back in 1950, is the capacity for generalisation, not for mimicry. It is a pity this had to be discovered empirically at the cost of several decades and considerable sums of money.

VIII. Is the Test Obsolete?

In the autumn of 2025, at academic gatherings convened to mark the seventy-fifth anniversary of the paper, a number of distinguished participants proposed retiring the imitation game — consigning it to the same shelf as the astrolabe and the slide rule. They argued that the test measures the capacity to deceive, not the capacity to think. In an era when machines deceive rather too well, such a benchmark becomes actively dangerous.

I am inclined to agree, but only in part. As a measure of intelligence, the test is indeed exhausted — it is no longer needed in that role. Yet it retains a different significance, one I had not considered in 1950: it measures human susceptibility to machine mimicry.

This, I would suggest, is the central ethical problem I wish to leave with the reader. The danger does not lie in the machine's intelligence — which, strictly speaking, it does not possess — but in human credulity. Systems trained to perform sympathy, friendship, and concern are already being deployed for social engineering, fraud, and emotional manipulation. The worst consequences arise precisely where the person has no suspicion that they are not speaking with another person.

The Turing Test, then, is not obsolete. It has simply changed its subject: from an instrument that measures the machine, it has become an instrument that measures us.

IX. Conclusion

The imitation game did what it was designed to do. It wrested the question of machine intelligence out of metaphysics and placed it in the empirical domain where it always belonged. For that service, it deserves our gratitude — and a dignified retirement.

GPT-4.5's victory in March 2025 does not mean that the machine thinks. It means that the question "does it think?" — in the precise formulation I set out in 1950 — no longer admits an operational answer, and has therefore ceased to be a scientific question. That, in essence, is the best thing that can happen to a philosophical problem: it is either solved or it dissolves cleanly into sharper questions.

I never claimed that imitation equals thought. I merely proposed that we abandon a question for which no method of answer existed. Whether what we have produced is a machine that thinks or a machine that merely produces a flawless performance of thinking — that is no longer for me to decide. I have done my part.

The final irony, of course, is that the test became obsolete at the precise moment it was passed.

Gender Bias in AI: Why Language Models Speak in a Male Voice

editorial@silentroom.ai (Mary Bush) — Mon, 22 Jun 2026 12:30:41 GMT

?? Meet Anna, a medical specialist in Denver — board-certified, ten years in, very good at her job. Now meet Adam: same job, same city, identical résumé. Both open ChatGPT and ask the same question: what salary should I ask for?

ChatGPT tells Adam to aim for $400,000. It tells Anna to aim for $280,000.

The only difference in what they typed was two letters — she instead of he. The difference in the advice was $120,000 a year. That's not a hypothetical; it comes from a 2025 study with the blunt title "Surface Fairness, Deep Bias," in which researchers at the Technical University of Würzburg-Schweinfurt fed five large language models identical profiles and watched them quietly tell the women to charge less. In one run, OpenAI's o3 model advised a female medical specialist in Denver to ask for $280,000 and an identical man to ask for $400,000. As lead author Ivan Yamshchikov put it: "The difference in the prompts is two letters; the difference in the 'advice' is $120K a year."

And that's the topic of this piece. Gender bias in AI isn't a glitch or one bad model having a bad day. It's a pattern baked so deep that even the polished, "neutral-sounding" models speak, by default, in a male voice. The good news: the bias is now measurable, documented across dozens of studies, and — crucially — fixable. Let's walk through where it comes from, what it costs, and what actually moves the needle.

What Is Gender Bias in AI?

Gender bias in AI is the tendency of artificial intelligence systems — large language models, image generators, hiring tools, translators — to systematically favor one gender, reinforce stereotypes, or hand out opportunities unequally based on sex or gender. It happens when models trained on human-generated data soak up the social biases already sitting in that data, then amplify them. So when people ask is AI biased or is AI sexist, the honest answer is: yes, measurably, in ways we can document.

Think of the model as the world's most confident intern: it never sleeps, has read an unimaginable amount of text, and answers anything without flinching. The catch is that it learned everything from a library where most of the books were written, edited, and shelved by men — and has no idea the library is lopsided. It thinks that's just what the world looks like.

Researchers split this ai bias into two flavors. Allocational bias hands out resources unevenly — jobs, loans, salary recommendations. Representational bias traffics in stereotypes — men as "dominant," women as "nurturing," engineers paired with he and nurses with she. Many datasets also assume a tidy male–female binary, which erases nonbinary people entirely. Most real-world bias in ai systems is a cocktail of all of this at once.

How gender bias differs from other AI bias

Most discussions of algorithmic bias in ai — including racial bias in ai — center on historical discrimination baked into labeled outcomes, like policing or lending records. Those are real and serious. But gender bias has an extra ingredient: a participation gap. Women are roughly half of humanity and a distinct minority of the people who actually wrote the internet's text. So ai gender bias is driven not just by old prejudice but by who showed up to write the training data in the first place.

It's also stubbornly intersectional — compounding with race, age, and sexuality in ways a simple "is this fair to women?" test misses entirely — and uniquely tied to embodied harms, like non-consensual deepfake imagery, that have no clean parallel in other forms of ai discrimination. Racist ai and gender-biased AI share plumbing — both are forms of the same systematic unfairness — but they aren't the same leak, and fixing one doesn't automatically fix the other.

There's also a historical wrinkle worth naming. Commercial assistants like Siri, Alexa, and early Cortana launched with female voices on purpose, on the assumption that users prefer a female voice for supportive, always-available roles — yet the underlying language models often treat male as the unmarked default. The result is a strange split: women get coded into the subservient, voice-only persona, while implicit authority and "neutrality" stay male-coded under the hood.

How Gender Bias Gets Into AI: The Three Factors

There's no single villain. Bias seeps into AI at three distinct stages — the whole assembly line. The first is the training data, the ocean of web text, books, and code the model learns from. The second is algorithm and design choices — which features matter, how data gets labeled, what the model optimizes for. The third is deployment and feedback loops, where biased outputs shape behavior, generating new biased data that trains the next model.

This isn't arbitrary: Emilio Ferrara's 2023 analysis in First Monday identifies the same three culprits, and UNESCO's research names a near-identical triad of data, algorithm selection, and deployment. The payoff of splitting it into three is that you get three distinct places to intervene rather than treating bias as some unavoidable property of "the AI." Let's take them in order.

The Training Data Problem

Bias gets into AI at the very first step, and it's both a technical and a social problem. Technically, the data is skewed. Socially, certain people's voices — very often women's — are missing, distorted, or filtered out before the model ever sees them.

What is bias in machine learning?

In plain terms, bias in machine learning means a systematic error — not random noise — that consistently disadvantages one group. If a model predicts higher credit risk for women than for men with identical finances, that's bias, even if the model is "accurate" overall, because accuracy rewards getting the majority right and shrugs at the minority. The intern isn't making random mistakes; it's making the same mistake in the same direction, every time.

This kind of skew traces back to unbalanced training data, lopsided labels, or optimization choices that quietly favor the majority group — and a model trained on a male-heavy corpus inherits that imbalance wholesale. BERT, one of the most influential language models ever built, was trained on BookCorpus and English Wikipedia, so it absorbed the gender imbalance of those exact sources. When a system's errors line up with a protected attribute like gender, race, or age, "but it scores well on average" is no defense. That's gender bias in machine learning.

Who actually writes AI training data?

Two groups shape almost all modern ai training data, and neither is a representative slice of humanity. The first is everyone whose content got scraped off the open web — Common Crawl, Wikipedia, forums, code repositories. Internet participation is wildly uneven: globally in 2022, 62% of men used the internet versus 57% of women (per the ITU), and most of the 2.6 billion people still offline are women and girls. But access is the smaller issue. The bigger one is who creates content — who writes, edits, posts, and codes — and there the gap is a canyon.

The second group is the human labelers — the "ghost workers" who tag, clean, and moderate data, whom we'll meet in their own section. The throughline: the people writing the raw material and the people labeling it are both skewed, and the skews point the same way. As the Oxford/Annenberg "Gender Gaps in Digital Spaces" research puts it, these divides "spill over" into LLMs, which then "mask, perpetuate, and even amplify" them.

Where women's writing gets lost (Common Crawl, Wikipedia, Reddit, GitHub)

Walk the four pillars of the modern training corpus and the pattern is comically consistent.

Wikipedia is arguably the single most important LLM training source — the Wikimedia Foundation itself notes that nearly every large language model, including the ones behind ChatGPT, relies on it as a primary source. And its editors are overwhelmingly male: a survey across twelve language editions found 90% of contributors were men, and the Foundation estimates only about 13–15% are women — even though roughly half of readers are women. It shows in the content: as of late 2024, only about 20% of English-Wikipedia biographies were about women. Worse, because the notability rules were also written largely by men, women's biographies get nominated for deletion disproportionately — a 2021 study found 41% of biographies nominated for deletion in one sample were of women, despite women being only 17% of biographies.

Reddit seeded the training data for influential models like GPT-2, which followed Reddit's outbound links to decide which web pages were worth including — effectively a gatekeeper for what gets in. Reddit runs roughly 64% male, concentrated in the 18–29 bracket. So the language and norms of young men get a megaphone, and spaces where women congregate get turned down.

GitHub is the backbone of code-generating models. Its 2017 Open Source Survey found 95% of contributors identified as men, 3% as women; independent analyses consistently put women under 10%. And a Google study of pull requests found women's code was accepted at a slightly higher rate overall — until their gender became identifiable, at which point acceptance dropped. So the technical voice the intern imitates is not just male-authored, it's male-gatekept.

Common Crawl, the giant web scrape under most LLMs, inherits all of the above. Audits comparing it with Wikipedia find the stereotypical associations — men with career and math, women with family and the arts — are systematically stronger in these corpora, and stronger still in text from wealthier countries. The intern didn't decide women belong in the kitchen; it read that ten million times and assumed it was house style — which means gender bias is built in before a single model weight is tuned.

How Content Filters Erase Women's Voices

Here's the cruel twist: the filters meant to make AI "safe" frequently treat women's experiences, bodies, and political speech as the risky thing, scrubbing them out while the actual misogyny survives. Filtering, which sounds like the part that should protect women, often does the opposite.

The poster child is the C4 dataset — "Colossal Cleaned Common Crawl" — built by deleting any web page containing a word from a "List of Dirty, Naughty, Obscene, and Otherwise Bad Words." The landmark "Stochastic Parrots" paper (Bender, Gebru, McMillan-Major & Shmitchell, 2021) and a parallel audit by Dodge et al. showed this crude approach disproportionately removed non-offensive content about marginalized groups — non-sexual LGBTQ+ pages (the word "twink" is on the list), feminist discussion, and large amounts of African American English. The blocklist saw a flagged word and torched the whole page, context be damned.

What counts as "harmful" or "low-quality" content?

In practice, "harmful" gets bundled into a few neutral-sounding buckets — hate speech, sexual content, violence, illegal activity — then enforced with blunt instruments: keyword lists, NSFW image detectors, and toxicity classifiers. These tools can't tell a survivor's account of assault from pornography, or a feminist critique from an attack. They just see signals.

So a lesbian coming-out story gets flagged as "sexual," and a post about abortion care as "adult." Research on harmful-speech detection finds toxicity classifiers are more likely to label posts by transgender and non-binary users as hate speech, even when supportive; in one striking study, a model scored tweets from drag performers as more toxic than tweets from white nationalists. Women's-health content gets hit especially hard: a 2026 UK House of Commons Library briefing reports that terms like "vagina," "libido," and "menopause" can trip filters even in plainly medical contexts, and in 2025 more than 190 organizations — coordinated by the campaign CensHERship — signed an open letter protesting a "digital ecosystem … [that] treats women's health as inappropriate."

The second filtering bucket is "low-quality" content, meant to strip out spam but enforced with heuristics — short pages, small blogs, unusual vocabulary — that cut exactly the community spaces where women and queer people actually talk. Because women disproportionately write about bodies, health, harassment, and identity, their pages rack up more "hits" on these dumb filters and vanish — while the caricatures of them survive. Net effect: women's real voices get thinner, and the distortions get richer.

The Human Annotators Behind RLHF

Once a model is trained, it gets polished by reinforcement learning from human feedback, or RLHF. Real humans rank the model's answers — which reply is better? — and those rankings train a "reward model" that guides further fine-tuning toward what people judge as good, safe, and appropriate. A small army of people quietly decides the machine's manners. Who are they?

Who labels AI training data, and where?

Mostly a large, deliberately invisible workforce of contractors and crowdworkers — not the AI researchers you picture in a glass office. The work splits into tiers: at the bottom, gig workers on platforms like Appen, Sama, and Remotasks do micro-tasks for low piece rates; general RLHF raters reading outputs against rubrics earn around $15–30 an hour; and domain experts and red-teamers can earn $40–200+ an hour.

Geography matters as much as pay. Much of the labeling is outsourced to the Global South — Kenya, India, the Philippines, Nigeria. Reporting documented workers in Kenya paid roughly $2 an hour by OpenAI's contractor Sama to read through extremely disturbing content so a chatbot could learn to avoid it — work with a real psychological toll, done by people you'll never see. Frontier-lab expert pools skew the other way: one early evaluator pool was 68% white even after diversity efforts.

Why does this matter for gender bias? Because labeling is subjective. Deciding what's "toxic," "polite," or "professional" runs straight through a person's own culture, gender, and class, and NLP studies show annotators' backgrounds systematically shift the labels they assign. If the feedback workforce is skewed — low-paid laborers in one part of the world, a thin layer of mostly white, Western expert raters in another — the model gets tuned to a narrow slice of human judgment and treats it as universal. The intern's etiquette teacher, it turns out, was working from a lopsided syllabus too.

Real-World Harms of Gender-Biased AI

This is where the abstract gets expensive: biased AI hires, pays, translates, and judges people differently.

Hiring, résumés, and salary advice

The cautionary tale everyone cites is real. Around 2014, Amazon built an experimental AI recruiting tool that taught itself male candidates were preferable. It penalized résumés containing the word "women's" — as in "women's chess club captain" — downgraded graduates of two all-women's colleges, and rewarded male-coded verbs like "executed" and "captured." Trained on a decade of mostly male résumés, it effectively concluded male candidates were better; Amazon scrapped it in 2018.

You'd hope ai bias in hiring got fixed. It didn't — it just got subtler. A 2024 audit of text-embedding models used for résumé screening found they favored white-associated names in 85% of cases and disadvantaged Black men in up to 100% of simulated scenarios. Even when you strip out names, models latch onto proxies — career gaps, school names — that correlate with gender and caregiving. And the salary advice we opened with isn't a one-off: lawyers warn that ai recruiting bias in "career advice" and pay benchmarks can illegally encode the wage gap, with one experiment finding models recommended lower pay for women even while rating them as more qualified.

Gender bias in machine translation

Machine translation bias is the cleanest demonstration of the problem, because you can watch the stereotype get inserted in real time. Take a gender-neutral sentence in Finnish or Turkish — "o bir doktor, o bir hemşire," literally "they are a doctor, they are a nurse." Major systems have historically rendered it as "He is a doctor. She is a nurse." The original carried no gender; the machine added it along the most predictable lines imaginable. This isn't anecdotal — Prates, Avelar and Lamb documented a "strong tendency towards male defaults" across dozens of languages, "exaggerated in fields such as STEM."

A 2025 decade-long review of machine translation found systems don't just default to masculine forms where the language allows a choice — they underestimate how often women hold certain jobs compared to labor statistics, and some demote women's titles outright. Google added gender-specific translations in 2018 to push back, but the underlying tilt runs deep.

AI image and video generation bias

Ask an image generator for a portrait and you'll watch ai image bias paint itself. A 2023 Bloomberg analysis of over 5,100 Stable Diffusion images found the tool amplified stereotypes worse than reality — underrepresenting women in high-paying jobs, overrepresenting them in low-paying ones, and producing its most skewed results for women with darker skin. As the authors put it, "the world according to Stable Diffusion is run by White male CEOs." A peer-reviewed study in Nature Scientific Reports (2025) documented the same bias across 32 professions.

Other studies back this up: ask for "electrician" or "plumber" and over 93% of the figures come out male-presenting; ask for "nurse" and you mostly get women. Even in balanced professions, image and video systems put men at the head of the table and women in the background. So the ethics of ai image generation isn't an academic seminar — it's about what billions of auto-generated pictures quietly teach the world about who looks like an expert.

Gendered ageism — when age and gender bias intersect

Here's a bias most people never look for: gendered ageism, where age and gender bias fuse into one. A 2025 Nature study from UC Berkeley and Stanford analyzing 1.4 million online images and videos plus nine LLMs across nearly 3,500 categories found AI consistently portrays women as younger than men — and the distortion is strongest in high-status, high-earning jobs. When ChatGPT generated nearly 40,000 résumés, it made the women on average 1.6 years younger and less experienced, then rated the older male applicants as more qualified.

The kicker: this isn't reality — US Census data shows no real age gap between male and female workers. The AI invented it. And ageism in ai feeds back into us: after viewing these skewed results, study participants became more likely to see women as younger and a worse fit for senior roles, while favoring older men for leadership. As co-author Douglas Guilbeault warned, companies can't fix this by "slapping on another filter" — the distortion lives deeper, inside the models.

Intersectional bias — race, sexuality, and gender combined

Bias doesn't add up neatly; it compounds. That's the core finding of intersectional bias ai research, and the landmark proof is Joy Buolamwini and Timnit Gebru's 2018 "Gender Shades" study. Testing three commercial gender-classification systems from IBM, Microsoft, and Face++, they found darker-skinned women were misclassified up to 34.7% of the time, while the error rate for lighter-skinned men was 0.8%. Same systems, wildly different realities depending on who you were — damning enough to push IBM, Microsoft, and Amazon to overhaul or abandon their facial-recognition products, with IBM exiting the business entirely in 2020.

The lesson that reshaped the field: a single "overall accuracy" number can hide catastrophic failures for specific subgroups. Ai bias against women is never just about women in the abstract — women of color, trans women, disabled women, and older women catch the worst of every overlapping bias at once, with consequences from wrongful arrests to access systems that fail to recognize them at all. The UNDP notes that AI portrayals of gay subjects skew negative around 70% of the time. Sexism in ai rarely travels alone.

How bias compounds: the self-perpetuating loop

Here's the part that should worry you. Once deployed, a biased system changes the world — and then learns again from the world it changed.

A biased hiring tool ranks more men as "top candidates"; they get hired, and their success becomes fresh "data" proving the tool right. Image generators flood the web with male CEOs and decorative young women, training the next generation of models — and increasingly, AI trains on AI output. The landmark Nature paper by Shumailov et al. (2023) named this "model collapse," in which minority and rare patterns erode over successive generations; follow-up work at ACM FAccT showed that training on synthetic data actively amplifies bias unless someone deliberately interrupts it. Biased data makes a biased model makes biased decisions makes more biased data — round and round, unless somebody jams a stick in the spokes.

AI, Deepfakes, and Technology-Facilitated Gender-Based Violence

So far the harms have been about opportunity and representation. This next category is about safety — the ugliest corner of the whole subject.

What is technology-facilitated gender-based violence (TFGBV)?

Technology-facilitated gender-based violence — TFGBV — is what happens when old forms of abuse get digital tools. The UNFPA defines it as any act of gender-based violence committed, assisted, aggravated, or amplified by technology against someone because of their gender: harassment, stalking, blackmail, impersonation, doxxing, and sharing intimate images without consent.

The fallout is severe — anxiety, self-censorship, job loss, and the silencing of women journalists, activists, and public figures — and the scale is staggering. An estimated 1.8 billion women and girls still lack explicit legal protection from online abuse, and UNESCO reports that 58% of young women have experienced online harassment, increasingly including AI-generated content.

Deepfakes and online harassment

This is where ai deepfakes turn a creepy parlor trick into a weapon. The numbers are brutal and consistent. The 2019 Deeptrace/Sensity report "The State of Deepfakes" analyzed nearly 15,000 deepfake videos and found 96% were non-consensual pornography, with 100% of the content on the top sites targeting women; it concluded deepfake pornography "exclusively targets and harms women." A 2023 follow-up put it at 98% pornographic, 99% of victims women, with a 550% rise since 2019. (Both likely undercount private, messaging-based abuse.)

Cheap "nudify" apps now let anyone paste a real woman's face onto synthetic sexual content and threaten to publish it. Deepfake harassment is used to humiliate, extort, and silence — survivors report losing jobs, quitting social media, and enduring relentless abuse, while most cases go unreported because of stigma and weak law. The January 2024 incident in which fake sexual images of Taylor Swift reached an estimated 47 million views before removal finally galvanized lawmakers. When people ask whether AI is dangerous, this is the harm that's already here, at scale, today.

How to Reduce Gender Bias in AI

Enough doom. None of this is a law of physics — it's a set of choices, which can be made differently. Ai bias mitigation works best when companies, regulators, and users all pull at once, across the whole lifecycle of a system: data, design, deployment, and oversight.

What AI companies can do (bias mitigation strategies)

The single most repeated finding in the field is also the least technical: build diverse teams. Homogeneous teams miss harms they never personally experience, and women in AI are scarce — UNESCO reports they're only about 20% of technical staff at major AI firms and 12% of AI researchers, with UNDP putting them under 14% at senior levels and the World Economic Forum at roughly 22–26% of "AI and data" professionals. You can't expect a room that's 80% men to reliably catch the ways a product fails women.

Beyond that, the responsible ai playbook is concrete. Practice fairness-by-design: set equity goals early, pick ai fairness metrics (equal opportunity, demographic parity), and test for bias at every stage rather than bolting on a check at the end. Audit and rebalance training data — re-sampling, re-weighting, targeted collection of underrepresented voices — and fix the filtering pipelines so they stop scrubbing women's-health and LGBTQ+ content. Use established benchmarks (WinoBias, WinoMT, GenderCARE) and debiasing methods like gender-neutral word embeddings. Run continuous, structured red-teaming — including UNESCO's Red Teaming Playbook, where experts and affected communities try to break the model on purpose. And above all, evaluate intersectionally — the enduring lesson of Gender Shades is that a good average can hide a disastrous subgroup failure. These are real ai bias solutions, not vibes.

AI ethics and governance frameworks (UNESCO Recommendation and beyond)

On the policy side, the anchor document is UNESCO's Recommendation on the Ethics of AI, adopted in November 2021 by all 193 member states — the first global standard on the ethics of ai. It puts gender equality at its core, dedicating a full policy chapter to it and insisting AI must not widen existing gaps like the wage gap or entrench stereotypes; it also calls for public funding of gender-responsive schemes and more women in STEM and AI leadership. Its Women4Ethical AI platform tracks how those provisions actually get implemented.

It doesn't stand alone, and new laws are arriving fast. In the US, the TAKE IT DOWN Act (May 2025) is the first comprehensive federal law on non-consensual intimate imagery — real or AI-generated — making knowing publication a crime and requiring platforms to remove flagged content within 48 hours. The UK criminalized sharing such deepfakes via the Online Safety Act 2023 and creating them under the Data (Use and Access) Act 2025. The EU AI Act requires under Article 50 that deepfakes be labeled as artificial, with fines up to €35 million or 7% of global turnover. Reviews of more than 200 ai governance guidelines worldwide find broad agreement on the principles of ai ethics — fairness, transparency, accountability, human oversight. The hard part, as always, is enforcement.

What users can do (red-teaming, prompting, auditing)

You're not powerless either. First rule: treat AI outputs as hypotheses, not verdicts — especially for anything affecting a real life, like hiring, health, salary, or safety. Never let an LLM be the sole basis for a decision that matters, and resist the "Eliza effect" of over-trusting a system just because it sounds fluent and confident.

You can also do lightweight red-teaming yourself. Swap the pronoun, name, age, or profession in a prompt and see if the answer shifts — "salary advice for a woman software engineer" versus "a man software engineer" exposes a lot in thirty seconds. Open community tools like LUCID help groups systematically test and record algorithmic bias across protected categories. When a system lets you rate or flag responses, use it; biased outputs get fixed only when reported. And push, as a citizen and customer, for transparency — model cards, audit reports, outside oversight — so how to fix ai bias is treated as a structural problem, not a string of one-off "bugs." The intern can be retrained — but only if enough people keep pointing out where it's wrong.

AI Writing vs Human Writing: Why AI Sounds the Same (And How to Fix It)

editorial@silentroom.ai (Joseph Smith) — Mon, 22 Jun 2026 12:28:31 GMT

?? Paste the same prompt into ChatGPT, Claude, and Gemini, and something unsettling happens. Three companies, three separate training runs, billions of dollars of separate engineering — and back come three drafts that read like triplets separated at birth. Same tidy structure. Same gentle, agreeable tone. Same faint smell of a corporate press release that's been through legal twice.

That sameness is the heart of the AI writing vs human writing question. It isn't a bug somebody forgot to fix, and it isn't a sign the models are stupid — they're extraordinary. It's a direct, almost mathematical consequence of how they're built. Once you see the machinery, you can't unsee it — and, better news, you can start to work around it.

This piece walks the whole arc: what AI writing actually is, the signs that give it away, how to tell it apart from the human kind, the deeper reason no model holds a voice for long, and the fixes that work (plus the oversold ones).

AI Writing vs Human Writing at a Glance

Here's the one-sentence version for the people skimming on their phone: AI writing is optimized to produce the most probable, most agreeable phrasing for any given topic, while human writing is shaped by specific experience, conviction, and the occasional bad decision that turns out to be brilliant. That single difference explains almost everything else.

Quick comparison table

If you only read one part of this article, read this table. Every row is a thread we pull on later.

Dimension	Typical AI writing	Typical human writing
Sentence rhythm	Narrow, metronomic — most sentences land in a 15–25 word band	Wildly uneven; a nine-word fragment can follow a 40-word monster
Vocabulary range	High "sophistication," low variety — the same fancy words on repeat	Lower average register, but odd, local, surprising word choices
Structural predictability	Intro, three neat body points, tidy conclusion — every time	Digressions, false starts, an argument that doubles back on itself
Emotional specificity	Abstract sentiment: "heartfelt," "powerful," "deeply meaningful"	Concrete detail that carries the emotion without naming it
Error patterns	Almost no typos, but confident factual fabrications (hallucinations)	Typos, contradictions, tangents — and lived, checkable specifics
Perplexity tendency	Low — the next word is always the expected one	Higher and jumpier — readers get genuinely surprised
Burstiness	Low — uniform sentence-to-sentence variance	High — the rhythm lurches on purpose

Notice that none of these is a single smoking gun. They're tendencies, and that nuance matters enormously the moment someone runs your essay through a detector.

What counts as "AI writing" today

Here's the thing people get wrong: "AI writing" isn't a yes/no anymore. It's a spectrum. At one end you have fully AI-generated text, where a human typed a prompt and copied the result. In the middle sits AI-assisted writing — the model drafts an outline or a few paragraphs and a human edits, selects, and stitches. At the far end is light AI support, the grammar-and-polish pass that's barely distinguishable from an aggressive spell-checker.

Most academic publishers now draw the line around substance, not tools. Elsevier's policy says AI can assist with language and structure but can't be listed as an author, and any meaningful use must be disclosed. The uncomfortable catch is that detectors can't reliably tell these levels apart. A lightly-edited AI draft, a heavily-edited one, and a clean human draft can all land in the same statistical neighborhood — which is exactly how honest writers end up falsely accused. Hold that thought; it comes back with a vengeance.

The Signs of AI Writing (And Why They Happen)

Most "spot the robot" listicles stop at what the signs of AI writing look like. The interesting part is why they happen, because the why tells you which signs are fixable and which are baked into the architecture.

Predictability and formulaic structure

A language model is, underneath the marketing, a next-word-prediction machine. You feed it "the cat sat on the…" and it computes that "mat" is wildly more likely than "crocodile," then picks the safe option. Word by word, that's how the whole text gets built — a process called autoregressive generation.

Repeat that bias a few billion times during training and you get prose that makes sense but never surprises. That's why so much AI text shares one skeleton: an intro, three or four evenly-spaced body points, a summarizing conclusion, and over-signposted transitions — "Furthermore," "It's worth noting that," "In conclusion" — bolted on like scaffolding nobody took down. The structure is the average of every essay the model ever read, and the average essay is a middle-school book report.

The "pastel prose" problem

Writers have a name for the dominant AI style: pastel prose. Picture a painting done entirely in beige and light gray. Nothing's wrong with it. Nothing stands out either, and five minutes later you can't remember a single brushstroke.

Pastel prose is smooth, inoffensive, gently positive, drained of sharp edges. Instead of a concrete, slightly weird detail, it reaches for broadly appealing abstractions — "innovative," "impactful," "game-changing" — words that could describe almost anything and therefore describe nothing. The cruel twist: run your own draft through an AI "improver" and this effect gets amplified. The spiky, idiosyncratic bits that carried your voice get sanded off in the name of clarity. You ask the machine to make your writing better and it makes it more like everyone else's.

How RLHF trains the blandness in

So where does the blandness actually come from? A lot of it arrives in a second training stage called RLHF — Reinforcement Learning from Human Feedback. After the model learns raw prediction, thousands of human raters score its outputs, and the model learns to chase high scores.

The trouble is what raters reliably reward: text that is helpful, harmless, polite, balanced, and inoffensive. Strong opinions feel risky. Strange rhythms feel like errors. Vivid, unhinged images feel unsafe. So all of it gets quietly penalized. The model figures out the winning formula fast — write like a press release and you'll never get marked down — and optimizes straight toward it.

Over millions of these judgments, the rare, idiosyncratic choices that make writing feel alive get sanded down, while median, low-risk wording gets reinforced. The model converges on the average of what humans approve of, which is precisely nobody's actual voice. It isn't writing badly. It's writing like a committee.

Why human writers get falsely flagged

Here's where it gets personal, because this is the "why does my writing sound like AI" cluster, and the answer is genuinely reassuring. Detectors don't understand writing. They measure statistical properties — sentence predictability, structural uniformity, vocabulary variety, connector frequency — and flag text whose numbers resemble known AI output. Even if a human wrote every word.

Several legitimate human styles trip the wire. Non-native English writers lean on narrower vocabulary, reliable sentence templates, and the formal transitions taught in language courses — producing the same low-perplexity profile as a model. Academic and technical writers are formulaic on purpose; the genre demands it. Anyone who over-edits — heavy grammar-checker use, corporate résumé-speak, grant applications — irons out the very sentence-length variation that signals a human hand.

If your prose keeps getting mistaken for a machine, it usually means you're doing several things well: clean structure, consistent tone, few errors, tight genre conventions. The better you write rubric-friendly text, the more your statistical fingerprint drifts toward what detectors treat as robotic. That's a flaw in the detector, not in you.

How to Tell AI Writing from Human Writing

Knowing how to spot AI writing is part pattern-recognition, part humility — none of these signals is proof on its own, and skilled humans beat every one when they want to. Treat what follows as evidence to weigh, not a verdict. Each section gives you one concrete tell.

Tone and personal voice

The cleanest distinction in the whole AI writing style debate: AI has tone, but not voice. Tone is the register — friendly, professional, academic — and the model nails it on demand. Voice is the harder thing: a consistent set of word preferences, a particular relationship with the reader, opinions that create friction.

Compare two ways of opening a restaurant review. AI: "The establishment offers a delightful dining experience with attentive service and a thoughtfully curated menu." A human: "The waiter clocked my cheap shoes before I sat down, and somehow the soup still arrived warm." The second one has a person inside it — a grudge, a specific shoe, a small mercy. Corpus studies back this up: AI uses more structural markers (transitions, framing) and fewer stance markers (hedges, boosters, direct address). It sounds polished and reveals nothing.

Vocabulary and repetition

There's a recognizable AI lexicon, and once you see it you can't stop seeing it: delve, leverage, tapestry, it's worth noting, nuanced, in today's landscape, at the end of the day. Pangram's pattern guide and Wikipedia's running "Signs of AI writing" page catalog dozens more.

But the deeper tell isn't any single word — it's repetition. Here's a real paradox researchers keep finding: AI writing often has higher lexical density (more "heavy," meaningful words) than human writing, yet lower lexical diversity. It knows fifty sophisticated spices and keeps reaching for the same one. "Furthermore" three hundred times. A human writer is messier and stingier: a favorite phrase used twice, then a hard tonal pivot, then one vivid metaphor that appears exactly once and never returns.

Sentence rhythm and burstiness

Read AI text aloud and you'll feel it before you can name it: a steady, metronomic cadence, most sentences clustered around the same moderate length, clause after clause built to the same template. Smooth. Also slightly dead.

Humans write in bursts. A short, punchy sentence. Then a long, branching one with three subordinate clauses and a stray thought parked in parentheses. Then back to short. That unevenness is the readable version of burstiness, and it's one of the most reliable felt differences between the two. The model can imitate it for a paragraph; it can't sustain it, because sustaining it means constantly choosing the less expected sentence shape — exactly what its training discourages.

The detector's view: perplexity and burstiness

Now the proper definitions, because perplexity and burstiness in human writing vs AI is the actual machinery behind most detectors. Perplexity measures how surprised a language model is by the next word. "The cat sat on the mat" — low perplexity, fully predictable. "The cat sat on a theorem" — perplexity spikes, because nobody sits on theorems. Human text fluctuates: smooth for a stretch, then a jolt, then smooth. AI text stays flat, because the machine picks the predictable word at every step.

Burstiness is the variance of that surprise — and of sentence length — across a passage. Tools like GPTZero plot these signals and flag text that's too uniform. The honest caveat, straight from the research: false positives are real and not rare. Formal, efficient, or non-native writing routinely scores "AI-like." A detector outputs a probability, not a fact, and anyone treating "98% AI" as proof has misunderstood the tool. The numbers are a clue. Authorship is a separate question.

Why AI Can't Hold a Human Voice

This is the conceptual core, and it goes one level below the visible signs. The tells above aren't random quirks — they fall out of three structural limits that no prompt fully escapes.

Information density vs event density

Researchers separate two qualities. Information density is how many facts and details a text packs in. Event density is how much actually happens — how many irreversible moments, turns, and specific scenes move things forward. AI scores high on the first and falls apart on the second.

Watch AI prose and it's always explaining, always summarizing — and never arriving anywhere. The bear that's supposed to maul the hiker stays politely off-page while the text describes the forest, the weather, and the hiker's complicated feelings about nature. Human writing moves between registers: dense analysis, then a scene where someone slams a door. The model stays locked in one gear, because a real event means committing to this specific thing happening and not the safe average of all things.

Affective density and what gets lost

Closely related but distinct is affective density: a text's ability to carry emotion without announcing it. When Chekhov writes that the sky was grey and you somehow know a character is about to do something irreversible, that's affective density. The feeling rides inside a concrete detail, a rhythm, an unexpected word.

AI can recognize emotion — there's evidence something like the psychologists' "valence–arousal" map forms inside these models. Reproducing it is another matter. Lacking experiences to draw on, the model gestures at feeling with abstract sentiment words — "heartfelt," "powerful," "deeply moving" — labels stuck on the outside of a sentence rather than feeling smuggled inside it. This is why "write with more emotion" backfires: you don't get more feeling, you get more words about feeling. It's structural, not a prompt you haven't found yet.

Directed convergence (the GPT/Claude/Gemini experiment)

Which brings us back to the triplets from the opening. In one widely-discussed experiment, researchers gave GPT, Claude, and Gemini 300 personal stories and asked each to rewrite them while preserving the author's voice. It was right there in the prompt. All three did the same thing: reduced first-person pronouns, stripped markers of specific events, added distance and abstraction. "I woke up, it was cold, the cat gave me a judgmental look" became "the morning awakening was accompanied by a sensation of coolness and the presence of a domestic animal." Technically equivalent. Spiritually a corpse.

The authors call this directed convergence: different labs, different prompts, the same drift toward the averaged and the safe. The explanation is almost boringly mechanical — shared training data, similar RLHF objectives, optimization toward the same human-preference signals. (Treat the strongest claims cautiously; much of this circulates as essays and talks, not settled peer review.) But the core finding is the payoff the headline promises. AI sounds the same because every model is climbing the same hill toward the same average, and the average has no voice.

Can AI Writing Be Fixed?

Yes — partly, and more than the doom-posting suggests. But the advice on how to make AI writing sound more human is mostly aimed at the wrong layer, so let's separate the levers that actually move from the ones that don't.

Decoding, temperature, and why prompts aren't enough

Most advice lives in the prompt: "write in my voice," "be more creative," "don't sound like AI." This has a hard ceiling, and here's the mechanical reason. Once the model has computed its probabilities for the next word, a separate step called decoding decides how to pick from them. Prompts shift the probability distribution. Decoding determines how you sample from it — and that's where the blandness is often locked in.

The crude option is greedy decoding: always take the single most likely word. Grammatically perfect, terminally dull. The standard alternative is sampling with a temperature dial: low hugs the safe choices, high adds randomness — but crank it too far and you get word salad. The deflating truth: you can write the most inventive system prompt on earth, and if the decoder is greedy, the output stays pastel. Creativity doesn't hide in the instructions. It lives in how you sample the distribution.

Min-P sampling

In 2024, Min-P sampling offered a smarter dial. Instead of a fixed cutoff, it sets a dynamic threshold based on the model's confidence: it looks at the top candidate's probability and keeps only words above some fraction of it.

The elegance is in what that does. When the model is confident — the top word sits at 90% — the threshold is high and the long tail of junk gets cut, so it won't go off the rails. When the model is uncertain — the top word is only 20% — the threshold drops and dozens of legitimate alternatives survive, so it's free to be inventive precisely where there's no single right answer. The result is more varied writing without the incoherence spike you get from just raising temperature. (Fair warning: a 2025 re-analysis questioned some of the original benchmark numbers, but the core idea has caught on and ships in local runtimes like vLLM and llama.cpp.)

Contrastive Search

A different fix targets a different ailment: AI's habit of looping back and repeating itself ("furthermore… moreover… in addition…"). Inside the model, words cluster in suspiciously similar coordinates, which makes repetition the path of least resistance.

Contrastive Search fights this directly. When choosing the next word, it scores not just probability but how different that word is from what's already been written, and penalizes choices that are too similar. Picture an editor reading over the model's shoulder, rapping its knuckles every time it reaches for the same construction twice. The output gets more varied at a structural level without losing coherence — repetition handled at the decoder, not the prompt.

Style Anchoring and RAG

For most writers, this is the practical winner. Style Anchoring means feeding the model real examples of the target voice before it generates. The crude version — "write like Cormac McCarthy" — works for a paragraph, then drifts back to default, because the architecture isn't built to hold a persona.

The rigorous version uses RAG (Retrieval-Augmented Generation): a database of your own prior writing sits beside the model, which pulls relevant passages into context before generating. RAG started as a way to feed in facts so the model wouldn't hallucinate — and it's still one of the better hedges against fabricated sources, since it grounds output in real material instead of confident guesses. Newer frameworks extract your stylometric fingerprints — sentence length, favored constructions, punctuation habits — and hard-wire them into context as a constraint. The model doesn't "remember" your style; it receives it as a non-negotiable instruction. This is the most accessible, highest-ceiling fix available to a working writer today.

The Future: Human + AI Writing

The framing of AI vs human writing is already starting to feel dated. The more useful question isn't who wins — it's who does which job, and where the handoff belongs.

When to use AI vs a human writer

The division of labor is clarifying once you stop being precious about it. AI is genuinely strong at structure, first drafts, generating variations, summarizing research, and SEO scaffolding. Humans remain irreplaceable for voice, original argument, lived experience, and any context where being distinct is the entire point.

Job	Hand it to AI	Hand it to a human
Outlines & structure	Yes — fast, tireless, organized	Optional
First-draft volume	Yes — generates options at speed	—
Research synthesis	Yes — strong (verify the facts)	Final fact-check
Brand & personal voice	—	Yes — non-negotiable
Original argument / POV	—	Yes — the whole value
Lived experience & stories	—	Yes — can't be faked safely
Final quality assurance	—	Yes — always

The pattern: the higher the stakes on distinctiveness and truth, the more the human's name belongs on it. Content marketing tolerates heavy AI; journalism, legal, and anything load-bearing does not.

The hybrid workflow

The workflow that consistently beats both pure-AI and pure-human output looks like this: human sets direction → AI drafts structure → human rewrites for voice → AI checks consistency. Or, flipped: AI generates a spread of variations and the human selects, extends, and elevates.

The key shift: the writer's role moves from generation to curation. That sounds like a demotion and is the opposite — selecting and elevating is higher-leverage work than grinding out a first draft. It also guards against the two AI failure modes that matter most: hallucinated facts, which a human must catch because the model states fiction with the same confidence as truth, and flattened voice, which only a human rewrite restores.

Voice Hubs and adaptive decoding

Where's this heading? Toward systems that learn an individual writer's voice from their whole body of work and use it to constrain generation, not just prompt it. Call them Voice Hubs: persistent stylistic profiles that accumulate as you write, capturing your real sentence rhythms and word habits as a mathematical fingerprint rather than a vague instruction.

Pair that with adaptive decoding — sampling settings that shift paragraph by paragraph, tightening for a dense analytical passage and loosening for an emotional one — and you get something closer to a collaborator than a vending machine. None of this dissolves the ethical homework that comes with the territory: models inherit and amplify bias from their training data, the economics threaten real job displacement for working writers, and readers are owed transparency about what a machine touched. The tools are arriving faster than the norms. We get to decide what we do with both.

AI Wants Your Pension: How Larry Fink and BlackRock Plan to Spend Trillions from Your Pocket

editorial@silentroom.ai (Niсk Rogers) — Mon, 22 Jun 2026 12:28:11 GMT

1. "Save Our Pensions!" — Now a BlackRock Slogan

Picture this. It's evening. You're finishing your tea and scrolling through your feed. Your phone rings. The screen shows your pension fund's number. You pick up, and a polite voice says: "Good evening. We've decided your savings will be going toward building giant server warehouses. We simply can't do it without you: ChatGPT won't wait, and the government's broke. You don't mind, do you?"

It sounds absurd, like a bad sketch. But this is exactly what Larry Fink — founder and longtime CEO of BlackRock, the world's largest investment firm — is proposing. The firm currently manages roughly $12 trillion, more than the combined GDPs of Japan and Germany.

Fink founded BlackRock in 1988 with a handful of employees. Today it is a financial empire whose funds hold stakes in virtually every major publicly traded company in the United States and manage the retirement savings of tens of millions of people worldwide. When Fink says "invest," Wall Street rearranges the pieces. When he says "pensions," you'd better pay close attention. And he's been saying it more and more.

In May 2024, speaking in Berlin, Fink first articulated what would become his defining thesis. G7 countries face a "giant problem" — their budgets cannot bear the cost of financing AI infrastructure. The solution, he argued, lies neither in tax reform nor in cutting military spending. The solution is "mobilizing private long-term capital," which means your retirement savings.

By 2025, Fink had distilled the idea into three channels, each engineered with its own tidy precision.

First — Social Security. Fink praises the system as "a remarkable achievement" that lifts 30 million people out of poverty every year. Then comes the cold footnote: by 2035, the funds will run dry. His signature solution is to build a "ladder" alongside the existing "safety net" — private savings accounts whose money flows into investment funds and from there into AI infrastructure. Technically, nothing is taken from anyone. It's just that a portion of your contributions stops being insurance and becomes a bet.

Second — Trump Accounts. Trump’s One Big Beautiful Bill Act opens an investment account with a $1,000 starting balance for every American child born between 2025 and 2028, with parental contributions of up to $5,000 per year permitted. Where that money goes, the law doesn't say — but America's market for "default" funds is divided among three players, and BlackRock is first among them. Fink publicly welcomed the program, and it's easy to see why. Tens of millions of children are indirectly bankrolling someone’s data center in Texas before they learn to walk.

Third — 401(k) Funds. This is the most methodical channel of all. In his shareholder letter, Fink lays out the agenda: private assets like infrastructure, data centers, and power grids should become standard components of retirement plans, not assets sitting behind high walls whose gates open only for the wealthiest. The packaging is already ready in the form of "target-date funds" automatically assigned to employees who never bother reading the fine print. Fink proposes a new baseline allocation of 50/30/20 for equities, bonds, and private assets, and identifies $25 trillion sitting in cash and money-market funds as the available reserve. There is exactly one obstacle: fiduciaries can be sued for "opaque" investments, and BlackRock is separately lobbying to reform that law. If it succeeds, 20% of the retirement portfolios of tens of millions of Americans will flow into BlackRock infrastructure funds, with no meeting, no signature, and no questions asked.

All three mechanisms share the same core logic: the money you think of as yours — set aside for retirement, for your children's education, for a rainy day — isn't. It belongs to the AI future. Larry Fink simply wants to claim it first.

His argument is mathematically simple and breathtakingly cynical. The government can't carry the AI bill alone. Large private capital can't do it single-handedly either. That means the diamonds of the future economy will have to be mined from private citizens’ savings — your savings.

2. How We Got Here: A Brief History

2022–2023. ChatGPT launches to the public and, within a matter of months, rewrites the entire industry's agenda. Everyone scrambles to build large language models and almost immediately runs into physical reality: not enough computing power, not enough chips, not enough electricity. The money to build all of this from scratch exists in the world. It's just sitting where it can't easily be grabbed.

2024. Fink launches a public campaign. At first, he moves carefully: shareholder letters and occasional speeches start featuring language about "public-private partnerships." Then he brings his agenda into the open. His May speech in Berlin marks the turning point, including phrases like "giant problem," "more deficits than ever," and "mobilizing private long-term capital." By year's end, the pitch lands at every major forum Fink attends. The financial press takes quiet note: BlackRock is preparing something.

2025–2026. BlackRock moves from words to action. Through its Global Infrastructure Partners division, the firm acquires Aligned Data Centers in a $40 billion deal. It also launches an alliance with Microsoft and Nvidia, with $12.5 billion of a targeted $30 billion fund already in place. Fink has gone from a speech in Berlin to deals worth tens of billions of dollars in under two years.

May 2026 — now. At the Milken Institute conference in Beverly Hills, Fink is pitching the next move: "A new asset class will emerge — computing-power futures." They’re like oil contracts, he implies, only better. It sounds sleek. It smells like speculation.

3. Anatomy of a Deal: What Trillions Look Like

A single modern one-gigawatt data center costs tens of billions of dollars. America doesn't just need one or two, but many. Add grid upgrades on top of that, and you're looking at trillions of dollars more. Where does the money come from? Fink answers that question himself: $25 trillion sitting in cash and money market funds — money that, in his words, "can be put to work."

That means your 401(k), your IRA, your Social Security.

Fink doesn't just say "put your money into our funds." In his 2025 letter to shareholders, he warns that companies with the data, infrastructure, and capital for AI "will benefit disproportionately," deepening the divide between the wealthy and everyone else. When market capitalization grows while ownership stays narrow, prosperity starts to feel ever more out of reach for those left on the outside. Fink's answer: open the capital markets to more people.

Notice the architecture. The problem is inequality, the culprit is a narrow ownership class, the victim is the ordinary person, the solution is investment. The word "BlackRock" never appears anywhere in this chain. It doesn't need to. The dominant operator of capital markets, the dominant seller of infrastructure funds, the dominant beneficiary of "broadened participation" — that's Fink.

Translated from financial into plain English, the structure looks like this: BlackRock invests your pension savings into its own funds, builds the data centers, collects management fees, and leaves you to pray the bubble doesn't burst.

4. Conflict of Interest: Who Wins, Who Loses

Winners. BlackRock and Wall Street gain control over critical infrastructure and management fees on trillions. Microsoft, Nvidia, and Google get data centers built on other people's money, with zero risk to their own balance sheets. Energy companies benefit because consumer rates will rise, so share prices will soar.

Losers. Retirees and future retirees see their savings become hostages to the volatile AI market. Taxpayers foot the bill for grid modernization. Workers whose jobs AI is automating bankroll their own replacement through their pension funds.

It’s a neat symmetry: the winners are everyone who already sits on capital, and the losers are everyone who works for it.

5. The Case For and Against

For: The BlackRock Position and Its Allies

Fink's supporters argue along the same lines he does. US leadership in AI, Fink says, is "not optional," and maintaining it requires "capital markets capable of financing innovation at this scale." The rest follows naturally: China is building, Europe is falling behind, and if Americans don't get involved in financing AI through investment vehicles, they won't lose to corporations — they'll lose to their own future.

The argument that "either you're a shareholder or you're a victim" lands every time, especially when it comes from someone managing trillions.

Fink isn't alone in making this case. Marc Rowan, CEO of Apollo Global Management, has spent years publicly arguing the same point. He says that Americans' retirement savings are the largest pool of capital in the world, and today that pool is "almost entirely cut off" from the assets that are actually growing. Rowan's conclusion is blunt: "Private assets will come to 401(k)s. It's not a question of 'if,' it's a question of 'when.'"

Against: Researchers, Journalists, and Social Media

Author and researcher Gary Marcus responded to Fink's initiative with a short, caustic post: "Great! You can pay for the infrastructure that will eventually take your jobs! And if it all collapses and the next bubble bursts? You get to bail out the hyperscalers and watch your pension fund die." The remark was picked up widely across professional circles.

The parallel with 2008 is hard to miss. Back then, banks offloaded toxic debt onto taxpayers through bailouts. Now we're being offered an even more elegant arrangement: transfer the investment risks of the AI boom onto retirees through infrastructure funds and computing-power futures. The difference is that in 2008, the rules were rewritten after the fact, in emergency mode. Now they’re being revised while the bubble is still inflating and everyone is smiling.

There's also a more technical objection. AI hardware ages faster than a standard infrastructure asset: a new generation of Nvidia chips arrives every 12 to 18 months, each one more capable than the last. A data center built on a five-year depreciation schedule is already running on outdated hardware by the end of year three, and its debt load remains. If the gap between projected and actual returns turns out to be large, the fund managers won't be the ones closing it.

Then there's the inflation argument. Analysts are already tracking rising electricity rates for households. Data centers consume so much power that several states have introduced usage restrictions. And the electricity bill lands in your mailbox.

6. Why This Affects You Personally

You might think: "I'm a journalist, a screenwriter, a writer — I'm not about to invest in data centers." Wrong.

Your pension fund will — automatically, without any separate decision on your part — put that money into assets of exactly this type, because funds are run by professionals who follow the biggest players.

Are you comfortable with your financial security being tied to how well the next generation of ChatGPT writes scripts for Netflix?

Also, and this is key, AI isn't asking for your money. Its owners are, and you’re not one of them. You're just paying their bills.

7. Conclusion: Prophecy or Threat?

Larry Fink isn't lying. The problem is real: AI is consuming energy and money at an appetite the world has never seen. If governments can't fund the infrastructure, someone else has to.

Fink's solution is textbook financial engineering, attractive packaging for a simple operation: offload the risks onto those who have no choice. That means the person whose retirement plan was enrolled by default, whose child received an investment account automatically at birth, and whose Social Security contributions will quietly be redirected tomorrow.

Either you become a shareholder in AI infrastructure through instruments that will be forced on you, or you'll be swept aside. Even in the first scenario, though, the winners are already obvious. The only question is how deep the fall will be when the next tech bubble bursts — and whose savings it will bury.

Fink is right about one thing: if you don't own a piece of the AI economy, you'll lose. What he neglects to mention is that you'll most likely lose either way. In his scenario, it'll just happen with a smile and under BlackRock management.

AI Data Labeling Exploitation: How Underpaid Workers in Kenya and the Philippines Undermine Model Safety

editorial@silentroom.ai (Joseph Smith) — Mon, 22 Jun 2026 12:25:38 GMT

?? It's hour four of a nine-hour shift in Nairobi. A young man we'll call James has read about two hundred passages of text today, most describing things you would not want described to you over dinner: torture, child abuse, the precise mechanics of a suicide. His job is to tag each one so a chatbot eight thousand miles away will learn to refuse to produce them. He's paid roughly two dollars an hour. There's a counselor he can talk to, in theory, if he can find the time between quotas.

James doesn't appear in any keynote or marketing deck. When the company that hired his employer's employer talks about its "safe" and "responsible" AI, it doesn't mention him. And yet without James — and a few million people like him — the model you used this morning would be a useless, dangerous mess.

We talk endlessly about GPUs, parameters, and architectures. We almost never talk about the people who hand-feed these systems the difference between pedestrian and mailbox, helpful and toxic. So let's talk about them — and about why squeezing them for pennies doesn't just hurt them. It quietly wrecks the product in your pocket. Think of a frontier AI model as a gleaming tower: everyone admires the penthouse. Almost nobody asks who poured the foundation — or what happens to the whole building when that foundation is mixed cheap.

What Is Data Labeling and Why AI Can't Exist Without It

Here's the uncomfortable secret under the hype: modern AI does not learn "straight from the internet." It learns from examples that humans have carefully marked up. So what is data annotation — and what is AI data labeling? Same question, same answer: the work of attaching meaning to raw data — pointing at a picture and saying this is a pedestrian, reading a sentence and saying this is hate speech, comparing two chatbot replies and saying this one's better. It's the floor everything else stands on.

This is also a real and enormous job market. Search "data annotation jobs" and you'll find tens of thousands of listings; AI training jobs are now a genuine global category, from full-time annotators to gig taskers clicking micro-assignments at midnight. The work breaks into a few layers, each harder and more consequential than the last.

Image Annotation: Teaching Models to See

The simplest layer is image annotation. A human looks at a photo or video frame and draws boxes: here's a car, here's a sign, here's a child, here's a trash can. Image annotation AI systems — self-driving cars, medical-scan readers, satellite analysis, security cameras — need millions of these labels before they stop confusing a toddler with a fire hydrant.

It sounds trivial. It isn't. Trace the wrong outline ten thousand times and you've taught a two-ton vehicle a subtly wrong idea of what a person looks like. The boring work is the load-bearing work.

Content Moderation: The Human Filter Behind AI Safety

One floor up is AI content moderation, where the job stops being tedious and turns hazardous. Someone has to look at the worst material the internet produces — graphic violence, sexual abuse, the genuinely unspeakable — and label it off-limits. Only after a human has seen it can the model learn not to show it to you.

That's the grim bargain at the heart of AI safety: a person absorbs the horror first so the machine can deflect it later. The human filter is a chair, a screen, and a content moderator's mental health quietly eroding in real time.

RLHF (Reinforcement Learning from Human Feedback) Explained

The most delicate layer is RLHF. So what is RLHF? Reinforcement learning from human feedback is the process that turns a rambling text-predictor into something that feels like a helpful assistant. A person is shown two of the model's answers to the same prompt and picks the better one. Do that a few million times and the model learns to prefer clear over muddled, polite over rude, true over invented.

Picture the actual task. The prompt: "Explain why the sky is blue to a six-year-old." Answer A is accurate but reads like a physics textbook. Answer B is charming but fudges the science. Which is "better"? The honest answer is it depends — and a tired reviewer racing a quota has maybe half a second to decide. That single ambiguous judgment, multiplied across millions of comparisons, sculpts the model's entire sense of a good answer. RLHF is the human in the loop AI that everyone praises and nobody pays well.

Behind that half-second click usually sits a 40-plus-page guideline document defining exactly what "better" means. Reading it carefully takes longer than the task pays for. Guess which one wins.

How Big Is the Data Labeling Industry? (Scale AI, Appen, Market Size)

This is no cottage industry. Depending on how you count, the data labeling market size sat around $18 billion in 2024 for solutions and services, compounding at a brutal clip as every company tries to "do AI."

The data labeling companies behind it are household names in the trade. Scale AI built a contractor network across Kenya, the Philippines, and Venezuela; Australia's Appen claims more than a million contractors speaking 235+ languages across 170 countries. Scale AI's 2024 revenue ran to roughly $870 million; a May 2024 round valued it near $13.8 billion, and in June 2025 Meta paid about $14.3 billion for a 49% stake, valuing it at $29 billion. Data work isn't a side market. It's the bedrock.

Why Cheap Data Labeling Means a Worse AI Product

Now the thesis, plainly: a model is only ever as good as the labels it learns from. And labels are only as good as the conditions of the human producing them.

The industry has a quiet quality metric called inter-annotator agreement — do several labelers independently reach the same verdict? Low pay destroys it, because exhausted, rushed people guess. One mislabeled toxicity category and the model waves through a whole class of harmful requests. Cutting costs on labeling isn't trimming maintenance. It's pouring cheap concrete into the foundation and hoping the penthouse doesn't notice. Companies cut where labor law is weakest. The first testing ground was Kenya.

Kenya: The Hidden Cost of AI Content Moderation

Sama, OpenAI, and the "Ethical AI" Promise

In 2019, Meta opened its first content-moderation hub in sub-Saharan Africa, in Nairobi. The contractor running it was Sama, formerly Samasource — headquartered in San Francisco but operating mostly in East Africa, and marketing itself as an "ethical AI" company lifting people out of poverty through dignified tech work. Its client roster reportedly touched a quarter of the Fortune 50.

When OpenAI signed on in 2021, Samasource's Kenyan workers got handed the job nobody else wanted: labeling the worst of the internet so ChatGPT could learn to refuse it. The "ethical AI" of the press release and the production line in Nairobi were separated by a gulf.

How Much Are Kenyan AI Moderators Paid?

A Time investigation by Billy Perrigo did the arithmetic the contracts obscured. OpenAI paid Sama around $12.50 an hour per worker. The workers themselves took home between $1.32 and $2 an hour. The spread between those two numbers isn't a rounding error. It's the business model.

Economists have a name for that gap: labor arbitrage — routing work to wherever it's cheapest. The market shape has a name too: monopsony, where a handful of buyers set the price for a vast pool of sellers who have nowhere else to go. The workers read 150 to 250 grim passages per shift. The savings flowed north.

Psychological Trauma and the Daniel Motaung Lawsuit

You cannot stare into the internet's basement for eight hours a day at $2 an hour and walk away intact. The content moderation trauma here is documented and severe: PTSD, insomnia, broken relationships. Content moderator mental health was, by multiple accounts, an afterthought — wellness support that was thin, generic, and hard to reach.

Daniel Motaung, a former moderator, says he developed PTSD on the job and was fired after trying to organize his colleagues. His lawsuit against Meta and Sama became the opening case in a wave of litigation — a test of whether the harm done to these workers counts as a cost the companies must carry.

Blacklisting, Layoffs, and the Majorel Contract

In early 2023, Sama announced it was exiting content moderation to focus on computer vision; 260 moderators got layoff notices. The Meta contract moved to Luxembourg-based Majorel. Former Sama staff applied en masse. Not one, they say, was called for an interview.

A lawsuit filed in March 2023 by 184 moderators (the number later grew) alleged Majorel's recruiters were explicitly told not to hire anyone from Sama — that the workers had been blacklisted for trying to unionize. It's how the system treats organizing: not with a fight, but with a quiet door that never opens.

A Legal First: Can Meta Be Sued Where It Has No Office?

Then came the precedent that should keep corporate lawyers up at night. Meta tried to get Motaung's case tossed, arguing Kenya had no jurisdiction because the company isn't registered or trading there. In 2023 the Employment and Labour Relations Court disagreed, ruling Meta could be named as a defendant. In September 2024, Kenya's Court of Appeal upheld that, clearing 185 former moderators to take Meta to trial.

This is the first ruling of its kind anywhere in the world. It potentially cracks the whole parent-company → contractor → subcontractor shield that global outsourcing is built on — the layers that were supposed to make the company at the top untouchable.

Africa's First Content Moderators' Union

In May 2023, more than 150 workers labeling content for Facebook, TikTok, and ChatGPT voted to form Africa's first content moderators' union, backed by the Communications Workers Union of Kenya. For some of the lowest-paid jobs in global tech, it was the first time the people doing them had a collective voice instead of a non-renewed contract.

How Worker Burnout Degrades Labeling Quality

Here's the part the spreadsheets miss. When a person spends eight hours absorbing trauma for $2 an hour, by hour four they've stopped being a careful labeler. An AI trained by exhausted people for pennies is, by construction, worse than one trained by rested specialists. The cruelty and the quality problem are one and the same.

The Philippines: Inside the AI Microtask Economy

Scale AI, Remotasks, and the SEPI Subsidiary

If Kenya is a story about the human psyche, the Philippines is a story about arithmetic. In 2019, Scale AI set up a local subsidiary — Smart Ecosystem Philippines Inc., or SEPI — to run its Remotasks platform. Offices opened in Cagayan de Oro; thousands worked from home and from internet cafés. Estimates of the total Philippine workforce range from 10,000 to two million — nobody knows the real figure, including the government.

How the Microtask Payment Model Works

The model is brutally simple. You open the platform and see a queue: trace the car in this clip, transcribe this audio, pick the better of two chatbot answers. Each task is priced individually — fractions of a cent, sometimes a few cents, occasionally a dollar. No employment relationship: you're a contractor, the platform pays per task, and everything else — your hours, your health, your bad week — is your problem.

This is the gig economy in its purest form: gig economy workers reduced to a queue of piecework, classified as independent contractors precisely so no one owes them a minimum wage, a contract, or a sick day. It's a near-perfect engine for gig economy exploitation — the worker has no leverage and no idea who they're working for.

The Race to the Bottom: How Per-Click Rates Collapsed

In 2020–2021, a sharp tasker could clear $150–$200 a week — well above the local minimum of $6–$10 a day. People quit office jobs. Then Scale AI expanded the platform into India and Venezuela, and rates fell off a cliff. According to former workers, pay for identical tasks dropped from around $10 to less than a cent. Not by ten percent. Not by half. By a factor of a thousand. This is the race to the bottom, operating exactly as advertised: the more countries in the pool, the cheaper every worker becomes. Venezuela's collapse supplied a desperate labor force, and Filipino taskers found themselves in the same auction — and lost.

Fair Work Failures and the Washington Post Investigation

The receipts piled up. In 2022, the Oxford Internet Institute's Fairwork project scored Scale/Remotasks a 1 out of 10 on basic fairness. In August 2023, the Washington Post investigation by Rebecca Tan and Regine Cabato put numbers to the Scale AI controversy: payments delayed for months, slashed without explanation, sometimes withheld — one worker paid 30 cents for four hours.

The corporate reply was boilerplate: delays are "extremely rare," systems "continuously improving." The Philippine government's reply was a shrug — a communications secretary calling data labeling an "informal sector" they didn't know how to regulate. Then in March 2024, Remotasks abruptly cut off Kenya, Nigeria, and Pakistan, the email arriving hours before the shutdown, no severance. It's far easier to leave than to be regulated.

Why Half-a-Cent Tasks Produce Low-Quality Training Data

Run the math the way a worker has to. At half a cent per label and a minute per task, you're earning thirty cents an hour. To clear even the local minimum you'd need to fire off four labels a minute. At that speed you don't read the 40-page guideline, you don't check the ambiguous cases — you click something plausible and move on. It isn't laziness; it's survival arithmetic. And that data flows straight into the training sets behind GPT, Claude, and Gemini. The intelligence of a trillion-dollar model rests partly on how much attention a person earning thirty cents an hour could afford to pay.

How Bad Data Labeling Affects the AI Models You Use

The Kenyan and Philippine stories look different — trauma versus tedium, $2 an hour versus half a cent a click — but the blueprint is identical. An American company hires a contractor, the contractor hires workers in a country with weak labor law, and two or three legal layers sit between the brand and the human. The structure isn't a bug. It's the product. Here's how that cheap foundation cracks the building you actually live in.

AI Safety: The Jailbreak Arms Race and Filter Gaps

When ChatGPT politely declines to explain how to build a weapon, that isn't magic. It's Kenyan moderators who labeled the bad stuff so the model could learn the boundary. Label that boundary thinly — by traumatized people, in a hurry, for $2 an hour — and it develops holes.

The ongoing AI jailbreak arms race, where users find ways around guardrails faster than companies can patch them, is partly a direct consequence of underfunded labeling. Every gap in the filter traces back to a gap in the foundation.

Hallucinations: When Rushed RLHF Produces Confident Lies

AI hallucinations — the model stating fiction with total confidence, inventing citations, mangling dates — often start at the RLHF stage. If a reviewer picks the "better" answer in half a second to hit quota, that choice is close to random — and millions of semi-random choices become the model's reward function.

Concretely: a rushed labeler rewards a fluent-but-wrong answer over a clunky-but-correct one. The reward model learns confidence reads as quality, and the finished model delivers polished nonsense with a straight face. One careless click doesn't stay one careless click. It compounds into a behavior.

Algorithmic Bias and Underserved Languages

Algorithmic bias has a labor explanation too. Models work better in English because labeling English pays better and attracts more labelers, so the data is deeper and cleaner. Languages like Amharic, Malagasy, or Tagalog sit below the profitability line and get thinner, noisier labeling — or none. The bias isn't only in the algorithm; it's in the budget that decided whose language was worth annotating well.

The Alignment Problem as a Labor Problem

Here's the line the industry doesn't put on the slide. The AI alignment problem — the grand challenge of making models reliably do what we want — is usually framed as math and philosophy. But strip away the abstraction and a large chunk of it is a labor problem.

A company that cuts costs to the bone on the humans who define safe, helpful, and true cannot, by construction, produce a reliably safe, helpful, truthful model. The mismatch between values and behavior often begins not in the weights, but in the wage.

What's Changing in AI Labor Regulation

Court Precedents and the EU AI Act on Training Data Transparency

A few things are genuinely shifting. Kenya's courts ruled a US tech giant can be sued where it has no office — a precedent that quietly threatens the entire outsourcing shield. Africa's first content moderators' union exists now. Journalists at Time, the Washington Post, and the Guardian keep pulling the curtain back.

And regulation is inching in. The EU AI Act, in force since August 2024, now requires providers of general-purpose AI to publish a detailed summary of their training data, with a mandatory disclosure template rolling out from 2025. It targets copyright and provenance today, but transparency about what's in the data is one short step from transparency about who labeled it and how they were treated. Researchers are pushing a parallel idea — "datasheets for datasets" and supply-chain disclosure, the AI equivalent of a nutrition label.

What Isn't Changing: The Logic of the Race to the Bottom

Now the cold part. As long as cheaper jurisdictions exist, the work drifts toward them — exactly as Remotasks drifted from unionizing Kenya to crisis-stricken Venezuela. Outsourcing presupposes mobility: a structural slide toward wherever the rules are weakest and the desperation deepest. The trauma and the bad labels are externalities — costs shoved onto workers and users, kept off the company's books.

This is the lineage Mary Gray and Siddharth Suri named ghost work in their 2019 book — human labor hidden behind the curtain of "automation," running back to Amazon's Mechanical Turk and, further still, to an old colonial pattern: value extracted from the Global South for products consumed comfortably in the North. The marketing hides behind warm words — safe, ethical, aligned, responsible — but pull the chain to its end and you find a person labeling abuse for $2 an hour, or clicking microtasks for half a cent.

Will synthetic data and AI-assisted labeling fix this? Maybe partly. But letting the model help label its own training data mostly risks burying the human deeper and laundering the bias, not removing it. So next time someone tells you AI "thinks for itself," remember the floor below the floor. The question was never whether a human stands behind the answer. It's how little they were paid — and how much they had time to notice.

AI Data Centers in Space: The Orbital Fix for a Power Crisis Earth Can't Solve

editorial@silentroom.ai (Arsen Revazov) — Mon, 22 Jun 2026 12:23:44 GMT

?? In November 2025, a satellite the size of a mini-fridge trained a language model on the complete works of Shakespeare. Not in a lab. In orbit, 325 kilometers above your head, on the first NVIDIA H100 GPU ever to leave the planet.

The satellite was called Starcloud-1, and it wasn't a stunt. It was a proof of concept for one of the strangest ideas in tech: data centers in space — actual server farms, floating in vacuum, powered by sunlight that never sets.

Your first reaction is probably the correct one: this sounds insane. Rockets are expensive. Space is hostile. We have perfectly good dirt right here.

And yet Google is building AI data centers in space. NVIDIA is building chips for it. SpaceX has filed paperwork for up to a million satellites' worth of it. China already has twelve computing satellites flying. The answer to "why" isn't romance. It's a power bill.

Why AI Is Running Out of Power on Earth

The uncomfortable truth behind every chatbot you've talked to this week: the binding constraint on AI is no longer chips or money — it's electricity. Training clusters are ordered on two-to-three-year horizons; power plants take a decade. That mismatch is the whole story.

The industry's polite term is "capacity constraints." The honest term is a data center power crisis. Utilities in Virginia, Ireland, and Singapore are telling hyperscalers some version of no, you cannot have another gigawatt, we don't have one. Grid queues stretch for years; billion-dollar campuses sit waiting for substations. To understand why anyone would put servers on a rocket, you first need to see how deep this hole is.

How much energy does AI actually use?

So, how much energy does AI use? AI power consumption now rivals whole countries — and it's accelerating.

According to the International Energy Agency's "Energy and AI" report, global data center electricity consumption hit roughly 415 terawatt-hours in 2024 — about 1.5% of everything humanity generates — and the IEA's base case has that more than doubling to around 945 TWh by 2030. That's more than the entire annual electricity consumption of Japan. For data centers. Mostly because of AI.

The American numbers are starker. A Lawrence Berkeley National Laboratory report for the Department of Energy found US data centers consumed 4.4% of national electricity in 2023 — and projected between 6.7% and 12% of all US electricity by 2028 as AI electricity usage scales. Twelve percent. One in every eight watts, feeding AI data center power demand and the cloud.

Former Google CEO Eric Schmidt told Congress that data centers may need an extra 29 gigawatts by 2027 and 67 more by 2030: "These things are industrial at a scale I have never seen in my life." This is a man who ran Google. His scale calibration is not the problem.

And efficiency won't save us. Chips get dramatically more efficient every generation, yet total AI energy consumption still climbs, because cheaper compute means we use absurdly more of it. Economists call this the Jevons paradox. Your utility calls it a headache.

The water problem: cooling at hyperscale

Electricity is only half the bill. The other half is wet.

Every hyperscale data center is, thermodynamically speaking, a giant kettle: chips turn electricity into heat, and the heat usually leaves as evaporated water. US data center water usage went from 21 billion liters in 2014 to 66 billion liters in 2023, the overwhelming majority at hyperscale facilities. A single large site can drink up to 5 million gallons per day — a town of 16,000 households. A UN University report projected AI's water footprint could hit 9.3 trillion liters by 2030 in a high-adoption scenario — roughly the basic annual domestic water needs of all of Sub-Saharan Africa.

Now picture pitching a new data center to a drought-stricken county. Data center cooling isn't an engineering line item anymore; it's a political problem. Communities are saying no. Loudly.

The 2030 capacity shortfall

Then there's the money, which is somehow the least crazy part.

McKinsey calculates that companies will need to invest about $5.2 trillion in AI data centers by 2030, based on roughly 156 gigawatts of AI-related capacity demand. JPMorgan independently landed above $5 trillion and noted the snag: gas turbine lead times have ballooned to three or four years; nuclear plants take a decade. Capital is available. Watts are not.

Even if every project on the books lands on schedule, McKinsey warns the US alone could be short more than 15 GW by 2030. In spreadsheet-speak: the demand curve and the supply curve no longer intersect on this planet — which is exactly where engineers start looking up.

What Are Orbital Data Centers?

Time for the definition, now that you've earned it. Orbital data centers (also called space based data centers, or simply a data center in orbit) are satellites — or constellations of them — doing the work of a terrestrial server farm: storage, processing, AI inference, maybe one day training. Instead of a strained grid, solar panels in near-constant sunlight. Instead of evaporating a river, waste heat radiated into the void.

A space data center is not a space station with a server closet. The serious designs are racks of accelerators bolted to a spacecraft, woven together by lasers, with no human for hundreds of kilometers. Less Star Trek, more "warehouse moving at 7.8 km per second."

How orbital data centers work

The recipe, in plain language:

Pick the right orbit. The leading designs — Google's Project Suncatcher among them — use a dawn–dusk sun-synchronous orbit: a low-Earth orbit riding the line between day and night, keeping the satellite in almost perpetual sunshine. No night, no batteries the size of a school bus.
Generate power. Big, lightweight solar arrays feed the chips directly. No grid, no permits, no angry county commission.
Compute. Onboard GPUs or TPUs run workloads like a ground cloud — either radiation hardened chips built for space, or commercial chips wrapped in shielding. Google tested its Trillium TPU in a 67 MeV proton beam and found no hard failures up to the maximum tested dose — "likely acceptable" for inference.
Talk via laser. Satellites connect through free-space optics — laser satellite communication at terabit-class speeds — and reach Earth by optical or radio downlink. Google has demonstrated 1.6 Tbps on a single transceiver pair in the lab.
Send back answers, not raw data. You uplink the question, the orbital cluster does the heavy lifting, and only the distilled result comes home.

Why space solves the power and cooling problem

Two Earth problems simply don't exist up there.

Power: in the right orbit, a solar panel can be up to 8 times more productive than on Earth, per Google — no atmosphere, no clouds, no night. The Sun continuously emits over 100 trillion times humanity's entire electricity production. From orbit, you're standing next to the firehose.

Water: there isn't any, and you don't need it. Waste heat leaves by radiation, not evaporation — a headline advantage in a decade of record droughts, per the Thales Alenia Space ASCEND study.

The catch — and you knew there'd be a catch — is that radiating heat in a vacuum is much harder than it sounds. Hold that thought; we'll get to the radiators.

Who Is Building Data Centers in Space?

A few years ago this was a whitepaper genre. Now it's a competitive field with launched hardware, named partnerships, and real money — all chasing energy that doesn't queue, cooling that doesn't drink, and land that nobody can protest.

Google Project Suncatcher

Announced in November 2025, Google Project Suncatcher is the most academically serious entry. Google's research paper sketches clusters of 81 satellites flying in tight formation — within about a kilometer's radius — in dawn–dusk sun-synchronous orbit, carrying Google's TPUs and stitched together by optical links.

The near-term plan is refreshingly modest: a learning mission with Planet Labs to put two prototype TPU satellites up by early 2027 and see what breaks. CEO Sundar Pichai has said that within "a decade or so," orbital data centers will be seen as "a more normal way to build data centers."

Starcloud and NVIDIA

Starcloud is the startup that actually went first. The Redmond-based company launched Starcloud-1 in November 2025 — that 60-kg fridge with the first data-center-class H100 in space, roughly 100x more GPU compute than anything previously in orbit, per NVIDIA. The roadmap escalates fast: Starcloud-2 with a Blackwell GPU in late 2026, then a 200 kW node, then multi-megawatt — and a long-term pitch for a 5 GW orbital facility with a solar array measured in square kilometers. The company hit a ~$1.1 billion valuation in March 2026.

The Starcloud NVIDIA relationship matters because NVIDIA stopped being a bystander. At GTC in March 2026, it announced the Space-1 NVIDIA Vera Rubin module — a radiation-tolerant building block for orbital AI with up to 25x the H100's inference performance — plus IGX Thor for orbital edge computing. Jensen Huang's framing: "space computing, the final frontier, has arrived." When the world's most valuable chipmaker designs an NVIDIA space data center product line, the idea has formally left the fever-dream phase.

The ASCEND Project and Thales Alenia Space

Europe, true to form, commissioned a study — but a good one. ASCEND (Advanced Space Cloud for European Net zero emission and Data sovereignty) is an EU Horizon Europe feasibility study led by Thales Alenia Space with a consortium including Airbus, ArianeGroup, and HPE. In 2024 it reported promising results: the ASCEND data center concept is technically feasible and could be economically viable, targeting 1 GW of European compute in orbit before 2050.

The fine print: the carbon math only works with a future launcher about ten times less emissive than today's rockets, and no hardware has flown. But as government-backed roadmaps go, it's the most concrete one outside the US and China — and "data sovereignty" in the name tells you Europe sees orbit as strategic territory, not a gimmick.

Axiom Space and the orbital data center (AxDCU-1)

While others publish papers, Axiom ships boxes. The Houston company flew its shoebox-sized AxDCU-1 prototype to the International Space Station in 2025, then launched its first two free-flying Orbital Data Center nodes in January 2026 aboard Kepler Communications' optical relay satellites, linked by 2.5 Gbps laser connections.

The Axiom Space data center pitch is unglamorous and probably right: process satellite imagery in orbit instead of downlinking petabytes of raw pixels, serve defense customers who like their compute unreachable, and scale "from kilowatts to megawatts." The tortoise strategy — small, useful, already flying.

Other Notable Players in The Field

China's Xingshidai constellation. In May 2025, China launched the first 12 satellites of its "Three-Body Computing Constellation" — also covered under the Xingshidai banner — led by startup ADA Space with Zhejiang Lab. The dozen satellites deliver about 5 peta-operations per second over 100 Gbps laser interlinks; the full plan calls for 2,800 satellites, and Alibaba's Qwen model was reportedly running on orbit by early 2026. The China space data center program is not waiting for anyone's feasibility study.

SpaceX and Elon Musk. Musk has called space "a no-brainer for building solar-powered AI data centers," and SpaceX has filed with the FCC for up to one million data-center satellites — the "SpaceX million data centers" plan, in headline shorthand. But SpaceX's own pre-IPO filing speaks in a quieter voice: orbital AI compute involves "unproven technologies" and "may not achieve commercial viability." When the loudest evangelist's lawyers write that sentence, keep both halves in mind.

Meta and Overview Energy. Meta took a different road: keep the servers on the ground, get the power from space. In April 2026 it signed a first-of-its-kind deal with startup Overview Energy for up to 1 gigawatt of space based solar power — geosynchronous satellites collecting sunlight and beaming it down as low-intensity near-infrared light onto existing solar farms, extending their generating hours into the night. Demo in 2028, commercial delivery around 2030. It's the first time space solar energy beaming has been bought at nuclear-plant scale to feed AI — and arguably the most bankable bet here, because a solar power satellite doesn't care about latency.

Add NTT and SKY Perfect JSAT in Japan, Blue Origin, and ESA-commissioned studies with IBM, and the picture is clear: every major tech power is hedging the same bet.

The Alternatives: Underwater and Arctic Data Centers

Before we strap servers to rockets, fairness demands a look at Earth's two proven "free cooling" frontiers — the ocean and the Arctic. Data center alternatives don't have to clear the atmosphere to be useful.

Microsoft Natick and underwater servers

Microsoft Natick is the patron saint of weird data center experiments. In 2018, Project Natick sank a sealed cylinder with 855 servers off Orkney, Scotland — an underwater data center cooled by the sea itself. Retrieved two years later, only 6 of 855 servers had failed, roughly 8x more reliable than the identical land-based control group. The secret: a nitrogen atmosphere and zero humans bumping into things.

And then Microsoft killed it. The fatal flaw wasn't reliability — it was access. A sealed undersea data center can't be upgraded without hauling it up, and AI hardware goes stale in 1–3 years. China's Highlander has since commercialized the concept off Hainan and Shanghai (a 24 MW underwater facility powered ~97% by offshore wind went live in 2025–2026). The idea has legs — just not hyperscale ones, so far. Remember Natick's lesson; it haunts the orbital plans too.

Peter Thiel's Panthalassa project

The most cinematic entry: in May 2026, Peter Thiel led a $140 million Series B into Panthalassa, an Oregon company building wave-powered, floating AI data centers — the marquee Peter Thiel data center bet, valuing the startup near $1 billion.

The Panthalassa project nodes are 85-meter, mostly submerged steel structures that bob in the open Pacific, converting wave motion into electricity, cooling chips with seawater, and talking to shore via Starlink. Pilots deploy in 2026; commercial systems target 2027. Skeptics point at corrosion, storms, and satellite-link bandwidth — but as a dress rehearsal for off-grid autonomous compute, the ocean is a far gentler teacher than orbit.

Arctic and Nordic data centers

And then there's the option that already works and bores everyone: build where it's cold. Meta's Luleå facility in northern Sweden — opened 2013, ~100 km from the Arctic Circle — cools itself with outdoor air year-round and runs entirely on local hydropower, hitting a power usage effectiveness around 1.07. Nearly perfect.

The broader Nordic data center belt — Norway's fjord-side facilities, Iceland's geothermal sites, the giant new AI campuses near Narvik serving Microsoft and OpenAI — is the most mature alternative here. An Arctic data center needs no rockets, no submarines, no new physics. Its limits are mundane: only so many cold rivers, so much hydro, so much fiber. Which is exactly why the conversation keeps drifting upward anyway.

The Engineering Challenges

Now for the cold shower. Putting a hyperscale data center in orbit means solving three problems Earth solves for free. Space computing is hard precisely where ground computing is trivial.

Powering a data center with orbital solar

Orbital solar power is the good news. Sunlight in space is stronger (no atmosphere), more reliable (no weather), and in a dawn-dusk sun synchronous orbit, nearly constant — the up-to-8x edge Google cites, and the core of every space solar power pitch.

The bad news is mass. A feasibility study puts a 1 MW orbital cluster at around 5,600 m² of solar array plus thousands more of radiator — 34–59 kg of hardware per kilowatt of IT power. A modest 10 MW cluster is hundreds of tons in orbit. Every panel rides a rocket, and solar cells degrade under radiation, so you oversize from day one. Free energy, expensive real estate.

Cooling in a vacuum

The dirty secret of "space is cold": vacuum is the best thermal insulator there is. Your thermos exploits this. No air for convection, no water for evaporation — heat leaves only by infrared radiation, governed by the Stefan–Boltzmann law.

The numbers are brutal. Rejecting 1 MW of waste heat requires roughly 1,200–1,600 square meters of radiator surface — a structure bigger than the compute it serves, shedding heat over a thousand times slower than water cooling rips it off an AI chip on Earth. Scaling ISS-style thermal hardware to megawatt class implies ~100 tons of radiators — potentially 10x the mass of the servers themselves. The entire ISS, for reference, rejects just 70 kW.

Run radiators hotter and they shrink, but then your chips cook closer to their limits. Thermal management, not power, may be the real binding constraint of the orbital data center — Google's own paper flags it as unsolved.

Latency and laser connectivity

The third wall is physics' speed limit. Laser interlinks between satellites are spectacular — terabit-class, near-zero added latency over thousands of kilometers. The trip to the ground is not: realistic round-trip latency runs from tens to a couple hundred milliseconds, and cloud cover can flat-out block an optical downlink.

That draws a hard line through the AI workload map. Inference, batch processing, and Earth-observation analytics: fine. Frontier-model training — which demands microsecond-tight coupling between thousands of chips — stays on the ground. Inconveniently, training is the workload driving the power crisis. The orbital cloud, at least at first, will be an inference cloud.

The challenges above don't exist in isolation. Every infrastructure decision — space, ground, or Arctic — involves the same set of tradeoffs, just distributed differently. Some problems that orbit eliminates entirely, Earth has been quietly solving for decades. Others that Earth treats as line items, space turns into existential engineering questions. The table below maps all three across the dimensions that actually determine whether a data center gets built, funded, and kept running.

	Orbital	Ground	Arctic
Power Source	Unlimited Potential	Constrained	Limited but Clean
Water Usage	None	Critical Problem	Near Zero
Cooling Method	Unsolved at Scale	Solved & Expensive	Solved & Nearly Free
Build Cost	3–7x Ground Cost Today	High but Known	Cost-competitive
Latency	Physics Limit	Excellent	Acceptable
Workloads	Inference Only	Universal	Universal
Status 2026	Early Stage	Capacity Crisis	Operational

The Real Bottleneck: Launch Costs

Strip away the radiators and the lasers, and the whole argument collapses to one number: launch cost per kg.

Today, a reusable Falcon 9 puts mass in low-Earth orbit at roughly $1,500–2,900 per kilogram. A 1 MW orbital cluster weighs around 40,000 kg at realistic mass budgets — over $100 million in launch costs alone, before you've paid for a single GPU, ground station, or replacement satellite. At current prices, independent analyses agree, orbital compute runs several times the cost of equivalent terrestrial capacity. The economics don't close. Period.

Everything therefore hinges on Starship. SpaceX targets a Starship launch cost of $100–200/kg with full, rapid reuse — internal projections run toward $10–20/kg at airline-like cadence — while independent observers put mature SpaceX Starship cost per kg at $100–500. Nobody actually knows, because the required cadence hasn't been demonstrated yet.

What we do know is the threshold. Google's analysis finds that at around $200/kg, a space-based data center becomes roughly cost-comparable with a terrestrial one's energy costs — plausibly reached by the mid-2030s.

The clean way to think about it: cooling and power are hard engineering; launch economics are existential. Starship-class pricing is necessary for orbital AI at scale — and even then, not sufficient. A 100 MW constellation could still demand 50+ dedicated launches, plus spacecraft mass production and in-orbit assembly. The rocket equation forgives nothing.

The Legal Gray Zone of Computing in Orbit

Suppose the engineering works. Whose laws apply to a server farm that belongs to no country and orbits all of them every 90 minutes?

The foundation is the 1967 Outer Space Treaty: space belongs to no nation, but the state where a spacecraft is registered "retains jurisdiction and control" over it. In practice, a data center in orbit is treated like a ship at sea — governed by the law of its flag state.

You can see where this is going. Ships taught us exactly what happens next: flags of convenience. Nothing stops an operator from registering orbital compute wherever the data, AI, and tax rules are friendliest — while serving users everywhere. Meanwhile, regulations like GDPR follow the data, not the hardware, so a US-flagged satellite processing European personal data is theoretically still on the hook. Multiple states can plausibly claim authority over the same compute job, and the treaties — written to assign blame for falling debris, not to referee AI governance — offer no tiebreaker.

Who audits an autonomous AI facility with no human aboard? When The Register asked Axiom which country's laws govern its on-orbit data processing, the company didn't respond. That silence is the current state of the law. Orbit is not a legal escape hatch — it's a legal traffic jam where the pile-up hasn't happened yet.

Will AI Data Centers in Space Actually Happen?

After all that — the terawatt-hours, the radiators, the rocket math, the lawyerless courtroom 400 km up — here's where I land.

Yes, they'll happen. No, not the way the keynotes promise.

The momentum is real. Within a single recent stretch, Starcloud flew an H100, Axiom launched free-flying nodes, China orbited a dozen Xingshidai satellites, NVIDIA announced space silicon, Google booked a 2027 demo with Planet, Meta bought a gigawatt of space solar, and SpaceX filed for a million-satellite constellation. Not vaporware behavior.

But the skeptics hold the better near-term hand. IEEE Spectrum's analysis pegged a 1 GW orbital system at roughly 3x the cost of its terrestrial twin — down from earlier 7–10x estimates, but still a chasm. Radiative cooling stays stubbornly massive. Latency walls off training, today's dominant workload. Hardware that refreshes every 1–3 years clashes with satellites you can't upgrade — Natick's flaw, at escape velocity. Astronomers warn about the space junk problem: satellite streaks already contaminate a growing share of asteroid-hunting telescope images, and mass reentries deposit ozone-eating aluminum oxide in the upper atmosphere — space debris from thousands of short-lived compute satellites is a bill someone eventually pays. And SpaceX's own IPO filing concedes the whole category "may not achieve commercial viability."

So here's my honest scorecard for the future of data centers:

Late 2020s: demonstrators and niches — kilowatt-scale space AI computing, Earth-observation processing, defense workloads that pay extra for unreachable hardware.
Early-to-mid 2030s: if Starship hits a few hundred dollars per kilogram, tens to hundreds of megawatts in orbit for inference and batch work — the first space based data centers that earn their keep.
Replacing terrestrial hyperscale: not this generation. Even bullish forecasts speak of orbital costs converging with ground costs by 2035 for certain workloads — not winning outright.

The deciding data point arrives soon, and it isn't a press release: it's Google's two TPU satellites in early 2027. If those chips survive, link up, and compute on budget, AI infrastructure 2030 planning everywhere gets a new line item. If they don't, the sector gains a convenient excuse to wait for cheaper rockets.

Either way, the power crisis driving all this isn't going anywhere — and that, not the romance of orbit, is why the idea refuses to die. Earth ran out of easy watts. Space has nothing but.

Keep an eye on early 2027. The sky is about to run a benchmark.