Neural Research

Context window size has become a headline capability metric. "We support 1 million tokens." "200,000 token context." The assumption: bigger context means better performance. The marketing is about window size. The actual behavior is not.

Models perform worst on information buried in the middle of long contexts. They capture details at the beginning and end. The middle becomes a dead zone.

Most models claiming 200K tokens become unreliable around ~130K. Performance doesn't degrade smoothly—it drops suddenly past an invisible threshold where recall collapses. You can feed the model 50 documents and still fail to give it usable knowledge.

Failure Modes in Production

Context poisoning

Agents can internalize incorrect information and reuse it repeatedly. In one documented case, an AI agent hallucinated a state, wrote it into its own context, and then acted on that false belief for extended periods.

In production, this looks like agents retrieving outdated API endpoints, failing, and then repeatedly retrying the same incorrect information because it has been reinforced in context. Once poisoned, the context persists the error across steps and sessions.

Context contradiction

When different parts of the context conflict, models often resolve the contradiction incorrectly—but with high confidence. As context grows, models rely more on provided context and less on pretrained knowledge.

This creates a failure mode known as context over-reliance, where incorrect context overrides correct internal knowledge.

Conversation history mismanagement

After multiple turns, systems accumulate thousands of tokens of history. Early signals (user preferences, constraints) are treated the same as trivial later responses.

There is no weighting, no prioritization, and often no summarization. The system lacks a concept of importance over time.

Session amnesia

Context resets between sessions. Agents do not retain long-term memory unless external systems are built. This forces every interaction to start from zero, ignoring accumulated user knowledge.

The rapid growth of the AI memory infrastructure market reflects this gap—it exists because native context handling is insufficient.

The Specific Mechanism

The advertised context window is a theoretical maximum under controlled conditions. It is not the effective operational limit in real-world usage.

Benchmarks test retrieval: the answer exists somewhere in the context. Production tasks require reasoning across distributed, noisy, and sometimes contradictory information.

The gap between advertised and reliable context is rarely measured—and typically falls between 30–50%.

The Industry Cost

Enterprises built RAG pipelines and agent systems assuming context windows behave as specified. They do not.

Context-poisoned agents operating in production
Session resets degrading user experience
Contradiction-driven incorrect outputs in critical workflows

The emergence of a multi-billion dollar memory infrastructure layer is not a feature—it is a workaround.

What Needs to Exist

A context reliability benchmark.

Not maximum size—but effective performance:

At what depth does recall drop below 90%? 80%? 50%?
How does performance vary by position (start, middle, end)?
How does contradiction density impact accuracy?

This benchmark should be standardized, versioned, and published per model.

Today, it does not exist.

Every team building on long-context systems is operating without visibility into their most critical assumption.

Context Windows Are Lying to You