Blog

Engineering logs and system observations.

Experiments, product decisions, and practical learnings from building AI-native systems.

Context Windows Are Lying to You
Blog · Feb 4, 2026

Context Windows Are Lying to You

Context window size as a capability metric. "We support 1 million tokens." "200,000 token context." The assumption: bigger context = the model can use more information = better performance. The marketing is about window size. The actual behavior is not.

Read post →
AI Training on Synthetic Data Is Collapsing the Diversity It Was Supposed to Expand
Blog · Feb 3, 2026

AI Training on Synthetic Data Is Collapsing the Diversity It Was Supposed to Expand

Recursive training on AI-generated outputs was promised as the solution to data scarcity. Instead, a mathematically proven 'model collapse' is erasing the long-tail diversity of human knowledge.

Read post →
The 'Lost in the Middle' Problem Is Killing RAG Systems Nobody Admits Are Broken
Blog · Feb 1, 2026

The 'Lost in the Middle' Problem Is Killing RAG Systems Nobody Admits Are Broken

RAG was promised as the cure for hallucinations, but a structural U-shaped attention bias is rendering most enterprise deployments unreliable. We examine why adding more context often makes RAG worse, not better.

Read post →
"Reasoning" Models Don't Know When They're Wrong — And They Sound Most Confident When They Are
Blog · Oct 3, 2025

"Reasoning" Models Don't Know When They're Wrong — And They Sound Most Confident When They Are

Chain-of-thought reasoning was supposed to fix hallucinations, but research shows it amplifies overconfidence. We explore the Confidence Inversion Hypothesis and why reasoning traces can be misleading.

Read post →
LLM-as-Judge Is the New LMArena — And It Has the Same Problems
Blog · Jul 2, 2025

LLM-as-Judge Is the New LMArena — And It Has the Same Problems

Using LLMs to evaluate other LLMs has become the dominant production paradigm. However, research reveals these 'automated judges' suffer from the exact same systematic biases as the human raters they were meant to replace.

Read post →