Context Engineering and the Bitter Lesson of AI Agent Architecture

Gopal Khakda
.
Feb 26, 2026
In my last post, I argued that building AI agents is about decisions, not code. Ask the right questions before you build.
This post is the uncomfortable sequel: whatever you decide today will be wrong tomorrow.
Not because you made bad choices. Because models will improve, and the structure you needed yesterday becomes the bottleneck you fight tomorrow. The real lessons — context engineering, framework lock-in, architectural humility — only emerge after you've shipped and watched things break.
This is the Bitter Lesson. PostHog learned it after a year of production agents. Tavily learned it after rebuilding their research system from scratch. We've learned it too.
Here's what it means for how you build.
The Bitter Lesson of AI Agent Architecture
In 2019, AI researcher Rich Sutton published a short essay that's become required reading. His thesis was simple and uncomfortable:
Over 70 years of AI research, general methods that leverage computation always win. Hand-crafted knowledge and clever structures get outpaced by scale.
Chess is the canonical example. For decades, researchers encoded human expertise: opening theory, positional understanding, endgame patterns. Deep Blue used some of this, combined with brute-force search. But AlphaZero crushed it — learning entirely through self-play, with no human knowledge at all.
The pattern repeats everywhere. Speech recognition. Computer vision. Natural language. Every time researchers thought they'd found the right structure, raw computation and learning proved them wrong.
The lesson is bitter because it means our carefully crafted understanding — our hard-won insights about how things should work — often becomes a hindrance, not a help.
How Model Improvements Bulldoze AI Agent Frameworks
PostHog spent a year building their AI agent. Their biggest learning?
"Model improvements change more than what you think at first."
Twelve months ago, reasoning models were experimental. Today, reasoning is essential to their agent's capabilities. Tool calling has improved massively — frontier models now handle complex tools with far greater reliability.
The two changes that transformed their implementation:
Cost-effective reasoning with o4-mini — simplified complex query creation
Reliable tool use with Claude 4 family — the agent could finally use diverse tools without going off track
They couldn't have predicted which improvements would matter most. But the improvements kept coming, and each one made some piece of their architecture obsolete.
Tavily experienced the same thing. Seven months ago, they abandoned their first deep research architecture entirely. It was sophisticated and clever — they thought that was a good thing. But its assumptions became bottlenecks when the next generation of models arrived.
The message is clear: whatever you build today is a snapshot of current model capabilities. Plan for it to change.
Why Agents Beat Workflows in 2026
Here's a take that would have been wrong a year ago: agents beat workflows.
PostHog tried graph-style workflows for months. In the GPT-4o era, calling tools in a loop with the same system prompt was a recipe for confusion. They built elaborate orchestration graphs to compensate.
It didn't work. In a graph, the LLM can't self-correct. Context gets lost between nodes. The architecture that was supposed to add reliability actually reduced it.
Today, their architecture is "pleasingly straightforward": a single LLM loop that calls tools until the task is done. No complex orchestration. The model decides what to do next.
This only works because models got better at tool calling. A year ago, the structure was necessary. Today, it's a constraint.
Lance Martin at LangChain had the same experience with deep research. His parallel workflow — workers writing separate report sections — was fast and reliable for its time. But when tool calling improved and MCP emerged, his structure couldn't adapt. He had to rebuild.
The lesson isn't "never use workflows." It's that the right architecture depends on current model capabilities, and those capabilities keep shifting.
Single-Loop Agents Beat Multi-Agent Orchestration
Even in an agentic approach, it's tempting to organize tasks into specialized subagents. The Widget CEO delegates to the Widget Engineer, verified by the Widget Tester, with input from the Widget Product Manager.
The idea is smart. The results are dumb.
Context is everything for an LLM. Every layer of abstraction introduces context loss. The ability to string tools together and self-correct washes away when you split across agents.
PostHog learned this directly. Claude Code proved it at scale. Incredible things come from a single LLM loop with simple tools.
Subagents still have their place — when you need to parallelize independent, self-contained tasks. But for sequential work that builds on itself, keep it in one loop.
The Simplest AI Agent Reliability Tool: To-Do Lists
If a single loop is the answer, how do you keep the agent on track over many iterations?
PostHog discovered something deceptively simple: the todo_write tool.
It does almost nothing. The agent writes its next steps, and that's it. But the effect is dramatic. Every time the agent uses it, it reinforces what it needs to do next. Instead of getting lost after a few tool calls, it keeps going — constantly correcting course.
This is one of those intuitive superpowers, like chain-of-thought prompting was. A simple mechanism that unlocks emergent behavior.
If you're building multi-step agents, add a to-do tool. Let the agent write its plan at each step. The self-reinforcement compounds.
Context Engineering: The Hidden Lever for AI Agents
Most agents propagate full tool responses through every iteration. You call a tool, get a response, and the whole thing goes into context for the next call.
This works for simple tasks. It breaks down for research.
Here's what happens with traditional context management:
Iteration 1: 3,000 tokens (query + prompt + tool response)
Iteration 2: 5,000 tokens (all of iteration 1 + new response)
Iteration 3: 7,000 tokens (all previous + new response)
Iteration 10: 21,000 tokens per call
Context grows quadratically. Costs spike. The model drowns in information and loses focus.
Tavily solved this with context distillation. Instead of carrying forward raw tool outputs, they distill each response into a short reflection — the key findings in 2-3 sentences. Only the reflections propagate. Raw sources get fetched at the end, specifically for citations.
The result:
Iteration 1: 3,000 tokens → distill to ~100 token reflection
Iteration 2: 1,100 tokens (query + prompt + reflection + new response)
Iteration 3: 1,200 tokens
Final: Reflections + raw sources for citation
They reduced token consumption by 66% while achieving state-of-the-art on DeepResearch Bench. Quality went up. Costs went down.
The insight applies beyond research. Whatever context you carry forward, ask: does the next iteration need the raw data, or just the insight? Distill aggressively. Keep the window clean.
The Structured Output Problem (And Why It Proves the Bitter Lesson)
Structured output is another place the bitter lesson strikes.
When you ask an LLM for JSON, typed schemas, or Pydantic models, reliability varies wildly across providers and model versions. Some follow schemas strictly. Others drift. Long nested JSON? Older models stop midway or produce malformed output.
So you build workarounds. Retry logic. Fallback parsers. Multiple parsing libraries. Then newer models arrive, follow schemas better, and your workarounds become dead weight — adding complexity and latency for problems that no longer exist.
This is the pattern: you build for today's limitations, and tomorrow's capabilities make your workarounds obsolete.
AI Agent Frameworks Considered Harmful
PostHog started with the OpenAI SDK, then migrated to LangChain + LangGraph. Today, they say they absolutely wouldn't.
"Every time you use a framework, you lock into its ecosystem."
The problem isn't that frameworks are bad. It's that LLMs evolve faster than frameworks can keep up. The ecosystems are fragile.
LLM calling abstractions crumble when providers add new features. OpenAI and Anthropic format web search results entirely differently. The frameworks try to maintain one facade, but it keeps breaking.
Worse: orchestrators like LangGraph lock you into a specific way of thinking. When that way becomes obsolete — when agents beat workflows — good luck refactoring everything away.
AI may settle on its React someday. But for now, the framework wars rage on. Stay neutral and low-level. Use primitives you control. Make removal easy.
Self-Balancing Agents: Beyond Rigid AI Agent Orchestration
The best-performing open-source deep research agent right now is ThinkDepth.ai. It topped the DeepResearch Bench, beating Google's and OpenAI's implementations.
Its approach: instead of rigid orchestration, it uses high-level self-balancing rules.
Traditional agents follow prescribed steps: plan, then research, then write. ThinkDepth lets the model decide. It provides constraints — close the information gap, then close the generation gap — but doesn't dictate how. The agent creates or removes its own steps. It adjusts strategy mid-task.
The structure is minimal: rules about what to achieve, not instructions about how. A smarter model makes smarter decisions within the same loose framework. You don't rebuild when models improve — the system just performs better.
AI Agent Evaluation: Why Evals Are Not Enough
Some say evals are everything. For foundation models, maybe. For agents, it's more nuanced.
Reality is gnarly. For many real-world, multi-step tasks, setting up a realistic test environment is harder than building the agent itself. You can cover expected happy paths, but users will create paths you never imagined.
PostHog found something more valuable than raw evals: Traces Hour. A weekly gathering focused 100% on analyzing LLM traces from production — real user interactions.
Evals make the most sense when they stem from such investigations. They're directional feedback, not optimization targets. The question isn't "did the score go up?" It's "did this change make the agent more reliable in practice?"
Tavily agrees: careful trace monitoring consistently provided higher-signal feedback than any single eval score. Reduced token usage, reliability, lower latency — these matter more than a one-point bump on a benchmark.
Transparency in AI Agents: Show Every Step
PostHog tried hiding details that seemed raw — chains of reasoning, failed tool calls, argument values. The constant feedback: "It's hard to trust when the process is a mystery."
They started streaming every tool call and reasoning token. Turns out humans are like LLMs: a black box is uninspiring no matter how good the results. Details give confidence. Show it all — the good, the bad, the ugly.
If you're building agents for users, transparency builds trust. Don't hide the work.
Five Lessons for Building Production AI Agents
Synthesizing everything:
1. Watch for the bulldozer.
Model improvements will obsolete your architecture. Design knowing this. What assumptions about model limitations are baked into your system? Name them. Revisit them.
2. Prefer single loops over complex orchestration.
Subagents and graphs lose context. A single LLM loop with simple tools self-corrects. Add a to-do tool for multi-step reliability.
3. Context engineering matters.
Don't propagate raw tool outputs forever. Distill into reflections. Deduplicate sources. Keep the window clean. Fetch raw data only when you need it for citations.
4. Avoid framework lock-in.
Use low-level primitives you control. Document why each piece of structure exists. Make removal easy.
5. Watch traces, not just evals.
Production behavior matters more than benchmarks. Review real user interactions weekly. Let investigations drive your evaluation design.
Build Like Every Architecture Has a Shelf Life
Because it does.
The agent architecture you design today is a snapshot of current model capabilities. Six months from now, the landscape shifts. Your structure becomes a constraint.
This isn't a reason not to build. You have to build something. Today's models aren't good enough to work without guidance.
But build with humility. Document your assumptions. Make removal easy. Revisit when the world changes.
PostHog rebuilt their architecture multiple times in a year. Tavily abandoned their first system after seven months. This isn't failure — it's the job.
The best agent builders aren't the ones who find the perfect structure. They're the ones who keep restructuring — adding what's needed, removing what's not, evolving with the models they depend on.
The Uncomfortable Truth
We want our work to last. We want to build systems that stand the test of time.
But in AI, nothing stands still. The structure that makes you successful today becomes the bottleneck that holds you back tomorrow.
This is bitter. It means your best work has an expiration date. It means you'll rebuild systems you thought were finished.
But it's also liberating. You don't have to get it perfect. You just have to get it right enough for now — and stay ready to change.
The paradigm is shifting. The models are evolving. The agents are improving.
The question is whether you're evolving with them.
In my last post, I argued that building AI agents is about decisions, not code. Ask the right questions before you build.
This post is the uncomfortable sequel: whatever you decide today will be wrong tomorrow.
Not because you made bad choices. Because models will improve, and the structure you needed yesterday becomes the bottleneck you fight tomorrow. The real lessons — context engineering, framework lock-in, architectural humility — only emerge after you've shipped and watched things break.
This is the Bitter Lesson. PostHog learned it after a year of production agents. Tavily learned it after rebuilding their research system from scratch. We've learned it too.
Here's what it means for how you build.
The Bitter Lesson of AI Agent Architecture
In 2019, AI researcher Rich Sutton published a short essay that's become required reading. His thesis was simple and uncomfortable:
Over 70 years of AI research, general methods that leverage computation always win. Hand-crafted knowledge and clever structures get outpaced by scale.
Chess is the canonical example. For decades, researchers encoded human expertise: opening theory, positional understanding, endgame patterns. Deep Blue used some of this, combined with brute-force search. But AlphaZero crushed it — learning entirely through self-play, with no human knowledge at all.
The pattern repeats everywhere. Speech recognition. Computer vision. Natural language. Every time researchers thought they'd found the right structure, raw computation and learning proved them wrong.
The lesson is bitter because it means our carefully crafted understanding — our hard-won insights about how things should work — often becomes a hindrance, not a help.
How Model Improvements Bulldoze AI Agent Frameworks
PostHog spent a year building their AI agent. Their biggest learning?
"Model improvements change more than what you think at first."
Twelve months ago, reasoning models were experimental. Today, reasoning is essential to their agent's capabilities. Tool calling has improved massively — frontier models now handle complex tools with far greater reliability.
The two changes that transformed their implementation:
Cost-effective reasoning with o4-mini — simplified complex query creation
Reliable tool use with Claude 4 family — the agent could finally use diverse tools without going off track
They couldn't have predicted which improvements would matter most. But the improvements kept coming, and each one made some piece of their architecture obsolete.
Tavily experienced the same thing. Seven months ago, they abandoned their first deep research architecture entirely. It was sophisticated and clever — they thought that was a good thing. But its assumptions became bottlenecks when the next generation of models arrived.
The message is clear: whatever you build today is a snapshot of current model capabilities. Plan for it to change.
Why Agents Beat Workflows in 2026
Here's a take that would have been wrong a year ago: agents beat workflows.
PostHog tried graph-style workflows for months. In the GPT-4o era, calling tools in a loop with the same system prompt was a recipe for confusion. They built elaborate orchestration graphs to compensate.
It didn't work. In a graph, the LLM can't self-correct. Context gets lost between nodes. The architecture that was supposed to add reliability actually reduced it.
Today, their architecture is "pleasingly straightforward": a single LLM loop that calls tools until the task is done. No complex orchestration. The model decides what to do next.
This only works because models got better at tool calling. A year ago, the structure was necessary. Today, it's a constraint.
Lance Martin at LangChain had the same experience with deep research. His parallel workflow — workers writing separate report sections — was fast and reliable for its time. But when tool calling improved and MCP emerged, his structure couldn't adapt. He had to rebuild.
The lesson isn't "never use workflows." It's that the right architecture depends on current model capabilities, and those capabilities keep shifting.
Single-Loop Agents Beat Multi-Agent Orchestration
Even in an agentic approach, it's tempting to organize tasks into specialized subagents. The Widget CEO delegates to the Widget Engineer, verified by the Widget Tester, with input from the Widget Product Manager.
The idea is smart. The results are dumb.
Context is everything for an LLM. Every layer of abstraction introduces context loss. The ability to string tools together and self-correct washes away when you split across agents.
PostHog learned this directly. Claude Code proved it at scale. Incredible things come from a single LLM loop with simple tools.
Subagents still have their place — when you need to parallelize independent, self-contained tasks. But for sequential work that builds on itself, keep it in one loop.
The Simplest AI Agent Reliability Tool: To-Do Lists
If a single loop is the answer, how do you keep the agent on track over many iterations?
PostHog discovered something deceptively simple: the todo_write tool.
It does almost nothing. The agent writes its next steps, and that's it. But the effect is dramatic. Every time the agent uses it, it reinforces what it needs to do next. Instead of getting lost after a few tool calls, it keeps going — constantly correcting course.
This is one of those intuitive superpowers, like chain-of-thought prompting was. A simple mechanism that unlocks emergent behavior.
If you're building multi-step agents, add a to-do tool. Let the agent write its plan at each step. The self-reinforcement compounds.
Context Engineering: The Hidden Lever for AI Agents
Most agents propagate full tool responses through every iteration. You call a tool, get a response, and the whole thing goes into context for the next call.
This works for simple tasks. It breaks down for research.
Here's what happens with traditional context management:
Iteration 1: 3,000 tokens (query + prompt + tool response)
Iteration 2: 5,000 tokens (all of iteration 1 + new response)
Iteration 3: 7,000 tokens (all previous + new response)
Iteration 10: 21,000 tokens per call
Context grows quadratically. Costs spike. The model drowns in information and loses focus.
Tavily solved this with context distillation. Instead of carrying forward raw tool outputs, they distill each response into a short reflection — the key findings in 2-3 sentences. Only the reflections propagate. Raw sources get fetched at the end, specifically for citations.
The result:
Iteration 1: 3,000 tokens → distill to ~100 token reflection
Iteration 2: 1,100 tokens (query + prompt + reflection + new response)
Iteration 3: 1,200 tokens
Final: Reflections + raw sources for citation
They reduced token consumption by 66% while achieving state-of-the-art on DeepResearch Bench. Quality went up. Costs went down.
The insight applies beyond research. Whatever context you carry forward, ask: does the next iteration need the raw data, or just the insight? Distill aggressively. Keep the window clean.
The Structured Output Problem (And Why It Proves the Bitter Lesson)
Structured output is another place the bitter lesson strikes.
When you ask an LLM for JSON, typed schemas, or Pydantic models, reliability varies wildly across providers and model versions. Some follow schemas strictly. Others drift. Long nested JSON? Older models stop midway or produce malformed output.
So you build workarounds. Retry logic. Fallback parsers. Multiple parsing libraries. Then newer models arrive, follow schemas better, and your workarounds become dead weight — adding complexity and latency for problems that no longer exist.
This is the pattern: you build for today's limitations, and tomorrow's capabilities make your workarounds obsolete.
AI Agent Frameworks Considered Harmful
PostHog started with the OpenAI SDK, then migrated to LangChain + LangGraph. Today, they say they absolutely wouldn't.
"Every time you use a framework, you lock into its ecosystem."
The problem isn't that frameworks are bad. It's that LLMs evolve faster than frameworks can keep up. The ecosystems are fragile.
LLM calling abstractions crumble when providers add new features. OpenAI and Anthropic format web search results entirely differently. The frameworks try to maintain one facade, but it keeps breaking.
Worse: orchestrators like LangGraph lock you into a specific way of thinking. When that way becomes obsolete — when agents beat workflows — good luck refactoring everything away.
AI may settle on its React someday. But for now, the framework wars rage on. Stay neutral and low-level. Use primitives you control. Make removal easy.
Self-Balancing Agents: Beyond Rigid AI Agent Orchestration
The best-performing open-source deep research agent right now is ThinkDepth.ai. It topped the DeepResearch Bench, beating Google's and OpenAI's implementations.
Its approach: instead of rigid orchestration, it uses high-level self-balancing rules.
Traditional agents follow prescribed steps: plan, then research, then write. ThinkDepth lets the model decide. It provides constraints — close the information gap, then close the generation gap — but doesn't dictate how. The agent creates or removes its own steps. It adjusts strategy mid-task.
The structure is minimal: rules about what to achieve, not instructions about how. A smarter model makes smarter decisions within the same loose framework. You don't rebuild when models improve — the system just performs better.
AI Agent Evaluation: Why Evals Are Not Enough
Some say evals are everything. For foundation models, maybe. For agents, it's more nuanced.
Reality is gnarly. For many real-world, multi-step tasks, setting up a realistic test environment is harder than building the agent itself. You can cover expected happy paths, but users will create paths you never imagined.
PostHog found something more valuable than raw evals: Traces Hour. A weekly gathering focused 100% on analyzing LLM traces from production — real user interactions.
Evals make the most sense when they stem from such investigations. They're directional feedback, not optimization targets. The question isn't "did the score go up?" It's "did this change make the agent more reliable in practice?"
Tavily agrees: careful trace monitoring consistently provided higher-signal feedback than any single eval score. Reduced token usage, reliability, lower latency — these matter more than a one-point bump on a benchmark.
Transparency in AI Agents: Show Every Step
PostHog tried hiding details that seemed raw — chains of reasoning, failed tool calls, argument values. The constant feedback: "It's hard to trust when the process is a mystery."
They started streaming every tool call and reasoning token. Turns out humans are like LLMs: a black box is uninspiring no matter how good the results. Details give confidence. Show it all — the good, the bad, the ugly.
If you're building agents for users, transparency builds trust. Don't hide the work.
Five Lessons for Building Production AI Agents
Synthesizing everything:
1. Watch for the bulldozer.
Model improvements will obsolete your architecture. Design knowing this. What assumptions about model limitations are baked into your system? Name them. Revisit them.
2. Prefer single loops over complex orchestration.
Subagents and graphs lose context. A single LLM loop with simple tools self-corrects. Add a to-do tool for multi-step reliability.
3. Context engineering matters.
Don't propagate raw tool outputs forever. Distill into reflections. Deduplicate sources. Keep the window clean. Fetch raw data only when you need it for citations.
4. Avoid framework lock-in.
Use low-level primitives you control. Document why each piece of structure exists. Make removal easy.
5. Watch traces, not just evals.
Production behavior matters more than benchmarks. Review real user interactions weekly. Let investigations drive your evaluation design.
Build Like Every Architecture Has a Shelf Life
Because it does.
The agent architecture you design today is a snapshot of current model capabilities. Six months from now, the landscape shifts. Your structure becomes a constraint.
This isn't a reason not to build. You have to build something. Today's models aren't good enough to work without guidance.
But build with humility. Document your assumptions. Make removal easy. Revisit when the world changes.
PostHog rebuilt their architecture multiple times in a year. Tavily abandoned their first system after seven months. This isn't failure — it's the job.
The best agent builders aren't the ones who find the perfect structure. They're the ones who keep restructuring — adding what's needed, removing what's not, evolving with the models they depend on.
The Uncomfortable Truth
We want our work to last. We want to build systems that stand the test of time.
But in AI, nothing stands still. The structure that makes you successful today becomes the bottleneck that holds you back tomorrow.
This is bitter. It means your best work has an expiration date. It means you'll rebuild systems you thought were finished.
But it's also liberating. You don't have to get it perfect. You just have to get it right enough for now — and stay ready to change.
The paradigm is shifting. The models are evolving. The agents are improving.
The question is whether you're evolving with them.
More from DevDash Labs

Service as a Software: How to Scale Your Professional Services Expertise with AI
Read More >>>

Figma Buzz: A Game-Changer for SMB Marketing Teams (Hands-On Review)
Read More >>>

The 2025 Generative AI Platforms: A Guide to Tools, Platforms & Frameworks
Read More >>>
