Models Eat Their Scaffolding

Where the complaints are, the roadmap follows.

Sunil Mallya · April 2026

The Scaffold Is the Roadmap

In late 2022, ChatGPT became the fastest-growing product in history. The demos were magical. Then people tried to ship it, and discovered the gap was enormous. Models hallucinated, drifted off task, and forgot what you asked for mid-response. So an entire scaffolding industry emerged: LangChain, prompt engineering as a six-figure job, guardrail companies raising venture rounds. The model was brilliant and unusable at the same time.

Then the models improved, and the scaffolding disappeared. Prompt ensembling → RLHF. Output validators → structured outputs. RAG → million-token windows. Chain-of-thought → native reasoning. Agent frameworks → platform harnesses, happening right now. Each was critical at the time, and each was absorbed within two years.

AI platforms evolve by absorbing the scaffolding around them. The workarounds builders create today become the primitives of tomorrow.

This is the operating logic of how AI platforms mature. When a model has a deficiency, builders don't wait for the fix. They patch around it, wrap it in scaffolding, bolt on whatever makes it work. The workaround ships, becomes load-bearing, and before long entire businesses are built on top of it. Then the next generation of the model, or the platform around it, absorbs the workaround natively. The scaffold becomes a primitive, and the company built on the gap either moves up the stack or becomes a rounding error.

Now look at what's being scaffolded today. Verification systems for domains that don't have test suites. Memory layers so agents can learn from their own runs. Trajectory capture so the next execution is better than the last. Multimodal loops so agents can see screens and read documents, not just process text. Infrastructure to make agents cheap enough to run always-on instead of on-demand.

Those are the complaints. Those are the roadmap.

But there is a corollary: not all scaffolding gets absorbed. Some of it solves problems that are permanent. Trust (how does one agent prove its identity to another?). Governance (how do you audit what an autonomous system did after the fact?). Permissions (who decides which agent gets access to what?). Accountability (who is responsible when an agent takes an irreversible action?). Compliance (how do you prove to a regulator that your agent followed the rules?). These aren't model deficiencies. They're structural problems that get harder as models get more capable. Knowing the difference between temporary scaffolding and permanent infrastructure is the entire game.

The Evidence: 2022 to Now

2022: The Model Was Brittle

The intro told the story of the production gap. Here's what the scaffolding actually looked like.

The reliability problem came first. Self-consistency sampling: run the same prompt five times, take majority vote. Prompt ensembling: rephrase the question three ways, average the results. Few-shot curation: hand-pick examples for every prompt because the model couldn't generalize from instructions alone. Model routers sent hard queries to GPT-4 and cheap ones to smaller models. LangChain chained prompts with retry logic and output parsers. Prompt marketplaces appeared. Companies raised venture capital around coaxing reliable behavior out of a model.

Then there was format trust. Models couldn't produce valid JSON. Every production system had output parsers, Pydantic schemas with retry loops, constrained decoding libraries. The pattern was: parse, fail, reprompt, retry, log.

And behavioral trust. You couldn't trust models to stay on topic, resist adversarial inputs, or avoid producing things you couldn't ship. Input classifiers, output filters, prompt injection detection, PII scrubbers, toxicity scorers. Entire companies launched around making models deployable.

All of it was scaffolding around one root cause: the model was brilliant and required a babysitter.

Absorbed.

RLHF baked instruction-following into the weights. Structured outputs forced schema compliance at the token generation level. The validator retry loop, the prompt ensembles, the few-shot curation, all became unnecessary overhead. LangChain survived by moving up the stack. The companies built on prompt fragility either pivoted or became footnotes.

2023: The Model Learns to Think

The reasoning loop.

Telling a model to "think step by step" dramatically improved performance. A prompting trick. A complete cheat that worked embarrassingly well. Builders extended it: Tree of Thoughts explored reasoning branches in parallel, Reflexion built agents that learned from failure, Process Reward Models scored intermediate steps.

Then came AutoGPT. The idea was correct: give a model a goal, let it decompose, act, observe, loop. The execution was held together with hope. AutoGPT hallucinated tool calls, spiraled into infinite loops, drifted from the goal by step four. People got excited, then disillusioned within weeks.

The failure was diagnostic. What broke wasn't the loop, because the loop was right. What broke was the substrate. No reliable environment, no ground truth, no error recovery. The loop needed something to grip.

The context window.

Early context windows were 4,000 tokens. You couldn't fit the document in. RAG was the workaround: retrieve chunks at query time. An entire industry emerged around it: vector databases, retrieval pipelines, chunking strategies, embedding models. That mattered enormously. But as context windows expanded and models got better at using long context, retrieval stopped looking like the product and started looking like plumbing. RAG still matters when the model needs to fetch external, live sources of data at runtime, or when provenance, permissions, and efficiency matter. What changed is that it is no longer, by itself, a durable differentiator.

Fine-tuning was the other approach to the knowledge gap: train the model on your domain data so it already knows what you need. But the foundation model companies were on a data acquisition spree, training on everything they could find and paying data companies for the rest. Each new base model arrived knowing more than the last fine-tuned version did. Meanwhile, in-context learning proved shockingly competitive. Fine-tuning became less compelling as a moat when the pitch was simply "we trained on our corpus." Part of what weakened that moat was that base model providers kept absorbing more domain knowledge into the frontier models themselves through broader and more specialized training data. Static dataset training is a snapshot, not a flywheel. The stronger moat moved to owning the reinforcement environment: the tasks, evaluators, reward signals, and user workflow that generate better trajectories over time. Fine-tuning survived, but in a narrower role: shaping behavior, reliability, and domain-specific performance rather than serving as the primary knowledge transfer mechanism.

Partially absorbed.

Reasoning scaffolds were absorbed most completely. o1 internalized the scratchpad, and DeepSeek R1 showed you could train reasoning into the model through reinforcement on outcomes. Retrieval followed a different path: longer context windows made RAG less existential and less differentiated, but not obsolete. Fine-tuning narrowed too: static dataset training became a weaker moat as advantage shifted toward owning the reinforcement environment. Explicit planning frameworks thinned as models learned to plan implicitly. Memory did not resolve at all.

One thread did not resolve: memory.

Models have no persistence. Every conversation starts cold. No matter how large the context window gets, it empties between sessions. Builders tried everything: stuff it all into context, sliding window summarization, external retrieval memory, MemGPT treating the context window as a CPU register with external storage as RAM. The sophistication of these solutions revealed how fundamental the problem was. Context is what you can hold in your head right now. Memory is what you carry between sessions. They are as different as RAM and disk. This one carries forward to 2026.

2024: Tools and the Action Layer

Models had knowledge. They had reasoning. What they didn't have was hands.

Builders gave them hands: function calling, browser automation, code interpreters, computer use. The action layer arrived. At first this lived mostly in wrappers around the model. Then the model companies pulled it inward: training models to recognize when tools were needed, emit structured tool calls, recover from tool errors, and expose tool-use patterns as native platform capabilities. An ecosystem of tool providers, connector frameworks, and middleware companies sprang up to bridge models to the outside world.

Complex tasks required sequencing: what's the order? What's contingent? What if a step fails? Builders constructed the planning layer explicitly. ReAct interleaved reasoning and action. Plan-and-Execute separated planning from execution. And then: how do you know the tool call worked? You build a critic. LLM-as-judge emerged: use one model to score another. Expensive, imprecise, the only available answer.

Partially absorbed.

Tools largely collapsed into the model layer. Model providers shipped standard tool use as a native capability: function calling, structured tool interfaces, built-in code execution. MCP emerged as the common protocol for exposing remote tools and shared data sources to models. At the same time, in local code workflows, the center of gravity often moved back toward direct CLI execution: simpler, faster, easier to inspect. The middleware market that existed to wire models to APIs shrank as the platform absorbed the plumbing. Explicit planning thinned as models learned to plan implicitly.

What remains is the verification gap: you can give a model hands, but you cannot yet trust what it does with them. Every irreversible action an agent takes is a trust debt that has not been repaid.

~~The model~~ The model company absorbs the scaffold.

2025: Agents Collapse into the Harness

The year of the agent. LangGraph, CrewAI, AutoGen, dozens more. These weren't trivial projects. They existed because the model alone couldn't manage state across steps, recover from errors, coordinate multiple calls, or maintain context through a long task. The frameworks provided the scaffolding that made agents usable in production. But the ones that delivered reliable results shared a common foundation: a code harness underneath.

Code has an oracle. Tests pass or they don't. Execution gives you ground truth. Write code, run it, feed the error back, loop. The harness solved what AutoGPT couldn't: it gave the loop a floor.

Then the model companies shipped the harness themselves. Anthropic shipped Claude Code. OpenAI shipped Codex. Manus showed a general-purpose agent could work outside the code domain. The independent agent framework layer got absorbed into the platform, right on schedule.

The harness also escaped the terminal. Agents moved into native surfaces: IDEs, desktop apps, browser operators, and persistent workspaces. That mattered for the same reason the code harness mattered. Reliability came from being embedded in an environment with files, permissions, local state, and a concrete action surface, not from the model alone.

Multi-agent and swarm frameworks are next. Startups are building orchestration for swarms of specialized agents. But model companies are already shipping the primitives: sub-agents, background agents, parallel tool execution, agent-to-agent delegation. The ecosystem builds the scaffold, proves the demand, and the platform absorbs it.

Absorbing now.

The harness is load-bearing, not collapsing. But it has a structural ceiling. It requires a closed environment and an oracle. Code has pytest. "Is this strategy memo correct?" does not. Most knowledge work has no test runner. And the harness is single-threaded while real work is concurrent.

The real lesson: the agent is not the whole product. The environment is.

2026 and Beyond: What Comes Next

The pattern tells you what gets absorbed. The more interesting question is what's being scaffolded right now, because that's the roadmap for the next two years.

Agents can't verify open-ended work.

This is the single biggest bottleneck. Code agents work because tests exist. The entire agent loop depends on an unambiguous signal: did it work or didn't it? In law, finance, operations, research, or strategy, that signal is usually weaker, slower, contested, or missing entirely. Agents can already operate in those domains, but they still cannot verify their work with the same reliability they can in software. Until that changes, the limiting factor is not action. It is trust.

Agents don't learn from their own runs.

Every session starts cold. But the execution traces, the sequences of actions, tool calls, errors, and corrections, are training data. The companies capturing and learning from agent trajectories will have the Tesla Autopilot advantage: fleet learning. The model gets better because the agents ran, better models produce better traces, and the cycle compounds. The company that closes this loop first has a moat that widens with every agent execution.

Agents don't retain coherent memory.

Not the RAG/context window problem, that's solved. This is genuine persistence: what did the agent learn last week, what does the user prefer, what failed before and why. Episodic memory, not retrieval. Human memory is integrated into cognition. It shapes how you reason, not just what facts you can access. What's coming is a data layer that persists identity and experience across agents, sessions, and systems. That layer is infrastructure in the same sense that a database is infrastructure. It does not disappear when models improve.

Agents are too expensive to run continuously.

Current agents are on-demand. You invoke them, they run, they stop. The next unlock is making agents cheap enough to run continuously in the background: monitoring, acting proactively, catching things before you ask. This is the difference between a tool you use and a colleague you work with. Inference cost has been dropping fast, but "cheap enough to run an agent 24/7 on every employee's behalf" is a different threshold entirely. When it's crossed, agent deployment goes from project-based to ambient.

Agents are blind outside of text.

Real work happens across modalities: reading a dashboard, watching a screen, scanning a PDF, listening to a meeting, interpreting a diagram. The interface between agent and world is still narrow. Models are becoming natively multimodal, but the agent harness around them hasn't caught up. The loop that lets an agent see a screen, decide what to do, act, and verify the result across vision, text, and code is the next scaffolding layer. Some of it will get absorbed into models. The orchestration across modalities won't.

What stays permanent.

Not everything above is scaffolding. Verification and trust require external auditability, for the same reason banks have external auditors regardless of how good their internal controls are. Agent identity and authentication are protocol problems, not capability problems. The governance layer only becomes more important as agents gain more capability. These are the layers that models cannot absorb. The enduring companies will be built on them.

From AI Scaffold To Model Primitive

Year	Scaffold		Absorbed by
2022	Prompt templates, few-shot libraries Output parsers, retry loops Safety wrappers, guardrail layers Self-consistency sampling	→	RLHF, instruction tuning JSON mode, structured outputs Native safety, system prompts Reliable single-pass inference
2023	Chain-of-thought, reasoning scaffolds RAG pipelines, vector databases Autonomous agent loops Fine-tuning for domain knowledge Persistent agent memory	→	o1, DeepSeek R1, native reasoning Long context, retrieval as plumbing Needed a harness to work Models absorbed domains + RL environments became the moat Harness-native memory, still incomplete
2024	Function calling wrappers, browser automation Connector middleware, API shims LLM-as-judge evaluation Explicit planning scaffolds	→	Native tool use, computer use MCP for remote tools, CLI for local execution Rubric-based evals, not final oracles Implicit planning in-weights
2025	Agent orchestration frameworks Observability and tracing layers Cross-session memory layers Approval checkpoints, human-in-the-loop Swarm coordination scaffolding	→	Platform harnesses, execution environments Platform observability (early) Built into harnesses, still incomplete Hooks, permission systems Sub-agents, background agents
2026	Verification oracles for non-code Trajectory capture / fleet learning Persistent agent memory Multimodal agent loops Agent identity / auth protocols Observability and compliance	→	External domain eval APIs? Platform-native training loops? Platform memory layers? Native multimodal models + platform orchestration? Industry standard (OAuth for agents)? Independent infra — auditors don't get absorbed

Absorbed

Partially absorbed

Absorbing

Open