The Myth of RLVR

Why Verifiable Rewards Aren't Enough and What It Takes to Build Open-World Agents

The Myth of RLVR

RLVR is real. It works. But the claims being made about it do not survive contact with the actual technical constraints. The idea that RLVR unlocks scalable intelligence, that it forms the foundation for general-purpose agents, or that it solves the hard problems of learning isn't quite true. The central mistake is subtle but consequential:

The myth of RLVR is the belief that correctness is the bottleneck.

It isn't.

Verifiable rewards haven't changed the fundamental limitations of reinforcement learning, instead they've just made reward signals cheaper to produce. The real bottlenecks are information flow, credit assignment, and trajectory-level learning, i.e. How does information about the goal propagate back through a long sequence of actions? How does the agent determine which decisions actually mattered? How does it learn from entire trajectories rather than isolated steps? These are the hard problems of RL, they have always existed and don't go away. Verifiable rewards don't fix any of them.

Consider these examples in an enterprise setting:

  • Ask an agent to resolve a customer support ticket: it reads the history, pulls up the account, checks three internal tools, drafts a response, escalates to a specialist, waits, follows up, closes the ticket. Was the customer satisfied? Verifiable. Was the escalation necessary, or did the agent miss something obvious in step two? Invisible.
  • Ask an agent to source a vendor for a new software contract: it researches options, compares pricing, schedules calls, negotiates terms, flags risks to legal. A verifier can confirm a contract was signed. It cannot tell you whether the agent left money on the table, overlooked a better vendor, or introduced risks that will surface in six months.

The pattern is the same: long horizons, entangled decisions, and outcomes that obscure the reasoning that produced them. A verifier sees the destination; it cannot see the path and in reinforcement learning, the path is everything (remember Bellman equations?). This is why RL did not start with final outcomes, but with Bellman's recursive view of value: future consequences are pulled backward to earlier states so learning can happen before the episode ends, i.e. turn delayed outcomes into intermediate learning signals. By evaluating how good it is to be in this state, the agent can distinguish early mistakes from late ones, near-misses from dead ends, and recoverable situations from unrecoverable ones. That is the essence of credit assignment: attributing success or failure to the right decisions at the right time, rather than collapsing an entire trajectory into a single judgment at the end. Modern RLVR-style recipes, particularly critic-free approaches like GRPO remove this machinery. They replace state-level evaluation with outcome ranking and defer all learning to a terminal pass/fail signal. A sparse binary reward fifty steps later does not solve credit assignment. It is the worst case for the very machinery credit assignment was invented to solve. Worse, default RLVR recipes systematically underutilize the most informative data available: failures.

RLVR optimizes for worlds we can grade, not worlds we live in.

We'll discuss more on the importance of value functions and the critic later in this blog.


1. What RLVR Actually Is (Once You Remove the Marketing)

RLVR is not a philosophy or a training ideology. It is a property of an environment.

An RL problem qualifies as RLVR only if:
  1. There exists a single, unambiguous notion of correctness
  2. Correctness can be programmatically verified
  3. Verification is cheap enough to support iterative optimization

You are in RLVR territory if you can write:

reward = float(model_output == ground_truth)

If you cannot, because correctness is subjective, contextual, delayed, multivalued, or preference-dependent, then RLVR does not apply. You are back in RLHF, RLAIF, surrogate reward modeling, or heuristic supervision, regardless of what you call it.

This distinction matters because RLVR is often discussed as if it were a general paradigm. It is not. It is a narrow class of problems where verification collapses to computation. That narrowness explains both why RLVR can look impressive and why its promise does not generalize.


2. The Thin Slice Where RLVR Works

Jason Wei articulates a useful heuristic: the ease of training an AI system is proportional to how verifiable the task is. Call this Verifier's Law.

Once you see it, it's hard to unsee. Tasks that are objective, fast to check, and cheap to score are exactly the ones RL tears through. That's why benchmarks fall so quickly. They're not just hard problems, they're verifiable problems. Almost by construction.

RLVR shines in domains that look like this: the reasoning is self-contained, the rules are stable, and the outside world doesn't interfere. Math problems, formal logic, competitive programming. An AIME problem has one right answer. A verifier checks equality. A coding problem either passes the tests or it doesn't. In that narrow slice of the world, RLVR is genuinely useful. You get clean feedback at scale, and learning works.

The trouble starts when you step outside that slice. The tasks we actually want agents to handle are almost the opposite: they're the ones where verification is the hard part. Sometimes it's harder to check an answer than to produce one. Fact-checking an essay. Validating a scientific claim. Deciding whether a business judgment was sound. Brandolini's Law applies here: it takes far more effort to refute something than to generate it. And in many real tasks, verification isn't just expensive, it's not even well-defined. Did the agent buy a good gift? Did it negotiate the right vendor contract? Did it make the right call when the information was incomplete? There's no answer key for these. They're contextual, subjective, and often only clear in hindsight.

This is the real reason RLVR looks so strong on benchmarks. The benchmarks are chosen to fit inside the model. They avoid long horizons, unstable environments, ambiguous goals, and real world interaction. That works fine for puzzles, but it becomes a fatal limitation the moment you try to build agents for the world we actually live in.


3. Verifiable ≠ Learnable: The Credit Assignment Problem

A common confusion in RLVR discourse is the belief that correct rewards are therefore useful rewards. Reality check – "They are not".

A reward can be perfectly accurate and still be information-theoretically insufficient to train a policy. The core issue is credit assignment.

Policy-gradient and trajectory-sampling methods like PPO, GRPO, REINFORCE, and related variants attempt to assign responsibility for an outcome across a sequence of actions. When rewards are delayed, sparse, or global, the signal-to-noise ratio collapses.

This has nothing to do with whether the reward is true. A single scalar at the end of a long trajectory does not explain which decisions mattered, why they mattered, or how to change them. This is why RL variance explodes with horizon, why sample complexity scales poorly, and why perfect verifiers do not magically make RL efficient.

RLVR improves reward accuracy. It does not improve reward bandwidth.

To put it simply: accuracy tells you whether the signal is true, but bandwidth tells you how much the signal teaches you about what to do differently. A binary pass/fail at the end of a 50-step trajectory is perfectly accurate and nearly useless for learning because it carries almost no information about which of those 50 steps should change. Verification is not teaching.

This isn't just intuition, it's mathematics. The noise in assigning blame grows quadratically with the number of steps: a 50-step task isn't 50x harder to debug than a 1-step task, it's closer to 2,500x harder.

Example: The Gordon Ramsay problem

Picture Gordon Ramsay judging your 50-step wedding cake. You sifted the flour, creamed the butter, separated the eggs, folded the batter, calibrated the oven, timed the layers, tempered the chocolate, whipped the frosting. He takes one bite and screams "TOO DRY!"

You know he's right, and the verification is perfect. But which of your 50 steps caused it? Was it over-mixing at step 12? Wrong oven temperature at step 23? Not enough eggs at step 7? Each step is a suspect, and the verdict tells you nothing about the crime.

RLVR gives you an infallible Gordon Ramsay. It doesn't give you a cooking instructor who watches you over-mix and says "stop" before the damage is done.


4. Outcome-Level Rewards are Poor

RLVR was a genuine advance for the right class of problems. Math competition problems, code synthesis with unit tests, formal proofs, games with win conditions these share a common structure: relatively short horizons, dense intermediate feedback, and outcomes that correlate tightly with the quality of reasoning. When the gap between decision and outcome is narrow, a binary signal at the end still carries usable information. Verification works because credit assignment is tractable.

The problem is that we've taken a technique that works for constrained, verifiable domains and treated it as a general recipe for agent learning. It isn't. The limitations aren't incidental, they're structural.

In long horizon tasks, learning comes primarily from mistakes, not success. A successful trajectory shows one path that worked. A failed trajectory encodes many constraints about what doesn't. Dead ends prune search spaces. Instability reveals system limits. Counterexamples dominate program synthesis. Error correction matters more than repetition. Success teaches you one way forward. Failure teaches you the shape of the wall.

Outcome-level rewards erase that shape. A single terminal pass/fail contains almost no information about the decisions that produced it, the mutual information between a one-bit outcome and any intermediate action in a long trajectory is tiny. Success compresses fifty steps into one bit. Failure compresses many qualitatively different mistakes into the same zero. Near-misses and catastrophes look identical. A mistake at step 3 and a mistake at step 47 look identical. The internal structure of failure where things went wrong and why is lost. The reward can be perfectly correct and still be a terrible teacher not because it's noisy, but because it doesn't carry enough information to support credit assignment across the horizon.

The contrast with process supervision makes this concrete. Outcome level supervision yields one scalar per episode. Step level supervision yields orders of magnitude more signal and preserves causal structure. This is why process supervised reward models outperform outcome supervised ones on reasoning tasks; they address credit assignment directly by retaining information about failure.

RLVR is a cleaner signal, but a weaker one.


5. Why Policy-Gradient RL Is the Wrong Tool, Even with Perfect Rewards

To be precise, the critique here targets policy-gradient, trajectory-sampling reinforcement learning as commonly deployed today and not all of RL. These methods are already known to suffer from structural weaknesses: high-variance gradients, poor scaling with horizon and dimensionality, extreme sensitivity to reward shaping, instability across seeds, and heavy dependence on interaction data.

RLVR doesn't fix these problems. It exposes something more fundamental.

Actor–Critic Was the Actual Breakthrough

Value-free methods like REINFORCE have always existed. They are conceptually clean and mathematically elegant. They are also not what drove the last decade of progress in reinforcement learning. Every major practical breakthrough came from actor–critic methods, for one reason: value functions make long horizon learning tractable. The critic is not an optimization trick or a variance reduction hack. It is the agent's internal model of progress.

A learned value function evaluates intermediate states before outcomes are known. It tells the agent whether it is drifting toward recovery or collapse, whether a partial trajectory is promising or doomed. It allows the agent to anticipate failure rather than react to it after the fact. Without this internal notion of "how things are going," long horizon credit assignment becomes almost impossible. Remove the critic, and policy gradients collapse into blind sampling hoping that enough rollouts eventually stumble into signal.

Why Learning Policy and Value Together Actually Matters

Actor–critic works because policy and value co-evolve on real data.

Every real interaction updates both components simultaneously. The policy learns which actions to take. The value function learns how good the resulting states actually are. Crucially, both are trained on the same trajectories, under the same state distribution, in a tight feedback loop.

As the policy improves, it visits more informative regions of the state space. As the critic improves, it provides lower-variance, localized learning signals. Each stabilizes the other.

This coupling is not incidental, it is the core advantage of modern reinforcement learning. Break that loop, and RL loses its edge.

What Goes Wrong with GRPO-Style and RLVR-Style Training

Methods like GRPO or RLVR explicitly sever this coupling. They replace learned value functions with contrastive comparisons, preference scores, or externally verified outcomes. Learning is driven not by internal estimates of progress, but by relative judgments made after the fact.

The result is training that is:

The agent is navigating without a compass.

These methods work when the task is shallow enough, i.e. short trajectories, dense feedback, cheap mistakes. But that's not because they represent better reinforcement learning. It's because the problem has been simplified enough to tolerate the absence of a critic. LLMs amplify this effect. Their pretrained priors already encode vast amounts of task structure, effectively compressing what looks like a long horizon problem into a short horizon preference nudge. The policy arrives nearly competent; RLVR just steers at the surface. Contrastive rewards appear effective not because they solve credit assignment, but because pretraining already did the heavy lifting. We are not witnessing a breakthrough in reinforcement learning. We are witnessing language models good enough to hide the fact that it's still broken. Just to be clear, RLVR is not fake, but its scope of influence is routinely overstated.

Value-free methods scale only when pretraining compresses the effective horizon to near zero. RLVR's success isn't a breakthrough in reinforcement learning. It's parasitic on the breakthroughs already baked into the base model.

Is This Even Reinforcement Learning?

The deeper issue is taxonomic. Much of what gets labeled RLVR is not reinforcement learning in any meaningful sense. The typical pipeline: generate candidate outputs, score them with a verifier, filter to successes, fine-tune on the survivors. This is rejection sampling followed by supervised learning. There is no temporal credit assignment, no policy-gradient optimization over trajectories, no value function guiding exploration. The "RL" in RLVR is often vestigial.

Some pipelines do use PPO or GRPO. But when researchers ablate the components, the results are humbling. Dong et al. (2025) found that simple rejection sampling achieves accuracy within a few points of GRPO on math benchmarks. The gains come primarily from filtering, not from policy gradients.

The gradients are not teaching. The filtering is.

RLVR Amplifies; It Does Not Create

Does RLVR teach models new reasoning abilities, or does it surface capabilities already present in the base model? Recent work suggests the latter.

Yue et al. (2025) compared RLVR-trained models against their base models using pass@k evaluation. The findings are striking: RLVR models outperform at low k, but base models catch up as k increases. At pass@256, the base model solves problems the RLVR model cannot.

The interpretation: RLVR concentrates probability mass on high-reward outputs, improving sampling efficiency at the cost of exploration. The model becomes better at surfacing what it already knew. It does not learn new strategies. As the authors put it: RLVR-trained models generate reasoning paths already within the base model's output distribution.

RLVR sharpens the blade. It does not forge a new one.


6. Steelmanning the RLVR Argument

To steelman the RLVR position: outcome level verifiers dramatically reduce reward noise, eliminate reward model hacking, and enable scalable self-improvement in domains where ground truth exists. In narrow, symbolic domains, this is a real advance.

The problem is not that RLVR is ineffective. The problem is that its effectiveness is tightly coupled to properties i.e short horizons, stable environments, unambiguous correctness; same qualities that disappear in real world agent tasks.


7. The Hard Problem of Building Agent Environments

WebArena. OSWorld. WorkArena. BrowserGym. These benchmarks represent serious engineering efforts to evaluate autonomous agents, and they have catalyzed real progress. The field owes them a debt. But building good evaluation environments is genuinely hard for reasons that go beyond engineering effort.

Why Researchers Build Simulated Worlds

RLVR requires verifiable rewards. Real-world tasks rarely provide them. This creates a dilemma: you cannot train with RLVR on tasks that lack ground truth verification, but those are precisely the tasks you want agents to perform.

The field's response has been to construct environments where verification is possible. WebShop simulates e-commerce so purchase success can be checked programmatically. OSWorld virtualizes desktop operating systems so file and application states can be inspected. WorkArena rebuilds ServiceNow workflows in a sandboxed replica. BrowserGym wraps web interactions in controllable containers. MiniWoB++ reduces web tasks to simplified, deterministic puzzles.

This strategy has now extended to enterprise software. Companies are building simulated versions of Salesforce, SAP, Workday entire fake corporate backends where agents can be trained and evaluated without touching production systems. The appeal is obvious: you get reproducibility, scale, and safety. You can run thousands of trajectories without corrupting real databases or annoying real customers. This is a rational research strategy. If you need verifiable rewards to train, you build worlds that provide them. The benchmarks are not arbitrary; they are shaped by RLVR's requirements. But this creates a subtle irony: we build fake worlds because real worlds are too hard to verify, then celebrate progress on the fake worlds as evidence that agents are ready for real ones.

What Simulations Cannot Capture

Simulated environments are valuable precisely because they simplify reality. But that simplification has costs:

Non-stationarity: Real websites change layouts, update APIs, deprecate features, and redesign workflows. Amazon today is not Amazon six months ago. Simulated environments are frozen snapshots. An agent trained on a static replica has never experienced the world shifting beneath it.

This raises a deeper question: is the agent learning knowledge about a specific interface, or learning how to act in a class of environments?

Example: Salesforce opportunity management

Consider an agent trained to manage sales opportunities in Salesforce. In the training environment, moving a deal to the next pipeline stage means clicking a specific "Stage" dropdown, selecting "Proposal/Price Quote," and filling in three required fields: Expected Close Date, Amount, and Probability. The agent learns this sequence and executes it reliably.

But Salesforce implementations vary wildly across organizations. Company A uses the standard pipeline with seven stages; Company B has a custom twelve-stage process with different names and validation rules. Company A requires Probability as a manual input; Company B auto-calculates it based on stage and hides the field entirely. Company A runs Lightning Experience; Company B is still on Classic, where the same action requires navigating a completely different page layout. Company C has built a custom "Deal Desk Approval" workflow that triggers when Amount exceeds $50,000, a business rule that exists nowhere in the UI, only in backend automation.

The underlying action is the same: advance this opportunity to reflect that we've sent a proposal. But the surface representation is different in every deployment. If the agent's knowledge is entangled, meaning "move to Proposal stage" is bound to a specific DOM element, a specific dropdown value, a specific field layout, then every customer implementation is a distribution shift. The agent isn't learning what it means to advance a deal. It's memorizing click sequences.

The knowledge that matters (what actions achieve what business goals) should be invariant to these surface differences. An agent that truly understood opportunity management would recognize that "Proposal/Price Quote," "Quote Sent," and "Pricing Review" are semantically equivalent stages in different orgs, that the required fields are about capturing commitment and forecast accuracy, not about filling in boxes. But nothing in outcome level training encourages that disentanglement. The verifier checks whether the stage field changed. It cannot check whether the agent understood why.

Agents trained this way learn to navigate the simulator they trained on, not the space of possible Salesforce instances. And in enterprise software, no two instances are the same.

Irreversibility and consequences: In simulation, mistakes are free. You can reset the environment and try again. In reality, actions have stakes: money is spent, emails are sent, files are deleted, customers are annoyed. Agents trained without consequences never learn caution. They have no representation of "this action is hard to undo" because in their training world, everything was undoable.

Example: Vendor contract negotiation

Consider the contract example from earlier: an agent sourcing a vendor for a software contract researches options, compares pricing, negotiates terms, flags risks to legal. In simulation, you can verify whether a contract was signed. But suppose the agent locked in unfavorable payment terms, waived a liability clause that legal would have caught, or failed to notice the vendor's financial instability. These aren't binary failures. They're judgment calls that unfold over months. A verifier can check the signature. It cannot check whether the agent understood what it was signing away.

Human judgment resists rubrics. Experienced procurement officers develop intuitions about which vendors are bluffing, which contract terms matter in practice, when to push back and when to concede. This knowledge is illegible, living in heuristics, pattern recognition, and contextual awareness that no one has written down. You cannot build a reward function for "negotiated wisely" because no one can fully specify what that means. The simulation either ignores these dimensions or reduces them to crude proxies, and agents trained on proxies learn to optimize proxies.

Ambiguity and underspecification: Benchmarks must define success precisely to enable automated scoring. But real user requests are vague: "find me a good flight" does not specify whether good means cheap, fast, or convenient. "Handle this customer complaint" does not come with a rubric. "Draft a response to this RFP" does not tell you which tradeoffs the company is willing to make. Agents trained on precisely specified tasks never learn to ask clarifying questions, make judgment calls, or navigate genuine ambiguity, because in their training worlds, ambiguity was designed out.

Edge cases and the long tail: Benchmarks are constructed from templates and known task distributions. They cover the common cases well. But real deployment means encountering situations no one anticipated: a website in maintenance mode, a form field that behaves unexpectedly, a user request that doesn't fit any template, a customer who is angry about something outside the agent's scope. The 1% of cases that break agents in production are systematically absent from training, not because benchmark designers are careless, but because you cannot template what you haven't foreseen.

None of these gaps reflect poorly on benchmark designers. They reflect the inherent difficulty of building environments that are simultaneously verifiable and realistic. The two properties are in tension.

The Verification Bottleneck

Automated evaluation at scale requires programmatic success conditions. There is no way around this. If you want to run thousands of agent trajectories, you need a function that returns pass or fail. This is not a design flaw; it is a constraint imposed by the need for reproducibility and scale.

The consequence is selection bias. Benchmarks can only include tasks where success is programmable. Tasks involving subjective judgment, open ended goals, multistakeholder tradeoffs, or context dependent quality are systematically excluded not because benchmark designers ignore them, but because no one knows how to score them automatically.

The simulated Salesforce can check whether a lead was updated. It cannot check whether the update was right, whether the sales rep would have made the same call, whether the context justified the action, or whether downstream processes will break. Enterprise software is full of actions that are technically valid but contextually wrong. Verification catches syntax. It misses tribal knowledge and process semantics.


8. What Actually Scales: Hybrid Systems

What scales in practice is not "better RL," but systems that reduce how much RL they need. The most capable systems today are explicitly hybrid. They are built around mechanisms that inject structure, foresight, and reuse, rather than relying on trial-and-error to rediscover them:

RLVR is useful as a scalpel: for alignment, calibration, and local preference shaping. It is not a backbone for acquiring long horizon competence. There's a lot of software engineering left to be done with existing agentic architectures and LLMs that exist today to crack long horizon tasks.


Conclusion

RLVR works not because it solves the hardest problems in learning, but because it carefully sidesteps them. It shines in settings where correctness is easy to verify, environments are stable, and tasks are short and self-contained. In those cases, verifiable rewards are a real and practical advantage. But that is not where intelligence is forged. Real intelligence emerges in messy regimes where feedback is delayed or incomplete, where the world changes under you, where progress is incremental, and where failure carries structure and meaning. Those are precisely the settings where RLVR runs out of road. Policy-gradient reinforcement learning does not scale gracefully into that territory. And training that focuses only on final outcomes quietly throws away the most valuable learning signal we have: how things went wrong, how close we were to success, and which constraints mattered along the way. If we care about building systems that genuinely learn and not just systems that look good under clean evaluation. We have to stop optimizing away the hard parts and start learning from them.

What would falsify this argument?

It would be weakened by evidence that long horizon agents can reliably learn from terminal, verifiable rewards alone without heavy imitation, search, process supervision, or auxiliary signals and still acquire stable intermediate representations and recoverable failure boundaries across non-stationary environments. No such evidence exists at scale. That absence is telling: extracting rich learning signals from sparse outcome level rewards is fundamentally hard. RLVR belongs in the toolbox, not at the center of agent design.

The myth of RLVR is the belief that correctness is the bottleneck. It isn't. The real bottleneck is everything that happens when correctness cannot be verified and everything we fail to learn when failure is thrown away.


TL;DR: The 3 Minute Version

The core claim: RLVR (Reinforcement Learning with Verifiable Rewards) works, but not because it solves hard problems. It works because it sidesteps them. The belief that "correctness is the bottleneck" is the myth. The real bottlenecks are information flow, credit assignment, and learning from failure, and verifiable rewards don't address any of these.

What RLVR actually is: A narrow class of problems where (1) correctness is unambiguous, (2) verification is programmatic, and (3) checking is cheap. If you can write reward = float(output == ground_truth), you're in RLVR territory. Most real world tasks don't qualify.

Where it works: Math competitions, coding challenges, formal proofs. Short horizons, stable rules, self-contained reasoning. These are verifiable problems, which is why benchmarks fall quickly.

Why it fails elsewhere: A binary pass/fail at the end of a 50-step trajectory carries almost no information about which steps to change. The noise in credit assignment grows quadratically with horizon, making a 50-step task ~2,500x harder to learn from than a 1-step task. Verification is not teaching.

The pretraining dependency: RLVR's success with LLMs isn't a breakthrough in RL. It's parasitic on pretraining. The base model arrives nearly competent; RLVR just steers at the surface. Studies show RLVR-trained models don't solve problems the base model can't solve with enough samples. It sharpens the blade but doesn't forge a new one.

Is it even RL? Much of what's called RLVR is actually rejection sampling followed by supervised learning. Generate candidates, filter by verifier, fine-tune on successes. No temporal credit assignment, no value function. The "RL" is often vestigial.

The simulation trap: We build fake worlds (WebArena, OSWorld, simulated Salesforce) because real worlds are too hard to verify, then celebrate progress on fake worlds as evidence agents are ready for real ones. But the gap is structural, not incremental. Real environments are non-stationary: Amazon's checkout flow today isn't what it was six months ago, but your simulator is frozen in time. Real actions are irreversible: when your agent sends the wrong email, deletes the wrong file, or locks in bad contract terms, there's no reset button. Real tasks are ambiguous: "handle this customer complaint" doesn't come with a rubric, and the right answer depends on context no one wrote down. Real deployment surfaces the long tail: the 1% of cases that break your agent in production are precisely the ones that couldn't be templated in training. And here's the deeper problem: an agent trained to "advance a Salesforce opportunity" by clicking specific buttons in a specific layout has learned the simulator, not the task. Move it to a different company's Salesforce instance with different fields, different stages, different workflows, and it's lost. The knowledge that matters, what business goal this action achieves, was never disentangled from the surface mechanics of one particular configuration.

What actually scales: Hybrid systems that reduce how much RL they need. Predictive modeling over blind exploration. Search and planning over pure policy learning. Tool use. Memory and retrieval. Offline learning from real trajectories including failures.

The bottom line: RLVR belongs in the toolbox, not at the center of agent design. Real intelligence emerges in messy regimes where feedback is delayed, environments shift, and failure carries structure. Those are precisely the settings where RLVR runs out of road.