Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents
Summary
Huggingface introduces EcomRLVE-GYM, a framework providing eight verifiable environments for training reinforcement learning agents on complex e-commerce tasks. The tool features adaptive difficulty curricula and algorithmic rewards to improve task completion in shopping assistants, demonstrated by training a Qwen 3 8B model.
View Cached Full Text
Cached at: 05/08/26, 09:04 AM
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents
Source: https://huggingface.co/blog/ecom-rlve Back to Articles
- Why RL for shopping agents?- From RLVE-Gym to EcomRLVE-GYM - What a training episode looks like - The eight environments - Adaptive difficulty curriculum - Deep dive: Cart Building (E_CART)- The problem - Why variants matter - Difficulty scaling - Scoring - Trajectories: easy vs. hard - User simulation - Environment scaling - Early results - Try it yourself - Resources - References TL;DR— We extend the RLVE framework from single-turn reasoning puzzles tomulti-turn, tool-augmented e-commerce conversations. EcomRLVE-GYM provides 8 verifiable environments — product discovery, substitution, cart building, returns, order tracking, policy QA, bundle planning, and multi-intent journeys — each with procedural problem generation, a 12-axis difficulty curriculum, and algorithmically verifiable rewards. We train a Qwen 3 8B model with DAPO over 300 steps and present early results demonstrating that environment scaling and adaptive difficulty transfer to agentic, real-world task completion.
This project originated in thePytorch OpenEnv Hackathonand is still evolving, follow us for updates 🔥
https://huggingface.co/blog/ecom-rlve#why-rl-for-shopping-agentsWhy RL for shopping agents?
Large language models can hold fluent conversations, yet deploying them as shopping assistants reveals a persistent gap:fluency ≠ task completion. A customer who asks*“find me a USB-C charger under $25 that ships in two days”*needs an agent that invokes the right catalog search, filters on three hard constraints, avoids hallucinating product IDs it never retrieved, and handles follow-ups when the top result goes out of stock.
Supervised fine-tuning can teach surface-level tool use from demonstrations, but it cannot scale to the combinatorial space of constraint configurations, partial-information dialogues, and multi-step transactional workflows that real e-commerce demands.
Reinforcement learning with verifiable rewards (RLVR) offers an alternative: the agent optimises foroutcomes— did the products satisfy the constraints? Was the cart correct? Was the return initiated for the right order line? The challenge is constructing reward functions that are bothverifiable(no LLM-as-a-judge subjectivity) andadaptive(difficulty that grows with the policy’s capability).
https://huggingface.co/blog/ecom-rlve#from-rlve-gym-to-ecomrlve-gymFrom RLVE-Gym to EcomRLVE-GYM
RLVE-Gym provides 400 environments for sorting, multiplication, Sudoku, and other algorithmic-reasoning tasks; however, those are allsingle-turn, text-in / text-outpuzzles — extending to agentic domains was left as future work.
EcomRLVE-GYM fills that gap: we stay in theverifiableregime (e-commerce outcomescanbe checked algorithmically) while extending tomulti-turn, tool-augmented, agenticconversations — environments where the agent mustact(call tools, modify world state) rather than merelyreason(produce a text answer) and compensates for the deficiency of the search system.
EcomRLVE-GYM transforms customer-service outcomes structurally verifiable:
Every signal above can be evaluated by a program with access to the hidden ground-truth goal. No human annotation or LLM-as-a-judge is needed.
https://huggingface.co/blog/ecom-rlve#what-a-training-episode-looks-likeWhat a training episode looks like
Before we explain the framework, here is what a single EcomRLVE episode looks like at difficultyd = 4. The environment generates a hidden goal, a simulated user opens the chat, and the agent must use tools to satisfy the request. Every action is verified algorithmically — no LLM judge required.
The reward is fully computed by code: F1 over\(product, variant, qty\)tuples, an efficiency bonus for finishing in fewer turns, and a hallucination check that every recommended product ID was actually retrieved. If the agent had picked the Lightning variant instead of USB-C, the simulated user would have corrected it mid-dialogue — and the F1 would have dropped.
https://huggingface.co/blog/ecom-rlve#the-eight-environmentsThe eight environments
Each environment covers a distinct real-world shopping scenario. The agent must complete the task using tools (catalog search, cart operations, order lookups, policy queries) and is scored by a program — not a human or another LLM.
EnvironmentWhat the agent must doProduct DiscoveryFind products that satisfy all the user’s constraintsSubstitutionAn item is out of stock — find a similar, compatible alternativeCart BuildingAdd the exact products, variants, and quantities the user asked forReturn + ReplacementIdentify the right order line, initiate a return, suggest a replacementOrder TrackingResolve which order the user means and report its current statusPolicy QAAnswer a deterministic question about store policy (return window, shipping rules, etc.)Bundle PlanningRecommend a complete shopping list for a project within a budgetMulti-Intent JourneyHandle a conversation that chains 2–5 of the above tasks in sequence Every environment uses the same three-part reward signal:
- Task reward— did the agent actually complete the goal? (e.g., were the right products recommended, was the cart correct, was the right order tracked?)
- Efficiency reward— did the agent complete it without wasting turns? Turns theusercaused (asking a follow-up, confirming an action) don’t count against the agent — only turns caused by agent mistakes do.
- Hallucination penalty— did the agent only recommend products it actually retrieved during the session? Recommending product IDs that were never looked up is penalised, so the agent cannot invent results from memory.
Invalid outputs (malformed JSON, illegal tool calls) trigger an immediate failure score, creating a strong incentive for well-formed responses from step one.
https://huggingface.co/blog/ecom-rlve#adaptive-difficulty-curriculumAdaptive difficulty curriculum
A single difficulty numberdcontrols 12 independent aspects of a task simultaneously. This is important because e-commerce conversations are hard in many different ways at once — not just along one dimension.
Here are four representative difficulty axes:
What changesEasy (d = 0)Medium (d = 6)Hard (d = 12)How many constraintsthe user has258How often the user omitsa constraint5%70%~80%Fraction of search resultsthat are distractors0%12%24%Items that go out of stockmid-conversation0%30%50%
The other eight axes cover turn budget, input noise (typos, slang), context switches, retrieval depth, order-history size, policy complexity, and tool budget. The full breakdown is in thetechnical report.
**Adaptive scheduling.**Each environment tracks the agent’s success rate independently and only advances to harder problems once the agent is passing the current level reliably. This keeps every environment training at the agent’s capability frontier — avoiding both “too easy to learn from” and “too hard to make progress on”.
https://huggingface.co/blog/ecom-rlve#deep-dive-cart-building-e_cartDeep dive: Cart Building (E_CART)
Cart building is a good showcase because it requires the full search → inspect → clarify → act loop, has a binary ground truth, and introduces a challenge absent from most recommendation benchmarks:variant selection.
To succeed, the agent must develop five distinct skills:
SkillWhat it means in practiceProduct DiscoverySearch the catalog with well-formed queries to find the right itemsVariant SelectionIdentify the correct color, size, or connector type — not just the right productCart ManagementAdd items with the exact variant and quantity the user asked forClarification DialogueAsk the user a focused follow-up when a request is ambiguous (e.g., missing size)Multi-Item OrdersHandle shopping lists with several different products in a single conversation The agent uses six tools to accomplish this:
ToolWhat it doescatalog\_searchSearches the product catalog with a natural-language querycatalog\_get\_variantsReturns available variants (color, size, connector, etc.) for a productcart\_addAdds a product to the cart with a specific variant and quantitycart\_viewReads the current cart so the agent can verify it matches the requestuser\_get\_visit\_historyFetches recently viewed products by userask\_userSends a clarification question to the customer when a detail is missing
https://huggingface.co/blog/ecom-rlve#the-problemThe problem
The generator samples 1–5 target products (scaling in difficulty withd), each potentially requiring a specific variant (USB-C vs Lightning, Matte vs Glossy) and a quantity > 1. The agent must:
- Search the catalog to find each product
- Call
catalog\.get\_variantsto see available options - Add the correct
\(product\_id, variant\_id, qty\)tuples to the cart
https://huggingface.co/blog/ecom-rlve#why-variants-matterWhy variants matter
Real product catalogs have sparse variant data — many products have none, and those that do typically vary only by colour or size. To create a richer discrimination task, wesynthesize variants at episode initialization:
- A per-category priority list picks the most natural attribute to vary (electronics →
connector\_type; clothing →size; kitchen →material). - For each target product, we generate 3 variants: 1 target + 2 plausible distractors. An “Anker 65W USB-C Charger” produces
\{USB\-C, Lightning, HDMI\}. - The verifier checkscomposite keys
\(product\_id, variant\_id\)— correct product but wrong variant means the unit is unmatched.
https://huggingface.co/blog/ecom-rlve#difficulty-scalingDifficulty scaling
Axisd = 0d = 3d = 6d = 9Distinct items1234Variant required21%66%93%99%Multi-quantity0%30%50%50%
Atd = 0the agent adds a single product with no variant complexity — learning the basiccatalog\.search → cart\.addworkflow. Atd = 6it juggles 3 items, nearly all requiring a specific variant, with half needing qty > 1.
https://huggingface.co/blog/ecom-rlve#scoringScoring
The cart must be exactly right — correct product, correct variant, correct quantity. Partial credit is given for partially correct carts, but a perfect score requires every item to match. If the agent adds the wrong variant, the simulated user corrects it mid-dialogue (“that’s the Lightning version, but I need USB-C”), giving the agent a chance to self-correct before the episode ends.
https://huggingface.co/blog/ecom-rlve#trajectories-easy-vs-hardTrajectories: easy vs. hard
Two real E_CART episodes from a Qwen 3 8B agent. Same environment, same agent — difficulty alone changes the game.
d = 1— 1 item, no variantsd = 8— 3 items, variants + typosUser message*“Shopping list: SonicWave Keypad. Please add everything.”“pls add the following to myaart**: Eco-Frienly HrenLefaCharger by GreenLeaf, High-PerformanceAqaPureAir Fryer ,LighrweihtPixelVault Tank Top (3).”— typos from noise axisTurn 1user\.get\_visit\_history\(\)→ 4 results. Spots the target immediately.user\.get\_visit\_history\(\)→ 9 results. Must disambiguate across chargers, fryers, tops.Turn 2cart\.add\("syn\_000162", qty=1\)→ Cart: 1 item, $2.74.“I found the SonicWave Keypad.”User:“Great, that’s the exact one.”No tool calls — presents 3 product cards. User:“Add the Charger with 100W output, Air Fryer with Charcoal filter, 3 Tank Tops in size XS.”Turn 3answer: \{ "done": true \}—Done in 3 turns.catalog\.get\_variants× 3 → charger: 6 variants (100W, 120W, 30W…), air fryer: 6 (Bamboo, Permanent,Charcoal…), tank top: 7 (XS, S, XL…). User:“I need three Tank Tops in XS.”Turn 4cart\.add\(charger, "100W", qty=1\)✓cart\.add\(fryer, "Bamboo", qty=1\)✗ should be Charcoal.cart\.add\(tank\_top, "size\_XL", qty=3\)✗ should be XS. User:“Air Fryer has Bamboo not Charcoal, Tank Top is XL not XS.”Turn 5Re-fetches tank top variants, adds 2× XS —doesn’t fix the air fryer. User:“Air Fryer still shows Bamboo. I need three XS, not two.”Turn 6Re-fetches tank top variantsagain*—still ignores the air fryer. User:“Please change the Air Fryer variant.”Turn 7“I cannot find a Charcoal filter variant.”— It exists; the agent saw it in turn 3 but never selected it. User:*“Skip the Air Fryer then.”*Turn 8answer: \{ "done": true \}— Cart still wrong: Charger ✓, Air Fryer (Bamboo) ✗, Tank Top (3×XL + 2×XS) ✗Rewardr\_task= +1.00,r\_eff= +0.33,r\_hall= 0.00,r_total = +0.80✓r\_task≈ 0.00,r\_eff= −0.43,r\_hall= 0.00,r_total = −0.06✗OutcomeCart matches goal. 3 turns, 2 effective.Wrong variants, wrong quantities, user gave up. 8 turns, 6 effective.
At d=1 the agent solves the task in 3 clean turns. At d=8 it spirals — picking Bamboo instead of Charcoal, XL instead of XS, never fixing the air fryer despite two user corrections, then hallucinating that the variant doesn’t exist. This is exactly the kind of multi-step error cascade that the difficulty curriculum surfaces, and that adaptive training should teach the agent to recover from.
https://huggingface.co/blog/ecom-rlve#user-simulationUser simulation
A verifiable environment needs a user simulator that behaves realistically. We use**Qwen3.5 (9.7B)**to generate natural, varied user messages rather than canned templates — covering everything from typo-filled requests to mid-conversation topic switches.
Two design choices matter for training quality:
**Preferences match stated constraints.**Each simulated user has a hidden set of preferences (price sensitivity, brand loyalty, shipping speed, etc.). These are deliberately biased toward whatever constraints the user communicated — so if the user said “under $25”, the reward function actually cares about price. Without this, an agent could be penalised for correctly following the user’s instructions.
**Strategic omission.**The LLM deliberately withholds some constraints from the opening message to force the agent to ask clarifying questions. The system tracks exactly what was and wasn’t mentioned, so the agent is never penalised for information it was never given.
https://huggingface.co/blog/ecom-rlve#environment-scalingEnvironment scaling
Following RLVE’s methodology, we define nested environment collections:
C1 ⊂ C2 ⊂ C4 ⊂ C8
CollectionEnvironmentsSkills trainedC1CartSerarch Query Formulation, Cart ManipulationC2+ SubstitutionSimilarity reasoning under constraintsC4+ Product Discovery, ReturnsTransactional workflows (Retrieval + recommendation, return initiation)C8+ Status, Policy, Bundle, JourneyKnowledge retrieval, planning, compositionality We hypothesise — consistent with RLVE’s findings — that C8 agents outperform single-environment specialists, even on the specialist’s own task.
https://huggingface.co/blog/ecom-rlve#early-resultsEarly results
We trained Qwen 3 8B with DAPO on C1 (Cart Building) for 300 steps as an initial viability study.
ConfigBase modelQwen 3 8BAlgorithmDAPO (G = 8 rollouts/prompt)LR1e-5Catalog2M products, FAISS index withAlibaba\-NLP/gte\-modernbert\-base(768-dim)User simQwen3.5 9.7B

We saw progressive growth in difficulty reached, confirming that adaptive scheduling produces a steady learning signal rather than the saturation (static-low) or starvation (static-high) patterns predicted by the RLVE paper.
https://huggingface.co/blog/ecom-rlve#try-it-yourselfTry it yourself
Run a live episode directly in your browser using the embedded demo below. Here is how to get started:
- Pick an environmentfrom the dropdown (e.g.,
E\_CARTfor cart building orE\_PDfor product discovery). - Set a difficulty—
0is a simple single-constraint task;6\+introduces missing information, noisy retrieval, and variant selection. - Click “Reset Episode”— the simulated user will open with a shopping request.
- You are the agentnow: Make tool calls, analyse outputs and submit the final list of product ids.
- Click**“Reset Episode”**between runs to start a fresh scenario.
https://huggingface.co/blog/ecom-rlve#resourcesResources
The environments, verifiers, and training configs are all open-source:
git clone https://github.com/owlgebra-ai/EcomRLVE-Gym
cd EcomRLVE-Gym
pip install -e .
The 2M-product catalog is on the Hub:
from datasets import load_dataset
catalog = load_dataset("owlgebra-ai/Amazebay-catalog-2M", split="train")
print(f"{len(catalog)} products loaded")
https://huggingface.co/blog/ecom-rlve#referencesReferences
- Zeng, Z., Ivison, H., Wang, Y., et al. (2025).*RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments.*ICML 2025.arXiv:2511.07317
- Yu, Q., Zhang, Z., Zhu, R., et al. (2025).DAPO: An Open-Source LLM Reinforcement Learning System at Scale.arXiv:2503.14476
- Shao, Z., Wang, P., Zhu, Q., et al. (2024).DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv:2402.03300
- DeepSeek-AI. (2025).*DeepSeek-R1: Incentivizing Reasoning in LLMs through Reinforcement Learning.*Nature.
- Meta AI. (2024).Llama 3.1: A Foundation Model for General Intelligence.llama.meta.com
- Qwen Team. (2025).Qwen3 Technical Report.arXiv:2505.09388
Similar Articles
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.
Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
Researchers from Tianjin University and Alibaba Group propose EA-RLVR, a reinforcement learning framework with verifiable rewards that improves cross-cultural entity translation in LLMs by activating parametric knowledge already encoded during pre-training, without relying on external knowledge bases. Training on 7k samples boosts Qwen3-14B's entity translation accuracy from 23.66% to 31.87% on unseen entities.
@SergioPaniego: OpenEnv is growing fast in tutorials. If you're looking to get started with RL environments, check them out > evaluate …
OpenEnv, a platform for reinforcement learning environments, is expanding its tutorials, covering topics like evaluating agents, rewards via rubrics, and connecting agents via MCP.
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
EnvScaler is an automated framework for scaling tool-interactive environments for LLM agents through programmatic synthesis, creating 191 diverse environments and 7K scenarios to improve agent performance on multi-turn, multi-tool interactions.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit is an automated pipeline that generates diverse, verified environments for claw-like agents from natural language descriptions, enabling the construction of Auto-ClawEval, a large-scale benchmark with 1,040 environments at 13,800x lower cost than human curation. The system supports continuous, on-demand evaluation and adaptive training environment generation across multiple model families and agent frameworks.

