@SergioPaniego: https://x.com/SergioPaniego/status/2067270222671741360

X AI KOLs Timeline 06/17/26, 03:36 PM Tools

trl openreward reinforcement-learning rl-environments training open-source

Summary

OpenReward environments now integrate directly into TRL's GRPOTrainer via a single OpenRewardSpec, allowing zero-glue-code training against a catalog of RL environments. The integration is experimental and part of a broader effort to make environment and agent RL first-class in TRL.

https://t.co/AKHNVGmBPz

Original Article

View Cached Full Text

Cached at: 06/18/26, 02:06 AM

Train against a live reward environment in TRL, now with OpenReward

TL;DR: OpenReward environments now plug straight into TRL’s GRPOTrainer. One OpenRewardSpec wires an ORS environment (its tasks, tools, and reward) into the trainer’s three slots, so you can train against the OpenReward catalog (or a self-hosted or local ORS server) with no glue code. pip install trl.

OpenReward is an open ecosystem of RL environments bui

lt on the Open Reward Standard (ORS), a public HTTP/SSE protocol for how an environment exposes its tasks, tools, sessions, and rewards. Because ORS is just a protocol, the same environment can run on the hosted openreward.ai catalog, self-hosted on your own infra, or locally while you develop it.

One OpenRewardSpec resolves an environment into the trainer’s three slots, so you pick one from the catalog, hand it over, and train:

pythonfrom trl import GRPOConfig, GRPOTrainer from trl.experimental.openreward import OpenRewardSpec

Resolves the env, its tasks, and its ORS-computed reward into the three trainer slots.

spec = OpenRewardSpec(“Eigent/SETA”, num_tasks=64)

trainer = GRPOTrainer( model=“Qwen/Qwen3-4B”, args=GRPOConfig(num_generations=8, max_tool_calling_iterations=20), train_dataset=spec.train_dataset, # the ORS task list environment_factory=spec.environment_factory, # one ORS session per rollout reward_funcs=spec.reward_funcs, # the ORS-computed reward ) trainer.train()

That runs today. The policy calls the environment’s tools turn by turn, the environment scores the outcome, and GRPO trains on it. The harness (the tool surface and the loop) comes from the environment’s tools; the only part being trained is the policy. Point the spec at a catalog name (set OPENREWARD_API_KEY) or at a URL for a self-hosted or local server. A full runnable script is seta.py.

Install, set your key, and launch (single node, vLLM colocate, as in the example):

bashpip install “trl[vllm,openreward]” export OPENREWARD_API_KEY=…

Terminal 1: vLLM server (2 GPUs)

CUDA_VISIBLE_DEVICES=2,3 trl vllm-serve
–model Qwen/Qwen3-4B
–tensor-parallel-size 2
–port 8000

Terminal 2: training (2 GPUs)

CUDA_VISIBLE_DEVICES=0,1 accelerate launch
–config_file examples/accelerate_configs/deepspeed_zero2.yaml
–num_processes 2
examples/scripts/openreward/seta.py
–vllm-mode server
–vllm-server-base-url http://localhost:8000

NOTE: OpenReward support is experimental (it lives under trl.experimental), so expect the API to keep evolving. It is one step in a broader direction to make environment and agent RL first-class in TRL, with the design being worked out in the open: environment-owned reward (#5912), environment-owned dataset (#5903), and a single rollout-source contract that unifies environment and agent rollouts (#5974).

TRL also integrates OpenEnv, the open environment standard. For the wider landscape of RL environment frameworks beyond TRL, see The ultimate guide to RL environments.

Resources

TRL OpenReward guide: https://huggingface.co/docs/trl/openreward
Runnable example (seta.py): https://github.com/huggingface/trl/blob/main/examples/scripts/openreward/seta.py
OpenReward catalog: https://openreward.ai
Open Reward Standard (ORS): https://openrewardstandard.io
The ultimate guide to RL environments: https://huggingface.co/spaces/AdithyaSK/rl-environments-guide
Agent Glossary (the vocabulary used here): https://huggingface.co/blog/agent-glossary
TRL OpenEnv integration: https://huggingface.co/docs/trl/openenv

@SergioPaniego: https://x.com/SergioPaniego/status/2067270222671741360

Train against a live reward environment in TRL, now with OpenReward

Resolves the env, its tasks, and its ORS-computed reward into the three trainer slots.

Terminal 1: vLLM server (2 GPUs)

Terminal 2: training (2 GPUs)

Similar Articles

@adithya_s_k: You can now train on 350+ RL Environments from OpenReward with TRL with just a few lines of code

@SergioPaniego: OpenEnv is growing fast in tutorials. If you're looking to get started with RL environments, check them out > evaluate …

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

@SergioPaniego: OpenEnv has a new home: http://github.com/huggingface/OpenEnv… starting today, it's coordinated by a committee that inc…

The Open Source Community is backing OpenEnv for Agentic RL

Submit Feedback

Similar Articles

@adithya_s_k: You can now train on 350+ RL Environments from OpenReward with TRL with just a few lines of code

@SergioPaniego: OpenEnv is growing fast in tutorials. If you're looking to get started with RL environments, check them out > evaluate …

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

@SergioPaniego: OpenEnv has a new home: http://github.com/huggingface/OpenEnv… starting today, it's coordinated by a committee that inc…

The Open Source Community is backing OpenEnv for Agentic RL