@GokuMohandas: https://x.com/GokuMohandas/status/2066853420326384055
Summary
This technical guide explains why organizations should build their own learning loops on open-source AI models rather than renting intelligence from frontier labs, drawing on case studies from finance, robotics, and biotech.
View Cached Full Text
Cached at: 06/16/26, 05:39 PM
Stop renting your intelligence: a technical guide to building your own learning loop
@satyanadella recently wrote that the winning move in AI is no longer picking the best model, it’s building a learning loop on top of one, so that your data and usage compound into IP nobody can rent back to you. This post is a deep dive on exactly why you’d want to do this, and a blueprint for how, by looking at how companies across industries (finance, robotics, autonomy, ecommerce, biology, etc.) have done precisely this to win.
Part 1: The case
While the industry debates whether to build or rent, here’s what’s actually happening right now.
Whose loop are you building?
Every time you use a frontier model, you hand the lab signal: the prompts, the traces, the edge cases, the workflows the model then learns to serve better. That alone helps the lab build its loop against yours. The deepest version of the leak is the forward-deployed engineer (FDE) relationship, where frontier labs embed engineers inside your company, build on your proprietary processes, and gather exactly the context they need to construct reinforcement learning environments that improve the next generation of their models. Either way, light touch or deep embed, you are funding the extraction of your own institutional knowledge into a model you’ll then keep renting.
In the short run this is actually pretty rational on both sides. The labs get the data they need to keep moving up the capability curve. And you, the customer, get implementation expertise that’s genuinely scarce and an ROI that’s genuinely fast.
But in the long run there are only two exits. Either you get hooked, renting back, at a premium, a model trained on traces of your own work. Or, as Satya recommends, you learn to build the learning loop yourself: your own evals, your own RL environments, your own post-training stack, on an OSS base you control.
“A frontier without an ecosystem is not stable. The real opportunity is not in picking the best model but in building a learning loop on top of models where human capital and token capital compound. Private RL environments should let models grow stronger on real traces from inside the organization. This loop becomes the new IP of the firm.” Satya Nadella, post on X
This post is the engineering version of that argument. It’s for the people who agree with the direction and want to know what it actually takes. What a learning loop is, why it isn’t just a fancy fine-tune, where the hard parts are, and how to build one on infrastructure you control. The companies already doing this (Nubank, Physical Intelligence, Coinbase, Torc, Runway, Bedrock Robotics, Recursion, Reflection AI, Notion, and many more) are following patterns we’re about to discuss.
Is OSS just a red herring?
Before we start though, there’s a counter-argument worth taking seriously, because it’s the most credible one against the position this post defends. Dario Amodei, CEO of Anthropic, recently put it like this:
“I’ve actually always seen [open source] as a red herring. When I see a new model come out I don’t care whether it’s open source or not. It actually doesn’t matter either way. Because ultimately you have to host it on the cloud. The people who host it on the cloud do inference. These are big models, they’re hard to do inference on. It’s not free. You have to run it on inference and someone has to make it fast on inference.” Dario Amodei, Anthropic (interview)
Dario is right about the surface mechanics. “Open weights” isn’t “open source” because hosting an OSS model isn’t free, the big ones are genuinely hard to serve, and making them fast is real engineering. But his conclusion to “therefore just rent” only holds if you ignore three things:
1. The hosting problem now has well-mapped solutions. The capabilities that were genuinely hard a year ago like multimodal serving, diffusion-based image and video generation, efficient inference for mixture-of-experts models, are now solved in the OSS ecosystem. Ray, vLLM, HuggingFace, KubeRay, SGLang, etc. have closed those gaps. xAI runs Grok Imagine on Ray. DigitalOcean cut P99 TTFT 70% with prefix-aware routing on Ray + vLLM. Tripadvisor cut batch LLM inference cost 70–82% versus competitive API offerings. Apple, BMW, Adobe, and more. The frontier-lab framing of “we’re the only ones who can serve big models fast” is definitely no longer true.
2. The OSS models are genuinely good now. Qwen, Llama, DeepSeek, Mistral, Gemma, Kimi for text and code. NVIDIA’s Cosmos line for action-conditioned world models and video. And there’s more that are industry specific too.
3. Hosting is now table stakes and extending is the moat. This is the move the “red-herring” framing doesn’t address at all. Once you can host a model, you can also extend it by post-training it on your own data, wrapping it in an RL environment that rewards your outcomes, building a learning loop that compounds every time someone uses your product.
4. The runtime has to be open, not just the model. If the runtime itself is proprietary (the inference-as-a-service and training-as-a-service category, all built on closed orchestration), you’ve traded model lock-in for runtime lock-in. The whole point of owning your intelligence is portability so that your weights are yours, your data is yours, and your code runs unchanged on whatever infrastructure you choose. That requires the runtime to be community-owned and OSS. The production platform on top, in-house, managed, on VMs or on K8s, can be whatever fits your team. The runtime cannot…
One more thing the framing leaves out is that renting is more expensive than owning at scale and the cost curve is what eventually forces migration. I’ve seen this not just at “enterprise scale” specifically, but any team with real usage can hit this within a few months.
DimensionRentOwnCostLinear in volume, no ceilingHigh upfront, flat marginalPerformanceTuned for the vendor’s average customerTuned for your traffic: TTFT, throughput, KV-cache reuse, prompt caching, batch shapes, speculative decoding policyQualityFrontier-general, same model for everyone, sometimes it gets straight up borkedSpecialized for your tasks via post-training and RL on your traces
Hosting can be simplified with the right blueprint for how to do it (and how many companies have already done it). And once you have it, you get something the rented stack can’t give you: the ability to make the model yours.
Part 2: What a learning loop actually is
The anatomy of a learning loop
A “learning loop” is a specific architecture wired into a feedback cycle.
Each box is a workload and together are what Satya calls the “hill climbing machine” (my assumption), the thing that compounds every time someone uses your product. The reason most companies don’t have one yet is not because any one single piece is hard, but stitching them together as a production system (not a notebook lol) requires a runtime that can move data, schedule heterogeneous GPU work, host services, train, and synchronize weights across all of them.
Before we go further, I want to quickly cover the the RL environment, which is easily the most misunderstood box in the diagram.
What an RL environment actually is
And why it isn’t just a fine-tune.
The single biggest source of confusion in the “own your intelligence” conversation is conflating fine-tuning with reinforcement learning. They aren’t the same thing and the difference is the entire reason owning your loop matters.
An RL environment is a programmable simulator of your business. It has four parts:
-
State. The situation the model is looking at. This could be a customer support ticket, code repo, financial transaction, your bot’s camera frame, medical record, etc.
-
Action space. What the model can do. Like write a reply, call a tool, edit a file, delete backups of your prod DB, place a trade, send a motor command, order a lab test, etc.
-
Transition. How the world updates after the action. This can be a real tool execution, a sandboxed environment or a simulator.
-
Reward. Something that scores the action against the outcome that actually matters to your business. Did the ticket get resolved? Did the tests pass? Did the trade make money? Did the bot grasp the burrito?
python# RL env class Environment: def reset(self) -> State: …
def step(self, state: State, action: Action) -> tuple[State, Reward]:
next_state = self._transition(state, action) # the world responds
reward = self._score(state, action, next_state) # programmatic, not labeled
return next_state, reward
The model basically runs this loop millions of times (lots of synthetic data gen here too). An RL algorithm (PPO, GRPO, DAPO, DPO-on-rollouts, and similar) updates the policy weights to make high-reward actions more likely over time.
Now let’s compare this to what supervised fine-tuning is doing. LoRA fine-tuning on collected data is basically imitation because you hand the model a static dataset of (input, desired_output) pairs and the model learns to reproduce the labels. The day your fraud patterns change, your product launches or your tool APIs evolve, the model is stale until you re-label and re-train. This kind of drift is very common and you need to constantly relabel (expensive).
But with RL, the model isn’t learning to imitate examples but instead, it’s learning to achieve outcomes. Every new rollout, every chargeback, every passing test, every successful agent session generates fresh training signal automatically and this loops just keeps compounding (one shot vs. cycle).
DimensionLoRA fine-tuneRL environmentWhat it teachesImitate labelsAchieve outcomesSignal sourceLabels (expensive, slow, static snapshot)Programmatic reward (cheap, continuous, process)Improves over time?No, frozen at trainingYes, every rollout is new training dataHandles novel situations?Only if they look like training dataYes, the model explores and is scoredMulti-step / tool use?Hard, needs labeled trajectoriesNative, the loop is the unit of trainingWhat it optimizesToken-level likelihoodThe business outcome you defined
The same blueprint, every industry
The learning-loop diagram (product → traces → eval → RL environment → post-training → serving → back to product) is industry-agnostic but what changes between industries is just what’s inside each box: the data modality, the action space, and the reward function. But otherwise, it’s pretty much the same.
StageFinanceCode agentsPhysical AILife sciencesCustomer opsProductRecsys, credit scoring, fraud screeningIDE, autonomous code reviewRobot fleet, autonomy stackResearch assistant, lab automationSupport agent, ops copilotTracesTransactions, app events, chargebacksRepo edits, test runs, PR reviewsSensor logs, video, action sequences, sim rolloutsExperiment results, paper queries, lab readingsTickets, transcripts, resolutionsRL env stateCustomer state + transaction historyRepo snapshotCamera frame + proprioceptionHypothesis + prior experimentsConversation state + CRM recordAction spaceDecision (approve / deny / route)Edit / tool call / run testsMotor command / motion planRun assay / query DB / propose moleculeReply / call tool / escalateTransitionReal downstream effect (chargeback, retention)Sandboxed test executionSimulator or real-world rolloutWet-lab or in-silico runLive or shadowed conversationRewardRealized P&L, retention, fraud caughtTests pass, lint clean, tokens usedTask success, sim score, sim-to-real transferHit rate, assay outcome, citation impactResolution, CSAT, time savedPost-trainLoRA-per-task on top of a transaction backboneRLVR with GRPO on sandboxed testsVLA fine-tune + closed-loop sim flywheelRLVR on experimental outcomesDPO / GRPO on labeled and shadowed tracesServingOnline scoring + batch refreshMulti-replica policy with autoscalePolicy server queried by sim and real robotsTool-using agent endpointConversational endpoint with traffic routing
Part 3: How you get there
Owning your loop has two independent axes and conflating them is why this topic can be confusing.
Capability is what you build. How much of the model is actually yours: borrowed embeddings → then your own embeddings → then a fine-tune → then an RL environment → then a continuous loop.
Hosting is where it runs. Where the weights actually execute: a rented API → then self-hosted OSS → then a hybrid of both behind one gateway.
These two axes are completely orthogonal because you can climb the capability staircase while still renting your hosting and you can self-host a model you’ve barely customized. Most teams I’ve worked with move up both over time but rarely in lockstep. We’ll start with where it runs and why all of this gets hard at production scale, then spend the rest of the post climbing the capability staircase one step at a time.
The hosting migration: from rent to own
“Build your own learning loop” sounds like a binary choice between renting from a frontier API and standing up an in-house foundation-model team. But in reality, this typically happens over three phases.
I’ve seen the framework below over and over (industry agnostic) but you can see specific details from Coinbase and BMW to see their exact journey.
Phase 1: Renting from frontier APIs
Route to frontier APIs (OpenAI, Anthropic, Google, etc.) through a thin AI gateway that handles auth, monitoring, and guardrails. One OpenAI-compatible interface upstream and multiple providers downstream to start proving out new features and workflows.
Phase 2: Self-hosting OSS models
Then you can stand up self-hosted inference for the modalities and use cases where rental costs or privacy concerns are too big to ignore. A common OSS layer here is Ray + vLLM for inference, KubeRay for orchestration, HuggingFace as the model source and you can fine-tune with LoRA for high-load use cases.
It’s all the same OpenAI-compatible surface the gateway but now backed by an OSS model running inside your perimeter:
python# Phase 2: self-host an OSS model behind the same OpenAI-compatible API
the gateway is already talking to.
from ray import serve from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig( model_loading_config=dict( model_id=“Qwen/Qwen3-32B-Instruct”, # LoRA adapters trained on your traces, hot-swappable per request lora_config=dict(max_num_adapters=8), ), accelerator_type=“H100”, deployment_config=dict( num_replicas=“auto”, max_ongoing_requests=64, ), engine_kwargs=dict(tensor_parallel_size=4), )
The gateway can now route to this exactly as it routes to Bedrock or Azure.
serve.run(build_openai_app({“llm_configs”: [llm_config]}), route_prefix=“/v1”)
The OSS unlock
Credit to the OSS community here because there are now OSS model and serving ecosystems that cover every modality your product needs. Of course, text and code have had production-grade OSS foundations for a while (Llama, Qwen, DeepSeek, Mistral, Gemma, Kimi). The harder modalities were the ones dominated by closed APIs (Sora for video, GPT-image, audio gen, native multimodal). But now we’ve got models like NVIDIA’s Cosmos line for action-conditioned world models and video, strong open weights for image generation (Flux), open speech models (Whisper), tons of VLMs and even multimodal serving for diffusion models (vLLM-omni).
So now you realize that hosting is real work. You start incurring GPU costs, set up a serving stack beyond just an API key. But the reason teams do it anyway is because the next phase (and the loop on top of it) is only reachable from here.
Phase 3: One gateway for all
The final phase (for now) is hybrid hosting. You have a gateway that routes to frontier APIs for some endpoints (low-volume, frontier-only capability stuff) and to self-hosted OSS models for others (high-volume, privacy-sensitive, cost-sensitive, product-specific since you maybe extended it). Ray Serve for nice load balancing, autoscaling, etc. and all under the same OpenAI-compatible interface upstream.
Why this is hard at production scale
I’m going to quickly list out some MLOps/LLMOps challenges here if it’s not already clear why every company hasn’t already set this up:
-
Eval infrastructure: Most teams have downloaded benchmarks and a few notebooks but not evals tied to the outcomes their business actually cares about… This takes time and thought to get right.
-
Trace collection: Your product has to be instrumented to capture every (state, action, outcome) triple but luckily lots of great tracing tools out there (or build your own quickly too).
-
Reward design: Scoring an action against a real business outcome is not trivial and you must involve domain experts here.
-
Sandboxing and safety: Executing the agent’s actions safely needs a sandbox with the right blast radius… a container for a code agent, a simulated ledger for a financial one, a sim for a robot, etc. I found this is extremely contextul and might be worth building out yourself, depending on your space.
-
Distributed RL at scale: Rollout and training staleness, weight-sync latency, off-policy correction, and replay-buffer management, updated recipies, etc. are all reasons you want to go OSS (and no home-grown) here.
-
Continuous deployment: A loop is only a loop if new policies ship back to production through versioned checkpoints, traffic-shifted rollouts, regression evals, and rollback.
I’m not listing these out to discourage setting up the learning loop but to point out which gaps you want to build on oss vs. use a bespoke saas.
Part 4: The climb
You don’t build a full RL loop right away but get there (quickly these days) over time.
Apple described their internal user base where roughly 70% of their teams are standard users doing post-training (SFT) with PyTorch + Accelerate on FSDP / DeepSpeed and 30% are advanced users doing pre-training and RL on PyTorch + Ray with FSDP2 / TP / PP . Most teams start at Stages 1–3 (embeddings, private evals, SFT) and then climb to Stages 4–5 (RL env + production loop). All of which can run on the same oss Ray runtime.
Each step is a workload and I’ll show the Ray primitives that handles its hard part and also what a platform adds once you run it at company scale. Everything runs on one open-source runtime (Ray), while the platform on top (Anyscale, KubeRay, or your own K8s) is a reversible choice, so you’re never locked in one level down.
Stage 1: Own your embeddings. This is definitely where most teams should sart because not every team is ready to train a full foundation model, but almost any team can own its embeddings. Train (or post-train) an embedding model on your data, then reuse the embeddings across every downstream product, search, recommenders, content matching, fraud, retrieval for AI agents. And then watch every downstream consumer improve from these contextual embeddings.
This is what Tripadvisor did with Ray to produce image embeddings, review (text) embeddings, geo embeddings, and using them everywhere from real-time location search to AI-agent RAG to content collections. It’s also what Nubank’s nuFormer is at the financial-services level, a single transaction-sequence backbone whose embeddings lift credit, fraud, churn, and recsys simultaneously. And Adobe Firefly does it inside the training loop itself, with JIT embeddings on Ray Serve feeding the trainer.
All of these teams build the embedding once, reuse everywhere, and watch the improvements compound across teams that used to ship their own ad-hoc features.
At a high level, it’s two Ray workloads that do most of the work. First, embed the whole corpus in one batch pass with Ray Data and vLLM, then write the vectors where every downstream product can read them:
python# Batch-embed your corpus with Ray Data + vLLM, then reuse the vectors
across search, recommenders, fraud, and RAG. Re-run on a schedule to refresh.
import ray from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
ds = ray.data.read_parquet(“s3://corpus/documents/2026/”) # text, reviews, transactions, …
embed_cfg = vLLMEngineProcessorConfig( model_source=“BAAI/bge-large-en-v1.5”, # or your own post-trained backbone task_type=“embed”, engine_kwargs=dict(max_model_len=512), concurrency=8, # scale out across replicas accelerator_type=“L4”, # embedding is cheap; small GPUs are fine ) embedder = build_llm_processor( embed_cfg, preprocess=lambda row: dict(prompt=row[“text”]), postprocess=lambda row: dict(id=row[“id”], embedding=row[“embeddings”]), )
ds = embedder(ds) ds.write_parquet(“s3://embeddings/corpus/2026-06/”) # one asset, every team reads it
Then we can put the same model behind an OpenAI-compatible endpoint with Ray Serve, so online products (search, agent RAG) all hit one /v1/embeddings surface instead of each rolling their own:
python# Serve the embedding backbone once; every downstream caller shares it. from ray import serve from ray.serve.llm import LLMConfig, build_openai_app
embed_config = LLMConfig( model_loading_config=dict(model_id=“BAAI/bge-large-en-v1.5”), accelerator_type=“L4”, deployment_config=dict(num_replicas=“auto”, max_ongoing_requests=128), )
Exposes POST /v1/embeddings, the same API your gateway already speaks.
serve.run(build_openai_app({“llm_configs”: [embed_config]}), route_prefix=“/v1”)
Runtime: Ray Data LLM for the batch inference, Ray Serve for the shared /v1/embeddings endpoint. At scale: schedule the re-embed as a recurring job and let it autoscale, so refreshing the whole corpus is cheap.
Stage 2: Private evals. Now we can start to build an eval harness that runs continuously on production traffic shadows. This alone, before any post-training, will tell you which off-the-shelf model is actually best for your business (don’t blindly follow public leaderboards here, just test things out).
pythonimport ray from ray.data import read_parquet
traces = read_parquet(“s3://your-traces/last-week/”)
def score(trace): return { “model_version”: trace.model_version, “resolved”: did_resolve(trace), # outcome you care about “csat”: trace.customer_score, “tokens_used”: trace.tokens, }
results = traces.map(score).to_pandas() print(results.groupby(“model_version”).mean())
Eval isn’t a notebook or a one-off script… at company scale it’s a fleet. Zoox shared their pattern where training emits multiple checkpoints, each fans out to an asynchronous eval job on a separate Ray cluster, custom evaluators (including LLM-as-Judge) plug in independently of the training loop, and results stream back to an experiment dashboard. The eval workers run on smaller, cheaper GPUs than the training cluster, because eval is latency-bound per batch and not throughput-bound. Again, this is all on the same Ray substrate but with different hardware profile (right sized) so we can lower the cost.
Runtime: Ray Data for trace ingestion, Ray for distributed eval on cheap GPUs. At scale: observability that tracks eval metrics across model versions over time.
Stage 3: SFT on collected traces. Once you have evals, then you can responsibly post-train. We can take any OSS base (Llama, Qwen, Mistral, DeepSeek, Kimi, whatever fits), supervised-fine-tune on traces of your best outcomes, deploy it, and measure with your eval harness.
But before you can fine-tune, you have to process/curate the data, and that’s often the real bottleneck for most teams. For multimodal workloads (video, lidar, audio, document images, robot trajectories) the gap between “we have this data sitting in S3” and “we have a training-ready dataset” is months of engineering to get right.
Here’s a simplified multimodal curation pipeline (the example is shaped around robot trajectory data, but the structure, CPU decode → GPU model stages → GPU embed/dedup → training-ready shards, is identical for video, audio, document images, and financial event streams):
pythonimport ray from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
Read trajectory data. Ray Data ships standard readers (parquet, images,
binary files, …) and is extensible via custom Datasource subclasses for
different domain formats.
ds = ray.data.read_parquet(“s3://fleet-trajectories/2026/”)
CPU stages: decode, scene-split, filter out idle segments.
ds = ( ds.map_batches(decode_video, num_cpus=4, batch_size=8) .map_batches(scene_split, num_cpus=2) .filter(lambda x: x[“motion_energy”] > 0.1) )
GPU stage: caption with a VLM. Ray Data LLM stays on-GPU through the
processor (NVIDIA integration), no CPU round-trips.
captioner_cfg = vLLMEngineProcessorConfig( model_source=“Qwen/Qwen2-VL-7B-Instruct”, engine_kwargs=dict(tensor_parallel_size=1, max_model_len=4096), concurrency=8, accelerator_type=“H100”, ) captioner = build_llm_processor( captioner_cfg, preprocess=lambda row: dict(messages=to_vlm_prompt(row)), postprocess=lambda row: dict(caption=row[“generated_text”]), ) ds = captioner(ds)
GPU stage: dedupe by embedding similarity.
ds = ds.map_batches(embed_and_dedup, num_gpus=1, batch_size=64)
Write training-ready shards directly to the training cluster’s store.
ds.write_parquet(“s3://training/cosmos-vla-2026/”)
You can use basically this same code to handle a financial use case (tokenize hundreds of billions of transactions for a transaction backbone), a recsys use case (refresh embeddings across hundreds of millions of users on a schedule), or a document use case (chunk and embed the corpus that feeds a private search layer).
Once the data is curated, post-training is a Ray Train job. The same ScalingConfig abstraction handles 1 GPU and 64 GPUs:
python# LoRA fine-tune on collected traces with Ray Train.
Real pattern, lifted from the Ray Train + QLoRA Anyscale template.
from peft import LoraConfig, get_peft_model from transformers import ( AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, )
import ray from ray.train import RunConfig, ScalingConfig from ray.train.torch import TorchTrainer from ray.train.huggingface.transformers import ( RayTrainReportCallback, prepare_trainer, )
def train_func(config): tokenizer = AutoTokenizer.from_pretrained(config[“model_name”]) model = AutoModelForCausalLM.from_pretrained(config[“model_name”]) model = get_peft_model(model, LoraConfig( r=config[“lora_r”], lora_alpha=config[“lora_alpha”], target_modules=[“q_proj”, “k_proj”, “v_proj”, “o_proj”], ))
trainer = Trainer(
model=model,
args=TrainingArguments(
output_dir=config["output_dir"],
per_device_train_batch_size=config["batch_size"],
learning_rate=config["learning_rate"],
num_train_epochs=config["num_epochs"],
bf16=True,
),
train_dataset=load_traces(config["dataset_uri"], tokenizer),
)
trainer = prepare_trainer(trainer)
trainer.add_callback(RayTrainReportCallback())
trainer.train()
TorchTrainer( train_loop_per_worker=train_func, train_loop_config={ “model_name”: “Qwen/Qwen3-7B”, “dataset_uri”: “s3://traces/last-month/”, “output_dir”: “/mnt/cluster_storage/qwen3-7b-sft/”, “lora_r”: 64, “lora_alpha”: 32, “batch_size”: 2, “learning_rate”: 2e-4, “num_epochs”: 1, }, scaling_config=ScalingConfig(num_workers=8, use_gpu=True), run_config=RunConfig(storage_path=“/mnt/cluster_storage”, name=“qwen3-7b-sft”), ).fit()
Swap train_loop for a DPO step, a GRPO step, or a DAPO step and the framing is identical. The SkyRL library on top of Ray Train provides the RL algorithm implementations and the weight-sync coordination shown in the RL env snippet in Stage 4.
Runtime: Ray Data for the heterogeneous CPU and GPU curation, Ray Train for the run (one ScalingConfig from 1 to 64 GPUs, elastic and fault-tolerant). At scale: workload examples/templates to scaffold the job, a workspace you can iterate in from your IDE and promote to production without a rewrite, and spot management to keep training cheap.
Stage 4: Minimal RL env on one workflow. Now we can pick a well-bounded loop like a single category of customer ticket, one class of code task, one robot skill, one fraud workflow, etc. Then we set up the state / action / transition / reward, write the reward carefully, run RL, measure against the eval harness from Stage 2. Trust me that you’ll learn so much more about your workflow this way than by doing countless SFTs.
Here’s the shape of that loop on Ray:
pythonimport ray from ray import serve from ray.train.torch import TorchTrainer from ray.train import ScalingConfig
SkyRL ships the GRPO trainer and rollout/trainer weight-sync coordination;
the real entrypoint is config-driven (skyrl.train.entrypoints.main_base) and
typically launched from the novaskyai/skyrl-train-ray container image.
The two symbols below are pseudo-stand-ins to keep this snippet readable.
1. The policy is served as an autoscaling Ray Serve deployment.
Rollout actors hit this endpoint for every action.
@serve.deployment( ray_actor_options={“num_gpus”: 1}, max_ongoing_requests=16, ) class Policy: def init(self, model_id: str): self.device = torch.device(“cuda”) self.policy = load_policy(model_id, dtype=torch.float16).to(self.device).eval()
async def act(self, state: dict) -> dict:
with torch.no_grad():
batch = self._preprocess(state)
return {"action": self.policy.select_action(batch).cpu().numpy().tolist()}
2. Environment workers run real tool calls in a sandbox,
score the outcome, and emit (state, action, reward) tuples.
For a code agent, the “tool” is a code execution sandbox.
For a fraud agent, it’s a feature store + ground-truth lookup.
@ray.remote(num_cpus=2) class CodeAgentEnv: def init(self, repo_snapshot: str): self.sandbox = Sandbox.from_snapshot(repo_snapshot)
def step(self, state, action):
observation = self.sandbox.run(action) # real tool exec
reward = self._score(observation) # programmatic reward
return observation, reward
def _score(self, obs):
# Reward = unit tests pass + linter clean - tokens used.
# Reward design is where the engineering judgment lives.
return obs.tests_passed - 0.01 * obs.tokens_used
3. Rollout coordinator fans out across many environment actors
in parallel, collects trajectories, ships them to the trainer.
@ray.remote def rollout(policy_handle, env, n_steps: int): trajectory = [] state = env.reset.remote() for _ in range(n_steps): action = ray.get(policy_handle.act.remote(state)) next_state, reward = ray.get(env.step.remote(state, action)) trajectory.append((state, action, reward)) state = next_state return trajectory
4. Trainer consumes trajectories, runs the RL update,
and broadcasts the new weights back to the Serve deployment
via Ray weight sync.
trainer = TorchTrainer( train_loop_per_worker=grpo_step, # provided by SkyRL scaling_config=ScalingConfig(num_workers=64, use_gpu=True), )
for epoch in range(num_epochs): trajectories = ray.get([ rollout.remote(policy, env, n_steps=128) for env in env_pool ]) result = trainer.fit(trajectories) # Real SkyRL is launched via Hydra config: # python -m skyrl.train.entrypoints.main_base –config-path ../config –config-name ppo # The loop shown here is the conceptual shape SkyRL implements internally. sync_weights_to_serve(result.checkpoint, target=Policy) # Ray weight sync
A few things to notice about this shape:
-
Five different kinds of workloads are running on one cluster: serving (Policy), CPU-bound sandboxed execution (CodeAgentEnv), GPU rollout inference, distributed training, and weight broadcast. Each has different hardware needs and needs to be coodinated but this is exactly the heterogeneous, multi-stage GPU orchestration Ray is suited for.
-
Ray weight sync is the load-bearing primitive. Naive RL stacks crash here because the trainer produces new weights but the rollout actors keep serving stale ones, and the policy gradient estimator quietly biases.
-
The sandbox is yours. No frontier API gives you this and this is your IP.
Runtime: Ray Serve (policy), Ray remote actors (the sandboxed env), Ray Train + SkyRL (trainer), and Ray weight sync. At scale: sim-in-the-loop templates and head-node fault tolerance so long rollouts survive node failures.
Stage 5: Production RL loop. Now we can start combining multiple workflows, continuous trace collection, scheduled retraining, automated deployment, regression evals on every policy update. This is the hill-climbing machine! You’ll actually end up changing nuances about how your product actually works / collects traces, etc. to make this all work.
And running an RL loop on a single team’s Ray cluster is one thing, but doing this across many teams is a different job. No matter what you choose for your platform (KubeRay, Anyscale, a homegrown setup, etc.), make sure it gives you:
-
Multi-tenant scheduling so experiments fill spare GPU capacity without preempting production jobs and you get meaningfully more experiment throughput out of the same fleet.
-
Observability: structured events, task and actor aggregates, per-node metrics and hardware profiling. Generic APM tools treat every process the same and won’t tell you why your tasks/actors failed.
-
Cost discipline: instance selection, autoscaling, spot management, image optimization, fast startup.
-
K8s control plane or VM stack that rides on the standards your platform team already runs (KubeRay, Kueue, and similar), not a parallel universe.
-
One agent-first surface: a structured, composable CLI + IDE/workspace integration, so engineers and coding agents operate the same way and you can iterate against a real cluster, then promote to production without a rewrite.
-
Tons of examples and expertise you’ve got enough to figure out so start with blueprints, recipes from your provider on how to do these workloads.
Runtime: Ray Core scheduling (rack-aware placement, scaling to tens of thousands of actors). At scale: everything above but just make sure runtime underneath stays open-source so you can move without rebuilding loops.
Come see all of this in action
Everything we discussed is completely buildable today on OSS ecossytem and infrastructure + patterns already proven by companies in every industry. The teams that start now will compound for years before the rest catch up. If this resonates, the entire program at Ray Summit 2026 is built around it.
-
Physical AI track. Talks from Physical Intelligence, Torc, Bedrock Robotics, and the Ray team on multimodal curation, VLA training, and simulation-driven RL.
-
Agents and RL track. SkyRL deep dives, weight sync internals, customer stories from teams running production RL loops in life sciences, recsys, and code agents.
-
Scale track. Ray Core’s new rack-aware scheduling (and the ~10x scheduling improvements), plus the Ray Data and Ray Serve LLM benchmark work landing this year.
-
Customer keynotes from the companies named throughout this post, with the actual architectures, the actual reward functions, the actual lessons.
🎟️ Free Ray Summit 2026 tickets. The first few groups of people to email [email protected] with the subject line “Stop renting your intelligence” get a free ticket on me! See the full program at anyscale.com/ray-summit/2026 (full lineup will be updated soon).
Own your data. Own your model. Own the loop.
Similar Articles
@rhythmrg: https://x.com/rhythmrg/status/2066561780495896785
The article argues that enterprises should post-train their own custom AI models for mission-critical, high-volume use cases to achieve differentiation, cost savings, and control over tradeoffs, rather than relying solely on general frontier models.
@TheAhmadOsman: That's why:
A tweet by @TheAhmadOsman advocates for open-source AI, arguing that artificial intelligence must remain accessible and community-governed to avoid dependency on closed corporate systems.
Microsoft Satya Nadella says the future is the learning loop. But who really owns it?
Satya Nadella argues companies must own their learning loop, not just AI models. The article warns that dependency on API providers risks losing control, and advocates for building systems that allow model swapping without losing institutional knowledge.
@TheAhmadOsman: Local AI is the future Learning how to run Opensource models (Inference), how to evaluate them systematically (Evals), …
A tweet from @TheAhmadOsman emphasizes that local AI is the future and recommends learning skills like running open-source models, conducting evals, and customizing models through fine-tuning.
@oneill_c: https://x.com/oneill_c/status/2054604986269802579
The article argues that serious AI companies are moving from wrapping general models to training their own specialized models using proprietary interaction data, as specialisation now routinely matches or beats frontier models for in-distribution agentic tasks, driving better unit economics.