prompt caching, but for rl training - 7.5x speedup on long-prompt/short-response workloads
Summary
A new optimization technique for open-source RL training engines introduces prompt caching during training, achieving up to 7.5x speedup on long-prompt, short-response workloads by reducing redundant compute.
Similar Articles
Prompt Caching in the API
OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.
CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward
CacheRL trains small agent foundation models for multi-step tool-calling tasks, achieving 92% process accuracy (approaching GPT-5's 94%) with 100x less compute using cached rollouts and hybrid reward shaping, with innovations in knowledge transfer, cache-aware rewards, and iterative SFT/GRPO training.
Building Fast & Accurate Agents with Prime-RL Post Training (22 minute read)
Ramp presents a case study on using reinforcement learning post-training to build Fast Ask, a specialized spreadsheet retrieval agent that improves accuracy and reduces latency compared to general-purpose models.
How I easily cut my input token burn ~90% on long agent runs
The author shares a practical tip to reduce input token costs by ~90% on long agent runs using prompt caching: placing unchanged text (system prompt, tool definitions, context) at the start of every prompt to leverage cached prefixes from LLM providers.