prompt caching, but for rl training - 7.5x speedup on long-prompt/short-response workloads

Reddit r/LocalLLaMA 05/11/26, 09:01 PM Tools

Summary

A new optimization technique for open-source RL training engines introduces prompt caching during training, achieving up to 7.5x speedup on long-prompt, short-response workloads by reducing redundant compute.

most open source RL engines pack sequences naively: prompt + response, repeated for every sample in the group. this is fine for short prompt, long completion workloads but inefficient for long prompt, short completion workloads. with 1000-token prompts and 100-token responses at G=8, you're processing 8800 tokens when only 1800 are unique. about 5x wasted compute. the fix is conceptually simple: compute the prompt once, then compute all G responses after it. it's analagous to inference prefix caching, except training needs gradients to flow back through the prompt, which breaks causal attention in the obvious implementation. getting it right required different tricks for full vs. linear attention layers. you can read about it in the blogpost in the comments. Numbers on Qwen3.5-4B: \- 16k prompt / 64 out → 7.5x \- 16k / 128 → 7.3x \- 16k / 1k → 5.4x \- 8k / 4k → 1.7x

Original Article

Similar Articles

Prompt Caching in the API

OpenAI Blog

OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv cs.LG

Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

arXiv cs.CL

CacheRL trains small agent foundation models for multi-step tool-calling tasks, achieving 92% process accuracy (approaching GPT-5's 94%) with 100x less compute using cached rollouts and hybrid reward shaping, with innovations in knowledge transfer, cache-aware rewards, and iterative SFT/GRPO training.

Building Fast & Accurate Agents with Prime-RL Post Training (22 minute read)

TLDR AI

Ramp presents a case study on using reinforcement learning post-training to build Fast Ask, a specialized spreadsheet retrieval agent that improves accuracy and reduces latency compared to general-purpose models.

How I easily cut my input token burn ~90% on long agent runs

Reddit r/AI_Agents

The author shares a practical tip to reduce input token costs by ~90% on long agent runs using prompt caching: placing unchanged text (system prompt, tool definitions, context) at the start of every prompt to leverage cached prefixes from LLM providers.