Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
Summary
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
View Cached Full Text
Cached at: 05/08/26, 09:08 AM
Paper page - Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
Source: https://huggingface.co/papers/2508.04660 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.
Group Relative Policy Optimization (GRPO) has proven to be an effective tool forpost-training language models(LMs). However, AI systems are increasingly expressed as modular programs that mix together multipleLM callswith distinctprompt templatesand other tools, and it is not clear how best to leverageGRPOto improve these systems. We begin to address this challenge by definingmmGRPO, a simplemulti-modulegeneralization ofGRPOthat groupsLM callsby module across rollouts and handles variable-length and interrupted trajectories. We find thatmmGRPO, composed withautomatic prompt optimization, improves accuracy by 11% on average acrossclassification,many-hop search, andprivacy-preserving delegationtasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-sourcemmGRPOin DSPy as thedspy.GRPO optimizer.
View arXiv pageView PDFProject pageGitHub34.3kautoAdd to collection
Get this paper in your agent:
hf papers read 2508\.04660
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2508.04660 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2508.04660 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2508.04660 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
BiasGRPO proposes a framework using Group Relative Policy Optimization (GRPO) to stabilize social bias mitigation in LLMs by normalizing rewards across sampled completions, outperforming DPO and PPO on multiple benchmarks. The authors also release a compute-efficient bias reward model designed for integration into multi-objective RLHF pipelines.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization
N-GRPO introduces semantic neighbor mixing in the GRPO framework to enhance mathematical reasoning diversity while preserving semantic consistency, achieving improvements on math benchmarks and out-of-distribution tasks.
GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization
GD^2PO introduces a conflict-aware filtering mechanism to mitigate multi-reward conflicts in reinforcement learning for large language models, preventing signal cancellation and accelerating training efficiency.
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
F-GRPO proposes a factorized group-relative policy optimization framework that unifies candidate generation and ranking in a single autoregressive LLM, addressing credit assignment issues and improving top-ranked performance across sequential recommendation and multi-hop QA benchmarks.