Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Papers with Code Trending 08/06/25, 05:28 PM Papers

Summary

The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

Original Article

View Cached Full Text

Cached at: 05/08/26, 09:08 AM

Paper page - Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Source: https://huggingface.co/papers/2508.04660 Authors:

Abstract

mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.

Group Relative Policy Optimization (GRPO) has proven to be an effective tool forpost-training language models(LMs). However, AI systems are increasingly expressed as modular programs that mix together multipleLM callswith distinctprompt templatesand other tools, and it is not clear how best to leverageGRPOto improve these systems. We begin to address this challenge by definingmmGRPO, a simplemulti-modulegeneralization ofGRPOthat groupsLM callsby module across rollouts and handles variable-length and interrupted trajectories. We find thatmmGRPO, composed withautomatic prompt optimization, improves accuracy by 11% on average acrossclassification,many-hop search, andprivacy-preserving delegationtasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-sourcemmGRPOin DSPy as thedspy.GRPO optimizer.

View arXiv page View PDF Project page GitHub34.3kauto Add to collection

Get this paper in your agent:

hf papers read 2508\.04660

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2508.04660 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2508.04660 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2508.04660 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Paper page - Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Gradient Extrapolation-Based Policy Optimization

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

Submit Feedback

Similar Articles

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Gradient Extrapolation-Based Policy Optimization

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking