@harold_matmul: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)
Summary
The article explains how GEPA (Genetic-Pareto Optimization) within DSPy is used for efficient prompt tuning, specifically applied to pretraining data curation at Microsoft AI, allowing researchers to replace manual prompt engineering with automated compute-driven optimization.
View Cached Full Text
Cached at: 06/24/26, 08:29 PM
dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)
Are you still tuning your LLMs by hand? - An ode to GEPA
I think GEPA (also called dspy.GEPA) is still highly under-appreciated by the ML community.
Credits due to @lateinteraction , @LakshyAAAgrawal (and others ). Many thanks to them for being pioneers, and creating the building blocks of “modern AI engineering”.
I’ve used GEPA successfully in the context of pretraining data curation at Microsoft AI, so I thought I’d give a brief overview of why I reach for this tool :)
Omar Khattab@lateinteraction·Jun 3dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)QuoteLakshya A Agrawal@LakshyAAAgrawal·Jun 3Excited to see the use of GEPA-optimized LLM judges for data filtering in MAI-Thinking-1 model’s pre-training pipeline! x.com/mustafasuleyma…102525719K
If I had to give a one-liner summary for the article:
dspy.GEPA allows me to tune task-specific LLMs by spending compute instead of human time.
As a deep learning researcher, my productivity and impact is somewhat tied to my ability to keep GPUs busy, and to effectively replace my manual labour by spending compute.
Within that mindset, GEPA (and DSPy) are really great tools.
But it goes beyond that. Structuring your work around GEPA also acts as a good forcing function to have good experiment hygiene, as it rewards you for creating evals for your tasks (which you should always do!).
But I’m getting ahead of myself.
**Why and when do we need it? **
We, the user, want an LLM to do a task well (e.g. classification, ranking, code generation, grading).
How do we approach this task ?
The simplest approach is to simply tweak the prompt. Let’s why GEPA helps.
The old way
-
Look at your data, understand the task details (hopefully label and build an eval)
-
Repeat this loop until satisfied:
-
Write the prompt
-
Run inference
-
Analyze failure modes manually (and maybe you have an eval score)
The tuning alone can take a few hours, and is specific to the LLM you picked.
With GEPA:
-
Look at your data, understand the task details (hopefully label and build an eval).
-
Write a dspy.GEPA optimization loop.
-
Run the prompt tuning loop.
-
You now have task-optimized prompt.
This is pretty nice! The only manual labour needed is to look at the data. The tuning should take a few minutes, and can be easily rerun for any LLM.
How does prompt tuning with GEPA work?
I won’t go into many details, as other people explain it much better (see GEPA website).
At a high level, it is simply a loop optimizing the task prompt, given some grader and a reflection LLM.
Check out this diagram.
Prompt tuning with GEPA in a nutshell
Prompt tuning with GEPA in a nutshell
The reflection LLM is usually a very strong LLM e.g. Opus or GPT 5.5, capable of understanding and summarizing the mistakes made the main LLM.
To be precise, in this loop, GEPA is the optimizer and its abbreviation means “Genetic-Pareto”, which is a good summary of how the optimizer chooses its best candidates.
As one can expect, the optimization loop is only as good as the objective, and thus we get explicitly rewarded when building good evals!
A note on evals
In the previous paragraphs, the term “eval” is a bit overloaded, as it actually encompasses
-
the train/val samples,
-
but also the grader, which is itself task-specific, and can be a composition of human labels and/or a reward model (this can be anything, but one or multiple LLM rubric graders usually does the trick.)
Note that in the loop above, the grader returns both a score & textual feedback.
This is an extra flexibility, allowed by the fact that our optimization is driven by an LLM. This can be useful to give more signal to the optimizer as to which solution is preferred.
Use-cases
To motivate the reader, I’m listing here a few use-cases where I’ve used dspy.GEPA.
-
Quality classification of web pages. This is mentioned in the MAI-Thinking-1 paper. For STEM web pages, the grader was a LLM rubric grader judging formatting and reasoning. For code web pages, the grader was mainly human labels.
-
Human+GEPA in the loop interface to bootstrap a few thousand labels for training an embedding classifier.
-
**Reverse-engineering human preferences/priors that are under-specified. **You can label 100 samples (good/bad) , then tune GPT-5.5 to match human labels. You can then read the prompt to see if your priors can be made explicit.
Why prompt tuning?
One may ask the very legitimate question, “Why should I use prompt tuning, why not finetune a model, or use an embedding-based classifier?”.
The answer to this boils down to how much these 3 factors matter to you:
-
Quality
-
Time to first solution
-
Scale (or cost)
The first two factors are usually the most important to me, and prompt tuning with an LLM ticks those two boxes easily.
Indeed, we get:
-
Tune strong external models without hosting them.
-
Flexibility. We can switch the main LLM and rerun the optimization loop very easily, which allows us to compute a Pareto frontier of quality vs cost.
-
Ease of use. While running a pipeline for the task, for the same model, one can switch between many different prompts on the fly to accommodate for different distributions of the data.
-
No train-inference mismatch. Assuming I want to run LLM inference at large scale, I can pick my inference settings in advance (nvpf4, fp8 kv cache, specific kernels), and then run the GEPA tuning with those specific numerics.
-
**We can use reasoning. **This lowers throughput significantly, but it can be necessary for hard problems, such as evaluating math correctness. This is not possible with more scalable methods like embedding-based classifiers.
-
The optimization loop is fast. This depends on how exhaustive the search is, but running a GEPA tuning loop can take a few minutes to a few hours.
Thanks to the above flexibility, I can get a solution for my task within a day, get a quality-cost Pareto frontier for my task, maybe run an ablation, and then decide whether I need to scale things up.
We can reduce costs, for example, by distilling the tuned LLM decisisons into an embedding-based classifier, or using smaller quantized LLMs.
Conclusion
Dear reader, if you’re here, thank you! I hope this article was able to convince you that there is some value in GEPA, and maybe to rethink your workflows :)
If you have questions, feel free to reach out. My DMs are always open to discuss interesting research and engineering problems.
Similar Articles
@lateinteraction: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)
GEPA-optimized LLM judges from dspy are used for data filtering in Microsoft's MAI-Thinking-1 model pre-training pipeline.
@harold_matmul: it was my idea :) Using GEPA is a very natural workflow for creating LLM programs. The iteration speed is very quick, a…
A user thanks for the GEPA tool, highlighting its natural workflow for LLM programs, fast iteration, and ability to bias optimization with data-derived priors.
@MaximeRivest: Compound AI System for Images are way under appreciated. We need gepa, dspy, autoresearch style optimization to go from…
Maxime Rivest argues that compound AI systems for images are undervalued and suggests leveraging optimization frameworks like DSPy and GEPA to automate pipeline creation involving SAM and classifiers.
@shawntenam: GEPA (http://github.com/gepa-ai/gepa) bumped Haiku 4.5 from 65% to 85% pass rate by auto-optimizing my prompt instructi…
GEPA is an open-source tool that automatically optimizes prompt instructions using execution traces and scores, raising Claude Haiku 4.5's pass rate from 65% to 85% without requiring a model swap.
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
GEPA is a prompt optimizer that uses natural language reflection to learn from trial and error, outperforming reinforcement learning methods like GRPO and MIPROv2 with up to 35x fewer rollouts across multiple tasks.