@harold_matmul: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)

X AI KOLs Timeline Tools

Summary

The article explains how GEPA (Genetic-Pareto Optimization) within DSPy is used for efficient prompt tuning, specifically applied to pretraining data curation at Microsoft AI, allowing researchers to replace manual prompt engineering with automated compute-driven optimization.

dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)
Original Article
View Cached Full Text

Cached at: 06/24/26, 08:29 PM

dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)


Are you still tuning your LLMs by hand? - An ode to GEPA

I think GEPA (also called dspy.GEPA) is still highly under-appreciated by the ML community.

Credits due to @lateinteraction , @LakshyAAAgrawal (and others ). Many thanks to them for being pioneers, and creating the building blocks of “modern AI engineering”.

I’ve used GEPA successfully in the context of pretraining data curation at Microsoft AI, so I thought I’d give a brief overview of why I reach for this tool :)

Omar Khattab@lateinteraction·Jun 3dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)QuoteLakshya A Agrawal@LakshyAAAgrawal·Jun 3Excited to see the use of GEPA-optimized LLM judges for data filtering in MAI-Thinking-1 model’s pre-training pipeline! x.com/mustafasuleyma…102525719K

If I had to give a one-liner summary for the article:

dspy.GEPA allows me to tune task-specific LLMs by spending compute instead of human time.

As a deep learning researcher, my productivity and impact is somewhat tied to my ability to keep GPUs busy, and to effectively replace my manual labour by spending compute.

Within that mindset, GEPA (and DSPy) are really great tools.

But it goes beyond that. Structuring your work around GEPA also acts as a good forcing function to have good experiment hygiene, as it rewards you for creating evals for your tasks (which you should always do!).

But I’m getting ahead of myself.

**Why and when do we need it?​ **

We, the user, want an LLM to do a task well (e.g. classification, ranking, code generation, grading)​.

How do we approach this task ?

The simplest approach is to simply tweak the prompt. Let’s why GEPA helps.

The old way​

  • Look at your data, understand the task details (hopefully label and build an eval)​

  • Repeat this loop until satisfied:​

  • Write the prompt​

  • Run inference​

  • Analyze failure modes manually (and maybe you have an eval score)

The tuning alone can take a few hours, and is specific to the LLM you picked.

With GEPA:

  • Look at your data, understand the task details (hopefully label and build an eval)​.

  • Write a dspy.GEPA optimization loop.

  • Run the prompt tuning loop.

  • You now have task-optimized prompt.

This is pretty nice! The only manual labour needed is to look at the data. The tuning should take a few minutes, and can be easily rerun for any LLM.

How does prompt tuning with GEPA work?

I won’t go into many details, as other people explain it much better (see GEPA website).

At a high level, it is simply a loop optimizing the task prompt, given some grader and a reflection LLM.

Check out this diagram.

Prompt tuning with GEPA in a nutshell

Prompt tuning with GEPA in a nutshell

The reflection LLM is usually a very strong LLM e.g. Opus or GPT 5.5, capable of understanding and summarizing the mistakes made the main LLM.

To be precise, in this loop, GEPA is the optimizer and its abbreviation means “Genetic-Pareto”, which is a good summary of how the optimizer chooses its best candidates.

As one can expect, the optimization loop is only as good as the objective, and thus we get explicitly rewarded when building good evals!

A note on evals

In the previous paragraphs, the term “eval” is a bit overloaded, as it actually encompasses

  • the train/val samples,

  • but also the grader, which is itself task-specific, and can be a composition of human labels and/or a reward model (this can be anything, but one or multiple LLM rubric graders usually does the trick.)

Note that in the loop above, the grader returns both a score & textual feedback.

This is an extra flexibility, allowed by the fact that our optimization is driven by an LLM. This can be useful to give more signal to the optimizer as to which solution is preferred.

Use-cases​

To motivate the reader, I’m listing here a few use-cases where I’ve used dspy.GEPA.

  • Quality classification of web pages. This is mentioned in the MAI-Thinking-1 paper. For STEM web pages, the grader was a LLM rubric grader judging formatting and reasoning. For code web pages, the grader was mainly human labels.

  • Human+GEPA in the loop interface to bootstrap a few thousand labels for training an embedding classifier.

  • **Reverse-engineering human preferences/priors​ that are under-specified. **You can label 100 samples (good/bad) , then tune GPT-5.5 to match human labels. You can then read the prompt to see if your priors can be made explicit.

Why prompt tuning?​

One may ask the very legitimate question, “Why should I use prompt tuning, why not finetune a model, or use an embedding-based classifier?”.

The answer to this boils down to how much these 3 factors matter to you:

  • Quality

  • Time to first solution

  • Scale (or cost)

The first two factors are usually the most important to me, and prompt tuning with an LLM ticks those two boxes easily.

Indeed, we get:

  • Tune strong external models without hosting them.

  • Flexibility. We can switch the main LLM and rerun the optimization loop very easily, which allows us to compute a Pareto frontier of quality vs cost.

  • Ease of use. While running a pipeline for the task, for the same model, one can switch between many different prompts on the fly​ to accommodate for different distributions of the data.

  • No train-inference mismatch. Assuming I want to run LLM inference at large scale, I can pick my inference settings in advance (nvpf4, fp8 kv cache, specific kernels), and then run the GEPA tuning with those specific numerics.

  • **We can use reasoning​. **This lowers throughput significantly, but it can be necessary for hard problems, such as evaluating math correctness​. This is not possible with more scalable methods like embedding-based classifiers.

  • The optimization loop is fast. This depends on how exhaustive the search is, but running a GEPA tuning loop can take a few minutes to a few hours.

Thanks to the above flexibility, I can get a solution for my task within a day, get a quality-cost Pareto frontier for my task, maybe run an ablation, and then decide whether I need to scale things up.

We can reduce costs, for example, by distilling the tuned LLM decisisons into an embedding-based classifier, or using smaller quantized LLMs.

Conclusion

Dear reader, if you’re here, thank you! I hope this article was able to convince you that there is some value in GEPA, and maybe to rethink your workflows :)

If you have questions, feel free to reach out. My DMs are always open to discuss interesting research and engineering problems.

Similar Articles