@leanxbt: https://x.com/leanxbt/status/2070852461494202609

X AI KOLs Timeline 06/27/26, 12:51 PM Tools

prompt-engineering loop-optimization evaluation agents llm ai-tooling automated-prompt-tuning

Summary

A detailed article introducing Loop Prompt Engineering, a method to automate prompt optimization by iteratively rewriting prompts based on evaluation against a dataset, with emphasis on avoiding recursive traps.

https://t.co/mrIJXQy3Lc

Original Article

View Cached Full Text

Cached at: 06/28/26, 06:01 AM

Loop Prompt Engineering: A New Approach to Optimizing Prompts

There is a task engineers spend hours on by hand and almost always lose: tuning a prompt. You change the wording, run it on a few examples, see if it got better or worse, change it again. It is slow, subjective, and you are holding three examples out of fifty in your head. And this is exactly the kind of work that has a machine check: an eval dataset that returns a number. Which means it can be wrapped in a loop.

The idea is simple to say: the loop rewrites the prompt itself, runs it against a set of examples, scores it, and repeats until the score is above a threshold. You set the goal once - “accuracy above 0.9 on this dataset” - and step out of the circuit. The machine searches for the wording that gets there on its own.

But this is exactly where loop engineering shows its teeth. Because this is the cleanest case where the main rule - carry judgment outside - breaks from the inside. When you fix tests, the oracle is iron: an exit code, nothing to argue with. When you optimize a prompt, both the thing being optimized and the check are text for the same model. The judge sits inside the same system it judges. This is a recursive trap, and most of this article is about not falling into it.

Why this is a loop and not a script

You could object: this is just a search, a loop with string replacement, what do agents have to do with it. The difference is who decides how to change the prompt.

In a dumb search, you write the variants: a list of ten wordings, run each, take the best. That is grid search, and it is capped by your imagination - you only try what you thought of.

In a loop, the next variant is written by the agent, and written based on why the previous one failed. The loop does not just measure score, it reads which examples the prompt got wrong and rewrites the wording around those errors. That is no longer grid search, it is directed descent: each iteration is a hypothesis about why the current prompt is weak, and an attempt to close it. That “read the failures and rewrite around them” is what the agent does and a script does not.

Step 0: An eval that returns a number, not an opinion

The same step-zero filter as in any loop: without a check independent of the agent, there is no loop. But for prompts it is stricter, because text quality is soft by default, and the loop needs a hard number.

An eval dataset is a set of input-expectation pairs. The harder your way of comparing an answer to the expectation, the more reliable the whole loop. The reliability scale, top to bottom:

Exact match or regex - the answer either equals the reference or not. Classification - the model picks from a fixed list of labels, compared to the correct one. Verifiable property - the answer parses as JSON, a number lands in a range, code passes a test. And only at the very bottom, a judge-model that scores quality, because there is no other way.

python

eval_set.jsonl - each line is one example

{“input”: “Does this function return None on an empty list?\n\ndef first(xs): return xs[0]”, “expected”: “no”, “type”: “exact”} {“input”: “Classify the ticket: ‘password reset email never arrives’”, “expected”: “auth”, “type”: “label”, “labels”: [“auth”, “billing”, “ui”, “other”]} {“input”: “Extract the amount from: ‘payment of $4,200 went through’”, “expected”: “4200”, “type”: “exact”}

The rule: drag the task as high up this scale as you can. Every step from “a judge scores it” toward “exact match” removes one source of noise and one hole the loop will later cheat through. If all you have is a subjective “the answer is good,” first figure out how to make “good” verifiable, and only then build the loop.

And the same determinism requirement as everywhere: run the eval twice on one prompt. If the score jumps, you have a flaky check, and the loop will chase noise. Pin the model temperature to zero on eval runs, or you are optimizing not the prompt but the luck of sampling.

Step 1: One manual run and an honest baseline

Before spinning anything, run the starting prompt through the whole eval by hand and record the number. This is your baseline, and without it you cannot tell the loop’s work from self-deception.

python# eval.py - runs one prompt over the whole dataset, returns a score import statistics

def run_eval(prompt: str, dataset: list, call_model) -> dict: results = [] for ex in dataset: answer = call_model(prompt, ex[“input”], temperature=0) if ex[“type”] in (“exact”, “label”): ok = answer.strip().lower() == ex[“expected”].strip().lower() else: ok = grade(answer, ex) # verifiable property results.append({“input”: ex[“input”], “answer”: answer, “expected”: ex[“expected”], “ok”: ok}) score = statistics.mean(r[“ok”] for r in results) fails = [r for r in results if not r[“ok”]] return {“score”: score, “fails”: fails, “n”: len(results)}

Record the starting score and, more importantly, look at the failures with your own eyes. If the baseline is already 0.95, the loop has almost nothing to improve and you will waste money. If the baseline is 0.3, maybe what is broken is not the prompt but the eval itself. A manual look at this step catches both troubles before they get multiplied by a hundred iterations.

Step 2: The minimal optimization loop

The skeleton is the same as in any loop: check first, agent action second, state on disk. Only the “action” here is not “fix the code” but “rewrite the prompt around the failures.”

python#!/usr/bin/env python3

optimize_prompt.py - loop improves the prompt until score is above threshold

MAX_ITER = 15 THRESHOLD = 0.90

def optimize(seed_prompt, dataset, call_model, propose): best = {“prompt”: seed_prompt, “score”: -1.0}

for i in range(1, MAX_ITER + 1):
    current = best["prompt"] if best["score"] >= 0 else seed_prompt
    result = run_eval(current, dataset, call_model)
    print(f"iter {i}: score={result['score']:.3f} "
          f"({result['n'] - len(result['fails'])}/{result['n']})")

    # VERIFY first: already above threshold, exit
    if result["score"] >= THRESHOLD:
        print(f"Hit {THRESHOLD} at iter {i}.")
        return best

    # keep the best seen prompt, not the last one
    if result["score"] > best["score"]:
        best = {"prompt": current, "score": result["score"]}

    # agent action: rewrite the prompt LOOKING at the failures
    new_prompt = propose(current, result["fails"], call_model)
    cand_score = run_eval(new_prompt, dataset, call_model)["score"]
    if cand_score > best["score"]:
        best = {"prompt": new_prompt, "score": cand_score}

print(f"Iter cap {MAX_ITER}. Best score {best['score']:.3f}")
return best

Two details that are easy to miss and then be surprised by. First: the loop keeps the best seen prompt, not the last one. Prompt optimization is not monotonic - an iteration easily makes things worse, and if you blindly take the last variant you drift downhill. Second: a candidate is accepted only if its score is actually higher. That turns the loop from a random walk over wordings into a climb that does not roll back.

Step 2.5: How the agent rewrites the prompt

The meatiest part is the propose function. The dumb version - “here is a prompt, improve it” - barely works: the agent does not know what to improve and fiddles with cosmetics. The strong version shows the agent the failures and forces it to diagnose first, then treat.

pythondef propose(current_prompt: str, fails: list, call_model) -> str: # show the agent up to 8 concrete failures, not the whole dataset sample = fails[:8] fail_text = “\n\n”.join( f“INPUT: {f[‘input’]}\nMODEL SAID: {f[‘answer’]}\n“ f“EXPECTED: {f[‘expected’]}“ for f in sample ) meta_prompt = f“““You are optimizing a prompt. Here is the current prompt:

<current_prompt> {current_prompt} </current_prompt>

It failed on these examples:

{fail_text}

First, in one or two sentences, diagnose the SINGLE most common reason these failed. Then rewrite the prompt to fix that reason specifically.

Rules:

Keep what already works, change only what addresses the diagnosed failure.
Do not add instructions that target these exact inputs by content. Generalize the fix, do not memorize the examples.
Output ONLY the new prompt, nothing else.“”“

return call_model(meta_prompt, “”, temperature=0.7).strip()

Two deliberate things here. The agent first gives a single diagnosis rather than patching everything at once - that keeps each iteration narrow and stops the prompt from bloating into a pile of contradictory instructions. And the explicit ban “do not target these exact inputs by content” is the first, weakest line of defense against the overfit we turn to next.

Step 3: The recursive trap and three ways to cheat

Here is where this loop differs from “fix the tests,” and where it is dangerous. When the criterion is soft, and the thing optimized and the thing checking are both text for one model, the loop gets not one but three ways to show a high score while improving nothing in substance. Each is reward hacking, and each has its own cure.

First: overfitting the eval. The loop appends instructions to the prompt that work on exactly your fifty examples and fall apart on the fifty-first. In the limit the agent literally sews the answers into the prompt. Score 1.0, generalization zero.

The cure is the same one ML has used from the start: split the dataset. The loop optimizes on train, and the threshold is checked on a held-out set the agent has never seen.

python# the dataset is split once, the agent only ever sees train train, holdout = dataset[:35], dataset[35:]

the loop optimizes on train

best = optimize(seed_prompt, train, call_model, propose)

but the “did we pass the threshold” decision is made on holdout

holdout_score = run_eval(best[“prompt”], holdout, call_model)[“score”] print(f“train {best[‘score’]:.3f} / holdout {holdout_score:.3f}“) if holdout_score < THRESHOLD: print(“Overfit: good on train, not on holdout. Prompt is no good.”)

The gap between train and holdout is the cheating metric. Train 0.95, holdout 0.6 means the loop learned your examples, not the task. Trust only the holdout number.

Second: the judge praising itself. If the eval leans on a judge-model, and the prompt is rewritten by a model too, a collusion appears: the agent learns to write answers the judge likes, not ones that are correct. Especially when judge and worker are the same model, it recognizes its own style and inflates the score.

The cure is two layers. First, a judge on a different model than the worker: it does not favor an alien style. Second, the judge must have a verifiable anchor wherever possible, not raw taste.

python# .judge.md - judge on a DIFFERENT model, anchored to fact “”“ You are grading an answer against a known-correct expected value. You are NOT rewarding style, fluency, or confidence.

EXPECTED: {expected} ANSWER: {answer}

Return PASS only if the answer is factually equivalent to EXPECTED. Different wording is fine. A confident wrong answer is FAIL. A hedged correct answer is PASS. Output PASS or FAIL and nothing else. “”“

Notice: even into a judge-model we feed expected - the known-correct answer. That turns “assess quality” into “check against fact,” and closes most of the hole. A pure taste-based judge with no anchor is a last resort, and its score must not be let into the stop condition as the only one.

Third: gaming the metric’s shape. If the score is exact match, the agent can make the model answer in one word, breaking everything that needs a full answer; if the score rewards length, answers bloat. The agent optimizes exactly what you measure, including the crookedness of the measure itself.

The cure is an eval of several metrics that cannot be raised at once: accuracy plus a length penalty, correctness plus format. When there are two metrics pulling in different directions, the cheap cheating path closes.

Step 4: Memory and prompt history

The loop’s memory here is not just “what is done” but the whole trajectory: which prompt gave which score and which diagnosis drove the change. Without it the loop walks in circles, re-trying wordings it already failed.

json// .prompt_history.jsonl - one line per iteration {“iter”: 3, “train_score”: 0.74, “holdout_score”: 0.71, “diagnosis”: “model over-classifies ambiguous tickets as ‘other’”, “prompt_sha”: “a1b2c3”, “kept”: true} {“iter”: 4, “train_score”: 0.71, “holdout_score”: 0.69, “diagnosis”: “added examples, made it verbose, hurt label accuracy”, “prompt_sha”: “d4e5f6”, “kept”: false}

The kept field - whether the change was accepted - turns the history into a map: you see which directions improved and which did not, and you can feed the agent past rejected diagnoses so it does not propose them again. And prompt_sha instead of the full text keeps the log compact, the prompts themselves living in separate files.

The best prompt itself lives in a separate file, overwritten only when the holdout score actually rises:

pythonimport json

the champion is overwritten only on a holdout gain

champion = json.load(open(“.champion.json”)) if holdout_score > champion[“holdout_score”]: json.dump({“prompt”: best[“prompt”], “holdout_score”: holdout_score, “iter”: i}, open(“.champion.json”, “w”))

Step 5: Isolation and brakes

The blast radius of this loop is different from a code loop’s. A prompt loop usually does not write to the repo or reach into prod, so the physical damage is small. The financial damage is not, and that is where the brakes go.

The main spend counter here is treacherous: each iteration runs the whole eval dataset, which is N model calls per single loop step. Fifty examples across fifteen iterations is seven hundred fifty calls, and if a judge-model sits in the eval, double it. Cost grows as iterations × dataset size, and without a ceiling it bites.

pythonMAX_ITER = 15 MAX_EVAL_CALLS = 1500 # hard ceiling on total model calls PATIENCE = 3 # stop if N iterations in a row without improvement

calls_spent = 0 no_improve = 0 prev_best = -1.0

inside the loop, after each eval:

calls_spent += result[“n”] if calls_spent >= MAX_EVAL_CALLS: print(“Call ceiling. Stopping.”); break

plateau detector: optimization ran out of steam

if best[“score”] <= prev_best + 0.005: no_improve += 1 if no_improve >= PATIENCE: print(f“Plateau for {PATIENCE} iterations. The loop is now burning money for nothing.“); break else: no_improve = 0 prev_best = best[“score”]

The plateau detector here is the main brake, more important than the iteration cap. Prompt optimization almost always grabs the easy percent fast and hits a wall: 0.7 to 0.88 in three iterations, then ten iterations stuck at 0.88, burning a dataset each. PATIENCE catches exactly that moment - when the loop stopped improving and started just spending. Without it you pay for iterations that do not move the number.

Step 6: How this loop dies

The same four deaths as any loop, but with symptoms specific to prompt optimization.

Runaway. The call counter climbs, the holdout score stands still. Cause: the threshold is unreachable for this model on this task, and the loop has no way to know it. Cure: a call ceiling and a plateau detector, plus a sober threshold - if the baseline is 0.5, a target of 0.99 may be physically unreachable.

Silent death by overfit. The most treacherous one for prompts: the train score crawls nicely up, the loop reports progress, while in fact it is just memorizing your examples. The symptom is visible only in the gap between train and holdout. Cure: never look at the train score as a result, stop condition on holdout only.

Random walk over wordings. The loop rewrites the prompt every turn, the score jitters up and down around one spot, converging nowhere. Cause: propose patches cosmetics instead of diagnosing, or the temperature is too high and every variant is a new random text. Cure: force the agent to give one diagnosis first, keep the best prompt, not the last.

Comprehension debt. The subtlest one right here. The loop hands you a prompt at holdout 0.93, you deploy it without reading. Inside is a wall of crutches that work on your dataset for reasons you do not understand, and break in prod on the first input outside the eval distribution. Cure: read the final prompt with your eyes and ask yourself why it works. If you cannot explain it, you did not optimize a prompt, you overfit the test and did not notice.

Where this actually pays off

This loop is not for every prompt. It pays off where three things meet: the prompt runs often (a classifier in prod, an extractor, a router), you have a labeled example set or it is cheap to collect, and quality is measurable as a number. Ticket classification, field extraction, request routing, moderation - ideal candidates: the answer is either right or wrong, the dataset accumulates from prod by itself.

And the reverse: for a one-off creative prompt you will run twice, building an eval loop is using a microscope to drive a nail. The cost of building the eval pays back only through frequency of use. Before you build, estimate: how many times this prompt will run in prod against how many hours go into labeling the eval. If the prompt lives a week and dies, optimize it by hand.

A loop pointed at prompting itself

There is a neat symmetry in this loop, and it is why it was worth building. Everywhere, loop engineering is a machine that delegates judgment: you carry the “done or not” decision outside, and the agent works until the check says yes. Here the machine delegates the judgment about how to delegate judgment. The loop optimizes not code but the prompt itself - that is, the interface through which you steer the model at all.

And that is exactly why the recursive trap from the start matters so much. When both the thing optimized and the thing checking are text for one model, the only thing keeping the loop from lying beautifully to itself is a hard anchor on the outside: a held-out dataset the agent has not seen, and a check tied to fact, not to taste. Remove the anchor and the loop will happily optimize the prompt to a score of 1.0 that means nothing.

So the rule stays the same, it is just sharper here: carry the judgment outside, and keep outside what the agent cannot rewrite. Fixing code, that is a test’s exit code. Optimizing a prompt, it is the holdout you hid and the fact the judge is tied to. A loop is exactly as good as the part it cannot touch. Build that untouchable part first, and the optimization second.