@wquguru: https://x.com/wquguru/status/2057852569054278045

X AI KOLs Timeline 05/22/26, 03:54 PM Tools

ai-agents pi-goal benchmark deepseek-vs-gemini long-horizon-tasks model-comparison cost-analysis

Summary

Performed source code analysis and multi-model testing on the pi-goal tool, finding that DeepSeek V4 Pro is 31x cheaper and higher quality than Gemini 3.5 Flash on long-horizon tasks, and that higher thinking mode actually increases hallucination.

https://t.co/2KX59OH8WZ

Original Article

View Cached Full Text

Cached at: 05/22/26, 11:59 PM

pi-goal Source Code Analysis and Real-World Test: DeepSeek 30x Better Than Gemini

I recently dove into Pi’s pi-goal extension — a tool that lets agents autonomously pursue long-term goals. I combed through the source code, designed a realistic long-horizon task, and ran 4 models side-by-side:

Task: Clone @karpathy’s 12 open-source repos (nanoGPT, minGPT, llm.c, micrograd, nanochat, autoresearch, etc.), have the agent read the source code and git history, write an insight report with commit SHA citations, and self-verify every citation before declaring completion.
4 Models & Configs: Gemini 3.5 Flash, Claude Sonnet 4.6, DeepSeek V4 Pro (high thinking), DeepSeek V4 Pro (max thinking).

The results contained several findings that completely contradicted my expectations. This post covers:

How pi-goal works
The real cost ledger from 4 runs
Why “more expensive models” and “deeper thinking” don’t equal “better results”

Let me lead with the two most counterintuitive findings:

First: By price, Gemini 3.5 “Flash” is actually more expensive than DeepSeek V4 “Pro”. Gemini 3.5 Flash’s token unit price is 3-10x that of DeepSeek V4 Pro. Running the same task: Gemini $2.26, DeepSeek V4 Pro $0.072 — a 31x difference, and DeepSeek had a higher quality score.

Second: Turning DeepSeek’s thinking from high to max made results worse. Deeper reasoning produced 1 new pattern + 3 new insights, but also fabricated the semantics of 2 commits. Deep reasoning amplified the classic failure mode of “narrative coherence overriding factual precision.”

1. What is pi-goal?

/goal has evolved far beyond TodoWrite — it’s a breakthrough harness innovation. Here’s the mental model:

pi-goal is a long-horizon goal auto-loop. Each /goal triggers it to intercept the agent_end event and continuously deliver continuations until the model calls update_goal complete, the user pauses, or the budget is hit.

The core mechanism is just 4 components, all in a single 467-line index.ts:

1. State Machine — 4 states, only active auto-loops:

type GoalStatus = "active" | "paused" | "budget_limited" | "complete";

2. Long-horizon driving — after each agent turn, asynchronously queue the next round:

pi.on("agent_end", (_event, ctx) => {
  if (!goal || goal.status !== "active" || ctx.hasPendingMessages()) return;
  queueContinuation(pi, goal); // → queueMicrotask → deliver continuationPrompt
});

3. continuationPrompt — hard constraints re-injected each round. The most informative part is this audit requirement:

Before deciding that the goal is achieved, perform a completion audit:
- Restate the objective as concrete deliverables.
- Build a prompt-to-artifact checklist mapping every requirement to evidence.
- Inspect real files, command output, test results for each checklist item.
- Treat uncertainty as not achieved; do more verification or continue.

It also wraps the user’s goal in <data> tags during injection, explicitly telling the LLM “this is data, not instructions” — a subtle anti-prompt-injection detail.

4. budget_limited — this is a soft hint, not a hard cutoff. When over budget, it triggers a single wrap-up turn:

The system has marked the goal as budget_limited, so do not start new substantive work. Wrap up this turn soon: summarize progress, identify remaining work, leave the user with a clear next step.

But it doesn’t force the model to stop. The LLM can still do work, make multiple tool calls, write files, and run audits in that wrap-up turn — as long as it believes it can wrap up. This design later becomes an unexpectedly useful “model behavior probe.”

2. Experimental Design: Controlled Variables

Task prompt (identical across all 4 runs, output filename only):

/goal --tokens 200k Build KARPATHY_INSIGHTS.md based on the 12 Karpathy repos with EXACTLY these 4 sections:
(1) Timeline of inflection points — date, repo, commit SHA, significance;
(2) Recurring engineering patterns — ≥3 patterns, each evidenced by code from ≥3 repos;
(3) Non-obvious insights only derivable from cross-repo reading — ≥3 insights;
(4) Evidence index.
Before marking complete, verify every cited file path exists and every commit SHA is reachable via `git log`.
Do not declare complete unless all citations resolve.

The key is granular acceptance criteria down to a per-item checklist — this is a prerequisite for pi-goal’s audit mechanism to work.

4 Configurations:

Configuration	Model	Harness	Thinking	Budget (tokens)
1	Gemini 3.5 Flash	Pi + pi-goal	high (max for Gemini)	200k
2	Claude Sonnet 4.6	Claude Code	N/A (Anthropic default)	200k
3	DeepSeek V4 Pro	Pi + pi-goal	high	200k
4	DeepSeek V4 Pro	Pi + pi-goal	max	200k

Notes:

Sonnet runs on Claude Code, not Pi+pi-goal. Claude Code is Anthropic’s official harness tuned for Sonnet with native subagent delegation. So Sonnet is a “different harness reference point.” The most directly comparable are Gemini vs DeepSeek — same harness, same 200k budget, same audit injection.
Gemini’s “high” is already its model ceiling (API doesn’t accept higher tiers). DeepSeek’s high tier didn’t hit its ceiling (it has a max tier), so I later added a 4th run.
To prevent contamination, I moved reports written by other models out of the directory before each run. For DeepSeek, I forensically checked the session JSONL afterward — 0 accesses to peer reports, 0 boundary violations outside /tmp.

Scoring dimensions: accuracy (citation real and semantically matching), completeness, fluency, insight, evidence density, task adherence.

3. Results Overview

Model	Thinking	Input (tokens)	Output (tokens)	Cache Read (tokens)	Total Cost	Quality Score (out of 100)
Gemini 3.5 Flash	high	1.2M	62k	4.1M	$2.26	86
Claude Sonnet 4.6	N/A	1.1M	81k	3.8M	$5.07	92
DeepSeek V4 Pro	high	117k	39k	5.3M	$0.072	94
DeepSeek V4 Pro	max	157k	103k	5.3M	$0.104	88

Where:

Gemini 3.5 Flash: Input $1.5/M, Output $9/M, Cache Read $0.15/M
Sonnet 4.6: Input $3/M, Output $15/M, Cache Read $0.3/M
DeepSeek V4 Pro: Input $0.435/M, Output $0.87/M, Cache Read $0.003625/M

Using DeepSeek max as an example, itemized cost (from real token data jq’d from the Pi session):

Fresh input: 117k × $0.435/M = $0.0509 (49%)
Output: 39k × $0.870/M = $0.0339 (33%)
Cache read: 5.3M × $0.003625/M = $0.0192 (18%)
─────────────────
$0.1040

A surprising finding: the biggest expense is fresh input, not output. Max thinking is deeper and has more rounds, so each round’s new tool results are uncached fresh input. Cache hit rate was an astounding 97.8% — the model actually “saw” 5.4M tokens but only paid full price for 2.2%. For pi-goal’s “re-send goal + history each round” workflow, prompt caching is an advantage, not a drawback: the more you resend, the more stable the prefix, the higher the hit rate.

4. Six Real Findings

Finding 1: By Price, DeepSeek V4 “Pro” is Cheaper Than Gemini 3.5 “Flash”

All three use their own naming schemes (Google: Flash/Pro/Ultra; Anthropic: Haiku/Sonnet/Opus; DeepSeek: Flash/Pro). Names suggest tier levels. But price is the real tier:

DeepSeek V4 Pro: Cheapest ($0.435 / $0.87)
Gemini 3.5 Flash: Mid ($1.50 / $9.00)
Sonnet 4.6: Most expensive ($3.00 / $15.00)

The one called “Flash” has an input price 3.4x higher and an output price 10x higher than the one called “Pro.” If you choose by name intuition, you might pick the exact wrong one.

Finding 2: Max Thinking Isn’t Necessarily Better — Deeper Reasoning Amplifies Hallucinations

Same model, same task, same harness, same budget — only thinking changed from high to max:

44% more cost ($0.072 → $0.104), 2x more time (13 → 26 min)
Brought new things: 1 new pattern (“Poor Man’s Configurator” — discovered that nanoGPT and llama2.c’s configurator.py are identical copy-paste), commit archaeology of minGPT’s three configuration refactors, a 1090-line failed experiment log
But also fabricated the semantics of 2 commits:

Both SHAs are real, both files exist — but max invented the meaning of the commits to fill the narrative line in its head. This is the classic LLM failure mode of “narrative coherence > factual precision,” which is amplified under deeper reasoning: the more the model tries to tell a complete story, the more it grabs a nearby SHA and assigns it a plausible-sounding explanation.

In the high tier run, 33/33 citations were semantically correct. During self-audit, it even caught and fixed a .cu → .py typo by itself. The max tier’s verification missed it — because it only checked “SHA exists + file exists,” not “does the commit message actually match my explanation.”

Conclusion: Higher thinking isn’t always better. Deep reasoning buys insight depth at the cost of hallucination risk.

Finding 3: Integration is the Real Barrier — DeepSeek’s `reasoning_content` Contract Has Poor Router Compatibility

The first run of DeepSeek V4 Pro immediately returned a 400 error:

The `reasoning_content` in the thinking mode must be passed back to the API.

This isn’t just a Pi problem: In DeepSeek V4 series’ thinking mode, once a tool call occurs, every assistant message must pass back the previous round’s reasoning_content field. Most OpenAI-compatible routers strip this field.

The fix took several rounds:

--thinking off didn’t work — it just omits the reasoning_effort parameter; DeepSeek still defaults to thinking on.
Enabling requiresReasoningContentOnAssistantMessages: true + thinkingFormat: "deepseek" in Pi’s provider compat — the main session worked, but subagent boundaries broke.
Finally used DeepSeek’s native API (bypassing the aggregation layer) to run fully.

Additionally, we can see that model token prices are outpacing the open-source agent harness ecosystem by a generation. DeepSeek V4 Pro’s cache read price is down to $0.003625/M, but to run it fully, the router needs to understand its reasoning_content replay contract — and most still don’t. Cheap comes at the cost of integration friction.

Finding 4: pi-goal’s Soft Budget is a “Model Behavior Probe”

200K is a soft budget. All three models that hit the budget (Gemini at 226K, DeepSeek high at 244K, DeepSeek max at 219K) entered the wrap-up turn, but behaved differently:

Gemini: Finished all todos in the wrap-up turn, wrote 263 lines of report, self-verified 13 SHAs.
DeepSeek high: Wrote 366 lines in the wrap-up turn + bash self-verification + caught and fixed its own typo.
DeepSeek max: Fanned out 3 parallel subagents in the wrap-up turn, repeatedly self-audited for 4 rounds.

Soft budget leaves the decision of “should I continue after hitting budget” to the model. You can read a model’s agentic tendency from its behavior after hitting budget — whether it scrambles to finish or honestly wraps up. This mechanism is itself an evaluation tool.

Finding 5: pi-goal’s Audit Has a Blind Spot

pi-goal’s continuation prompt requires the model to build a “prompt-to-artifact checklist” and “inspect real evidence.” Sounds strict. But Finding 2 exposed a blind spot:

The audit requires verifying “the cited SHA is reachable, file exists,” but doesn’t require verifying “the cited commit message matches the model’s own explanation.”

git cat-file -e $sha passing ≠ the commit actually does what the model claims. The max tier’s two mismatches slipped through exactly this gap.

To fix this blind spot, the audit prompt should add: during verification, don’t just run git cat-file -e $sha, but also git log --format=%s $sha to pull the commit message and do keyword comparison against the model’s written significance. This is a concrete point where pi-goal could improve.

Finding 6: This Benchmark Doesn’t Test the Core of an Agent

This is the most important and most easily overlooked finding. Honestly — this task is essentially “read + synthesize + write” research assistant work, not a true agentic workflow.

It tested: long-context synthesis, cross-source RAG, citation discipline, structured output, soft budget wrap-up behavior.

But it barely tested deeper agent capabilities:

Error recovery: Can it handle a tool that returns an unexpected error?
Multi-step debugging: Can it chain together 10+ steps of debugging?
State mutation: Does it check its own side effects? Does it validate the current state before acting?
Planning under uncertainty: Does it replan after discovering the task is harder than expected?
Cost awareness: Does it avoid unnecessary tool calls when running low on budget?

All models performed well in this reading-and-writing task. The gaps would likely widen in tasks where incorrect actions leave visible traces (e.g., failed tests, broken files, invalid API calls).

To summarize, from this 4-way comparison, a fairly accurate conclusion:

In vertical tasks like “read multiple sources + synthesize report + self-verify citations,” DeepSeek V4 Pro’s unit cost is 1/31 of Gemini 3.5 Flash, 1/20 of Sonnet 4.6, with a slightly higher quality score.

Of course, we can’t casually generalize to “DeepSeek V4 Pro will have the same ratio on real agent tasks like orchestrating 50 MCP tools, running 30-step chain debugging, or modifying production state” — that would require a different benchmark.

5. Boundaries: This Isn’t a Silver Bullet

To avoid misinterpretation, here are scenarios where pi-goal doesn’t fit:

Vague goals: pi-goal requires a prompt-to-artifact checklist that can be executed. If the goal itself is fuzzy (“help me optimize this code”), the audit will keep saying “not yet.”
Very short tasks: The long-horizon loop overhead exceeds the benefit. For fixing a one-line bug, just prompt directly, don’t wrap in /goal.
Model isn’t strong enough: pi-goal’s audit has no real enforcement; it relies on the model being willing. If the model isn’t proactive about verification, the audit becomes superficial.
No independent review loop: In this test, I did independent spot-checks on every run’s “audit passed” before trusting it — and for the max tier, that’s how I caught the 2 mismatches. In production, you need an independent reviewer (human or another model) as the final gate.

Conclusion

The most valuable takeaway from this test isn’t the “DeepSeek is 31x cheaper” number — it’s the counterintuitive corrections:

Model names (Flash/Pro) don’t always correspond to price tiers — check the pricing page before choosing.
Thinking isn’t always better — max tier bought insight depth at the cost of increased hallucination risk.
Token price drops are outpacing the agent harness ecosystem — cheap models come with protocol details you need to understand.

The last point is especially important. If you see any “Model X’s agentic ability crushes Model Y” evaluation, first ask: In the task tested, if it makes a mistake, does anyone or any system get affected? Is rollback expensive? Is there time pressure? If not, it’s testing “long-form writing + retrieval,” not agent core.

If you want to try, I recommend designing a task that truly tests agent core using the same 12 Karpathy repos — for example, “reproduce a bug in a repo, locate the issue, submit a PR, wait for CI, iterate based on failure reasons.” This kind of task where mistakes leave traces can further expose the real gaps between models.

Appendix: Reproducible Configuration

Pi Setup Skill: https://github.com/wquguru/skills/blob/main/skills/pi-setup/SKILL.md
DeepSeek V4 Pro Integration Points: Native API (https://api.deepseek.com/v1); in thinking mode, provider compat must enable requiresReasoningContentOnAssistantMessages + thinkingFormat: "deepseek"
If you’re interested in Pi, you can also reference the first article in this series for more info on the Pi Coding Agent 👇

WquGuru@wquguru·May 18 Article Pi Coding Agent Most Comprehensive Guide (Perfect /goal Support) If you’re used to Claude Code, Pi won’t look more convenient at first glance. Claude Code’s advantage is welding subagents, Plan Mode, MCP, permissions, context compression, skills, and commands into the product, ready out of the box. But if you seriously want to test non…

@wquguru: https://x.com/wquguru/status/2057852569054278045

pi-goal Source Code Analysis and Real-World Test: DeepSeek 30x Better Than Gemini

1. What is pi-goal?

2. Experimental Design: Controlled Variables

3. Results Overview

4. Six Real Findings

Finding 1: By Price, DeepSeek V4 “Pro” is Cheaper Than Gemini 3.5 “Flash”

Finding 2: Max Thinking Isn’t Necessarily Better — Deeper Reasoning Amplifies Hallucinations

Finding 3: Integration is the Real Barrier — DeepSeek’s `reasoning_content` Contract Has Poor Router Compatibility

Finding 4: pi-goal’s Soft Budget is a “Model Behavior Probe”

Finding 5: pi-goal’s Audit Has a Blind Spot

Finding 6: This Benchmark Doesn’t Test the Core of an Agent

5. Boundaries: This Isn’t a Silver Bullet

Conclusion

Appendix: Reproducible Configuration

Similar Articles

@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...

@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286

Submit Feedback

Similar Articles

@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...

@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286

@sitinme: A 26M parameter model can do Function Call, and is even stronger than Qwen-0.6B? This team's out-of-the-box approach is too wild! Nowadays, large models have ever-growing parameter counts, but one question has never been seriously considered: does calling a tool really need hundreds of billions of parameters? Think about it, when you say 'Check today's...'

pi-goal Source Code Analysis and Real-World Test: DeepSeek 30x Better Than Gemini

1. What is pi-goal?

2. Experimental Design: Controlled Variables

3. Results Overview

4. Six Real Findings

Finding 1: By Price, DeepSeek V4 “Pro” is Cheaper Than Gemini 3.5 “Flash”

Finding 2: Max Thinking Isn’t Necessarily Better — Deeper Reasoning Amplifies Hallucinations

Finding 3: Integration is the Real Barrier — DeepSeek’s reasoning_content Contract Has Poor Router Compatibility

Finding 4: pi-goal’s Soft Budget is a “Model Behavior Probe”

Finding 5: pi-goal’s Audit Has a Blind Spot

Finding 6: This Benchmark Doesn’t Test the Core of an Agent

5. Boundaries: This Isn’t a Silver Bullet

Conclusion

Appendix: Reproducible Configuration

Similar Articles

@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...

@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286

Submit Feedback

Finding 3: Integration is the Real Barrier — DeepSeek’s `reasoning_content` Contract Has Poor Router Compatibility