Heads up for DeepSWE benchmark: The cost is measured per task, not the total run.

Reddit r/singularity 05/31/26, 11:22 PM News

benchmark cost-analysis deep-swe mimo gpt tokens psa

Summary

The DeepSWE benchmark costs are per task, not per total run. Running models like Mimo V2.5 Pro can cost ~$225 for a full run, while Mimo V2.5 non-pro costs ~$7.15. Users should be aware of this before running expensive models.

I was running the Deep SWE benchmark and saw Mimo V2.5 Pro at $1.99 and figured running Mimo V2.5 (non-pro) would be cheaper than $1.99. But actually, it's not like Artificial Analysis where it measure the total amount, you need to multiply that by the total number of tasks, which is 113 tasks. This means that Mimo V2.5 Pro is actually \~$225 for a full run and GPT 5.5 medium is a total of \~$264. Fortunately, based on the cost for a complete run of Mimo V2.5 (non-pro) for the first 14 tasks at about $0.89, it seems like it's going to cost a total of \~$7.15, so I'm still planning to let it run. But just beware if you're about to run the benchmark with a more expensive model thinking that it's a cheap benchmark to run in general. Here's the projection based on what it's done so far: ### **So far (14 tasks) — Total Cost: $0.89** * **Cache hits (98.8%):** 153.5M tokens | $0.43 * **Cache misses (1.2%):** 1.8M tokens | $0.25 * **Output:** 723K tokens | $0.20 ### **Projected (113 tasks) — Total Cost: ~$7.15** * **Cache hit cost:** $3.47 * **Cache miss cost:** $2.04 * **Output cost:** $1.64

Original Article

Similar Articles

An Exam for Active Observers

arXiv cs.CL

This paper introduces ActiveVision, a benchmark to evaluate active observation in multimodal large language models. Frontier models like GPT-5.5 and Claude Fable 5 perform poorly, solving only 10.6% and 3.5% of tasks respectively, compared to human 96.1%, highlighting a lack of iterative visual perception.

Rate-Utility Frontiers for Language Encodings: Comparing Tokens, Bytes, and Pixels Under Controlled Linguistic Content

arXiv cs.CL

This paper compares subword tokens, raw bytes, and rendered pixels as text encodings for language models under controlled linguistic content across 13 languages. It traces rate–utility frontiers and finds that no encoding dominates across tasks, with pixels preserving surface form best, bytes preserving cross-lingual alignment best, and tokens supporting topic prediction best.

Frontier AI performance across the business disciplines: a case-grounded benchmark of knowledge work and analytical reasoning

arXiv cs.CL

This paper introduces BusinessCaseBench, a benchmark of business case questions from 18 disciplines with expert grading rubrics. It finds that frontier AI models already score highly and show rapid improvement, with implications for business education and professional work.

Before the Action: Benchmarking LLMs on Prospective Hypothesis Discovery

arXiv cs.CL

The paper introduces HypoArena, a benchmark for evaluating LLMs' ability to proactively construct hypothesis spaces from incomplete evidence, and experiments on 15 frontier LLMs reveal capability stratification.

ContinuityBench: A Benchmark and Systems Study of Stateful Failover in Multi-Provider LLM Routing

arXiv cs.LG

Introduces ContinuityBench, a benchmark and systems study for stateful failover in multi-provider LLM routing, proposing new metrics (CPR, CLO) and a history-forwarding proxy architecture achieving 99.20% context preservation.

Similar Articles

An Exam for Active Observers

Rate-Utility Frontiers for Language Encodings: Comparing Tokens, Bytes, and Pixels Under Controlled Linguistic Content

Frontier AI performance across the business disciplines: a case-grounded benchmark of knowledge work and analytical reasoning

Before the Action: Benchmarking LLMs on Prospective Hypothesis Discovery

ContinuityBench: A Benchmark and Systems Study of Stateful Failover in Multi-Provider LLM Routing

Submit Feedback