Heads up for DeepSWE benchmark: The cost is measured per task, not the total run.

Reddit r/singularity News

Summary

The DeepSWE benchmark costs are per task, not per total run. Running models like Mimo V2.5 Pro can cost ~$225 for a full run, while Mimo V2.5 non-pro costs ~$7.15. Users should be aware of this before running expensive models.

I was running the Deep SWE benchmark and saw Mimo V2.5 Pro at $1.99 and figured running Mimo V2.5 (non-pro) would be cheaper than $1.99. But actually, it's not like Artificial Analysis where it measure the total amount, you need to multiply that by the total number of tasks, which is 113 tasks. This means that Mimo V2.5 Pro is actually \~$225 for a full run and GPT 5.5 medium is a total of \~$264. Fortunately, based on the cost for a complete run of Mimo V2.5 (non-pro) for the first 14 tasks at about $0.89, it seems like it's going to cost a total of \~$7.15, so I'm still planning to let it run. But just beware if you're about to run the benchmark with a more expensive model thinking that it's a cheap benchmark to run in general. Here's the projection based on what it's done so far: ### **So far (14 tasks) — Total Cost: $0.89** * **Cache hits (98.8%):** 153.5M tokens | $0.43 * **Cache misses (1.2%):** 1.8M tokens | $0.25 * **Output:** 723K tokens | $0.20 ### **Projected (113 tasks) — Total Cost: ~$7.15** * **Cache hit cost:** $3.47 * **Cache miss cost:** $2.04 * **Output cost:** $1.64
Original Article

Similar Articles

@seclink: This 12-billion-parameter model uses a unified Transformer architecture to efficiently handle raw multimodal inputs. It requires only 16GB of RAM to run, making it a perfect fit for devices like the MacBook Pro. It excels in various benchmarks, such as achieving 78.8% on GPQA Diamond and...

X AI KOLs Following

A 12-billion-parameter multimodal model has been released as open source. It features a unified Transformer architecture and requires only 16GB of RAM to run. It performs exceptionally well across multiple benchmarks, supports a 256K context window, and works with over 140 languages.

Someone did an audit on the new DeepSWE, the results aren't pretty

Reddit r/singularity

DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.