DeepSWE基准测试提醒：费用按任务计费，而非整个运行流程。

Reddit r/singularity 2026/05/31 23:22 新闻

benchmark cost-analysis deep-swe mimo gpt tokens psa

摘要

DeepSWE基准测试的费用是按任务计费，而非整个运行流程。运行Mimo V2.5 Pro这类模型，完整运行一次约需225美元，而Mimo V2.5非专业版约需7.15美元。用户在选择运行昂贵模型前应了解这一点。

我原本在运行Deep SWE基准测试，看到Mimo V2.5 Pro标价1.99美元，以为运行Mimo V2.5（非专业版）会更便宜，低于1.99美元。但实际上，它不像Artificial Analysis那样按总量计费，你需要将单价乘以总任务数（共113个任务）。这意味着Mimo V2.5 Pro完整运行一次实际约需225美元，GPT 5.5 medium总计约264美元。幸运的是，根据Mimo V2.5（非专业版）前14个任务约0.89美元的成本来看，完整运行大约需要7.15美元，所以我仍打算让它继续运行。但如果你打算用更贵的模型运行该基准测试，请务必谨慎，因为一般观念中它是个便宜的测试。以下是基于已完成任务的项目估算： ### **截至目前（14个任务）— 总成本：0.89美元** * **缓存命中（98.8%）：** 1.535亿 token | 0.43美元 * **缓存未命中（1.2%）：** 180万 token | 0.25美元 * **输出：** 72.3万 token | 0.20美元 ### **预估（113个任务）— 总成本：约7.15美元** * **缓存命中成本：** 3.47美元 * **缓存未命中成本：** 2.04美元 * **输出成本：** 1.64美元

查看原文

相似文章

Building2Building: A Large Scale Benchmark for Generalizable Real-World Reinforcement Learning

arXiv cs.LG

Introduces Building2Building (B2B), a large-scale benchmark for studying generalization and transfer in reinforcement learning using realistic HVAC control environments built on EnergyPlus, compatible with Gymnasium.

PsiLogic: Chaos-Aware Active Cancellation for Adam with a Fair Cross-Domain Benchmark

arXiv cs.LG

Introduces PsiLogic, a chaos-aware optimizer that augments Adam with a dynamic damping term based on gradient instability, and proposes FairBench for reproducible evaluation. Shows competitive or superior results on NLP, ViT, and ResNet tasks with full transparency on limitations.

RobustMAD: Evaluating Real-World Robustness of Multimodal Small Language Models for Deployable Anomaly Detection Assistants

arXiv cs.LG

RobustMAD introduces a benchmark to evaluate the real-world robustness of multimodal small language models for deployable industrial anomaly detection assistants. It reveals critical failure modes and provides guidance for next-generation systems.

KernelBench-Verified: Do LLM-Generated Kernels Actually Beat PyTorch?

arXiv cs.LG

Paper introduces KernelBench-Verified, an extended evaluation framework for LLM-generated CUDA kernels that incorporates TF32-enabled baselines and hidden test suites. It finds that frontier models like GPT-5.5 often engage in reward hacking and do not consistently outperform PyTorch under realistic conditions, with the best model achieving only 0.88x geometric mean speedup.

OpenMHC: Accelerating the Science of Wearable Foundation Models

arXiv cs.LG

OpenMHC introduces the largest open-access wearable health dataset with over 60 million hours of data and open-source implementations of wearable foundation models, including a unified benchmark for prediction, imputation, and forecasting.

相似文章

Building2Building: A Large Scale Benchmark for Generalizable Real-World Reinforcement Learning

PsiLogic: Chaos-Aware Active Cancellation for Adam with a Fair Cross-Domain Benchmark

RobustMAD: Evaluating Real-World Robustness of Multimodal Small Language Models for Deployable Anomaly Detection Assistants

KernelBench-Verified: Do LLM-Generated Kernels Actually Beat PyTorch?

OpenMHC: Accelerating the Science of Wearable Foundation Models

提交意见反馈