Opus 4.8 shows improved build quality and lower cost compared to Opus 4.7 on the MineBench 3D block-structure benchmark, though with some inconsistencies. The model demonstrates streamlined thinking and more efficient inference.
**Some Notes:** * *Average Inference Time: 24.8 min (1,487seconds)* * *Total Cost (for 15 builds): $41.52* * Much cheaper than Opus 4.7 was, despite having the same API pricing * The CoT / thinking times have clearly been streamlined (similar to what OpenAI has been doing with their latest releases) which lowers overall cost, but despite that, the output seems better than Opus 4.7, so that's good * This is, in my opinion, one of the first Claude models in a long time that actually feels like a genuinely impressive release; its builds are actually of similar quality to GPT 5.5, though a bit more inconsistent * During generation, the model had to retry 5 builds due to either hallucinations with the given block palette (it used blocks which were not available) or malformed outputs * That's pretty on par with the Claude models, though the adaptive thinking seems to work better this time around (in previous attempts the model would spend all of it's output tokens for CoT and not have enough left over to finish its actual JSON output) * In my opinion, Opus 4.8 is a clear improvement over Opus 4.7 (or maybe it's what Opus 4.7 was supposed to be originally 🤷♂️) * Feel free to see all the other updates on the [GitHub release](https://github.com/Ammaar-Alam/minebench/releases/tag/3.6.0) (thanks for the suggestion!) * **If you enjoy these posts please feel free to help** [**fund**](https://buymeacoffee.com/ammaaralam) **the benchmark** **Benchmark:** [https://minebench.ai/](https://minebench.ai/) **Git** **Repository:** [https://github.com/Ammaar-Alam/minebench](https://github.com/Ammaar-Alam/minebench) **Previous Posts:** * [Comparing GPT 5.4 and GPT 5.5](https://www.reddit.com/r/singularity/comments/1sxapqb/differences_between_gpt_54_and_gpt_55_on_minebench/) * [Comparing Kimi K2.5 and Kimi K2.6](https://www.reddit.com/r/LocalLLaMA/comments/1srs4uj/differences_between_kimi_k25_and_kimi_k26_on/) * [Comparing Opus 4.6 and Opus 4.7](https://www.reddit.com/r/ClaudeAI/comments/1sofgno/differences_between_opus_46_and_opus_47_on/) * [Comparing GPT 5.4 and GPT 5.4-Pro](https://www.reddit.com/r/OpenAI/comments/1rr0vi4/differences_between_gpt_54_and_gpt_54pro_on/) * [Comparing GPT 5.2 and GPT 5.4](https://www.reddit.com/r/singularity/comments/1rluvdz/difference_between_gpt_52_and_gpt_54_on_minebench/) * [Comparing GPT 5.2 and GPT 5.3-Codex](https://www.reddit.com/r/OpenAI/comments/1rdwau3/gpt_52_versus_gpt_53codex_on_minebench/) * [Comparing Opus 4.5 and 4.6, also answered some questions about the benchmark](https://www.reddit.com/r/ClaudeAI/comments/1qx3war/difference_between_opus_46_and_opus_45_on_my_3d/) * [Comparing Opus 4.6 and GPT-5.2 Pro](https://www.reddit.com/r/OpenAI/comments/1r3v8sd/difference_between_opus_46_and_gpt52_pro_on_a/) * [Comparing Gemini 3.0 and Gemini 3.1](https://www.reddit.com/r/singularity/comments/1ra6x6n/fixed_difference_between_gemini_30_pro_and_gemini/) **Extra Information (if you're confused):** Essentially it's a benchmark that tests how well a model can create a 3D Minecraft like structure. So the models are given a palette of blocks (think of them like legos) and a prompt of what to build, so like the first prompt you see in the post was a fighter jet. Then the models had to build a fighter jet by returning a JSON in which they gave the coordinate of each block/lego (x, y, z). It's interesting to see which model is able to create a better 3D representation of the given prompt. The smarter models tend to design much more detailed and intricate builds. The repository readme might provide might help give a better understanding. *(Disclaimer: This is a public benchmark I created, so technically self-promotion :)*
Anthropic released Claude Opus 4.8, an incremental update over Opus 4.7 with sharper judgment and longer autonomous work capability, though some engineers remain skeptical about its code generation without extensive guidance.