METR evaluated an early version of Claude Mythos

Reddit r/singularity News

Summary

METR evaluated an early version of Claude Mythos Preview in March 2026 using their time-horizons task suite, estimating a 50%-time-horizon of at least 16 hours, indicating the model is at the upper end of what current benchmarks can measure, with caveats about stability at longer time ranges.

[https://metr.org/time-horizons/](https://metr.org/time-horizons/) "We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks. [](https://x.com/METR_Evals/status/2052896621760004602/photo/1) Of the 228 tasks in our suite, only 5 are estimated as 16+ hours long, making measurements at this range unstable and less meaningful than at ranges with better task coverage. Thus, we are not highlighting exact estimates for models above 16 hours measured with our current suite. [](https://x.com/METR_Evals/status/2052896623852929510/photo/1) We believe that this task suite could still distinguish a much more capable model from current publicly-known state-of-the-art models. But we do not consider measurements at this range to be robust enough for precise quantitative comparisons or extrapolations. In principle the time-horizon methodology allows us to measure higher capability models by adding longer tasks, and we’re working on updated methods. But these are still in development; for now, we advise caution in interpreting recent time-horizon numbers."
Original Article

Similar Articles

Hardening Firefox with Claude Mythos Preview

Hacker News Top

Mozilla details how they used Claude Mythos Preview and other AI models to identify and fix a significant number of latent security bugs in Firefox, demonstrating a shift in the efficacy of AI for code hardening.