Tag
M3 achieves solid benchmark scores but impresses with its ability to perform risk assessment and pre-mortem analysis before making code changes, highlighting a more cautious and thorough approach to refactoring in messy legacy repos.
A detailed critique of the METR AI time horizons graph reveals numerous severe methodological errors, including biased human baselines, unmeasured data, and test-training contamination, undermining its conclusions about AI capabilities.