Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

arXiv cs.AI 06/08/26, 04:00 AM Papers

ai-safety chain-of-thought reasoning frontier-models benchmarks task-completion time-horizons

Summary

This paper measures how well frontier AI models reason without explicit chain-of-thought across 30,000 questions, finding that no-CoT task-completion time horizons have been doubling yearly and could exceed 7 minutes by 2028, raising concerns for safety oversight.

arXiv:2606.07157v1 Announce Type: new Abstract: Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

Original Article

View Cached Full Text

Cached at: 06/08/26, 09:14 AM

# Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Source: [https://arxiv.org/abs/2606.07157](https://arxiv.org/abs/2606.07157)
Authors:[Dewi Gould](https://arxiv.org/search/cs?searchtype=author&query=Gould,+D),[Francis Rhys Ward](https://arxiv.org/search/cs?searchtype=author&query=Ward,+F+R),[Anders Cairns Woodruff](https://arxiv.org/search/cs?searchtype=author&query=Woodruff,+A+C),[Rauno Arike](https://arxiv.org/search/cs?searchtype=author&query=Arike,+R),[Josh Hills](https://arxiv.org/search/cs?searchtype=author&query=Hills,+J),[Alex Serrano](https://arxiv.org/search/cs?searchtype=author&query=Serrano,+A),[Ida Caspary](https://arxiv.org/search/cs?searchtype=author&query=Caspary,+I),[Jason Ross Brown](https://arxiv.org/search/cs?searchtype=author&query=Brown,+J+R),[Jo J\. Jiao](https://arxiv.org/search/cs?searchtype=author&query=Jiao,+J+J),[Patrick Leask](https://arxiv.org/search/cs?searchtype=author&query=Leask,+P),[Twm Stone](https://arxiv.org/search/cs?searchtype=author&query=Stone,+T),[Ram Potham](https://arxiv.org/search/cs?searchtype=author&query=Potham,+R),[Ionut Gabriel Stan](https://arxiv.org/search/cs?searchtype=author&query=Stan,+I+G),[Harry Mayne](https://arxiv.org/search/cs?searchtype=author&query=Mayne,+H),[Simeon Hellsten](https://arxiv.org/search/cs?searchtype=author&query=Hellsten,+S),[Shubhorup Biswas](https://arxiv.org/search/cs?searchtype=author&query=Biswas,+S),[Ariana Azarbal](https://arxiv.org/search/cs?searchtype=author&query=Azarbal,+A),[William L\. Anderson](https://arxiv.org/search/cs?searchtype=author&query=Anderson,+W+L),[Elle Najt](https://arxiv.org/search/cs?searchtype=author&query=Najt,+E),[Ryan Greenblatt](https://arxiv.org/search/cs?searchtype=author&query=Greenblatt,+R),[Julian Stastny](https://arxiv.org/search/cs?searchtype=author&query=Stastny,+J)

[View PDF](https://arxiv.org/pdf/2606.07157)

> Abstract:Many efforts to ensure frontier AI models are safe rely on monitoring their chain\-of\-thought \(CoT\) reasoning\. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight\. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory\-of\-mind, and strategic reasoning\. To compare models against humans, we estimate the $50\\%$\-task\-completion time horizon \(TH\): the human time required for tasks a model completes with $50\\%$ success rate\. We complement this with a $50\\%$ reasoning token horizon: the minimum number of o3\-mini reasoning tokens needed for tasks a model solves with $50\\%$ success rate\. We find that the no\-CoT $50\\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT\-5\.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens\. Our median estimates predict that frontier no\-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty\. We recommend frontier developers track this explicitly\.

## Submission history

From: Dewi Gould \[[view email](https://arxiv.org/show-email/32da10a3/2606.07157)\] **\[v1\]**Fri, 5 Jun 2026 11:17:08 UTC \(4,603 KB\)

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Similar Articles

@jietang: Recent thoughts: The Shift to Long-Horizon Tasks The most likely breakthrough this year will be in long-horizon tasks. …

Detecting misbehavior in frontier reasoning models

@WGOV: Cognitive offloading and the speedup illusion in human-AI interaction Sunny Yu, Myra Cheng, Ahmad Jabbar, Ilia Sucholut…

Do AI agents spend more time waiting for humans than actually working?

Open-World Evaluations for Measuring Frontier AI Capabilities

Submit Feedback

Similar Articles

@jietang: Recent thoughts: The Shift to Long-Horizon Tasks The most likely breakthrough this year will be in long-horizon tasks. …

Detecting misbehavior in frontier reasoning models

@WGOV: Cognitive offloading and the speedup illusion in human-AI interaction Sunny Yu, Myra Cheng, Ahmad Jabbar, Ilia Sucholut…

Do AI agents spend more time waiting for humans than actually working?

Open-World Evaluations for Measuring Frontier AI Capabilities