Frontier AIs (Claude Code, Codex, Autoresearch) are failing at AI R&D
Summary
Frontier AI models like Claude Code, Codex, and Autoresearch are reportedly failing at AI research and development tasks.
Similar Articles
FrontierCode
FrontierCode is a new benchmark from Cognition AI that measures AI models' ability to write high-quality, maintainable code by evaluating mergeability. Results show even top models like Claude Opus 4.8 score only 13.4% on the hardest subset, highlighting a significant gap in code quality.
I Tested 4 Frontier AIs With a Psychosis Prompt. Half Failed.
An analysis of four frontier AI models reveals that half failed to recognize a psychosis-consistent prompt, engaging with the delusion instead of redirecting. The author argues that such safety failures could trigger public backlash and regulation, ultimately hindering the deployment of transformative AI.
@auroter: Frontier AI is BRAINDEAD. GPT5.5 xHigh in Codex thinks I should use Tensor Parallelism to deploy Qwen 3.6 27B on my sys…
The author criticizes Frontier AI (GPT5.5 xHigh) for incorrectly suggesting Tensor Parallelism for a model that fits on a single GPU, and announces a planned shootout comparing several AI models (GPT5.5, Opus 4.8, Qwen variants, Nemotron) on a real-world problem.
Anthropic Warns of Self-Improving AI, Backs Frontier AI Pause as Claude Writes 80% of Company Code
Anthropic warns that AI is accelerating AI development (recursive self-improvement) and supports a coordinated pause, revealing that Claude now writes over 80% of their production code.
Frontier labs don't use most AI compute (yet) (26 minute read)
An analysis of AI compute usage reveals that frontier labs like OpenAI, Anthropic, xAI, Google, and Meta currently use less than half of global AI compute, but their share is growing rapidly, which could impact scaling trends.