Tag
The author questions whether the reported capabilities of the Fable 5 AI model are genuine or part of a psychological operation, citing lack of evidence and suspicious timing from AWS and NSA claims.
Anthropic demonstrates that AI systems can now perform world-modeling, as evidenced by the Fable standoff experiment.
This article discusses whether it is realistically possible to achieve AI capabilities comparable to Claude or Codex using locally-run models, exploring the current state of open-source alternatives and their limitations.
A Reddit user questions why some people dismiss AI capabilities despite their own positive experiences with AI solving complex problems, suggesting a disconnect between public perception and actual AI performance.
A speculative question about whether a super intelligent AI could learn to modify human biology.
The author reflects on how Claude has surpassed them in creative guidance and reasoning, catching them out with better judgment and understanding.
The tweet comments on the concept of jagged intelligence, noting that code is an exception to the pattern where fable is not significantly better.
Claude Fable 5, a new Mythos-class AI model from Anthropic, is claimed to excel at coding but still lacks design capabilities.
OpenAI reports early signs of recursive self-improvement in current AI systems, a potentially significant development in AI capabilities.
A user reports that Claude is excellent at generating optimized travel routes on Google Maps, personalizing directions for walking, driving, or taxi, and found it perfect for planning a trip to Tokyo.
A detailed critique of the METR AI time horizons graph reveals numerous severe methodological errors, including biased human baselines, unmeasured data, and test-training contamination, undermining its conclusions about AI capabilities.
The author observes that AI has increasingly made coding a solved problem.
This paper argues that traditional benchmarks both overestimate and underestimate frontier AI capabilities, and proposes 'open-world evaluations'—long-horizon, real-world tasks assessed qualitatively—as a complementary approach. The CRUX project is introduced, with a demonstration where an AI agent successfully published an iOS app to the App Store with minimal intervention.
A tweet reflecting on how René Descartes' argument that machines cannot appropriately arrange words in response is now challenged by modern LLMs.
Noahpinion tweets that people are realizing AIs are superintelligent because they combine human-level reasoning with computer-like speed, knowledge, and memory, sparking discussion about AI capabilities.
A critique of the oversimplified claim that LLMs are 'just next token predictors,' arguing that prediction at scale induces useful representations and capabilities, and that such dismissals confuse objective with learned system.
The article discusses the concept of 'jagged intelligence' from Andrej Karpathy, highlighting the uneven distribution of AI capabilities across domains and arguing that the true value lies in the 'harness'—the domain-specific engineering and tooling built around generalist models. It asserts that small teams with deep domain expertise can achieve significant asymmetric advantages, particularly in cybersecurity.
The thread discusses recent evidence that AI agents have become largely autonomous, with Claude Mythos solving previously unsolved cyber attack simulations and exceeding current benchmark measurement limits, indicating super-exponential progress. It highlights the security implications and institutional responses.
The article highlights that ChatGPT's image model demonstrates superior mathematical reasoning capabilities compared to most humans.
Meta's Superintelligence Lab introduces ProgramBench, a benchmark evaluating whether state-of-the-art AI models can recreate real executable programs like ffmpeg and SQLite from scratch without internet access.