Tag
Shares early benchmark scores and evaluation metrics for an open-weight model stack run on a single AMD MI300X, noting competitive performance against closed-source alternatives.
The article discusses the growing disconnect between high AI benchmark scores and actual real-world performance, highlighting issues like consistency, latency, and context handling.
Simon Willison evaluates OpenAI's GPT-5.5 cyber capabilities, examining its performance in cybersecurity tasks.
Claude Opus 4.7 shows decreased performance compared to versions 4.6 and 4.5 on SimpleBench evaluation.
This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.
MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.
OpenAI submitted proof attempts for the First Proof challenge, a research-level math competition testing whether AI can produce correct, checkable proofs. The company's internal model successfully solved at least five of the ten problems, demonstrating significant progress in sustained reasoning and rigorous mathematical thinking.
Google DeepMind announced an expanded partnership with the UK AI Security Institute (AISI) via a new Memorandum of Understanding to deepen collaborative research on AI safety, security, and risk mitigation.
OpenAI publishes a framework for business leaders on using AI evaluations (evals) to measure and improve AI system performance in organizational contexts, distinguishing between frontier evals for model development and contextual evals tailored to specific business workflows.
Intercom shares three lessons from rapidly adopting AI to transform their customer service platform: testing models early and deeply, building AI-first from the ground up rather than bolting it on, and using rigorous evaluation processes to quickly adopt new models like GPT-4.1.
OpenAI outlines 10 safety practices it actively uses and improves upon, including empirical red-teaming, alignment research, abuse monitoring, and voluntary commitments shared at the AI Seoul Summit. The company emphasizes a balanced, scientific approach to safety integrated into development from the outset.
The Frontier Model Forum announces the creation of a new AI Safety Fund with over $10 million in initial funding from major AI companies (Anthropic, Google, Microsoft, OpenAI) and philanthropic partners to support independent AI safety research. The fund will focus on developing model evaluations and red-teaming techniques to assess frontier AI systems' dangerous capabilities.