Tag
This paper presents an applied evaluation of foundation models for time series forecasting compared to supervised approaches across four operational domains, and proposes a Complexity Router to selectively assign series to the optimal model class for balancing accuracy and inference cost.
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.
Apex-Testing, a benchmark for evaluating agentic coding models using real private GitHub repositories, has been updated with recent models and detailed metrics including cost, time, and ELO-based leaderboard.
The author tests multiple coding agent harnesses (GitHub Copilot, Pi, Claude Code, OpenCode) using the same Qwen3.6 27B model, finding that harness design significantly impacts performance, with OpenCode excelling at web searches and web development, and GitHub Copilot struggling with file editing tools.
This paper introduces a population coupling trend and h-field diagnostic to analyze the relationship between coding and reasoning capabilities across frontier AI models, finding that capabilities cooperate but with varying emphasis per lab. It provides a playbook for measurement and predicts benchmark saturation trends.
A tweet evaluating the Grok Build model, praising its speed, instruction following, and logical reasoning, but noting sloppy code output.
The article argues that the difference between impressive and useless AI often lies not in the model itself but in the surrounding workflow—context, memory, tool access, and orchestration. It suggests that workflow architecture may become a more significant competitive advantage than raw model capability.
A tool that tracks the ELO history of major AI models from the LMSYS Arena leaderboard, revealing hidden trends like performance degradation and upgrades over time.
Anthropic's Claude Mythos Preview model has been evaluated by XBOW and UK AISI, showing unprecedented autonomous cybersecurity capabilities, including solving end-to-end cyber ranges and finding thousands of vulnerabilities. The announcement emphasizes the need to prepare for rapidly advancing AI capabilities in cybersecurity.
The article argues that perceived degradation in coding agents is often due to untracked changes in agent instances and configuration rather than the underlying model itself, highlighting a critical lack of baseline measurement in current AI agent workflows.
Wink Engineering evaluates the efficacy of neural super-resolution as a pre-filter for license plate OCR, concluding that it fails to improve accuracy and often leads to hallucinated characters compared to training directly on low-resolution data.
The author launches 'AI IQ', a new tool that scores frontier AI models on the human IQ scale, providing visualizations of model performance, intelligence costs, and EQ comparisons rather than standard leaderboard tables.
The author critiques the stylistic clarity and recognizable 'tics' of frontier models, noting this reduces their 'aura,' but argues that claims about their lack of analytical or informational value are largely incorrect.
An opinion piece suggesting that AI teams will increasingly focus on 'harness engineering' and advocating for a review article on the framework.
Shares early benchmark scores and evaluation metrics for an open-weight model stack run on a single AMD MI300X, noting competitive performance against closed-source alternatives.
The article discusses the growing disconnect between high AI benchmark scores and actual real-world performance, highlighting issues like consistency, latency, and context handling.
Simon Willison evaluates OpenAI's GPT-5.5 cyber capabilities, examining its performance in cybersecurity tasks.
Claude Opus 4.7 shows decreased performance compared to versions 4.6 and 4.5 on SimpleBench evaluation.
This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.
MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.