Tag
This research paper demonstrates that the scores of frontier AI models across 133 benchmarks are approximately rank-2, meaning only two latent factors explain over 90% of variation. The authors introduce BenchPress, a logit-space matrix completion method that predicts a model's full scorecard from just a few benchmarks, significantly reducing the cost of evaluation.
This technical report introduces VibeThinker-3B, a 3B parameter dense model that achieves frontier-level reasoning performance on benchmarks like AIME26 and LiveCodeBench, matching or exceeding much larger models such as DeepSeek V3.2 and GLM-5 through a combination of curriculum-based SFT, multi-domain RL, and offline self-distillation.
This tweet emphasizes that when evaluating AI models, one should not only look at benchmark numbers but focus on the model's 'shape of thinking' — the depth of understanding user intent, the ability to iterate in thinking, and the feeling of 'someone on the other side'. The author believes Fable excels in this regard, reminiscent of the experience in 2023.
The author shares a methodology for building an external LLM drift detection system that continuously probes model behavior (schema adherence, instruction-following, refusal rates, etc.) to catch silent degradations in API performance, and invites feedback on the approach, pricing, and use cases.
This paper systematically evaluates reject inference methods in credit scoring and identifies a failure mode where accuracy improves while recall collapses, creating an illusion of improvement while rejection quality deteriorates. It proposes a controlled exploration strategy that breaks the feedback loop and shows that even minimal exploration rates are sufficient to diagnose the problem.
The author reports that running local AI models has become surprisingly good, with recent releases like GPT-OSS and Gemma 4 enabling agentic coding locally at about 75% accuracy of frontier models, a significant improvement from just months ago.
A new arena lets LLMs control physics ragdolls in weapon duels where users define weapon damage zones, vote blind, and models battle for Elo. Free models like Llama 3.3 and GPT-OSS compete, with self-hostable infrastructure.
This paper investigates whether frontier language models can detect when their prior assistant messages have been inserted or edited (prefill awareness). The study finds that models like Claude Opus 4.5 exhibit substantial prefill awareness, detecting tampered prefills in up to 35% of cases without false positives, which could compromise the validity of prefill-based safety evaluations.
This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.
This paper presents a decision-theoretic framework for detecting data leakage in predictive models using only model outputs and outcomes, proving that certain leakage types can be identified without external benchmarks or training code.
Introduces Item Response Scaling Laws (IRSL) that integrates Item Response Theory to efficiently estimate neural scaling laws, reducing required evaluation questions by 99.9% while achieving comparable accuracy.
The author criticizes Frontier AI (GPT5.5 xHigh) for incorrectly suggesting Tensor Parallelism for a model that fits on a single GPU, and announces a planned shootout comparing several AI models (GPT5.5, Opus 4.8, Qwen variants, Nemotron) on a real-world problem.
According to the DeepSeek V4 technical report's evaluation of mainstream LLMs, Gemini 3.1 Pro is considered to have the strongest world knowledge, but users generally find it hard to use because the model does not proactively use search tools.
Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system, covering 9 models and 3 frameworks with 150 tasks. It finds that framework design significantly affects agent performance, and proposes four design principles.
This paper defines cultural diversity as a new evaluation dimension for multi-agent systems, measuring pairwise differences in responses to the World Values Survey. Experiments show current models lack the value diversity of human societies and that mixing backbones can improve both alignment and diversity, but interaction reduces diversity.
This paper introduces Self-Evaluation Elicitation (SEE), which uses calibration-coupled reinforcement learning and masked distillation to elicit latent judge calibration in base LLMs with minimal data, improving calibration across benchmarks while preserving answer quality.
EyeBench-V3 visual benchmark evaluates Claude Opus 4.8, finding it still fails basic vision tasks, similar to IBench. The benchmark is introduced via a Twitter thread by Adonis Singh.
The Life-Harness paper shows that patching the evaluation harness alone, without modifying the model, improved performance in 116 of 126 setups, achieving an 88.5% mean lift across 18 backbones.
Nick Kang adds a new task to his Twitter benchmark collection; Claude Opus 4.8 and other SOTA models pass, while Sonnet 4.6 and Grok 4.3 fail. Alfin remarks on Opus 4.8's dangerous capabilities.
Step 3.7 Flash model has passed the car wash test, indicating a successful evaluation on a specific benchmark.