model-evaluation

#model-evaluation

You Don't Need to Run Every Eval

arXiv cs.LG ↗ · yesterday Cached

This research paper demonstrates that the scores of frontier AI models across 133 benchmarks are approximately rank-2, meaning only two latent factors explain over 90% of variation. The authors introduce BenchPress, a logit-space matrix completion method that predicts a model's full scorecard from just a few benchmarks, significantly reducing the cost of evaluation.

0 favorites 0 likes

#model-evaluation

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

Hacker News Top ↗ · 2d ago Cached

This technical report introduces VibeThinker-3B, a 3B parameter dense model that achieves frontier-level reasoning performance on benchmarks like AIME26 and LiveCodeBench, matching or exceeding much larger models such as DeepSeek V3.2 and GLM-5 through a combination of curriculum-based SFT, multi-domain RL, and offline self-distillation.

0 favorites 0 likes

#model-evaluation

@FinanceYF5: 2/ He never looks at benchmark numbers when evaluating models. The only thing he truly cares about is: [The shape of the model's thinking] — How deeply can it understand user intent? — How far can it iterate in its thinking? — Does it make you feel like there's someone on the other side? Fable gave him this sense of aliveness. 'It feels like returning to 2023'

X AI KOLs Following ↗ · 4d ago Cached

This tweet emphasizes that when evaluating AI models, one should not only look at benchmark numbers but focus on the model's 'shape of thinking' — the depth of understanding user intent, the ability to iterate in thinking, and the feeling of 'someone on the other side'. The author believes Fable excels in this regard, reminiscent of the experience in 2023.

0 favorites 0 likes

#model-evaluation

Building independent LLM drift detection - sharing the methodology, looking for feedback on the approach

Reddit r/artificial ↗ · 6d ago

The author shares a methodology for building an external LLM drift detection system that continuously probes model behavior (schema adherence, instruction-following, refusal rates, etc.) to catch silent degradations in API performance, and invites feedback on the approach, pricing, and use cases.

0 favorites 0 likes

#model-evaluation

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

arXiv cs.LG ↗ · 2026-06-18 Cached

This paper systematically evaluates reject inference methods in credit scoring and identifies a failure mode where accuracy improves while recall collapses, creating an illusion of improvement while rejection quality deteriorates. It proposes a controlled exploration strategy that breaks the feedback loop and shows that even minimal exploration rates are sufficient to diagnose the problem.

0 favorites 0 likes

#model-evaluation

Running local models is good now

Hacker News Top ↗ · 2026-06-16 Cached

The author reports that running local AI models has become surprisingly good, with recent releases like GPT-OSS and Gemma 4 enabling agentic coding locally at about 75% accuracy of frontier models, a significant improvement from just months ago.

0 favorites 0 likes

#model-evaluation

I built an arena where LLMs sword-fight with real physics. You decide which part of the blade is sharp, vote blind, and free OpenRouter models battle for Elo. Llama 3.3 is currently stabbing GPT-OSS in the face.

Reddit r/AI_Agents ↗ · 2026-06-12

A new arena lets LLMs control physics ragdolls in weapon duels where users define weapon damage zones, vote blind, and models battle for Elo. Free models like Llama 3.3 and GPT-OSS compete, with self-hostable infrastructure.

0 favorites 0 likes

#model-evaluation

Prefill Awareness in Large Language Models

arXiv cs.AI ↗ · 2026-06-12 Cached

This paper investigates whether frontier language models can detect when their prior assistant messages have been inserted or edited (prefill awareness). The study finds that models like Claude Opus 4.5 exhibit substantial prefill awareness, detecting tampered prefills in up to 35% of cases without false positives, which could compromise the validity of prefill-based safety evaluations.

0 favorites 0 likes

#model-evaluation

LLMs Can Better Capture Human Judgments--With the Right Prompts

arXiv cs.CL ↗ · 2026-06-12 Cached

This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.

0 favorites 0 likes

#model-evaluation

A prior-free blind detection of information leakage from model predictions

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper presents a decision-theoretic framework for detecting data leakage in predictive models using only model outputs and outcomes, proving that certain leakage types can be identified without external benchmarks or training code.

0 favorites 0 likes

#model-evaluation

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

arXiv cs.LG ↗ · 2026-06-09 Cached

Introduces Item Response Scaling Laws (IRSL) that integrates Item Response Theory to efficiently estimate neural scaling laws, reducing required evaluation questions by 99.9% while achieving comparable accuracy.

0 favorites 0 likes

#model-evaluation

@auroter: Frontier AI is BRAINDEAD. GPT5.5 xHigh in Codex thinks I should use Tensor Parallelism to deploy Qwen 3.6 27B on my sys…

X AI KOLs Following ↗ · 2026-06-08 Cached

The author criticizes Frontier AI (GPT5.5 xHigh) for incorrectly suggesting Tensor Parallelism for a model that fits on a single GPU, and announces a planned shootout comparing several AI models (GPT5.5, Opus 4.8, Qwen variants, Nemotron) on a real-world problem.

0 favorites 0 likes

#model-evaluation

@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...

X AI KOLs Following ↗ · 2026-06-07 Cached

According to the DeepSeek V4 technical report's evaluation of mainstream LLMs, Gemini 3.1 Pro is considered to have the strongest world knowledge, but users generally find it hard to use because the model does not proactively use search tools.

0 favorites 0 likes

#model-evaluation

@0xLogicrw: Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system. The evaluation cross-tests 9 large models with three frameworks: Hermes, OpenClaw, and QwenPaw, covering 150 real-world tasks and 4050 ...

X AI KOLs Timeline ↗ · 2026-06-05 Cached

Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system, covering 9 models and 3 frameworks with 150 tasks. It finds that framework design significantly affects agent performance, and proposes four design principles.

0 favorites 0 likes

#model-evaluation

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

This paper defines cultural diversity as a new evaluation dimension for multi-agent systems, measuring pairwise differences in responses to the World Values Survey. Experiments show current models lack the value diversity of human societies and that mixing backbones can improve both alignment and diversity, but interaction reduces diversity.

0 favorites 0 likes

#model-evaluation

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

This paper introduces Self-Evaluation Elicitation (SEE), which uses calibration-coupled reinforcement learning and masked distillation to elicit latent judge calibration in base LLMs with minimal data, improving calibration across benchmarks while preserving answer quality.

0 favorites 0 likes

#model-evaluation

opus 4.8 is still very much blind - EyeBench-V3 visual benchmark (similar to IBench)

Reddit r/singularity ↗ · 2026-06-01

EyeBench-V3 visual benchmark evaluates Claude Opus 4.8, finding it still fails basic vision tasks, similar to IBench. The benchmark is introduced via a Twitter thread by Adonis Singh.

0 favorites 0 likes

#model-evaluation

@rohit4verse: 2 months ago, I wrote "The Harness Is Everything" 1.3M views. Last week's Life-Harness paper: 116 of 126 model-environm…

X AI KOLs Timeline ↗ · 2026-05-31 Cached

The Life-Harness paper shows that patching the evaluation harness alone, without modifying the model, improved performance in 116 of 126 setups, achieving an 88.5% mean lift across 18 backbones.

0 favorites 0 likes

#model-evaluation

@nick_kango: One more task to add to my twitter benchmark collection:) Btw, Opus 4.8 and all the SOTA models passed when i tried tha…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

Nick Kang adds a new task to his Twitter benchmark collection; Claude Opus 4.8 and other SOTA models pass, while Sonnet 4.6 and Grok 4.3 fail. Alfin remarks on Opus 4.8's dangerous capabilities.

0 favorites 0 likes

#model-evaluation

Step 3.7 Flash passes the car wash test

Reddit r/LocalLLaMA ↗ · 2026-05-29

Step 3.7 Flash model has passed the car wash test, indicating a successful evaluation on a specific benchmark.

0 favorites 0 likes

model-evaluation

Submit Feedback