Tag
Mathematician Timothy Gowers recounts how ChatGPT 5.5 Pro produced PhD-level mathematical research in about an hour with minimal human input, solving open problems from a combinatorics/additive number theory paper and prompting him to significantly revise his assessment of LLMs' mathematical capabilities.
METR evaluated an early version of Claude Mythos Preview in March 2026 using their time-horizons task suite, estimating a 50%-time-horizon of at least 16 hours, indicating the model is at the upper end of what current benchmarks can measure, with caveats about stability at longer time ranges.
A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.
Developer created a new benchmark called continuity-benchmarks to test AI coding agents' ability to maintain consistency with project rules during active development, addressing gaps in existing memory benchmarks that focus on semantic recall rather than real-time architectural consistency and multi-session behavior.
An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.
This paper presents empirical measurements of information density in web pages from the perspective of LLM agents, using a curated benchmark of 100 URLs across five categories. It finds that structural extraction reduces token count by an average of 71.5% while preserving answer quality, and reveals an undocumented compression layer in Claude Code.
The author built a benchmark harness to evaluate local LLMs for autonomous Go code generation, focusing on log parser generation for SIEM pipelines, and published results comparing quality vs. speed.
Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
A new IR benchmark release addresses broken text benchmarking in DL19/DL20/BEIR, enabling meaningful measurement of improvements in current era training methods.
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
This paper introduces DataDignity, a framework and benchmark (FakeWiki) for pinpoint provenance, aiming to identify the specific training data sources that support an LLM's response. It proposes ScoringModel and SteerFuse methods to improve attribution accuracy over standard retrieval baselines.
This paper presents a comprehensive benchmark for evaluating adversarial attacks and defenses in Graph Neural Networks, highlighting the need for standardized and fair experimental protocols.
This paper introduces Sequor, a new benchmark for evaluating how well AI models follow constraints in long, multi-turn conversations. It highlights that current models struggle significantly with maintaining instruction adherence over extended interactions.
This paper introduces a benchmark for semantic segmentation in low-resource dialectal Arabic and proposes a model that improves performance on conversational speech compared to standard baselines.
This paper introduces a unified benchmark to evaluate the robustness of Graph Neural Networks on noisy, text-derived knowledge graphs and the effectiveness of graph construction methods in the biomedical domain.
This paper introduces IRC-Bench, a benchmark for recognizing implicit entities in first-person reminiscences using contextual cues rather than explicit mentions. It evaluates various LLM and retrieval configurations, finding QLoRA-adapted Llama 3.1 8B to be the top performer in open-world settings.
This paper identifies a failure mode called Entity Identity Confusion in multimodal knowledge editing, where models incorrectly bind image-entity relationships. It introduces EC-Bench to diagnose this issue and proposes mitigation strategies for faithful editing.
Introduces TableVista, a comprehensive benchmark for evaluating foundation models on multimodal table reasoning under visual and structural complexity, comprising 3,000 problems expanded into 30,000 multimodal samples. Evaluation of 29 models reveals performance degradation on complex layouts and vision-only settings.
XL-SafetyBench is a benchmark of 5,500 test cases across 10 country-language pairs to evaluate LLM safety and cultural sensitivity, distinguishing jailbreak robustness from cultural awareness.