benchmark

#benchmark

A recent experience with ChatGPT 5.5 Pro

Hacker News Top ↗ · 12h ago Cached

Mathematician Timothy Gowers recounts how ChatGPT 5.5 Pro produced PhD-level mathematical research in about an hour with minimal human input, solving open problems from a combinatorics/additive number theory paper and prompting him to significantly revise his assessment of LLMs' mathematical capabilities.

0 favorites 0 likes

#benchmark

METR evaluated an early version of Claude Mythos

Reddit r/singularity ↗ · 14h ago

METR evaluated an early version of Claude Mythos Preview in March 2026 using their time-horizons task suite, estimating a 50%-time-horizon of at least 16 hours, indicating the model is at the upper end of what current benchmarks can measure, with caveats about stability at longer time ranges.

0 favorites 0 likes

#benchmark

MTP is all about acceptance rate

Reddit r/LocalLLaMA ↗ · 17h ago

A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.

1 favorites 1 likes

#benchmark

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

Reddit r/artificial ↗ · 17h ago

Developer created a new benchmark called continuity-benchmarks to test AI coding agents' ability to maintain consistency with project rules during active development, addressing gaps in existing memory benchmarks that focus on semantic recall rather than real-time architectural consistency and multi-session behavior.

0 favorites 0 likes

#benchmark

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following ↗ · 19h ago Cached

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

0 favorites 0 likes

#benchmark

Measuring information density in web pages from an LLM agent's perspective [R]

Reddit r/MachineLearning ↗ · 20h ago

This paper presents empirical measurements of information density in web pages from the perspective of LLM agents, using a curated benchmark of 100 URLs across five categories. It finds that structural extraction reduces token count by an average of 71.5% while preserving answer quality, and reveals an undocumented compression layer in Claude Code.

0 favorites 0 likes

#benchmark

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Reddit r/LocalLLaMA ↗ · 21h ago

The author built a benchmark harness to evaluate local LLMs for autonomous Go code generation, focusing on log parser generation for SIEM pipelines, and published results comparing quality vs. speed.

0 favorites 1 likes

#benchmark

Can LLMs model real-world systems in TLA+?

Hacker News Top ↗ · 23h ago Cached

Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.

0 favorites 0 likes

#benchmark

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA ↗ · yesterday

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

0 favorites 0 likes

#benchmark

@bclavie: This might be the best IR release of the year. Text benchmarking is (was) broken, DL19/DL20/BEIR no longer hold valuabl…

X AI KOLs Following ↗ · yesterday Cached

A new IR benchmark release addresses broken text benchmarking in DL19/DL20/BEIR, enabling meaningful measurement of improvements in current era training methods.

0 favorites 0 likes

#benchmark

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv cs.AI ↗ · yesterday Cached

This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.

0 favorites 0 likes

#benchmark

DataDignity: Training Data Attribution for Large Language Models

arXiv cs.AI ↗ · yesterday Cached

This paper introduces DataDignity, a framework and benchmark (FakeWiki) for pinpoint provenance, aiming to identify the specific training data sources that support an LLM's response. It proposes ScoringModel and SteerFuse methods to improve attribution accuracy over standard retrieval baselines.

0 favorites 0 likes

#benchmark

Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation

arXiv cs.LG ↗ · yesterday Cached

This paper presents a comprehensive benchmark for evaluating adversarial attacks and defenses in Graph Neural Networks, highlighting the need for standardized and fair experimental protocols.

0 favorites 0 likes

#benchmark

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

arXiv cs.CL ↗ · yesterday Cached

This paper introduces Sequor, a new benchmark for evaluating how well AI models follow constraints in long, multi-turn conversations. It highlights that current models struggle significantly with maintaining instruction adherence over extended interactions.

0 favorites 0 likes

#benchmark

Linear Semantic Segmentation for Low-Resource Spoken Dialects

arXiv cs.CL ↗ · yesterday Cached

This paper introduces a benchmark for semantic segmentation in low-resource dialectal Arabic and proposes a model that improves performance on conversational speech compared to standard baselines.

0 favorites 0 likes

#benchmark

A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

arXiv cs.LG ↗ · yesterday Cached

This paper introduces a unified benchmark to evaluate the robustness of Graph Neural Networks on noisy, text-derived knowledge graphs and the effectiveness of graph construction methods in the biomedical domain.

0 favorites 0 likes

#benchmark

IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences

arXiv cs.CL ↗ · yesterday Cached

This paper introduces IRC-Bench, a benchmark for recognizing implicit entities in first-person reminiscences using contextual cues rather than explicit mentions. It evaluates various LLM and retrieval configurations, finding QLoRA-adapted Llama 3.1 8B to be the top performer in open-world settings.

0 favorites 0 likes

#benchmark

Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

arXiv cs.CL ↗ · yesterday Cached

This paper identifies a failure mode called Entity Identity Confusion in multimodal knowledge editing, where models incorrectly bind image-entity relationships. It introduces EC-Bench to diagnose this issue and proposes mitigation strategies for faithful editing.

0 favorites 0 likes

#benchmark

TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity

arXiv cs.CL ↗ · yesterday Cached

Introduces TableVista, a comprehensive benchmark for evaluating foundation models on multimodal table reasoning under visual and structural complexity, comprising 3,000 problems expanded into 30,000 multimodal samples. Evaluation of 29 models reveals performance degradation on complex layouts and vision-only settings.

0 favorites 0 likes

#benchmark

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

arXiv cs.CL ↗ · yesterday Cached

XL-SafetyBench is a benchmark of 5,500 test cases across 10 country-language pairs to evaluate LLM safety and cultural sensitivity, distinguishing jailbreak robustness from cultural awareness.

0 favorites 0 likes

benchmark

Submit Feedback