ai-benchmark

#ai-benchmark

AI runs Indian Grocery simulation for 30 days. GPT 5.5 nails it!

Reddit r/AI_Agents ↗ · 22h ago

DukaanBench evaluates LLMs on Indian grocery store management, testing inventory, marketing, and perishability under capital constraints; GPT 5.5 succeeded.

0 favorites 0 likes

#ai-benchmark

@xdotli: A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBe…

X AI KOLs Following ↗ · 2026-06-16 Cached

SkillsBench 1.1 is released as the first audited, error-free benchmark for AI agent skills, showing rapid capability improvement from ~36% to 67% resolution rate and demonstrating that skills can substitute for model scale.

0 favorites 0 likes

#ai-benchmark

@KyleHessling1: Qwopus Coder leading the pack here! Even my old 18B frankenmerge is holding 4th in this eval above some much newer and …

X AI KOLs Timeline ↗ · 2026-06-16 Cached

A tweet thread discusses benchmark results where Qwopus Coder tops the leaderboard, while Cohere's North-Mini-Code-1.0 lands last on an agentic tool-calling board, showing surprising outcomes for smaller models.

0 favorites 0 likes

#ai-benchmark

Humans outperform AI at this highly rigorous mathematics test

Reddit r/singularity ↗ · 2026-06-14 Cached

The First Proof test evaluated four AI systems on novel research-level math problems, with the top model scoring only 6 out of 10, demonstrating that current AI still lags behind top mathematicians in rigorous reasoning.

0 favorites 0 likes

#ai-benchmark

Fable passes the "When A.I. Passes This Test, Look Out" test

Reddit r/ArtificialInteligence ↗ · 2026-06-12

Claude Fable achieves 53% on the 'Humanity's Last Exam' benchmark, surpassing the expected end-of-2025 milestone earlier than projected, indicating rapid AI progress.

0 favorites 0 likes

#ai-benchmark

@maiff20: Gave the 2025 college entrance exam math paper to Codex, told it not to access the internet, the reasoning process was actually using Python to compute, rather than looking up historical information in the model, finished in 7 minutes, scored 150 points, and even pointed out contradictions in the answers.

X AI KOLs Timeline ↗ · 2026-06-08 Cached

A user tested OpenAI Codex on the 2025 college entrance exam math paper, restricting it from accessing the internet and only using Python for real computation. It completed the test in 7 minutes, scored a perfect 150 points, and also pointed out contradictions in the official answers.

0 favorites 0 likes

#ai-benchmark

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

Reddit r/LocalLLaMA ↗ · 2026-05-31

PolyRange is a new open-source benchmark for evaluating offensive AI capabilities on web targets, designed to resist contamination by generating fresh tasks per deployment and including active defense tiers.

0 favorites 0 likes

#ai-benchmark

@sashimikun_void: GPT-5.5 outperformed Claude Opus 4.8 on the DEEPSWE benchmark. Opus 4.8 takes twice as long, generates three times the …

X AI KOLs Following ↗ · 2026-05-30 Cached

GPT-5.5 outperforms Claude Opus 4.8 on the DEEPSWE benchmark, achieving higher scores with lower cost and less token bloat.

0 favorites 0 likes

#ai-benchmark

@bentossell: wait… if most people think 5.5 is better than 4.7, i assume that’s due to terminal coding benchmark… 4.8 is still outpe…

X AI KOLs Following ↗ · 2026-05-28 Cached

The tweet discusses the release of Claude Opus 4.8, which improves upon Opus 4.7 with sharper judgment and longer independent work, though it notes that version 5.5 still outperforms it on a terminal coding benchmark.

0 favorites 0 likes

#ai-benchmark

AI can finally pass the Turing Test better than a human, study warns

Reddit r/ArtificialInteligence ↗ · 2026-05-20 Cached

A new study published in PNAS shows that advanced LLMs like GPT-4.5 can pass the Turing Test, with participants finding them more human than actual humans, prompting a reevaluation of what the test measures.

0 favorites 0 likes

#ai-benchmark

Am I missing something about GPT-5.5 efficiency?

Reddit r/singularity ↗ · 2026-05-11

A user questions the token efficiency of GPT-5.5 versus GPT-5.4 in Codex, analyzing a chart from Artificial Analysis and praising Cursor's token performance.

0 favorites 0 likes

#ai-benchmark

ProgramBench (5 minute read)

TLDR AI ↗ · 2026-05-07 Cached

ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.

0 favorites 0 likes

#ai-benchmark

The International 2018: Results

OpenAI Blog ↗ · 2018-08-23 Cached

OpenAI Five competed against top professional Dota 2 teams at The International 2018, losing both matches against elite human players while demonstrating competitive gameplay and strategic depth developed through self-taught learning.

0 favorites 0 likes

ai-benchmark

Submit Feedback