Tag
DukaanBench evaluates LLMs on Indian grocery store management, testing inventory, marketing, and perishability under capital constraints; GPT 5.5 succeeded.
SkillsBench 1.1 is released as the first audited, error-free benchmark for AI agent skills, showing rapid capability improvement from ~36% to 67% resolution rate and demonstrating that skills can substitute for model scale.
A tweet thread discusses benchmark results where Qwopus Coder tops the leaderboard, while Cohere's North-Mini-Code-1.0 lands last on an agentic tool-calling board, showing surprising outcomes for smaller models.
The First Proof test evaluated four AI systems on novel research-level math problems, with the top model scoring only 6 out of 10, demonstrating that current AI still lags behind top mathematicians in rigorous reasoning.
Claude Fable achieves 53% on the 'Humanity's Last Exam' benchmark, surpassing the expected end-of-2025 milestone earlier than projected, indicating rapid AI progress.
A user tested OpenAI Codex on the 2025 college entrance exam math paper, restricting it from accessing the internet and only using Python for real computation. It completed the test in 7 minutes, scored a perfect 150 points, and also pointed out contradictions in the official answers.
PolyRange is a new open-source benchmark for evaluating offensive AI capabilities on web targets, designed to resist contamination by generating fresh tasks per deployment and including active defense tiers.
GPT-5.5 outperforms Claude Opus 4.8 on the DEEPSWE benchmark, achieving higher scores with lower cost and less token bloat.
The tweet discusses the release of Claude Opus 4.8, which improves upon Opus 4.7 with sharper judgment and longer independent work, though it notes that version 5.5 still outperforms it on a terminal coding benchmark.
A new study published in PNAS shows that advanced LLMs like GPT-4.5 can pass the Turing Test, with participants finding them more human than actual humans, prompting a reevaluation of what the test measures.
A user questions the token efficiency of GPT-5.5 versus GPT-5.4 in Codex, analyzing a chart from Artificial Analysis and praising Cursor's token performance.
ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.
OpenAI Five competed against top professional Dota 2 teams at The International 2018, losing both matches against elite human players while demonstrating competitive gameplay and strategic depth developed through self-taught learning.