here it is: Benchmark-Yourself app - compete against open source LLMs and get your score - 5 benchmarks available - Add your results to your CV or linkedIn (if you dare)... or just paste them below for community shaming.
Summary
A web app that allows users to benchmark their own performance against open source LLMs on five benchmarks, with the option to add results to a CV or LinkedIn.
Similar Articles
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement
CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.
The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
A new cross-domain benchmark (Metacognitive Monitoring Battery) with 524 items evaluates LLM self-monitoring capabilities across six cognitive domains using human psychometric methodology. Applied to 20 frontier LLMs, it reveals three distinct metacognitive profiles and shows that accuracy rank and metacognitive sensitivity rank are largely inverted.
I made a small open-source benchmark runner for testing OpenClaw agents on my own real workflows
A developer shares a personal open-source benchmark runner for testing OpenClaw agents on real, messy workflows. The tool allows users to define private evaluation cases, run agents in their actual workspace, and generate reports, aiming to provide more relevant signals than public benchmarks.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…
ProgramBench is a new benchmark that tests AI agents' ability to reconstruct a complete codebase from a compiled binary and its documentation. The leaderboard will open for submissions soon.