here it is: Benchmark-Yourself app - compete against open source LLMs and get your score - 5 benchmarks available - Add your results to your CV or linkedIn (if you dare)... or just paste them below for community shaming.

Reddit r/LocalLLaMA 05/28/26, 10:39 PM Tools

Summary

A web app that allows users to benchmark their own performance against open source LLMs on five benchmarks, with the option to add results to a CV or LinkedIn.

[https://benchmark-yourself.streamlit.app/](https://benchmark-yourself.streamlit.app/) BBQ is 🔥 * Rule 4: Limit Self-Promotion - this is not self promotion * The 1/10th rule is a good guideline: self-promotion should not be more than 10% of your content. - my content is high quality and diversified * Affiliation must be disclosed: No engagement farming, No “I found this..”, etc. - I am not affiliated with streamline or oMLX or anything.

Original Article

Similar Articles

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

arXiv cs.CL

CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

arXiv cs.CL

A new cross-domain benchmark (Metacognitive Monitoring Battery) with 524 items evaluates LLM self-monitoring capabilities across six cognitive domains using human psychometric methodology. Applied to 20 frontier LLMs, it reveals three distinct metacognitive profiles and shows that accuracy rank and metacognitive sensitivity rank are largely inverted.

I made a small open-source benchmark runner for testing OpenClaw agents on my own real workflows

Reddit r/openclaw

A developer shares a personal open-source benchmark runner for testing OpenClaw agents on real, messy workflows. The tool allows users to define private evaluation cases, run agents in their actual workspace, and generate reports, aiming to provide more relevant signals than public benchmarks.

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

arXiv cs.CL

CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.

@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…

X AI KOLs Following

ProgramBench is a new benchmark that tests AI agents' ability to reconstruct a complete codebase from a compiled binary and its documentation. The leaderboard will open for submissions soon.

Similar Articles

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

I made a small open-source benchmark runner for testing OpenClaw agents on my own real workflows

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…

Submit Feedback