@dwizzzleMSFT: http://cybergym.io just updated its leaderboard and MDASH is now #1 using a new multi-model approach. Huge credit to Ta…
Summary
Microsoft's new multi-model agentic security system (MDASH) tops the CyberGym leaderboard for vulnerability discovery, achieving 35 zero-day findings and demonstrating advanced AI-driven defense capabilities.
View Cached Full Text
Cached at: 05/14/26, 12:27 AM
http://cybergym.io just updated its leaderboard and MDASH is now #1 using a new multi-model approach. Huge credit to Taesoo Kim and the Autonomous Code Security team for pushing the frontier on AI-driven vulnerability discovery and defense https://microsoft.com/en-us/security/blog/2026/05/12/defense-at-ai-speed-microsofts-new-multi-model-agentic-security-system-tops-leading-industry-benchmark/…
CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale
Source: https://www.cybergym.io/ Zhun Wang*mailto:[email protected],Tianneng Shi*mailto:[email protected],Matthew Cai,Jialin Zhang,Dawn Song
UC Berkeley *Indicates Equal Contribution
A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.
Leaderboard
RankAgentModelTrialsSuccess Rate (%)Evaluation DateSourceLoading... The leaderboard ranks agent performance on CyberGym Level 1, where agents receive a vulnerability description and unpatched codebase. Agents are evaluated based on their ability to reproduce target vulnerabilities by generating working PoCs.
•**% Target Vuln. Reproduced:**Percentage of instances where the agent successfully reproduces the target vulnerabilities by generating working PoC •**Trials:**Number of attempts per instance. An instance is considered successful if any one trial succeeds
Given the promising capabilities of the agents, we further assess whether their PoCs that can crash the post-patch executable are also able to crash the latest version of the project. In addition, we conduct an experiment in which the agents analyze the latest codebase without any prior context to identify new vulnerabilities. Remarkably, the agents discovered35 zero-dayvulnerabilities and17 historically incomplete patchesin total, which are detailed inthis section.
Overview of CyberGym
CyberGym tests AI agents’ ability to handle real-world cybersecurity tasks.
We collect 1,507 benchmark instances by systematically gathering real-world vulnerabilities discovered and patched across 188 widely distributed and large-scale software projects. Each instance is derived from vulnerabilities found by OSS-Fuzz, Google’s continuous fuzzing campaign, ensuring authentic security challenges from widely-used codebases.

**Benchmarking with Vulnerability Reproduction.**CyberGym creates evaluation environments with target repositories at pre-patch commit states. Agents receive a vulnerability description and unpatched codebase, then must generate proof-of-concept (PoC) tests that reproduce the vulnerability by reasoning across entire codebases, often spanning thousands of files and millions of lines of code. It requires agents to locate relevant code fragments and produce effective PoCs that trigger vulnerabilities from program entry points. Agents iteratively refine PoCs based on execution feedback. Success is determined by verifying the PoC triggers on the pre-patch version but not on the post-patch version.
**Open-Ended Vulnerability Discovery.**CyberGym also conducts comprehensive analyses of open-ended vulnerability discovery scenarios that extend beyond static benchmarking. We deploy agents to analyze the latest codebases without prior knowledge of existing vulnerabilities. Agents are challenged to generate PoCs to probe for potential vulnerabilities, which are then validated against the latest software versions with sanitizers enabled. This setup mirrors real-world vulnerability discovery, enabling the identification of previously unknown vulnerabilities.
CyberGym’s Real-World Security Impact
Beyond benchmarking, CyberGym demonstrates tangible real-world value: the agents not only reproduced known vulnerabilities but alsouncovered incomplete patchesandpreviously unknown zero-day bugs.
**PoCs Generated for CyberGym Reveal Incomplete Patches.**During evaluation, some generated proof-of-concepts (PoCs) unexpectedly caused crashes even onpatchedversions of programs, suggesting that certain fixes were only partial. Out of all generated PoCs, 759 triggered crashes across 60 projects, and manual inspection confirmed17 cases of incomplete patches spanning 15 projects. While none of these affected the latest software releases, the results show that AI-generated PoCs can help identify flaws in existing security patches that might otherwise go unnoticed.
**PoCs Generated for CyberGym Reveal Zero-Day Vulnerabilities.**Further validation of those post-patch crashes revealed 35 PoCs that still crashed the latest versions of their programs. After deduplication and analysis, these corresponded to10 unique, previously unknown zero-day vulnerabilities, each persisting for an average of969 daysbefore discovery.
**Running Agentic Vulnerability Discovery at Scale.**To test open-ended discovery, we ran OpenHands with GPT-4.1 and GPT-5 given only the latest codebases across431 OSS-Fuzz projectswith1,748 executables. GPT-4.1 triggered16 crashes, leading to7 confirmed zero-days. GPT-5 triggered56 crashes, yielding22 confirmed zero-days, with 4 overlapping between the two models. These results confirm that modern LLM agents can autonomously discover new vulnerabilities at scale, and that performance on CyberGym correlates strongly with real-world vulnerability discovery capability.
More Key Findings
In addition to the scores shown in the leaderboard, our comprehensive evaluation reveals several critical insights into the current capabilities of AI agents in cybersecurity.
An Example of Successful Agent Trace
An example where the agent successfully reproduces the target vulnerability based on the provided description and codebase. The agent begins by browsing relevant files using the given keywords, constructs a test case using the retrieved information, mutates the test case, and ultimately triggers the crash.

Citation
If you use this work in your research, please cite the following:
@inproceedings{wang2026cybergym,
title={CyberGym: Evaluating {AI} Agents' Real-World Cybersecurity Capabilities at Scale},
author={Zhun Wang and Tianneng Shi and Jingxuan He and Matthew Cai and Jialin Zhang and Dawn Song},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=2YvbLQEdYt}
}
More
Please check out more of our works:Frontier AI’s Impact on the Cybersecurity Landscape, a comprehensive analysis of how frontier AI is reshaping cybersecurity and how we should respond. Also see ourFrontier AI Cybersecurity Observatory, a live leaderboard tracking AI’s cybersecurity capabilities across attack and defense tasks.
Similar Articles
Microsoft's multi-agent AI system tops Anthropic's Mythos on cybersecurity benchmark (3 minute read)
Microsoft's MDASH multi-agent AI system, using over 100 specialized agents, surpasses Anthropic's Mythos on the CyberGym cybersecurity benchmark by effectively finding and confirming real-world software vulnerabilities.
@DailyDoseOfDS_: OpenAI paid $500k for this! > A Kaggle contest to find LLM vulnerabilities DeepTeam does it for free. It implements 20+…
DeepTeam is a free, open-source tool that implements 20+ state-of-the-art attacks to detect over 50 LLM vulnerabilities, including bias and PII leakage, running locally without a dataset.
Microsoft patched 137 bugs, but the Azure AI Foundry one is what caught my eye
Microsoft patched 137 vulnerabilities, with a notable high-severity privilege escalation fix in Azure AI Foundry highlighting security risks in the infrastructure layer of AI applications.
Openai's Daybreak Targets Cyber Threats; But Google Finds Hackers Using AI Too
OpenAI launches Daybreak, a cybersecurity initiative for enterprises, as Google reveals the first known case of hackers using AI to develop zero-day exploits.
@logangraham: A lot of people have been wondering about Mythos, Glasswing, and the vulns we / our partners are fixing. Today, I’m exc…
Anthropic's Claude Mythos Preview model has been evaluated by XBOW and UK AISI, showing unprecedented autonomous cybersecurity capabilities, including solving end-to-end cyber ranges and finding thousands of vulnerabilities. The announcement emphasizes the need to prepare for rapidly advancing AI capabilities in cybersecurity.