@rohanpaul_ai: ByteDance Seed delivered again. They released EdgeBench, to test whether AI agents can improve through experience, usin…

X AI KOLs Timeline 07/03/26, 12:15 AM Tools

bytedance-seed edgebench benchmark ai-agents evaluation real-world-tasks learning

Summary

ByteDance Seed released EdgeBench, a benchmark that tests whether AI agents can improve through experience by performing real-world tasks over 12+ hours, shifting evaluation from static knowledge to dynamic learning.

ByteDance Seed delivered again. They released EdgeBench, to test whether AI agents can improve through experience, using 134 real-world tasks that run for at least 12 hours. The big deal is that it shifts AI evaluation from “what does the model already know?” to “can the model learn while doing real work?” Huge, because future AI agents will not just answer questions from training data. They will enter messy environments, use tools, make attempts, read feedback, fix mistakes, and slowly build better solutions. Most current benchmarks are too short for that, so they mostly test memory, coding skill, or one-shot reasoning. EdgeBench instead gives agents 12-hour real-world tasks with feedback loops, so it can measure whether the agent improves through experience. Each task has a local workspace for fast trial and error, plus a hidden judge that gives stronger feedback on submitted work, which is meant to feel closer to real expert work. The authors then ran frontier agents for about 38,000 total hours and tracked how their best score changed as they kept interacting with the task environment. The big result is that when scores are averaged across many tasks, learning follows a very clean log-sigmoid curve, meaning progress is slow, then faster, then starts to level off. They also found that newer agents seem to learn from environments much faster, with the top models roughly doubling their 2-hour learning speed every 3 months.

Original Article

View Cached Full Text

Cached at: 07/03/26, 02:28 AM

ByteDance Seed delivered again.

They released EdgeBench, to test whether AI agents can improve through experience, using 134 real-world tasks that run for at least 12 hours.

The big deal is that it shifts AI evaluation from “what does the model already know?” to “can the model learn while doing real work?”

Huge, because future AI agents will not just answer questions from training data. They will enter messy environments, use tools, make attempts, read feedback, fix mistakes, and slowly build better solutions.

Most current benchmarks are too short for that, so they mostly test memory, coding skill, or one-shot reasoning.

EdgeBench instead gives agents 12-hour real-world tasks with feedback loops, so it can measure whether the agent improves through experience.

Each task has a local workspace for fast trial and error, plus a hidden judge that gives stronger feedback on submitted work, which is meant to feel closer to real expert work.

The authors then ran frontier agents for about 38,000 total hours and tracked how their best score changed as they kept interacting with the task environment.

The big result is that when scores are averaged across many tasks, learning follows a very clean log-sigmoid curve, meaning progress is slow, then faster, then starts to level off.

They also found that newer agents seem to learn from environments much faster, with the top models roughly doubling their 2-hour learning speed every 3 months.

Blog https://edge-bench.org Paper https://edge-bench.org/paper.pdf GitHub https://github.com/ByteDance-Seed/EdgeBench… Dataset https://huggingface.co/datasets/ByteDance-Seed/EdgeBench…

EdgeBench covers 134 long real-world tasks across science, software engineering, optimization, professional work, formal math, and games.

The point is that the authors are trying to measure general agent learning from feedback across many kinds of work, not performance on one domain or short one-shot tasks.

This is how EdgeBench lets agents learn during a task: they can keep testing ideas in a local environment, then submit work to a hidden judge for stronger feedback.

The point is that the benchmark is built around real feedback loops, so it measures whether an agent can improve through trial, error, and revision rather than just produce 1 final answer.

This shows what “learning during the run” looks like in practice: the agent starts with a rough gravitational-wave reconstruction, then improves through several distinct discoveries over 12 hours.

The point is that EdgeBench is measuring real iterative progress, where feedback helps the agent find better structure, fix bottlenecks, and raise its score from 42.8 to 67.0 instead of just making more random attempts.

@rohanpaul_ai: ByteDance Seed delivered again. They released EdgeBench, to test whether AI agents can improve through experience, usin…

Similar Articles

@_TobiasLee: Seed 2.1 from Bytedance achieved impressive results on two of our benchmarks. Claw-Eval (Multimodal, https://claw-eval.…

@xdotli: A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBe…

@rohanpaul_ai: Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper pr…

Seed2.1 released

@Radha_AI: China just dropped an AI employee that never sleeps. It researches, codes, builds websites, makes slides, and generates…

Submit Feedback

Similar Articles

@_TobiasLee: Seed 2.1 from Bytedance achieved impressive results on two of our benchmarks. Claw-Eval (Multimodal, https://claw-eval.…

@xdotli: A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBe…

@rohanpaul_ai: Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper pr…

@Radha_AI: China just dropped an AI employee that never sleeps. It researches, codes, builds websites, makes slides, and generates…