@rohanpaul_ai: ByteDance Seed delivered again. They released EdgeBench, to test whether AI agents can improve through experience, usin…
Summary
ByteDance Seed released EdgeBench, a benchmark that tests whether AI agents can improve through experience by performing real-world tasks over 12+ hours, shifting evaluation from static knowledge to dynamic learning.
View Cached Full Text
Cached at: 07/03/26, 02:28 AM
ByteDance Seed delivered again.
They released EdgeBench, to test whether AI agents can improve through experience, using 134 real-world tasks that run for at least 12 hours.
The big deal is that it shifts AI evaluation from “what does the model already know?” to “can the model learn while doing real work?”
Huge, because future AI agents will not just answer questions from training data. They will enter messy environments, use tools, make attempts, read feedback, fix mistakes, and slowly build better solutions.
Most current benchmarks are too short for that, so they mostly test memory, coding skill, or one-shot reasoning.
EdgeBench instead gives agents 12-hour real-world tasks with feedback loops, so it can measure whether the agent improves through experience.
Each task has a local workspace for fast trial and error, plus a hidden judge that gives stronger feedback on submitted work, which is meant to feel closer to real expert work.
The authors then ran frontier agents for about 38,000 total hours and tracked how their best score changed as they kept interacting with the task environment.
The big result is that when scores are averaged across many tasks, learning follows a very clean log-sigmoid curve, meaning progress is slow, then faster, then starts to level off.
They also found that newer agents seem to learn from environments much faster, with the top models roughly doubling their 2-hour learning speed every 3 months.
Blog https://edge-bench.org Paper https://edge-bench.org/paper.pdf GitHub https://github.com/ByteDance-Seed/EdgeBench… Dataset https://huggingface.co/datasets/ByteDance-Seed/EdgeBench…
EdgeBench covers 134 long real-world tasks across science, software engineering, optimization, professional work, formal math, and games.
The point is that the authors are trying to measure general agent learning from feedback across many kinds of work, not performance on one domain or short one-shot tasks.
This is how EdgeBench lets agents learn during a task: they can keep testing ideas in a local environment, then submit work to a hidden judge for stronger feedback.
The point is that the benchmark is built around real feedback loops, so it measures whether an agent can improve through trial, error, and revision rather than just produce 1 final answer.
This shows what “learning during the run” looks like in practice: the agent starts with a rough gravitational-wave reconstruction, then improves through several distinct discoveries over 12 hours.
The point is that EdgeBench is measuring real iterative progress, where feedback helps the agent find better structure, fix bottlenecks, and raise its score from 42.8 to 67.0 instead of just making more random attempts.
Similar Articles
@_TobiasLee: Seed 2.1 from Bytedance achieved impressive results on two of our benchmarks. Claw-Eval (Multimodal, https://claw-eval.…
ByteDance's Seed 2.1 model achieved strong results on multimodal agentic (Claw-Eval) and long video understanding (Video-MME) benchmarks, though a gap remains between perception and agentic capabilities.
@xdotli: A big pain point in using AI benchmarks is encountering errors after its first release. Today, we're releasing SkillsBe…
SkillsBench 1.1 is released as the first audited, error-free benchmark for AI agent skills, showing rapid capability improvement from ~36% to 67% resolution rate and demonstrating that skills can substitute for model scale.
@rohanpaul_ai: Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper pr…
This paper introduces Agents' Last Exam, a benchmark that tests AI agents on real expert work across 55 digital work areas. Current best agents fail most tasks, averaging only 2.6% pass rate on the hardest tier, revealing a large gap between benchmark scores and real-world automation readiness.
Seed2.1 released
ByteDance has released Seed2.1, a new AI model, with accompanying blog post and model card.
@Radha_AI: China just dropped an AI employee that never sleeps. It researches, codes, builds websites, makes slides, and generates…
ByteDance releases DeerFlow 2.0, an open-source AI agent framework for local execution of tasks like coding, research, and content generation without cloud dependencies or subscriptions.