SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Hugging Face Daily Papers 04/22/26, 12:00 AM Papers

Summary

SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.

Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self-feedback alone induces recursive drift. Our data and code are open-source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.

Original Article

View Cached Full Text

Cached at: 04/23/26, 03:35 AM

Paper page - SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Source: https://huggingface.co/papers/2604.20087

Abstract

Continual skill learning methods for LLM agents show mixed performance across diverse tasks, with improvements dependent on task structure and feedback mechanisms rather than model scaling.

Skills have become the de facto way to enableLLM agentsto perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluatingcontinual skill learningmethods, comprising 20 verified,skill-dependent tasksacross 15 sub-domains derived from areal-world skill taxonomy, evaluated at three levels: skill quality,execution trajectory, andtask outcome. Using this benchmark, we evaluate recentcontinual learning techniques, those leveraging one-shot, self/teacher feedback, andskill creatorto generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereasself-feedbackalone inducesrecursive drift. Our data and code are open-source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation andcontinual learning techniques.

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2604\.20087

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.20087 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20087 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20087 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Paper page - SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

Submit Feedback

Similar Articles

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing