SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Hugging Face Daily Papers Papers

Summary

SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.

Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self-feedback alone induces recursive drift. Our data and code are open-source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.
Original Article
View Cached Full Text

Cached at: 04/23/26, 03:35 AM

Paper page - SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Source: https://huggingface.co/papers/2604.20087

Abstract

Continual skill learning methods for LLM agents show mixed performance across diverse tasks, with improvements dependent on task structure and feedback mechanisms rather than model scaling.

Skills have become the de facto way to enableLLM agentsto perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluatingcontinual skill learningmethods, comprising 20 verified,skill-dependent tasksacross 15 sub-domains derived from areal-world skill taxonomy, evaluated at three levels: skill quality,execution trajectory, andtask outcome. Using this benchmark, we evaluate recentcontinual learning techniques, those leveraging one-shot, self/teacher feedback, andskill creatorto generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereasself-feedbackalone inducesrecursive drift. Our data and code are open-source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation andcontinual learning techniques.

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2604\.20087

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.20087 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20087 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20087 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv cs.AI

This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Hugging Face Daily Papers

SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.