Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
Summary
This paper identifies 'library drift' as a silent failure mode in self-evolving LLM skill libraries, where unbounded skill accumulation causes retrieval degradation and performance stagnation. It provides trace-level diagnostics and a verified governance recipe that lifts pass@1 from 0.258 to 0.584 on MBPP+ hard-100.
Similar Articles
The Scaling Laws of Skills in LLM Agent Systems
This paper identifies two coupled scaling laws for skill libraries in LLM agent systems: routing accuracy decays logarithmically with library size, and execution dynamics show a rescue effect. The laws are validated across 15 models and over a million decisions, and law-guided optimization significantly improves performance.
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
Introduces SkillDAG, a self-evolving typed directed graph for LLM skill selection at scale that models inter-skill relationships and allows agents to query and evolve the graph during execution, outperforming baselines on ALFWorld and SkillsBench.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
SkillAdaptor is a training-free step-level skill adaptation framework with explicit failure attribution for LLM agents, improving performance on WebShop, PinchBench, and Claw-Eval.