Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

arXiv cs.AI Papers

Summary

This paper identifies 'library drift' as a silent failure mode in self-evolving LLM skill libraries, where unbounded skill accumulation causes retrieval degradation and performance stagnation. It provides trace-level diagnostics and a verified governance recipe that lifts pass@1 from 0.258 to 0.584 on MBPP+ hard-100.

arXiv:2605.19576v1 Announce Type: new Abstract: Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.
Original Article

Similar Articles

The Scaling Laws of Skills in LLM Agent Systems

arXiv cs.CL

This paper identifies two coupled scaling laws for skill libraries in LLM agent systems: routing accuracy decays logarithmically with library size, and execution dynamics show a rescue effect. The laws are validated across 15 models and over a million decisions, and law-guided optimization significantly improves performance.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv cs.AI

This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.