Tag
This paper identifies that naive skill accumulation in LLM agents can cause performance regressions, as skills beneficial for some tasks hurt others. The authors propose Assay, a framework that measures per-skill causal contributions and applies per-task masking, achieving state-of-the-art results on AppWorld and τ-bench without weight updates.
This paper presents an experiment where GPT-4.1 is asked to pick a random number between 1 and 100, 10,000 times, and the resulting distribution is analyzed for bias compared to a uniform baseline.