Tag
This paper analyzes two capabilities in self-evolving LLM agents: harness-updating and harness-benefit. It finds that harness-updating is flat across base capability levels, while harness-benefit is non-monotonic, with mid-tier models benefiting most.
This paper introduces a population coupling trend and h-field diagnostic to analyze the relationship between coding and reasoning capabilities across frontier AI models, finding that capabilities cooperate but with varying emphasis per lab. It provides a playbook for measurement and predicts benchmark saturation trends.