Tag
Self-Harness introduces a new paradigm where LLM-based agents iteratively improve their own operating harness by mining model-specific weaknesses, proposing harness modifications, and validating them through regression testing, achieving substantial performance gains on Terminal-Bench-2.0 across multiple base models.
This paper introduces an auto-research framework using specialist agents to iteratively refine training recipes through an empirical loop of code execution and feedback. The system autonomously improves performance on tasks like Parameter Golf and NanoChat without human intervention by leveraging lineage feedback.