Tag
This paper introduces the Explanation Fairness Taxonomy (EFT) to analyze disparities in how LLMs justify decisions across demographic groups, finding significant biases in explanation quality and tone despite balanced decisions.
This empirical study validates theoretical findings on feature repulsion and spectral lock-in during the grokking phenomenon in two-layer neural networks, demonstrating how activation functions influence the transition from memorization to generalization.
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
This paper challenges the assumption that adding more scaffolding components to LLM agents always improves performance, demonstrating through systematic experiments that cross-component interference often leads to degradation. The study finds that simpler, task-specific subsets of components frequently outperform fully equipped 'all-in' agents across various model scales.
SWE-chat introduces a 6,000-session dataset of real-world coding agent interactions, revealing that only 44% of agent-generated code survives in commits and highlighting inefficiencies and security issues in current AI-assisted development.
This paper presents the first large-scale empirical study of agent context files (READMEs) used in agentic coding tools, analyzing their structure, maintenance patterns, and content. It highlights that while functional context is well-covered, non-functional requirements like security and performance are rarely specified.
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.