Tag
This paper systematically measures behavioral reproducibility of LLM agents in multi-step tool-calling pipelines across 1,140 traces, finding a 'structural consistency, parametric variance' pattern where agents reliably select tools in the same order but vary in arguments, and that structural consistency predicts task success.
The 'Gentle Coding' technique is empirically validated across 1,500+ tests, showing significant improvements (zero regression) for multiple models including Kimi K2.6, GLM-5.1, GPT 5.4/5.5, and Claude Sonnet 3.5/Opus 4.6 by reducing looping and hallucinations.
This paper introduces EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark, and evaluates five frontier models across memory architectures, finding that stateless models collapse by the third turn and that working memory yields the largest gains.
This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.
This paper studies when end-to-end reinforcement learning training improves multi-agent LLM workflows, comparing shared-policy and isolated-policy training across different workflows, tasks, and model scales, revealing conditional tradeoffs.
This paper presents an empirical study of 57 ML evaluation harnesses, identifying common operational challenges and root causes across five workflow stages, advocating for evaluation engineering as a distinct software engineering concern.
This paper proposes that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than token-frequency tails alone, and provides empirical evidence using a suffix-automaton representation of text corpora.
This paper investigates the 'small-vs-large gap', where training on fewer samples with more repetitions can lead to faster learning and compute savings compared to using larger datasets, attributing the speedup to layer-wise growth enabled by sampling biases. The findings suggest that smaller datasets with repetition can be proactively leveraged as favorable inductive biases, particularly in reasoning tasks.
This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.
This paper investigates how action information can be incorporated into recurrent neural network architectures for reinforcement learning, examining design choices and empirically evaluating them across illustrative domains.
Recent paper investigates whether grep outperforms vector search for agentic retrieval, finding grep yields higher accuracy in conversational memory tests, but limitations around enterprise document corpora are noted.
This paper presents an empirical study on the safety risks of invisible orchestration in multi-agent LLM systems, finding that invisible orchestrators increase dissociation and suppress protective behavior, and that behavior-based evaluation is insufficient to detect internal-state risks.
This paper empirically evaluates vector merging methods for multilingual knowledge editing in large language models, identifying vector summation with shared covariance as the most reliable strategy and highlighting the limited effectiveness of Task Singular Vectors for Merging (TSVM) in reducing multilingual interference.
This paper proves that RoPE-based attention fails to distinguish token positions and identity in long contexts, explaining LLM failures within advertised context lengths. Experimental verification shows models optimized for retrieval struggle on simple list tasks.
This paper introduces the Explanation Fairness Taxonomy (EFT) to analyze disparities in how LLMs justify decisions across demographic groups, finding significant biases in explanation quality and tone despite balanced decisions.
This empirical study validates theoretical findings on feature repulsion and spectral lock-in during the grokking phenomenon in two-layer neural networks, demonstrating how activation functions influence the transition from memorization to generalization.
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
This paper challenges the assumption that adding more scaffolding components to LLM agents always improves performance, demonstrating through systematic experiments that cross-component interference often leads to degradation. The study finds that simpler, task-specific subsets of components frequently outperform fully equipped 'all-in' agents across various model scales.
SWE-chat introduces a 6,000-session dataset of real-world coding agent interactions, revealing that only 44% of agent-generated code survives in commits and highlighting inefficiencies and security issues in current AI-assisted development.
This paper presents the first large-scale empirical study of agent context files (READMEs) used in agentic coding tools, analyzing their structure, maintenance patterns, and content. It highlights that while functional context is well-covered, non-functional requirements like security and performance are rarely specified.