empirical-study

#empirical-study

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

arXiv cs.CL ↗ · 2026-05-29 Cached

This paper systematically measures behavioral reproducibility of LLM agents in multi-step tool-calling pipelines across 1,140 traces, finding a 'structural consistency, parametric variance' pattern where agents reliably select tools in the same order but vary in arguments, and that structural consistency predicts task success.

0 favorites 0 likes

#empirical-study

UPDATE: "Gentle Coding" is mathematically proven. 1,500+ test runs show major gain for Kimi K2.6 and even more for GLM-5.1! GPT 5.4/5.5 and Claude Sonnet 3.5/Opus 4.6 also better, with ZERO REGRESSION ACROSS THE BOARD.

Reddit r/LocalLLaMA ↗ · 2026-05-29

The 'Gentle Coding' technique is empirically validated across 1,500+ tests, showing significant improvements (zero regression) for multiple models including Kimi K2.6, GLM-5.1, GPT 5.4/5.5, and Claude Sonnet 3.5/Opus 4.6 by reducing looping and hallucinations.

0 favorites 0 likes

#empirical-study

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper introduces EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark, and evaluates five frontier models across memory architectures, finding that stateless models collapse by the third turn and that working memory yields the largest gains.

0 favorites 0 likes

#empirical-study

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Hugging Face Daily Papers ↗ · 2026-05-27 Cached

This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.

0 favorites 0 likes

#empirical-study

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

arXiv cs.AI ↗ · 2026-05-26 Cached

This paper studies when end-to-end reinforcement learning training improves multi-agent LLM workflows, comparing shared-policy and isolated-policy training across different workflows, tasks, and model scales, revealing conditional tradeoffs.

0 favorites 0 likes

#empirical-study

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Hugging Face Daily Papers ↗ · 2026-05-22 Cached

This paper presents an empirical study of 57 ML evaluation harnesses, identifying common operational challenges and root causes across five workflow stages, advocating for evaluation engineering as a distinct software engineering concern.

0 favorites 0 likes

#empirical-study

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

arXiv cs.CL ↗ · 2026-05-21 Cached

This paper proposes that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than token-frequency tails alone, and provides empirical evidence using a suffix-automaton representation of text corpora.

0 favorites 0 likes

#empirical-study

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

arXiv cs.LG ↗ · 2026-05-21 Cached

This paper investigates the 'small-vs-large gap', where training on fewer samples with more repetitions can lead to faster learning and compute savings compared to using larger datasets, attributing the speedup to layer-wise growth enabled by sampling biases. The findings suggest that smaller datasets with repetition can be proactively leveraged as favorable inductive biases, particularly in reasoning tasks.

0 favorites 0 likes

#empirical-study

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI ↗ · 2026-05-20

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.

0 favorites 0 likes

#empirical-study

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper investigates how action information can be incorporated into recurrent neural network architectures for reinforcement learning, examining design choices and empirically evaluating them across illustrative domains.

0 favorites 0 likes

#empirical-study

@jerryjliu0: There’s an open question on whether grep is all you need for agentic search. This recent paper by @PwCUS (Sen et al.) s…

X AI KOLs Following ↗ · 2026-05-17 Cached

Recent paper investigates whether grep outperforms vector search for agentic retrieval, finding grep yields higher accuracy in conversational memory tests, but limitations around enterprise document corpora are noted.

0 favorites 0 likes

#empirical-study

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

arXiv cs.AI ↗ · 2026-05-15 Cached

This paper presents an empirical study on the safety risks of invisible orchestration in multi-agent LLM systems, finding that invisible orchestrators increase dissociation and suppress protective behavior, and that behavior-based evaluation is insufficient to detect internal-state risks.

0 favorites 0 likes

#empirical-study

Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey

arXiv cs.CL ↗ · 2026-05-15 Cached

This paper empirically evaluates vector merging methods for multilingual knowledge editing in large language models, identifying vector summation with shared covariance as the most reliable strategy and highlighting the limited effectiveness of Task Singular Vectors for Merging (TSVM) in reducing multilingual interference.

0 favorites 0 likes

#empirical-study

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Hugging Face Daily Papers ↗ · 2026-05-15 Cached

This paper proves that RoPE-based attention fails to distinguish token positions and identity in long contexts, explaining LLM failures within advertised context lengths. Experimental verification shows models optimized for retrieval struggle on simple list tasks.

0 favorites 0 likes

#empirical-study

Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper introduces the Explanation Fairness Taxonomy (EFT) to analyze disparities in how LLMs justify decisions across demographic groups, finding significant biases in explanation quality and tone despite balanced decisions.

0 favorites 0 likes

#empirical-study

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

arXiv cs.LG ↗ · 2026-05-12 Cached

This empirical study validates theoretical findings on feature repulsion and spectral lock-in during the grokking phenomenon in two-layer neural networks, demonstrating how activation functions influence the transition from memorization to generalization.

0 favorites 0 likes

#empirical-study

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.

0 favorites 0 likes

#empirical-study

More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper challenges the assumption that adding more scaffolding components to LLM agents always improves performance, demonstrating through systematic experiments that cross-component interference often leads to degradation. The study finds that simpler, task-specific subsets of components frequently outperform fully equipped 'all-in' agents across various model scales.

0 favorites 0 likes

#empirical-study

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Hugging Face Daily Papers ↗ · 2026-04-22 Cached

SWE-chat introduces a 6,000-session dataset of real-world coding agent interactions, revealing that only 44% of agent-generated code survives in commits and highlighting inefficiencies and security issues in current AI-assisted development.

0 favorites 0 likes

#empirical-study

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Papers with Code Trending ↗ · 2025-11-17 Cached

This paper presents the first large-scale empirical study of agent context files (READMEs) used in agentic coding tools, analyzing their structure, maintenance patterns, and content. It highlights that while functional context is well-covered, non-functional requirements like security and performance are rarely specified.

0 favorites 0 likes

empirical-study

Submit Feedback