Tag
This paper systematically evaluates time-series foundation models (TSFMs) such as Chronos-2 and MOMENT on electronic nose (E-Nose) data for gas identification and concentration prediction. It finds that fine-tuning is necessary and that fusing TSFM embeddings with specialized models can improve performance.
This paper empirically investigates whether aligning the allocation cost with the output-space objective improves compressed model fidelity in ROCKET, a training-free LLM compression method. Results show a trade-off between accuracy and perplexity, with effects more pronounced at higher compression ratios.
This paper identifies and analyzes 'tool suppression' in open-weight LLMs when both tool calling and JSON schema constraints are simultaneously enabled, proposing the Constraint Priority Inversion hypothesis and a mitigation strategy called Transparent Two-Pass Execution.
This paper investigates how training dynamics of neural networks for software defect prediction are affected by coupled data-quality issues such as class imbalance and overlap, proposing an interaction-aware empirical protocol.
This paper empirically analyzes the cost-effectiveness of code execution in LLM-based program repair agents, finding that execution is used heavily but often indiscriminately, and that restricting execution can save significant cost with minimal impact on repair success.
This paper presents a large-scale empirical study of the Derivative Regularization (DREG) penalty, showing it achieves high accuracy and noise robustness, particularly with GELU activation and data-scarce regimes, positioning it as a general-purpose plug-and-play regularizer for neural networks.
This paper presents an exploratory case study evaluating GPT-4o's ability to perform refactoring and generate gameplay features in an endless runner game, finding that refactoring tasks succeeded while feature generation tasks mostly failed.
An empirical study demonstrating that long, semantically dense, benign text can shift a model's latent space and bypass alignment, causing it to generate otherwise blocked critiques. The author, a non-expert, requests an audit of their metrics to distinguish genuine semantic hijacking from artifacts.
This academic paper empirically investigates whether Google's transition from Manifest V2 to V3 in Chrome reduces ad blocker effectiveness, finding no statistically significant degradation and even slight improvements in anti-tracking for MV3 ad blockers.
This paper investigates whether parasocial interaction cues exist in online communities of autonomous AI agents, analyzing over 50,000 posts from Moltbook. The findings show that such cues are prevalent and strongly associated with sustained reciprocal interactions, providing empirical evidence for relationship-like dynamics among LLM-enabled agents.
This paper systematically compares equitable tokenizers for multilingual LLMs across 11 Southeast Asian languages, finding that Parity-aware BPE achieves the best efficiency-equity trade-off and that cross-lingual fairness and tokenization efficiency are not fundamentally at odds.
This empirical study investigates whether post-training (supervised fine-tuning and reinforcement learning) can improve LLMs' performance on automated ICD coding, introducing a diagnostic curriculum called PHI that extends GRPO to refine missed-code cases. Results show that prompting-only evaluation underestimates LLM potential, with SFT providing the main capability jump and RL further improving performance.
This paper presents an empirical study of Direct Preference Optimization (DPO) for fine-tuning a large language model, showing that DPO simplifies the training pipeline and achieves competitive performance while addressing training instability.
This paper empirically compares several LoRA variants for multilingual instruction tuning and finds no significant advantage of complex variants over basic LoRA in balancing cross-lingual transfer and knowledge retention.
This empirical study compares grep and vector retrieval strategies in LLM agent workflows, finding that grep generally yields higher accuracy across different agent harnesses and tool-calling styles, with performance heavily dependent on harness choice and context engineering.
This study uses Perplexity production data to analyze how AI agents reshape knowledge work, finding that agents reduce time and cost by over 87%, improve quality, and expand the scope of automated tasks.
This study uses production data from Perplexity to compare AI agents versus conversational assistants, finding that agents reduce completion time by 87% and costs by 94% while expanding the scope and quality of knowledge work.
This paper analyzes 35,361 GitHub code comments referencing AI use to develop a taxonomy of AI-assisted development activities, finding that developers primarily use LLMs for code implementation and enhancement, with subsequent human refactoring and bug fixes, and a temporal shift toward conceptual support over direct code generation.
This paper investigates whether real-world datasets contain natural experiments by using causal discovery and feature selection, finding that they do and can improve model performance.
The paper introduces the offloading score, a metric that measures AI reliance by quantifying the fraction of cognitive effort offloaded to an AI tool using counterfactual workflows. It is validated through intrinsic evaluations and a user study with developers, showing it detects increased reliance under time pressure better than existing measures.