Tag
A commentary emphasizing that despite AI advances, human understanding remains crucial for safe and humane deployment, urging users to verify AI outputs and treat AI with respect.
This paper investigates why multi-step tool-use reinforcement learning (RL) often collapses or yields limited gains, identifying probability spikes in control tokens as a key cause. It shows that interleaving supervised fine-tuning with RL improves stability and explores various supervisory signals to guide robust training.
Weave of Formal Thought (WoFT) introduces a sound and complete constrained decoder for code generation that guarantees syntactic validity relative to the full Tree-sitter specification, and a fine-tuning method that trains models to interleave grammar symbols using reweighted wake-sleep, improving perplexity on Python code generation.
Introduces SFL-MTSC, a structured aggregation framework for robust multi-intent spoken language understanding using LLM self-consistency at the semantic frame level, showing improved slot F1 and overall accuracy on the MAC-SLU benchmark.
This paper presents a red teaming framework for LLMs that uses a multi-role architecture to systematically uncover vulnerabilities, particularly in faithfulness. The framework demonstrated a 7.9% increase in attack success rate in QA tasks and highlights the impact of architectural choices over parameter scaling on model safety.
Hybrid-IR introduces a dual-path retrieval framework combining graph-based and dense retrieval with iterative reasoning to improve complex medical QA, addressing limitations in existing RAG methods. Experiments on three benchmarks show effectiveness.
This paper develops a codebook for self-stigma among people who use drugs and analyzes 72,115 Reddit posts to examine prevalence, co-occurrence, and temporal patterns of cognitive, affective, and behavioral stigma indicators, finding that self-stigma is expressed as an integrated phenomenon with behavioral indicators often preceding core indicators.
This survey reformulates industrial continual learning for LLMs as a closed-loop update-and-release problem in a versioned ecosystem, identifying key challenges and proposing five lifecycle design principles for sustainable model evolution.
ScaleToT proposes a method to generalize structured LLM reasoning for low-activity user modeling at billion scale, using tree-of-thought refinement and training a student model to reduce cost. An online A/B test in advertising deployment showed a 6.738% increase in LT30.
This paper explores cross-lingual prompting strategies to improve access to parametric knowledge in large language models, demonstrating significant gains in knowledge transfer and factual recall across 17 languages on multilingual benchmarks.
The paper discusses the small scaling exponents of large language models, arguing that they indicate an unsustainable regime in terms of energy resources. It also examines the 'pedestal effect' and draws analogies with fluid turbulence to comment on data smoothness.
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
ATRIA is a multi-agent system for ECG report generation that mirrors the clinician's iterative workflow, enabling bidirectional editing, evidence grounding, and clinician-in-the-loop verification.
AVOC introduces a retrieval-inspired token compression method for omni-modal LLMs that effectively handles hour-long audio-video inputs by selecting informative tokens based on relevance, importance, and diversity. The framework achieves state-of-the-art results on long-form audio-video understanding benchmarks, surpassing prior methods by significant margins.
This paper introduces 'pigeonholing,' a phenomenon where bad prompts cause LLMs to collapse and repeat errors, leading to a 38-40% performance drop. Experiments across 10 tasks and 10 models show worsening with more conversation turns, and propose RLVR with synthetic errors as a mitigation.
This paper proposes monitoring LLM misalignment by decomposing it into fine-grained cognitive processes (misalignment indicators) and detecting them via linear probes on internal activations, achieving high AUROC on out-of-distribution transcripts.
This paper introduces CAVEWOMAN, a two-channel evaluation protocol for assessing the effects of linguistic input and output compression on LLMs. It finds that output compression reduces costs, while input compression increases costs and degrades accuracy, challenging the common 'caveman style' advice.
This paper extends contextual entrainment from token-level to sentence-level, showing that even counterfactual sentences in prompts increase their probability during inference. The effect decreases with model size and is driven by 2-4% of attention heads, which can be ablated without performance loss.
This paper proposes version-aware operations and transaction memories for the MeMo architecture, enabling direct editing of explicit correlation matrix memories instead of full retraining when knowledge changes.
This paper introduces CAMS, a modular multi-document summarization framework that extracts atomic claims with token-level provenance, clusters equivalent claims, and rewrites them into summaries with fine-grained, multi-source traceability, significantly improving faithfulness and citation precision.