Trending stories ranked by heat, importance and recency.
Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.
OPERA proposes a reinforcement learning method for open-ended tasks using intrinsic rewards based on perplexity dynamics, replacing unreliable LLM-as-a-judge reward models. It achieves state-of-the-art results on Qwen3-8B, matching proprietary models in creative writing and other open-ended tasks.
This paper introduces BitEmbed, an extreme low-bit framework for LLM-based text embeddings that converts pretrained LLM backbones into BitNet-style encoders with ternary weights and quantized activations. It achieves comparable performance to full-precision models while significantly reducing encoding and storage costs.
MedGuards proposes a multi-agent framework for detecting and correcting errors in medical text using specialized agents and confidence-guided arbitration, improving reliability without additional training. Experiments on multilingual clinical notes show significant improvements.
This paper proposes ReverieMem, a three-layer memory architecture for book-based LLM role-playing agents that prevents factual overreach and stylistic monotony. It also introduces the KBF-QA benchmark and achieves significant improvements in knowledge boundary fidelity and narrative quality.
This paper identifies and analyzes 'tool suppression' in open-weight LLMs when both tool calling and JSON schema constraints are simultaneously enabled, proposing the Constraint Priority Inversion hypothesis and a mitigation strategy called Transparent Two-Pass Execution.
Riazi-8B is an Urdu large language model fine-tuned for mathematical reasoning, achieving improved performance on MGSM-Urdu through continued pre-training and supervised fine-tuning on Urdu Chain-of-Thought data.
BiPACE introduces a drop-in advantage estimator that fixes state-action credit mismatch in stepwise group-based RL for LLM agents, using bisimulation-guided state clustering and action counterfactual estimation, achieving significant performance gains on ALFWorld, WebShop, and TextCraft with Qwen2.5 models.
This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.
This paper shows that a language model with a lossy memory that retains a wrong conclusion but drops the evidence produces confident incorrect answers, whereas an empty memory leads to abstention. The authors propose a source-first compression policy that preserves recomputable sources instead of conclusions to maintain correctability, and demonstrate the mechanism across multiple models and dialogue systems.
This paper introduces Agent-Authored World Modeling (AAWM), a training procedure that constructs world-model supervision based on the policy's own decision needs rather than next-observation prediction, aligning the learning objective with the dynamics required for effective decision-making.
This paper introduces directional sharpness, a new metric for certifying the generalization performance of machine learning models that is both efficient to compute and more reliable than existing proxies like test accuracy or traditional sharpness, even when training deviates from prescribed procedures.
This survey synthesizes research on toxicity detection and detoxification for multilingual large language models, cataloging threat models, task formulations, detection approaches, and mitigation strategies, while identifying persistent challenges such as uneven language coverage and culturally contingent definitions of harm.
TRACER is a training-free framework for traffic accident reconstruction that formulates the problem as closed-loop structured inference, iteratively refining event-anchored motion hypotheses under geometric and kinematic constraints, achieving improved fidelity and consistency over data-driven and physics-based baselines.
This paper argues that standard output-level evaluations of machine unlearning overestimate success, showing that methods can appear successful at the output layer while retaining structured representation-level discrepancies relative to retrained models. The authors propose retraining-consistent representation forgetting as a stronger evaluative lens.
This paper presents Geo-Strat-RL, a synthetic environment that uses reinforcement learning with verifiable rewards (RLVR) to train vision-language models to reason about geological event histories from stratigraphic diagrams and seismic data, demonstrating improved reconstruction and cross-domain transfer.
Introduces Local Branch Routing (LBR), a token-level test-time scaling framework that expands a local lookahead tree and uses a lightweight router to select the best branch. LBR improves reasoning on mathematical benchmarks over chain-of-thought and other baselines.
Hybrid-IR introduces a dual-path retrieval framework combining graph-based and dense retrieval with iterative reasoning to improve complex medical QA, addressing limitations in existing RAG methods. Experiments on three benchmarks show effectiveness.
This paper systematically studies the damage caused by exact document repetition during language model pretraining, showing that repeating a moderately sized subset a moderate number of times maximally harms performance, and that repetition can waste up to 33% of compute (as measured by compute-equivalent loss).
iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.