Trending

Trending stories ranked by heat, importance and recency.

Cards List
#101

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

arXiv cs.CL · 10h ago Cached

Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.

0 favorites 0 likes
#102

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

arXiv cs.CL · 10h ago Cached

OPERA proposes a reinforcement learning method for open-ended tasks using intrinsic rewards based on perplexity dynamics, replacing unreliable LLM-as-a-judge reward models. It achieves state-of-the-art results on Qwen3-8B, matching proprietary models in creative writing and other open-ended tasks.

0 favorites 0 likes
#103

BitNet Text Embeddings

arXiv cs.CL · 10h ago Cached

This paper introduces BitEmbed, an extreme low-bit framework for LLM-based text embeddings that converts pretrained LLM backbones into BitNet-style encoders with ternary weights and quantized activations. It achieves comparable performance to full-precision models while significantly reducing encoding and storage costs.

0 favorites 0 likes
#104

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

arXiv cs.CL · 10h ago Cached

MedGuards proposes a multi-agent framework for detecting and correcting errors in medical text using specialized agents and confidence-guided arbitration, improving reliability without additional training. Experiments on multilingual clinical notes show significant improvements.

0 favorites 0 likes
#105

Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents

arXiv cs.CL · 10h ago Cached

This paper proposes ReverieMem, a three-layer memory architecture for book-based LLM role-playing agents that prevents factual overreach and stylistic monotony. It also introduces the KBF-QA benchmark and achieves significant improvements in knowledge boundary fidelity and narrative quality.

0 favorites 0 likes
#106

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

arXiv cs.CL · 10h ago Cached

This paper identifies and analyzes 'tool suppression' in open-weight LLMs when both tool calling and JSON schema constraints are simultaneously enabled, proposing the Constraint Priority Inversion hypothesis and a mitigation strategy called Transparent Two-Pass Execution.

0 favorites 0 likes
#107

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

arXiv cs.CL · 10h ago Cached

Riazi-8B is an Urdu large language model fine-tuned for mathematical reasoning, achieving improved performance on MGSM-Urdu through continued pre-training and supervised fine-tuning on Urdu Chain-of-Thought data.

0 favorites 0 likes
#108

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

arXiv cs.CL · 10h ago Cached

BiPACE introduces a drop-in advantage estimator that fixes state-action credit mismatch in stepwise group-based RL for LLM agents, using bisimulation-guided state clustering and action counterfactual estimation, achieving significant performance gains on ALFWorld, WebShop, and TextCraft with Qwen2.5 models.

0 favorites 0 likes
#109

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv cs.CL · 10h ago Cached

This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.

0 favorites 0 likes
#110

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

arXiv cs.CL · 10h ago Cached

This paper shows that a language model with a lossy memory that retains a wrong conclusion but drops the evidence produces confident incorrect answers, whereas an empty memory leads to abstention. The authors propose a source-first compression policy that preserves recomputable sources instead of conclusions to maintain correctability, and demonstrate the mechanism across multiple models and dialogue systems.

0 favorites 0 likes
#111

Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making

arXiv cs.CL · 10h ago Cached

This paper introduces Agent-Authored World Modeling (AAWM), a training procedure that constructs world-model supervision based on the policy's own decision needs rather than next-observation prediction, aligning the learning objective with the dynamics required for effective decision-making.

0 favorites 0 likes
#112

Certification of Machine Learning Models via Directional Sharpness

arXiv cs.LG · 10h ago Cached

This paper introduces directional sharpness, a new metric for certifying the generalization performance of machine learning models that is both efficient to compute and more reliable than existing proxies like test accuracy or traditional sharpness, even when training deviates from prescribed procedures.

0 favorites 0 likes
#113

A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

arXiv cs.CL · 10h ago Cached

This survey synthesizes research on toxicity detection and detoxification for multilingual large language models, cataloging threat models, task formulations, detection approaches, and mitigation strategies, while identifying persistent challenges such as uneven language coverage and culturally contingent definitions of harm.

0 favorites 0 likes
#114

TRACER: Training-Free Closed-Loop Structured Inference for Traffic Accident Reconstruction

arXiv cs.LG · 10h ago Cached

TRACER is a training-free framework for traffic accident reconstruction that formulates the problem as closed-loop structured inference, iteratively refining event-anchored motion hypotheses under geometric and kinematic constraints, achieving improved fidelity and consistency over data-driven and physics-based baselines.

0 favorites 0 likes
#115

Erased, but Not Gone: Output Forgetting Is Not True Forgetting

arXiv cs.LG · 10h ago Cached

This paper argues that standard output-level evaluations of machine unlearning overestimate success, showing that methods can appear successful at the output layer while retaining structured representation-level discrepancies relative to retrained models. The authors propose retraining-consistent representation forgetting as a stronger evaluative lens.

0 favorites 0 likes
#116

Geo-Strat-RL: Learning Geological Event Reasoning from Verifiable Tasks

arXiv cs.LG · 10h ago Cached

This paper presents Geo-Strat-RL, a synthetic environment that uses reinforcement learning with verifiable rewards (RLVR) to train vision-language models to reason about geological event histories from stratigraphic diagrams and seismic data, demonstrating improved reconstruction and cross-domain transfer.

0 favorites 0 likes
#117

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

arXiv cs.CL · 10h ago Cached

Introduces Local Branch Routing (LBR), a token-level test-time scaling framework that expands a local lookahead tree and uses a lightweight router to select the best branch. LBR improves reasoning on mathematical benchmarks over chain-of-thought and other baselines.

0 favorites 0 likes
#118

Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering

arXiv cs.CL · 10h ago Cached

Hybrid-IR introduces a dual-path retrieval framework combining graph-based and dense retrieval with iterative reasoning to improve complex medical QA, addressing limitations in existing RAG methods. Experiments on three benchmarks show effectiveness.

0 favorites 0 likes
#119

Internal Data Repetition Destroys Language Models

arXiv cs.LG · 10h ago Cached

This paper systematically studies the damage caused by exact document repetition during language model pretraining, showing that repeating a moderately sized subset a moderate number of times maximally harms performance, and that repetition can waste up to 33% of compute (as measured by compute-equivalent loss).

0 favorites 0 likes
#120

Improved Large Language Diffusion Models

arXiv cs.CL · 10h ago Cached

iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.

0 favorites 0 likes
← Previous
Next →
← Back to home

Submit Feedback