Trending stories ranked by heat, importance and recency.
This paper proposes a reinforcement learning framework for computer-use agents that uses autonomous vision-language evaluation as a scalable reward signal, modeling evaluator noise to improve task success rates across desktop environments.
This paper explores cross-lingual prompting strategies to improve access to parametric knowledge in large language models, demonstrating significant gains in knowledge transfer and factual recall across 17 languages on multilingual benchmarks.
The paper discusses the small scaling exponents of large language models, arguing that they indicate an unsustainable regime in terms of energy resources. It also examines the 'pedestal effect' and draws analogies with fluid turbulence to comment on data smoothness.
The paper introduces the Latent Bridge, a trainable continuous channel that couples a slow reasoning VLM (Qwen3-VL-8B-Thinking) and a fast reactive VLM (MiniCPM-o 4.5) for real-time game agents. Experiments on Atari games and MetaDrive show it matches or outperforms the text-based bridge while avoiding destructive interference when used alone.
AGORA is a new benchmark for evaluating large language models on archive-grounded reasoning tasks across workplace documents, comprising 362 questions over 9,664 real documents. The strongest model achieves only 59.4% accuracy, highlighting substantial room for improvement.
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
This paper formulates orchestration of coding agents as cost-sensitive sequential hypothesis testing using a Bayesian controller that dynamically decides when to gather evidence, refine, verify, or stop. Experiments across six generators and nine benchmarks show Bayesian control is most valuable when verification is costly and critics are informative but imperfect.
This paper systematically quantifies the tokenization penalty for 20 African languages across 11 frontier and open tokenizers, finding up to 8.9× inference cost and latency multipliers and as little as 11% effective context window compared to English, highlighting a structural digital divide encoded in subword vocabularies.
This paper investigates whether aggregate structural invariants, specifically spectral bounds, can accelerate continuous subgraph matching (CSM) over dynamic graphs. It characterizes limitations of lazy spectral maintenance, shows exact maintenance is affordable when selective, and demonstrates pruning power of up to 51% in benchmarks.
ExtractConf is a confidence estimation method for LLM-based document field extraction that uses two structurally different calls (field-guided and document-guided) to derive disagreement signals, achieving 0.928 ROC AUC on DocILE invoices and enabling reliable selective prediction for high-stakes automation.
This paper proposes a cycle-consistent neural architecture that generates faithful natural language explanations of formal verification certificates, achieving 90% soundness and 860x faster inference than LLM baselines.
Introduces PHANTOM, a large-scale open-source dataset of pre-generated adversarial attacks for vision-language models, covering 1010 high-level categories and 55 subcategories of harmful intents with 47,524 adversarial samples. The dataset aims to lower the barrier for adversarial research and enable systematic evaluation of VLM robustness and safety.
This paper introduces DigenRL, a disaggregated RL framework for diffusion-based generative LLMs that uses generation-axis pipeline parallelism and trainer-assisted generation to improve throughput by 1.56-2.10x over existing systems.
A comprehensive survey of transformer-based language models covering architectures, applications across domain verticals (healthcare, finance, legal, etc.), and critical assessment of trade-offs including compute cost, alignment, and data provenance.
This paper introduces Prob-BBDM, a probabilistic Brownian Bridge Diffusion Model for efficient and high-quality MRI sequence synthesis from 2D axial slices, achieving up to 88.46% SSIM and 26.09 dB PSNR with only 4 diffusion steps, and demonstrating clinical utility in tumor segmentation.
AVOC introduces a retrieval-inspired token compression method for omni-modal LLMs that effectively handles hour-long audio-video inputs by selecting informative tokens based on relevance, importance, and diversity. The framework achieves state-of-the-art results on long-form audio-video understanding benchmarks, surpassing prior methods by significant margins.
Presents LemonHarness, an integrated execution framework for long-horizon LLM agents that constrains state-changing operations within a clearly defined workspace, introduces a reusable rule knowledge base, and adds time-aware execution. Achieves 84-86% accuracy on Terminal-Bench 2.0.
SURGeLLM introduces a unified transformer framework with surgical feature gates, task-conditioned prefix tokens, and instance-weighted normalization to address mismatched inductive biases, class imbalance, and lexical knowledge injection in multi-task learning, achieving significant gains across four diverse NLP tasks.
This paper presents a rigorous N-qubit theory of stochastic quantum neural networks (SQNNs) for adversarially robust network intrusion detection, proving a decoherence-contraction theorem and showing that depolarising noise provides robustness against adversarial attacks, with experiments on the NSL-KDD dataset.
This paper introduces FedEPD, a framework for federated graph learning under long-tailed data distributions. It uses an energy-guided dual decoupling approach to separate topological purification from semantic recalibration, achieving state-of-the-art performance on benchmarks with up to 4.97% accuracy improvement.