Tag
Introduces Strategy-Guided Policy Optimization (SGPO) for LLM reasoning, which replaces trajectory imitation with strategy distillation, improving generalization on math benchmarks.
Two recent arXiv papers found that GPT-5.4 and Claude Opus 4.6 employ a metaprogramming strategy when handling unfamiliar programming languages — generating target code with Python and debugging locally — rather than writing the target language code directly. This strategy is key to distinguishing top-tier agents from average ones, and strategy sophistication matters more than model parameter scale.
ReNIO enhances on-policy distillation for LLMs by reweighting negative trajectories based on token-level probability ratios, improving reasoning performance in mathematical and code generation tasks.
A position paper by Subbarao Kambhampati and researchers at Arizona State University argues that chain-of-thought reasoning in LLMs creates an illusion of reasoning, and the industry needs to move beyond costly token generation to alternative reasoning mechanisms.
Introduces Independent Combinatorial Tokens (ICT) framework that uses Jensen-Shannon divergence between token logit distributions to identify critical branching points, preventing entropy collapse and explosion in RLVR for LLM reasoning. Achieves up to 14.9% pass@4 improvement on Qwen models.
This paper proposes Trajectory-Augmented Policy Optimization (TAPO), which constructs micro-reflective correction trajectories using the model's own correct and incorrect rollouts to improve reasoning in large language models, outperforming standard self-distillation methods on math benchmarks.
Proposes REVES, a two-stage iterative framework that alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems.
This paper introduces CoRA, a GRPO-based reinforcement learning framework that aligns LLM confidence with generated rationales to improve the reliability of chain-of-thought reasoning, achieving up to 26.51% reduction in misalignment error across multiple benchmarks.
Introduces Adelic operation-preserved embeddings (AOE), a training-free representation that encodes numbers by combining real value with p-adic expansions, preserving additive and multiplicative structure. Achieves perfect accuracy on the Weaving Pattern benchmark.
Proposes Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework for aligning LLM reasoning in mental health assessment, achieving an average improvement of 10.4 percentage points in weighted F1-score over existing baselines.
This paper introduces MARS, a stopping rule for parallel LLM test-time scaling that probes partial traces to stop early without sacrificing accuracy, saving 25–47% of tokens across reasoning models on competition math benchmarks.
Introducing RecToM, an inference-time framework that models nested beliefs via recursive perspective construction for Theory of Mind reasoning in LLMs, achieving state-of-the-art performance on multiple benchmarks.
This paper introduces Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that improves the efficiency of power sampling for enhancing base language model reasoning. EGPS achieves up to 12.6x speedup over standard Metropolis-Hastings sampling while reaching best or tied-best accuracy on benchmarks like MATH500, HumanEval, and GPQA.
This paper investigates whether early-token confidence signals from LLM decoding can predict reasoning quality in multi-agent debate systems, finding that confidence in the first few generated tokens is the strongest predictor of rubric-based essay scores.
TRACE is a unified rollout budget allocation framework that enhances reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness. It improves efficiency and accuracy on agentic benchmarks like Multi-Hop QA.
This paper introduces Prefix Utility Model (PUM), which evaluates LLM reasoning prefixes based on their utility (improvement in solve rate) rather than local correctness. PUM shows strong performance in mathematical reasoning tasks across selection, search, and reinforcement learning.
ThinkBooster is a unified framework for test-time compute scaling of LLM reasoning, providing a modular Python library, a performance-efficiency benchmark, an OpenAI-compatible proxy service, and a visual debugger. Empirical results on math and coding tasks demonstrate practical gains with quality-cost trade-offs.
This survey reviews the use of large language models for graph computation, categorizing them into two paradigms: LLMs as executors and LLMs as planners. It finds LLMs promising for simple tasks but unreliable for large-scale exact computations, and suggests future directions.
AI agents often fail due to authentication hurdles like email verification, OTP timeouts, and captchas, not due to reasoning errors, highlighting infrastructure challenges in production.
The article discusses a shift in LLM reasoning research from making reasoning explicit via chain-of-thought to exploring latent reasoning that doesn't require language traces, questioning whether visibility is necessary for effective reasoning.