Tag
This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.
This paper introduces CHARM, a framework for detecting and mitigating cascading hallucinations in multi-step agentic RAG pipelines, where early-stage errors propagate and amplify across reasoning steps. CHARM achieves an 89.4% cascade detection rate and 82.1% error propagation reduction across multiple benchmarks with low latency overhead.
This paper proposes SGDR (State-Grounded Dynamic Retrieval), an online skill learning method for web agents that enables stepwise, state-aware skill reuse rather than static task-level retrieval. Experiments on WebArena show SGDR achieves 37.5% success rate with GPT-4.1, a ~10.6% relative gain over strong baselines.
An article summarizing Anthropic's 2025 paper on mechanistic interpretability, showing that LLMs are not black boxes and that circuit tracing can reveal multi-step reasoning and human-identifiable concepts.
This paper proposes HyperGuide, a method that distills reasoning progress into a hyperbolic geometric signal to guide step-by-step generation in LLMs, improving multi-step reasoning efficiency without explicit tree search.
This paper presents a reinforcement learning post-training pipeline for tool-calling LLM agents operating on FHIR healthcare data, achieving a 77% answer correctness on FHIR-AgentBench using a smaller Qwen3-8B model compared to 50% with o4-mini.
GPT-5.5 brings a 19 percentage point improvement in multi-step reasoning and financial modeling, significantly reducing the burden of knowledge work, which excites the Box team.