Tag
SPEAR is a code-augmented agentic prompt optimizer that uses a Python sandbox for structural error analysis, achieving state-of-the-art performance on multiple LLM evaluation suites including industrial judge tasks, BBH, and GSM8K.
A framework called GuardedRepair is proposed for post-hoc replacement of LLM mathematical reasoning, using selective replacement with safety guards to fix errors while minimizing harm to correct traces. On GSM8K it improves accuracy from 95.60% to 96.89% without breaking correct answers.
This paper identifies a 'positional copying' shortcut where small language models answer arithmetic questions by copying the last number before the answer delimiter, bypassing actual reasoning. This effect explains why shuffling CoT steps retains performance; it accounts for 89-92% of teacher-forcing accuracy in 1-3B models on GSM8K.
HRM-Text introduces a Hierarchical Recurrent Model that decouples computation into slow and fast layers, enabling efficient pretraining from scratch on only 40 billion tokens and a $1,500 budget, achieving competitive performance with larger models.
OpenAI trained a system using verifiers to solve grade school math word problems with 90% of child-level accuracy, nearly doubling fine-tuned GPT-3 performance. The approach addresses language models' weakness in multistep reasoning by training verifiers to evaluate candidate solutions and select the best one.