gsm8k

#gsm8k

SPEAR: Code-Augmented Agentic Prompt Optimization

arXiv cs.CL ↗ · 2026-05-27 Cached

SPEAR is a code-augmented agentic prompt optimizer that uses a Python sandbox for structural error analysis, achieving state-of-the-art performance on multiple LLM evaluation suites including industrial judge tasks, BBH, and GSM8K.

0 favorites 0 likes

#gsm8k

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

arXiv cs.CL ↗ · 2026-05-26 Cached

A framework called GuardedRepair is proposed for post-hoc replacement of LLM mathematical reasoning, using selective replacement with safety guards to fix errors while minimizing harm to correct traces. On GSM8K it improves accuracy from 95.60% to 96.89% without breaking correct answers.

0 favorites 0 likes

#gsm8k

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

arXiv cs.LG ↗ · 2026-05-25 Cached

This paper identifies a 'positional copying' shortcut where small language models answer arithmetic questions by copying the last number before the answer delimiter, bypassing actual reasoning. This effect explains why shuffling CoT steps retains performance; it accounts for 89-92% of teacher-forcing accuracy in 1-3B models on GSM8K.

0 favorites 0 likes

#gsm8k

HRM-Text: Efficient Pretraining Beyond Scaling

arXiv cs.CL ↗ · 2026-05-21 Cached

HRM-Text introduces a Hierarchical Recurrent Model that decouples computation into slow and fast layers, enabling efficient pretraining from scratch on only 40 billion tokens and a $1,500 budget, achieving competitive performance with larger models.

0 favorites 0 likes

#gsm8k

Solving math word problems

OpenAI Blog ↗ · 2021-10-29 Cached

OpenAI trained a system using verifiers to solve grade school math word problems with 90% of child-level accuracy, nearly doubling fine-tuned GPT-3 performance. The approach addresses language models' weakness in multistep reasoning by training verifiers to evaluate candidate solutions and select the best one.

0 favorites 0 likes

gsm8k

SPEAR: Code-Augmented Agentic Prompt Optimization

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

HRM-Text: Efficient Pretraining Beyond Scaling

Solving math word problems

Submit Feedback