Tag
This paper proposes Privileged-Future On-Policy Self-Distillation (PF-OPSD) for controlled concrete reasoning, combining world models' visual simulation with language models' abstract reasoning to improve prediction accuracy and robustness on two new benchmarks.
GraphARC is a new benchmark for abstract reasoning on graph-structured data, extending the ARC paradigm to graphs. Evaluations of state-of-the-art language models reveal a comprehension-execution gap and performance degradation on larger instances, highlighting scaling challenges.
Introduces A2RBench, an automated pipeline for generating formally verifiable abstract reasoning benchmarks for LLMs, using cycle consistency to ensure unique solutions, and reveals that current LLMs underperform humans significantly on 3D reasoning tasks.