Tag
Introduces 'Machine Studying' as a new formulation of continual learning where AI systems autonomously develop expertise from a corpus, and presents StudyBench for evaluation.
Introduces 'Machine Studying' as a problem where AI agents must autonomously develop expertise from a corpus, beyond RAG or long-context, and presents the StudyBench benchmark for evaluation.
This paper systematically studies how inference-time compute (token budgets, context compaction, repeated submissions) affects frontier LLM performance on challenging benchmarks, demonstrating that scores are protocol-dependent and advocating for evaluations that report capability as a function of inference compute.
TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.