Tag
This paper investigates whether online skill and memory modules for web agents are worth their token cost under a fixed inference budget, finding that a budget-matched vanilla baseline often matches or outperforms augmented methods across three domains and models.
This paper studies a staged promotion protocol for micro-pretraining, using escalating budgets from minutes to hours to filter configurations. It finds that early screens are useful but unstable, and that a staged approach can retain a long-horizon reference while identifying alternatives that fail continuation thresholds.
This paper proposes a staged factorial screening workflow for budget-constrained micro-pretraining, demonstrating that short designed experiments can identify stable hyperparameter penalty directions and support a screen-then-refine strategy.