Tag
This paper formulates adaptive sampling for large language models as a Markov decision process and trains a lightweight RL controller to balance correctness, latency, and computational cost, achieving improved trade-offs.
The NOVA framework models the 'generate, verify, accumulate, retrain' loop as an adaptive sampling process over a knowledge space, identifying failure modes and proving a scaling law for cumulative generation cost under Zipf-like discovery distributions.