Tag
This paper identifies that language model reasoning trajectories during test-time sampling cluster into 'reasoning basins', causing majority vote failures when the dominant basin is incorrect. It introduces ARBITER, a model-agnostic method that uses conservative additive evidence from the model's own outputs and hidden states to improve accuracy without external data.