Tag
This paper investigates whether early-token confidence signals from LLM decoding can predict reasoning quality in multi-agent debate systems, finding that confidence in the first few generated tokens is the strongest predictor of rubric-based essay scores.
This paper studies the relationship between token-level log-probability distributions, LLM-as-judge rubric scores, and final task accuracy in multi-agent debate systems. It finds a consistent four-phase confidence trajectory and role asymmetry between Constructor and Auditor agents.