Tag
This paper introduces Base Sequence Analysis, a framework that encodes LLM agent runtime behavior into compact sequences, revealing high-risk patterns like the 'P-X-P' trigram and a verification deficit. It presents Governor, a runtime intervention system that improves task success by 6.2% and reduces token consumption by 44%.
Researchers evaluate 28 LLMs on the St. Petersburg game to distinguish between outcome-level resemblance and mechanism-level alignment in risk decision-making, finding that LLMs often produce human-like bids without underlying human-consistent reasoning mechanisms. The study demonstrates that behavioral alignment can be superficial, urging high-stakes evaluations to go beyond outcome similarity.