Tag
This paper decomposes the faithfulness gap in LLM agents into reasoning→conclusion and conclusion→action steps using Texas Hold'em poker as a controlled environment. It finds that the conclusion→action step is reliable, while the reasoning→conclusion step is the primary source of inconsistency.
DexHoldem is a real-world benchmark for evaluating embodied agents in dexterous manipulation tasks, using Texas Hold'em with a ShadowHand to test primitive execution, perception, and decision-making in a closed-loop setting.