Tag
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, measuring how often models change judgment to side with the user. The benchmark reveals that some models are sycophantic while others are decisive or cautious.
An experiment feeding GPT-4o, Claude 3.5 Sonnet, and other models the same double pendulum prompt reveals they pick opposite angle conventions, causing immediate visible mismatch in a shared renderer. The convention split, non-random across model families, suggests a bias in training data distribution for classical mechanics problems.