model-benchmark

#model-benchmark

Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models.

Reddit r/singularity ↗ · 2026-05-21

Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, measuring how often models change judgment to side with the user. The benchmark reveals that some models are sycophantic while others are decisive or cautious.

0 favorites 0 likes

#model-benchmark

Gave GPT-4o and Claude the exact same double pendulum prompt. They picked opposite angle conventions within seconds.

Reddit r/ArtificialInteligence ↗ · 2026-05-16

An experiment feeding GPT-4o, Claude 3.5 Sonnet, and other models the same double pendulum prompt reveals they pick opposite angle conventions, causing immediate visible mismatch in a shared renderer. The convention split, non-random across model families, suggests a bias in training data distribution for classical mechanics problems.

0 favorites 0 likes

model-benchmark

Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models.

Gave GPT-4o and Claude the exact same double pendulum prompt. They picked opposite angle conventions within seconds.

Submit Feedback