Tag
A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.
An experiment feeding GPT-4o, Claude 3.5 Sonnet, and other models the same double pendulum prompt reveals they pick opposite angle conventions, causing immediate visible mismatch in a shared renderer. The convention split, non-random across model families, suggests a bias in training data distribution for classical mechanics problems.