Gave GPT-4o and Claude the exact same double pendulum prompt. They picked opposite angle conventions within seconds.

Reddit r/ArtificialInteligence 05/16/26, 02:14 PM News

ai-conventions model-benchmark physics-simulation double-pendulum training-data model-behavior coding-tasks

Summary

An experiment feeding GPT-4o, Claude 3.5 Sonnet, and other models the same double pendulum prompt reveals they pick opposite angle conventions, causing immediate visible mismatch in a shared renderer. The convention split, non-random across model families, suggests a bias in training data distribution for classical mechanics problems.

I asked GPT-4o and Claude 3.5 Sonnet to each write a JavaScript double pendulum simulator from the same contract. Same initial conditions, same step size, same host renderer drawing both panels. The pendulums immediately swung in visibly different orientations. Turns out one model measured θ from the upward vertical and the other measured θ from the downward vertical. Both are mathematically valid conventions, but when the host renderer in `public/workers/simulator-host.js` reads `info.theta1` and `info.theta2` and draws both panels the same way, the mismatch is impossible to miss. One pendulum hangs and swings naturally, the other looks like it's doing gymnastics from the ceiling. The thing that surprised me is how fast this surfaces. You don't need to wait for chaotic divergence over thousands of timesteps. The convention split is visible on literally the first frame. A unit test checking the math would pass for both, because both sets of equations of motion are internally consistent. It's only when you force them into the same rendering pipeline that you see they disagree about what "down" means. The setup is a side by side benchmark called Physics Bench where every model implements `step`, `getInfo`, and `reset` from a strict contract defined in `lib/prompt.ts`. Models never write their own `draw` function. The host owns rendering, so any visual difference between panels is a real difference in the physics, not a cosmetic one. I also noticed that when I ran more models (Gemini 1.5 Pro, Llama 3.1 70B, a few others), the convention split wasn't random. Some model families consistently picked one convention over the other, which makes me think this is baked into their training data distribution for classical mechanics problems. The contract is strict on purpose: exactly one fenced code block, first line must start with `function createSimulator(`, no imports, no exports, no DOM access, no drawing. Everything the model returns is pure simulation logic. That constraint is what makes the convention mismatch so clean to observe, because there's nowhere for the model to hide a workaround. Curious if anyone has seen similar convention disagreements when asking GPT-4o to solve other physics or engineering problems where sign conventions matter.

Original Article

Gave GPT-4o and Claude the exact same double pendulum prompt. They picked opposite angle conventions within seconds.

Similar Articles

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

Found a tool that asks GPT, Claude, Gemini, and Grok the same question and gives you one consensus answer

The "One-Size-Fits-All" AI era is dead. I benchmarked GPT-5.5, Claude 4.7, Gemini 3.1 Pro, and DeepSeek V4 Pro here is the actual state of the frontier.

@kapicode: I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the sa…

Same agent, same prompt, different runs. Which output do you ship?

Submit Feedback

Similar Articles

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

Found a tool that asks GPT, Claude, Gemini, and Grok the same question and gives you one consensus answer

The "One-Size-Fits-All" AI era is dead. I benchmarked GPT-5.5, Claude 4.7, Gemini 3.1 Pro, and DeepSeek V4 Pro here is the actual state of the frontier.

@kapicode: I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the sa…

Same agent, same prompt, different runs. Which output do you ship?