Gave GPT-4o and Claude the exact same double pendulum prompt. They picked opposite angle conventions within seconds.

Reddit r/ArtificialInteligence News

Summary

An experiment feeding GPT-4o, Claude 3.5 Sonnet, and other models the same double pendulum prompt reveals they pick opposite angle conventions, causing immediate visible mismatch in a shared renderer. The convention split, non-random across model families, suggests a bias in training data distribution for classical mechanics problems.

I asked GPT-4o and Claude 3.5 Sonnet to each write a JavaScript double pendulum simulator from the same contract. Same initial conditions, same step size, same host renderer drawing both panels. The pendulums immediately swung in visibly different orientations. Turns out one model measured θ from the upward vertical and the other measured θ from the downward vertical. Both are mathematically valid conventions, but when the host renderer in `public/workers/simulator-host.js` reads `info.theta1` and `info.theta2` and draws both panels the same way, the mismatch is impossible to miss. One pendulum hangs and swings naturally, the other looks like it's doing gymnastics from the ceiling. The thing that surprised me is how fast this surfaces. You don't need to wait for chaotic divergence over thousands of timesteps. The convention split is visible on literally the first frame. A unit test checking the math would pass for both, because both sets of equations of motion are internally consistent. It's only when you force them into the same rendering pipeline that you see they disagree about what "down" means. The setup is a side by side benchmark called Physics Bench where every model implements `step`, `getInfo`, and `reset` from a strict contract defined in `lib/prompt.ts`. Models never write their own `draw` function. The host owns rendering, so any visual difference between panels is a real difference in the physics, not a cosmetic one. I also noticed that when I ran more models (Gemini 1.5 Pro, Llama 3.1 70B, a few others), the convention split wasn't random. Some model families consistently picked one convention over the other, which makes me think this is baked into their training data distribution for classical mechanics problems. The contract is strict on purpose: exactly one fenced code block, first line must start with `function createSimulator(`, no imports, no exports, no DOM access, no drawing. Everything the model returns is pure simulation logic. That constraint is what makes the convention mismatch so clean to observe, because there's nowhere for the model to hide a workaround. Curious if anyone has seen similar convention disagreements when asking GPT-4o to solve other physics or engineering problems where sign conventions matter.
Original Article

Similar Articles

Same agent, same prompt, different runs. Which output do you ship?

Reddit r/AI_Agents

The author observes that running the same task with Claude Code across different sessions yields varying decision patterns, making it hard to choose outputs that are safe to ship, and highlights the lack of tooling for evaluating agent decision profiles.