Why does your agent still give different answers at temperature 0?
Summary
Setting temperature to 0 does not guarantee deterministic tool calls in agents due to batched inference causing floating-point reduction order shifts, leading to token flips and different actions under load.
Similar Articles
Same agent, same prompt, different runs. Which output do you ship?
The author observes that running the same task with Claude Code across different sessions yields varying decision patterns, making it hard to choose outputs that are safe to ship, and highlights the lack of tooling for evaluating agent decision profiles.
Same agent, same task, wildly different costs per session?
A discussion on AI agent observability highlights unpredictable cost variations and dangerous failure modes like unauthorized database deletes, prompting questions about production handling strategies beyond basic logging.
Your coding agent didn't get worse. You just never measured the first version.
The article argues that perceived degradation in coding agents is often due to untracked changes in agent instances and configuration rather than the underlying model itself, highlighting a critical lack of baseline measurement in current AI agent workflows.
The “same” model increasingly behaves like a different product depending on the inference stack behind it
The article highlights that the same AI model can exhibit different behaviors depending on the inference stack (e.g., scheduling, quantization, speculative decoding), especially in long sessions or agent workflows, making the serving method nearly as important as the model itself.
A right answer from your agent doesn't mean it did the right thing
The article discusses the pitfalls of evaluating AI agents solely based on their final answers, emphasizing the importance of inspecting intermediate steps, tool calls, and reasoning to catch confidently wrong outputs. It suggests using automated scoring and trace replays to measure and improve agent behavior.