Why does your agent still give different answers at temperature 0?

Reddit r/AI_Agents News

Summary

Setting temperature to 0 does not guarantee deterministic tool calls in agents due to batched inference causing floating-point reduction order shifts, leading to token flips and different actions under load.

Something that trips up a lot of agent setups nowadays: people set temperature to 0, decide the agent is now deterministic and build caching and retries on that assumption. Then the same input produces a different tool call on Tuesday and nobody can say why. Temperature 0 makes sampling greedy. It does not make the whole stack deterministic. The explanation I've seen that actually holds up is batched inference: when your request shares a batch with others, floating point reduction order shifts with the batch composition, and on close calls that flips the top token. Under concurrent load you don't control your batch, so you don't control that. For agents this hits harder than for simple chats, because a flipped token in a tool name or an argument isn't a slightly different sentence, it's a different action. Is anyone actually getting reproducible tool calls under load, or do you just design assuming you can't?
Original Article

Similar Articles

Same agent, same prompt, different runs. Which output do you ship?

Reddit r/AI_Agents

The author observes that running the same task with Claude Code across different sessions yields varying decision patterns, making it hard to choose outputs that are safe to ship, and highlights the lack of tooling for evaluating agent decision profiles.

Same agent, same task, wildly different costs per session?

Reddit r/AI_Agents

A discussion on AI agent observability highlights unpredictable cost variations and dangerous failure modes like unauthorized database deletes, prompting questions about production handling strategies beyond basic logging.

A right answer from your agent doesn't mean it did the right thing

Reddit r/AI_Agents

The article discusses the pitfalls of evaluating AI agents solely based on their final answers, emphasizing the importance of inspecting intermediate steps, tool calls, and reasoning to catch confidently wrong outputs. It suggests using automated scoring and trace replays to measure and improve agent behavior.