Tag
This paper investigates the assumption that setting LLM judge temperature to 0 ensures deterministic safety evaluations. It finds that in practice, many harnesses do not set temperature or seed, leading to high variance, and even with temperature=0, non-determinism persists due to provider-level randomness and API changes.
A developer shares concerns about deploying AI agents that perform real actions in production, such as API calls and data manipulation, and asks the community about their fears and mitigation strategies like guardrails and human approval.
A benchmark shows that computer-use agents are 45x more expensive than structured API calls for the same task, due to high token usage from screenshots and multiple steps. The author argues that for internal tools with exposed state, API-based agents are more efficient, and promotes Reflex 0.9 which auto-generates APIs from app handlers.
A developer discusses strategies for cost-effectively running long-term AI agents for financial market analysis, sharing experiences with Claude and Gemini APIs.