Tag
Setting temperature to 0 does not guarantee deterministic tool calls in agents due to batched inference causing floating-point reduction order shifts, leading to token flips and different actions under load.
A study by LangChain and Harvey explores methods to reduce the cost of verifying legal agent outputs by batching criteria evaluations and using open models, achieving order-of-magnitude cost savings while maintaining near-frontier performance.
GreptimeDB v1.0 introduces Pending Rows Batcher, a three-stage pipeline that moves CPU-intensive work off the Datanode's critical section, improving Prometheus remote write throughput from 1.20M to 2.17M points/sec and reducing Datanode CPU usage by 20%.
This paper analyzes the trade-off between mixed batching and exclusive batching for LLM inference, showing that the optimal choice depends on GPU memory bandwidth. It proposes a threshold-based hybrid scheduler that dynamically switches between the two methods, achieving up to 41.9% higher throughput on bandwidth-constrained GPUs.
Demonstrates running subagents locally on a MacBook Pro M5 using Codex CLI and LM Studio with Qwen 3.6 and MLX batching for code review and bug detection.
LM Studio announces a beta update to its MLX engine, introducing batching for vision models and improved caching for faster inference.