Tag
The author argues that for live voice agents, STT latency and real-time behavior are more critical than raw transcription accuracy, and proposes a different evaluation scorecard.
The article explains how the author achieved p99 zero-millisecond perceived latency for autocomplete on 240 million domain names by prefetching suggestions on keyDown and caching, with a fast API built on Tranco and CZDS data.
This blog post explains the inspection paradox in system latency and recovery time measurement, showing why customers experience longer average waits than service metrics suggest. It includes an interactive simulation and emphasizes the importance of understanding the tail of the distribution.
Liquid AI demonstrates using LFM2.5-ColBERT-350M as a filter to select only the five most relevant tools from 151 options, reducing latency and improving tool selection accuracy.
A blog post analyzes the M/M/c queueing model and shows that increasing the number of servers in a load-balanced system improves latency at constant per-server load, a beneficial and somewhat counterintuitive result for cloud economics.
This article summarizes a presentation by Junda Chen on disaggregated inference for LLMs, explaining why goodput (throughput meeting latency SLOs) matters more than raw throughput, and how separating prefill and decode phases improves performance. It also highlights the influence on NVIDIA Dynamo.
A developer debunks the common belief that LLM latency is the primary cause of slow voice agents, explaining that delays often stem from earlier stages like audio capture, VAD, and STT. They recommend logging specific latency metrics and testing various STT/TTS providers and orchestration frameworks to diagnose issues.
This paper proposes a practical evaluation method for long-form simultaneous speech-to-speech translation that uses ASR, forced alignment, and sentence embedding alignment to compute latency and quality metrics on continuous speech, overcoming limitations of prior approaches.
This guide explains the discipline of AI inference engineering, covering the split between prefill and decoding phases, the shift from closed to open models, and optimization techniques for latency, throughput, and cost.
Modal announced several major product updates including VM Sandboxes with real Linux kernel support, lower-latency regional routing, domain allowlisting for Sandboxes, RBAC, named images, and SDK updates.
Dian Hu draws a parallel between the importance of low latency in search engines and the upcoming need for fast AI inference.
This article from The Old New Thing explains that Windows thread pools are optimized for throughput, not latency, and provides solutions for low-latency scheduling, such as creating a custom thread pool or using a dedicated worker thread, with code examples in C++ and C#.
A blog post revisits the concept of 'Slow Search' in the context of agentic retrieval, arguing that per-query latency can be traded for better retrieval quality to reduce overall task time and cost for AI agents.
Explores why mainstream games have not yet integrated large language models into NPCs, questioning whether latency issues or lack of interest from game studios are the primary obstacles.
Discusses key challenges facing AI voice agents in real-world customer interactions, such as accent handling, latency, and integration, and invites experiences from businesses.
Formula 1 teams invest millions in driver-in-the-loop simulators with ultra-low latency and high fidelity to replicate real car behavior, enabling drivers to train and develop setups.
Hippocratic AI partners with Modular to use MAX framework for inference on large language models, achieving sub-500ms TTFT, ~30% faster P99 latency and ~22% faster mean latency at scale on NVIDIA B300 GPUs, with portability to AMD.
Coinbase's 10-hour outage postmortem reveals they run global trading from a single region without automated failover, raising concerns about their infrastructure reliability.
This paper presents a game-theoretic analysis of disaggregated inference architectures that separate prefill and decode phases across GPU pools, characterizing how GPU saturation affects performance. The authors propose an adaptive controller that detects saturation transitions and adjusts routing parameters, reducing the Price of Anarchy significantly in experiments on NVIDIA B200 clusters.
A detailed investigation of Linux latency in gaming using a Teensy-based LDAT tool, measuring click-to-photon latency with various settings on Nvidia GPUs under KDE Wayland, comparing to Windows.