Tag
The article shares practical lessons for building low-latency, high-throughput AI agents, including workload estimation, token reduction, parallelism, microservices, and handling LLM failures.
An open-weights 8B parameter voice model achieves only 110ms latency, faster than average human conversation latency of 200-250ms. It can be run locally and is freely available via a GitHub repository.
Miso One is an open-source 8B parameter text-to-speech model with real emotional range and 110ms latency, designed for voiceover work.
LuMay Voice Agent is gaining attention for its business workflow focus, built-in CRM and automation, enterprise compliance, and reliable real-world calling capabilities.
Google DeepMind released Magenta RealTime 2, an open music generation model for on-device streaming with low-latency control via text, audio examples, and MIDI.
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
This paper presents a stateful inference architecture for multi-agent tool calling that reuses KV cache across turns and employs speculative decoding, achieving 2.1x-4.2x speedup over vLLM and SGLang on agentic workflows.
The article compares LuMay Voice Agent with established platforms Bland AI, Vapi, and Synthflow, highlighting its sub-500ms latency and workflow automation, and seeks community feedback on real-world performance.
NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.
Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.
This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.
OpenAI releases gpt-realtime-translate, a low-latency speech-to-speech model optimized for live interpretation, accompanied by a developer cookbook for building multilingual browser, phone, and video applications.
Google has made Gemini 3.1 Flash-Lite generally available, offering ultra-low latency and high-volume processing with multimodal capabilities, targeting enterprise applications.
Google's Gemini AI is featured in a new project showcasing low latency and realistic spontaneous collaboration capabilities alongside Sesame.
Technical blog post from a self-described WebRTC expert criticizing OpenAI's use of WebRTC for voice AI, arguing the protocol is poorly suited because it's designed for real-time conferencing with aggressive packet dropping, which conflicts with Voice AI use cases where accuracy matters more than minimal latency.
FractalBits introduces a specialized single-node KV storage engine that eliminates fsync calls to achieve significantly higher write throughput on NVMe SSDs by managing durability directly at the hardware level.
Researchers introduce 8M-30M parameter micro language models that instantly generate the first few words on-device before cloud models complete responses, enabling responsive AI on ultra-constrained devices like smartwatches.
OpenAI is releasing GPT-5.3-Codex-Spark, a smaller, ultra-low-latency coding model optimized for real-time collaboration, delivering over 1000 tokens per second on Cerebras hardware. It is available as a research preview to ChatGPT Pro users and marks the first milestone in OpenAI's partnership with Cerebras.
The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.
OpenAI partners with Cerebras to integrate 750MW of ultra low-latency AI compute into its platform, aiming to accelerate inference and enable faster real-time AI responses across various workloads.