low-latency

#low-latency

What I learned building low latency and high throughput AI agents

Reddit r/AI_Agents ↗ · 4d ago

The article shares practical lessons for building low-latency, high-throughput AI agents, including workload estimation, token reduction, parallelism, microservices, and handling LLM failures.

0 favorites 0 likes

#low-latency

@svpino: Humans have an average of 200-250 ms of latency when speaking to each other. This voice model is even faster: only 110 …

X AI KOLs Following ↗ · 5d ago

An open-weights 8B parameter voice model achieves only 110ms latency, faster than average human conversation latency of 200-250ms. It can be run locally and is freely available via a GitHub repository.

0 favorites 0 likes

#low-latency

@omarsar0: Another banger open-source release. Miso One is an 8B text-to-speech model with real emotional range, so voiceovers car…

X AI KOLs Following ↗ · 6d ago Cached

Miso One is an open-source 8B parameter text-to-speech model with real emotional range and 110ms latency, designed for voiceover work.

0 favorites 0 likes

#low-latency

Why more people are mentioning LuMay Voice Agent lately?

Reddit r/AI_Agents ↗ · 2026-05-29

LuMay Voice Agent is gaining attention for its business workflow focus, built-in CRM and automation, enterprise compliance, and reliable real-world calling capabilities.

0 favorites 0 likes

#low-latency

google/magenta-realtime-2

Hugging Face Models Trending ↗ · 2026-05-28 Cached

Google DeepMind released Magenta RealTime 2, an open music generation model for on-device streaming with low-latency control via text, audio examples, and MIDI.

0 favorites 0 likes

#low-latency

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.

0 favorites 0 likes

#low-latency

Stateful Inference for Low-Latency Multi-Agent Tool Calling

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper presents a stateful inference architecture for multi-agent tool calling that reuses KV cache across turns and employs speculative decoding, achieving 2.1x-4.2x speedup over vLLM and SGLang on agentic workflows.

0 favorites 0 likes

#low-latency

Is LuMay a real alternative to Bland, Vapi & Synthflow in 2026?

Reddit r/AI_Agents ↗ · 2026-05-24

The article compares LuMay Voice Agent with established platforms Bland AI, Vapi, and Synthflow, highlighting its sub-500ms latency and workflow automation, and seeks community feedback on real-world performance.

0 favorites 0 likes

#low-latency

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline ↗ · 2026-05-22 Cached

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

0 favorites 0 likes

#low-latency

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv cs.CL ↗ · 2026-05-15 Cached

Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.

0 favorites 0 likes

#low-latency

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.

0 favorites 0 likes

#low-latency

Build a Realtime Speech Translation (28 minute read)

TLDR AI ↗ · 2026-05-11 Cached

OpenAI releases gpt-realtime-translate, a low-latency speech-to-speech model optimized for live interpretation, accompanied by a developer cookbook for building multilingual browser, phone, and video applications.

0 favorites 0 likes

#low-latency

Google shipped Gemini 3.1 Flash-Lite in General Availability (2 minute read)

TLDR AI ↗ · 2026-05-11 Cached

Google has made Gemini 3.1 Flash-Lite generally available, offering ultra-low latency and high-volume processing with multimodal capabilities, targeting enterprise applications.

0 favorites 0 likes

#low-latency

Sesame x Gemini: low latency, extremely realist, and they started spontaneously collaborating

Reddit r/singularity ↗ · 2026-05-10

Google's Gemini AI is featured in a new project showcasing low latency and realistic spontaneous collaboration capabilities alongside Sesame.

0 favorites 0 likes

#low-latency

OpenAI's WebRTC problem

Hacker News Top ↗ · 2026-05-07 Cached

Technical blog post from a self-described WebRTC expert criticizing OpenAI's use of WebRTC for voice AI, arguing the protocol is poorly suited because it's designed for real-time conferencing with aggressive packet dropping, which conflicts with Voice AI use cases where accuracy matters more than minimal latency.

0 favorites 0 likes

#low-latency

Removing fsync from our local storage engine

Hacker News Top ↗ · 2026-05-07 Cached

FractalBits introduces a specialized single-node KV storage engine that eliminates fsync calls to achieve significantly higher write throughput on NVMe SSDs by managing durability directly at the hardware level.

0 favorites 0 likes

#low-latency

Micro Language Models Enable Instant Responses

Hugging Face Daily Papers ↗ · 2026-04-21 Cached

Researchers introduce 8M-30M parameter micro language models that instantly generate the first few words on-device before cloud models complete responses, enabling responsive AI on ultra-constrained devices like smartwatches.

0 favorites 0 likes

#low-latency

Introducing GPT-5.3-Codex-Spark

OpenAI Blog ↗ · 2026-02-12 Cached

OpenAI is releasing GPT-5.3-Codex-Spark, a smaller, ultra-low-latency coding model optimized for real-time collaboration, delivering over 1000 tokens per second on Cerebras hardware. It is available as a research preview to ChatGPT Pro users and marks the first milestone in OpenAI's partnership with Cerebras.

0 favorites 0 likes

#low-latency

Qwen3-TTS Technical Report

Papers with Code Trending ↗ · 2026-01-22 Cached

The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.

0 favorites 0 likes

#low-latency

OpenAI partners with Cerebras 

OpenAI Blog ↗ · 2026-01-14 Cached

OpenAI partners with Cerebras to integrate 750MW of ultra low-latency AI compute into its platform, aiming to accelerate inference and enable faster real-time AI responses across various workloads.

0 favorites 0 likes

low-latency

Submit Feedback