Tag
This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.
OpenAI releases gpt-realtime-translate, a low-latency speech-to-speech model optimized for live interpretation, accompanied by a developer cookbook for building multilingual browser, phone, and video applications.
Google has made Gemini 3.1 Flash-Lite generally available, offering ultra-low latency and high-volume processing with multimodal capabilities, targeting enterprise applications.
Google's Gemini AI is featured in a new project showcasing low latency and realistic spontaneous collaboration capabilities alongside Sesame.
Technical blog post from a self-described WebRTC expert criticizing OpenAI's use of WebRTC for voice AI, arguing the protocol is poorly suited because it's designed for real-time conferencing with aggressive packet dropping, which conflicts with Voice AI use cases where accuracy matters more than minimal latency.
FractalBits introduces a specialized single-node KV storage engine that eliminates fsync calls to achieve significantly higher write throughput on NVMe SSDs by managing durability directly at the hardware level.
Researchers introduce 8M-30M parameter micro language models that instantly generate the first few words on-device before cloud models complete responses, enabling responsive AI on ultra-constrained devices like smartwatches.
OpenAI is releasing GPT-5.3-Codex-Spark, a smaller, ultra-low-latency coding model optimized for real-time collaboration, delivering over 1000 tokens per second on Cerebras hardware. It is available as a research preview to ChatGPT Pro users and marks the first milestone in OpenAI's partnership with Cerebras.
The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.
OpenAI partners with Cerebras to integrate 750MW of ultra low-latency AI compute into its platform, aiming to accelerate inference and enable faster real-time AI responses across various workloads.
OpenAI introduces the Realtime API, enabling developers to build low-latency multimodal speech-to-speech conversational experiences with natural voice interactions powered by GPT-4o. The API supports six preset voices and simplifies development by eliminating the need to integrate multiple models.