low-latency

#low-latency

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

arXiv cs.CL ↗ · 15h ago Cached

This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.

0 favorites 0 likes

#low-latency

Build a Realtime Speech Translation (28 minute read)

TLDR AI ↗ · 2d ago Cached

OpenAI releases gpt-realtime-translate, a low-latency speech-to-speech model optimized for live interpretation, accompanied by a developer cookbook for building multilingual browser, phone, and video applications.

0 favorites 0 likes

#low-latency

Google shipped Gemini 3.1 Flash-Lite in General Availability (2 minute read)

TLDR AI ↗ · 2d ago Cached

Google has made Gemini 3.1 Flash-Lite generally available, offering ultra-low latency and high-volume processing with multimodal capabilities, targeting enterprise applications.

0 favorites 0 likes

#low-latency

Sesame x Gemini: low latency, extremely realist, and they started spontaneously collaborating

Reddit r/singularity ↗ · 3d ago

Google's Gemini AI is featured in a new project showcasing low latency and realistic spontaneous collaboration capabilities alongside Sesame.

0 favorites 0 likes

#low-latency

OpenAI's WebRTC problem

Hacker News Top ↗ · 6d ago Cached

Technical blog post from a self-described WebRTC expert criticizing OpenAI's use of WebRTC for voice AI, arguing the protocol is poorly suited because it's designed for real-time conferencing with aggressive packet dropping, which conflicts with Voice AI use cases where accuracy matters more than minimal latency.

0 favorites 0 likes

#low-latency

Removing fsync from our local storage engine

Hacker News Top ↗ · 6d ago Cached

FractalBits introduces a specialized single-node KV storage engine that eliminates fsync calls to achieve significantly higher write throughput on NVMe SSDs by managing durability directly at the hardware level.

0 favorites 0 likes

#low-latency

Micro Language Models Enable Instant Responses

Hugging Face Daily Papers ↗ · 2026-04-21 Cached

Researchers introduce 8M-30M parameter micro language models that instantly generate the first few words on-device before cloud models complete responses, enabling responsive AI on ultra-constrained devices like smartwatches.

0 favorites 0 likes

#low-latency

Introducing GPT-5.3-Codex-Spark

OpenAI Blog ↗ · 2026-02-12 Cached

OpenAI is releasing GPT-5.3-Codex-Spark, a smaller, ultra-low-latency coding model optimized for real-time collaboration, delivering over 1000 tokens per second on Cerebras hardware. It is available as a research preview to ChatGPT Pro users and marks the first milestone in OpenAI's partnership with Cerebras.

0 favorites 0 likes

#low-latency

Qwen3-TTS Technical Report

Papers with Code Trending ↗ · 2026-01-22 Cached

The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.

0 favorites 0 likes

#low-latency

OpenAI partners with Cerebras 

OpenAI Blog ↗ · 2026-01-14 Cached

OpenAI partners with Cerebras to integrate 750MW of ultra low-latency AI compute into its platform, aiming to accelerate inference and enable faster real-time AI responses across various workloads.

0 favorites 0 likes

#low-latency

Introducing the Realtime API

OpenAI Blog ↗ · 2024-10-01 Cached

OpenAI introduces the Realtime API, enabling developers to build low-latency multimodal speech-to-speech conversational experiences with natural voice interactions powered by GPT-4o. The API supports six preset voices and simplifies development by eliminating the need to integrate multiple models.

0 favorites 0 likes

low-latency

Submit Feedback