OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Summary
OmniFlatten is a novel GPT-based model that enables real-time, full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original architecture.
View Cached Full Text
Cached at: 05/08/26, 08:51 AM
Paper page - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Source: https://huggingface.co/papers/2410.17799 Published on Oct 23, 2024
Abstract
A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model’s architecture.
Full-duplex spoken dialogue systems significantly advance over traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novelEnd-to-End GPT-based modelOmniFlattenfor full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex communication capabilities, we propose a multi-stage post-training scheme that progressively adapts atext-based large language model (LLM)backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages:modality alignment,half-duplex dialogue learning, andfull-duplex dialogue learning. Throughout all training stages, we standardize the data using aflattening operation, which allows us to unify the training methods and the model architecture across different modalities and tasks. Our approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated byOmniFlattencan be found at this web site (https://omniflatten.github.io/).
View arXiv pageView PDFGitHub57.7kautoAdd to collection
Get this paper in your agent:
hf papers read 2410\.17799
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2410.17799 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2410.17799 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2410.17799 in a Space README.md to link it from this page.
Collections including this paper5
Similar Articles
k2-fsa/OmniVoice
OmniVoice is a massively multilingual zero-shot text-to-speech model supporting over 600 languages, built on a diffusion language model architecture with fast inference and voice cloning capabilities.
GPT-5.3 Instant: Smoother, more useful everyday conversations
OpenAI releases GPT-5.3 Instant, an update to ChatGPT's most-used model that improves conversational flow, reduces unnecessary refusals, and decreases hallucinations by up to 26.8% in high-stakes domains. The update focuses on tone, relevance, and practical usability based on user feedback.
Advancing voice intelligence with new models in the API
OpenAI has announced three new voice models in its API: GPT-Realtime-2 with advanced reasoning, GPT-Realtime-Translate for live multilingual translation, and GPT-Realtime-Whisper for streaming transcription, aiming to enable more natural and action-oriented voice applications.
OpenAI's New Voice Models Want to Do More Than Talk Back
OpenAI has launched three new real-time audio models to enable continuous, multitasking voice interactions that prioritize long-context reasoning, live translation, and seamless tool use.
Hello GPT-4o
OpenAI announces GPT-4o, a flagship multimodal model that processes audio, vision, text, and video in real-time with 232ms average audio response latency. The model matches GPT-4 Turbo on text/code while significantly improving multilingual, audio, and vision capabilities at 50% cheaper API costs.