OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Papers with Code Trending 10/23/24, 11:58 AM Papers

Summary

OmniFlatten is a novel GPT-based model that enables real-time, full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original architecture.

Full-duplex spoken dialogue systems significantly advance over traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex communication capabilities, we propose a multi-stage post-training scheme that progressively adapts a text-based large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. Throughout all training stages, we standardize the data using a flattening operation, which allows us to unify the training methods and the model architecture across different modalities and tasks. Our approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 08:51 AM

Paper page - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Source: https://huggingface.co/papers/2410.17799 Published on Oct 23, 2024

Abstract

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model’s architecture.

Full-duplex spoken dialogue systems significantly advance over traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novelEnd-to-End GPT-based model OmniFlattenfor full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex communication capabilities, we propose a multi-stage post-training scheme that progressively adapts atext-based large language model (LLM)backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages:modality alignment,half-duplex dialogue learning, andfull-duplex dialogue learning. Throughout all training stages, we standardize the data using aflattening operation, which allows us to unify the training methods and the model architecture across different modalities and tasks. Our approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated byOmniFlattencan be found at this web site (https://omniflatten.github.io/).

View arXiv page View PDF GitHub57.7kauto Add to collection

Get this paper in your agent:

hf papers read 2410\.17799

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2410.17799 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2410.17799 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2410.17799 in a Space README.md to link it from this page.

Collections including this paper5

Browse 5 collections that include this paper

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Paper page - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper5

Similar Articles

k2-fsa/OmniVoice

GPT-5.3 Instant: Smoother, more useful everyday conversations

Advancing voice intelligence with new models in the API

OpenAI's New Voice Models Want to Do More Than Talk Back

Hello GPT-4o

Submit Feedback

Similar Articles

GPT-5.3 Instant: Smoother, more useful everyday conversations

Advancing voice intelligence with new models in the API

OpenAI's New Voice Models Want to Do More Than Talk Back