StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Hugging Face Daily Papers Papers

Summary

StreamChar is a streaming framework for real-time audio-video generation of character animation, using an LLM orchestrator and joint audio-video DiT with two-stage distillation and memory mechanisms to maintain long-horizon consistency and visual quality.

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:24 AM

Paper page - StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Source: https://huggingface.co/papers/2605.25659

Abstract

StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage distillation and maintaining visual consistency through memory mechanisms.

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, astreaming frameworkthat separates long-horizon orchestration from short-window audio-video denoising. AnLLM-based orchestratoruses the transcript and historical context to produceframe-aligned audio conditions, and ajoint audio-video DiTperformslocal bidirectional denoisingwithreference and motion-frame conditioning. For efficient deployment, we use atwo-stage distillation pipelinethat first compresses the sampler and then fine-tunes the student underonline chunk rollouts. Aprogress-aware pointeraligns partial transcripts with generated audio during rollout training, and asink-chunk memoryprovides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity,audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.25659

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.25659 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.25659 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.25659 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Streaming Video Generation with Streaming Force Control

Hugging Face Daily Papers

StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.

Stream-T1: Test-Time Scaling for Streaming Video Generation

Hugging Face Daily Papers

Stream-T1 is a proposed framework for test-time scaling in streaming video generation, improving temporal consistency and quality through mechanisms like noise propagation and reward pruning. The paper addresses the high computational costs of existing diffusion-based methods by leveraging chunk-level synthesis.

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

Hugging Face Daily Papers

HorizonStream introduces a long-horizon attention mechanism for streaming 3D reconstruction that explicitly models geometric propagation via an evidence influence kernel, achieving stable, scalable reconstruction with constant memory and linear time complexity, and generalizing to sequences over 10,000 frames.