StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Hugging Face Daily Papers 05/25/26, 12:00 AM Papers

Summary

StreamChar is a streaming framework for real-time audio-video generation of character animation, using an LLM orchestrator and joint audio-video DiT with two-stage distillation and memory mechanisms to maintain long-horizon consistency and visual quality.

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:24 AM

Paper page - StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Source: https://huggingface.co/papers/2605.25659

Abstract

StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage distillation and maintaining visual consistency through memory mechanisms.

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, astreaming frameworkthat separates long-horizon orchestration from short-window audio-video denoising. AnLLM-based orchestratoruses the transcript and historical context to produceframe-aligned audio conditions, and ajoint audio-video DiTperformslocal bidirectional denoisingwithreference and motion-frame conditioning. For efficient deployment, we use atwo-stage distillation pipelinethat first compresses the sampler and then fine-tunes the student underonline chunk rollouts. Aprogress-aware pointeraligns partial transcripts with generated audio during rollout training, and asink-chunk memoryprovides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity,audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.25659

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.25659 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.25659 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.25659 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Paper page - StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Streaming Video Generation with Streaming Force Control

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Stream-T1: Test-Time Scaling for Streaming Video Generation

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

Submit Feedback

Similar Articles

Streaming Video Generation with Streaming Force Control

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Stream-T1: Test-Time Scaling for Streaming Video Generation

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction