StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration
Summary
StreamChar is a streaming framework for real-time audio-video generation of character animation, using an LLM orchestrator and joint audio-video DiT with two-stage distillation and memory mechanisms to maintain long-horizon consistency and visual quality.
View Cached Full Text
Cached at: 06/02/26, 03:24 AM
Paper page - StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration
Source: https://huggingface.co/papers/2605.25659
Abstract
StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage distillation and maintaining visual consistency through memory mechanisms.
Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, astreaming frameworkthat separates long-horizon orchestration from short-window audio-video denoising. AnLLM-based orchestratoruses the transcript and historical context to produceframe-aligned audio conditions, and ajoint audio-video DiTperformslocal bidirectional denoisingwithreference and motion-frame conditioning. For efficient deployment, we use atwo-stage distillation pipelinethat first compresses the sampler and then fine-tunes the student underonline chunk rollouts. Aprogress-aware pointeraligns partial transcripts with generated audio during rollout training, and asink-chunk memoryprovides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity,audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.25659
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.25659 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.25659 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.25659 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Streaming Video Generation with Streaming Force Control
StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
SwanSphere proposes a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies, achieving superior performance in both video-to-spatial and text-to-spatial audio tasks.
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a proposed framework for test-time scaling in streaming video generation, improving temporal consistency and quality through mechanisms like noise propagation and reward pruning. The paper addresses the high computational costs of existing diffusion-based methods by leveraging chunk-level synthesis.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 introduces a reliability-perplexity aware reward distillation framework for streaming video generation that adaptively weights supervision to improve visual and motion quality without additional computational overhead.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
HorizonStream introduces a long-horizon attention mechanism for streaming 3D reconstruction that explicitly models geometric propagation via an evidence influence kernel, achieving stable, scalable reconstruction with constant memory and linear time complexity, and generalizing to sequences over 10,000 frames.