MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Hugging Face Daily Papers 06/16/26, 12:00 AM Papers

real-time audio-visual autoregressive social-world-model streaming-generation 22b-parameters

Summary

MaineCoon is a 22B-parameter real-time audio-visual autoregressive model for social world modeling, capable of streaming generation at up to 47.5 FPS on a single GPU, introducing novel training techniques and an agentic inference framework.

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

Original Article

View Cached Full Text

Cached at: 06/18/26, 03:57 PM

Paper page - MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Source: https://huggingface.co/papers/2606.17800

https://huggingface.co/papers/2606.17800#mainecoon-pursuing-a-real-time-audio-visual-social-world-modelMaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Catnip AI Team

https://huggingface.co/papers/2606.17800#abstractAbstract

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. They typically omit critical auditory information or fail to capture the high-engagement pacing, emotional resonance, and rapid conversational flow that define viral social media. To bridge this gap as the first step to social world models, we presentMaineCoon, the first real-time audio-visual autoregressive model that has22B parametersand is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate ofup to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planning. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

https://huggingface.co/papers/2606.17800#highlightsHighlights

⚡ Real-time on a single GPU.A 22B interactive audio-visual autoregressive model capable of streaming generation and sub-second interaction, with a record-breaking frame rate ofup to 47.5 FPSon a single H100. Generation cost drops wellbelow $0.001 per second— and keeps falling.
**🌍 A new paradigm: social world models.**MaineCoon positions and serves as the first generative core forsocial world models, a technical foundation for next-generation AI-native social platforms.
🎓 Forcing-free streaming training.A multi-stage training paradigm —self-resampling,cross-modal representation alignment,domain-aware preference optimization, andreinforced online-policy distillation (ROPD)— that enables native, efficient streaming audio-visual training at 22B scale.
🧠 Agentic streaming inference.An agentic inference framework that supportsthousand-second-scalegeneration while mitigating drift through agentic cache management, chunk commitment, long-context rollout, and prompt planning.
**📊 SocialVideo-Bench.**A new benchmark focused on audio-visual social-video generation, with 9 representative metrics covering visual quality, motion, audio quality, audio-visual alignment, and social-video harmony. MaineCoon outperforms 7 representative open audio-visual models while achieving the fastest generation speed — a new state of the art for real-time social video generation.

https://huggingface.co/papers/2606.17800#showcaseShowcase

Hand-picked MaineCoon generations (audio-visual, with sound) play directly in the**GitHub repository**.

🎬Minute-scale, long-form demosare best viewed on our**blog. 🕹️Try MaineCoon liveat theexperience platform**.

https://huggingface.co/papers/2606.17800#benchmark–socialvideo-benchBenchmark — SocialVideo-Bench

Table 2. Main quantitative results on SocialVideo-Bench.🐱**MaineCoon (Ours)**achieves the best average score and wins most metrics, including the two most comprehensive ones — Audio-Visual Harmony (AVH) and Joint Audio-Visual Integrated Score (JAVIS) — over both streaming and bidirectional baselines.

TypeModelVis↑Mot↑Aud↑IB-TV↑IB-TA↑IB-AV↑AV-Al↑AVH↑JAVIS↑Average↑Bidirectional T2AVJavisDiT++4.392.224.060.1340.0700.1510.3120.1360.1120.711Ovi4.441.893.760.1380.0790.1910.4120.1880.1620.779JoyAI-Echo4.611.173.470.1470.0880.2260.3190.1960.1730.749MoVA4.661.683.690.1330.1050.2580.3590.2450.2160.842LTX-2.34.100.994.060.1320.1110.3110.3340.2870.2470.848Streaming TA2VLiveAvatar4.601.464.130.1310.1200.3160.3260.2910.2460.892SoulX-FlashTalk4.651.994.070.1280.1200.3070.2790.2830.2380.895Streaming T2AV🐱MaineCoon (Ours)****4.711.624.350.1270.130****0.3180.3340.3080.2720.934🥇 🐱 = our method ·bold= best,italic= second best. Metrics — Vis: visual quality · Mot: motion · Aud: audio quality · IB-TV / IB-TA / IB-AV: ImageBind Text–Video / Text–Audio / Audio–Video alignment · AV-Al: audio–visual alignment · AVH: Audio-Visual Harmony · JAVIS: Joint Audio-Visual Integrated Score. See the technical report for the full benchmark and metric definitions.

Table 3. Latency and model size comparison.Sampling throughput (FPS) is measured for 480P 20-second generation on a single H100 GPU. 🐱MaineCoon (Ours)has thelargest model yet by far the fastestspeed — up to7× fasterthan other streaming audio-visual generators, and faster even than a 1.3B streaming video model.

TypeModelParamsFPS↑Bidirectional T2AVJavisDiT++1.8B0.87Ovi11B0.58JoyAI-Echo23B18.0MoVA32B0.26LTX-2.322B1.40LTX-2.3-Distilled22B20.7Streaming T2VCausal-Forcing1.3B19.1Helios-Distilled14B18.2Krea14B6.1Streaming TA2VLiveAvatar14B6.7SoulX-FlashTalk14B6.6Streaming T2AV🐱MaineCoon (Ours)22B47.5🥇 🐱 = our method ·bold= best,italic= second best. FPS for 480P-20s on a single H100.

https://huggingface.co/papers/2606.17800#paperPaper

The full paper is available on**arXiv:2606.17800**. A PDF copy is also included in this repository:MaineCoon\_Technical\_Report\.pdf. It covers the social-video data infrastructure, the native streaming autoregressive training recipe, the agentic streaming inference framework, SocialVideo-Bench, and a position/outlook on social world models.

https://huggingface.co/papers/2606.17800#acknowledgementsAcknowledgements

MaineCoon stands on the shoulders of the open-source community. We are especially grateful to:

🎬 LTX-2.3 & the LTX series —Lightricks.MaineCoon’s audio-visual backbone builds on the excellent openLTX-2.3model. Huge credit to the LTX team and the broader LTX-Video series.- LTX-2(incl. LTX-2.3):https://github.com/Lightricks/LTX-2 - LTX-Video:https://github.com/Lightricks/LTX-Video
**⚡ DMD series & the distribution-matching distillation community.Our reinforced online-policy distillation (ROPD) builds on theDistribution Matching Distillation (DMD / DMD2)**line of work and the wider few-step / real-time distillation community.- DMD2:https://github.com/tianweiy/DMD2 - DMD(project page):https://tianweiy.github.io/dmd/

We thank these projects and their communities for advancing real-time, few-step, and streaming video generation.

https://huggingface.co/papers/2606.17800#citationCitation

@article {catnip2026mainecoon,
  title        = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},
  author       = {Catnip AI Team},
  year         = {2026},
  journal      = {arXiv preprint arXiv:2606.17800},
  url          = {https://arxiv.org/abs/2606.17800}
}

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Paper page - MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

https://huggingface.co/papers/2606.17800#mainecoon-pursuing-a-real-time-audio-visual-social-world-modelMaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

https://huggingface.co/papers/2606.17800#abstractAbstract

https://huggingface.co/papers/2606.17800#highlightsHighlights

https://huggingface.co/papers/2606.17800#showcaseShowcase

https://huggingface.co/papers/2606.17800#benchmark–socialvideo-benchBenchmark — SocialVideo-Bench

https://huggingface.co/papers/2606.17800#paperPaper

https://huggingface.co/papers/2606.17800#acknowledgementsAcknowledgements

https://huggingface.co/papers/2606.17800#citationCitation

Similar Articles

@rohanpaul_ai: AI video is moving into its real-time reaction era, with MaineCoon now leading in low-latency AI video. @catnips_ai jus…

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Continuous Audio Thinking for Large Audio Language Models

Submit Feedback

Similar Articles

@rohanpaul_ai: AI video is moving into its real-time reaction era, with MaineCoon now leading in low-latency AI video. @catnips_ai jus…

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Continuous Audio Thinking for Large Audio Language Models