netflix/void-model
Summary
Netflix releases VOID, a video inpainting model that removes objects from videos while realistically simulating physical interactions (e.g., objects falling when a person is removed), built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning.
View Cached Full Text
Cached at: 04/20/26, 02:45 PM
netflix/void-model · Hugging Face
Source: https://huggingface.co/netflix/void-model
https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion
VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, butphysical interactionslike objects falling when a person is removed.
Project Page|Paper|GitHub|Demo
https://huggingface.co/netflix/void-model#quick-startQuick Start
The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with40GB+ VRAM(e.g., A100).
https://huggingface.co/netflix/void-model#model-detailsModel Details
VOID is built onCogVideoX-Fun-V1.5-5b-InPand fine-tuned for video inpainting with interaction-awarequadmaskconditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).
https://huggingface.co/netflix/void-model#checkpointsCheckpoints
FileDescriptionRequired?void\_pass1\.safetensorsBase inpainting modelYesvoid\_pass2\.safetensorsWarped-noise refinement for temporal consistencyOptional
Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.
https://huggingface.co/netflix/void-model#architectureArchitecture
- **Base:**CogVideoX 3D Transformer (5B parameters)
- **Input:**Video + quadmask + text prompt describing the scene after removal
- **Resolution:**384x672 (default)
- **Max frames:**197
- **Scheduler:**DDIM
- **Precision:**BF16 with FP8 quantization for memory efficiency
https://huggingface.co/netflix/void-model#usageUsage
https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook
The easiest way — clone the repo and runnotebook\.ipynb:
git clone https://github.com/netflix/void-model.git
cd void-model
https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI
# Install dependencies
pip install -r requirements.txt
# Download the base model
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
--local-dir ./CogVideoX-Fun-V1.5-5b-InP
# Download VOID checkpoints
hf download netflix/void-model \
--local-dir .
# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
--config config/quadmask_cogvideox.py \
--config.data.data_rootdir="./sample" \
--config.experiment.run_seqs="lime" \
--config.experiment.save_path="./outputs" \
--config.video_model.transformer_path="./void_pass1.safetensors"
https://huggingface.co/netflix/void-model#input-formatInput Format
Each video needs three files in a folder:
my-video/
input_video.mp4 # source video
quadmask_0.mp4 # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
prompt.json # {"bg": "description of scene after removal"}
The repo includes a mask generation pipeline (VLM\-MASK\-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.
https://huggingface.co/netflix/void-model#trainingTraining
Trained on paired counterfactual videos generated from two sources:
- HUMOTO— human-object interactions rendered in Blender with physics simulation
- Kubric— object-only interactions using Google Scanned Objects
Training was run on8x A100 80GB GPUsusing DeepSpeed ZeRO Stage 2. See theGitHub repofor full training instructions and data generation code.
https://huggingface.co/netflix/void-model#citationCitation
@misc{motamed2026void,
title={VOID: Video Object and Interaction Deletion},
author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
year={2026},
eprint={2604.02296},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02296}
}
Similar Articles
nvidia/Cosmos3-Super-Image2Video
NVIDIA releases Cosmos3-Super-Image2Video, a model that generates temporally coherent video sequences from an input image and text instructions, part of the Cosmos 3 omnimodal world model platform for Physical AI applications.
@tonysimons_: A Netflix engineer built an open-source proxy that cuts AI token usage by 60-95%. Zero code changes. Benchmarks show ±0…
A Netflix engineer built Headroom, an open-source proxy that compresses LLM context by 60-95% with no code changes and negligible accuracy loss. It supports major AI agents and is available on GitHub under Apache 2.0.
EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.
Streaming Video Generation with Streaming Force Control
StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.
Training Video Foundation Models with NVIDIA NeMo
This paper presents a scalable open-source pipeline using NVIDIA NeMo for training and inference of Video Foundation Models, addressing challenges in generating high-quality videos with accelerated dataset curation and parallelized training.