netflix/void-model
Summary
Netflix releases VOID, a video inpainting model that removes objects from videos while realistically simulating physical interactions (e.g., objects falling when a person is removed), built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning.
View Cached Full Text
Cached at: 04/20/26, 02:45 PM
netflix/void-model · Hugging Face
Source: https://huggingface.co/netflix/void-model
https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion
VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, butphysical interactionslike objects falling when a person is removed.
Project Page|Paper|GitHub|Demo
https://huggingface.co/netflix/void-model#quick-startQuick Start
The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with40GB+ VRAM(e.g., A100).
https://huggingface.co/netflix/void-model#model-detailsModel Details
VOID is built onCogVideoX-Fun-V1.5-5b-InPand fine-tuned for video inpainting with interaction-awarequadmaskconditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).
https://huggingface.co/netflix/void-model#checkpointsCheckpoints
FileDescriptionRequired?void\_pass1\.safetensorsBase inpainting modelYesvoid\_pass2\.safetensorsWarped-noise refinement for temporal consistencyOptional
Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.
https://huggingface.co/netflix/void-model#architectureArchitecture
- **Base:**CogVideoX 3D Transformer (5B parameters)
- **Input:**Video + quadmask + text prompt describing the scene after removal
- **Resolution:**384x672 (default)
- **Max frames:**197
- **Scheduler:**DDIM
- **Precision:**BF16 with FP8 quantization for memory efficiency
https://huggingface.co/netflix/void-model#usageUsage
https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook
The easiest way — clone the repo and runnotebook\.ipynb:
git clone https://github.com/netflix/void-model.git
cd void-model
https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI
# Install dependencies
pip install -r requirements.txt
# Download the base model
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
--local-dir ./CogVideoX-Fun-V1.5-5b-InP
# Download VOID checkpoints
hf download netflix/void-model \
--local-dir .
# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
--config config/quadmask_cogvideox.py \
--config.data.data_rootdir="./sample" \
--config.experiment.run_seqs="lime" \
--config.experiment.save_path="./outputs" \
--config.video_model.transformer_path="./void_pass1.safetensors"
https://huggingface.co/netflix/void-model#input-formatInput Format
Each video needs three files in a folder:
my-video/
input_video.mp4 # source video
quadmask_0.mp4 # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
prompt.json # {"bg": "description of scene after removal"}
The repo includes a mask generation pipeline (VLM\-MASK\-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.
https://huggingface.co/netflix/void-model#trainingTraining
Trained on paired counterfactual videos generated from two sources:
- HUMOTO— human-object interactions rendered in Blender with physics simulation
- Kubric— object-only interactions using Google Scanned Objects
Training was run on8x A100 80GB GPUsusing DeepSpeed ZeRO Stage 2. See theGitHub repofor full training instructions and data generation code.
https://huggingface.co/netflix/void-model#citationCitation
@misc{motamed2026void,
title={VOID: Video Object and Interaction Deletion},
author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
year={2026},
eprint={2604.02296},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02296}
}
Similar Articles
nvidia/Cosmos3-Super-Image2Video
NVIDIA releases Cosmos3-Super-Image2Video, a model that generates temporally coherent video sequences from an input image and text instructions, part of the Cosmos 3 omnimodal world model platform for Physical AI applications.
@tonysimons_: A Netflix engineer built an open-source proxy that cuts AI token usage by 60-95%. Zero code changes. Benchmarks show ±0…
A Netflix engineer built Headroom, an open-source proxy that compresses LLM context by 60-95% with no code changes and negligible accuracy loss. It supports major AI agents and is available on GitHub under Apache 2.0.
EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.
Streaming Video Generation with Streaming Force Control
StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
CRONOS is a benchmark that evaluates counterfactual physical consistency in video prediction models by intervening on viewpoint, scene, object category, and appearance while keeping physical event types fixed. It reveals substantial failures in current video generators.