netflix/void-model
Summary
Netflix releases VOID, a video inpainting model that removes objects from videos while realistically simulating physical interactions (e.g., objects falling when a person is removed), built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning.
View Cached Full Text
Cached at: 04/20/26, 02:45 PM
netflix/void-model · Hugging Face
Source: https://huggingface.co/netflix/void-model
https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion
VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, butphysical interactionslike objects falling when a person is removed.
Project Page|Paper|GitHub|Demo
https://huggingface.co/netflix/void-model#quick-startQuick Start
The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with40GB+ VRAM(e.g., A100).
https://huggingface.co/netflix/void-model#model-detailsModel Details
VOID is built onCogVideoX-Fun-V1.5-5b-InPand fine-tuned for video inpainting with interaction-awarequadmaskconditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).
https://huggingface.co/netflix/void-model#checkpointsCheckpoints
FileDescriptionRequired?void\_pass1\.safetensorsBase inpainting modelYesvoid\_pass2\.safetensorsWarped-noise refinement for temporal consistencyOptional
Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.
https://huggingface.co/netflix/void-model#architectureArchitecture
- **Base:**CogVideoX 3D Transformer (5B parameters)
- **Input:**Video + quadmask + text prompt describing the scene after removal
- **Resolution:**384x672 (default)
- **Max frames:**197
- **Scheduler:**DDIM
- **Precision:**BF16 with FP8 quantization for memory efficiency
https://huggingface.co/netflix/void-model#usageUsage
https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook
The easiest way — clone the repo and runnotebook\.ipynb:
git clone https://github.com/netflix/void-model.git
cd void-model
https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI
# Install dependencies
pip install -r requirements.txt
# Download the base model
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
--local-dir ./CogVideoX-Fun-V1.5-5b-InP
# Download VOID checkpoints
hf download netflix/void-model \
--local-dir .
# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
--config config/quadmask_cogvideox.py \
--config.data.data_rootdir="./sample" \
--config.experiment.run_seqs="lime" \
--config.experiment.save_path="./outputs" \
--config.video_model.transformer_path="./void_pass1.safetensors"
https://huggingface.co/netflix/void-model#input-formatInput Format
Each video needs three files in a folder:
my-video/
input_video.mp4 # source video
quadmask_0.mp4 # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
prompt.json # {"bg": "description of scene after removal"}
The repo includes a mask generation pipeline (VLM\-MASK\-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.
https://huggingface.co/netflix/void-model#trainingTraining
Trained on paired counterfactual videos generated from two sources:
- HUMOTO— human-object interactions rendered in Blender with physics simulation
- Kubric— object-only interactions using Google Scanned Objects
Training was run on8x A100 80GB GPUsusing DeepSpeed ZeRO Stage 2. See theGitHub repofor full training instructions and data generation code.
https://huggingface.co/netflix/void-model#citationCitation
@misc{motamed2026void,
title={VOID: Video Object and Interaction Deletion},
author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
year={2026},
eprint={2604.02296},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02296}
}
Similar Articles
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
Relit-LiVE: Relight Video by Jointly Learning Environment Video
This paper introduces Relit-LiVE, a novel video relighting framework that produces physically consistent results without requiring camera pose information by using raw reference images and joint environment video prediction.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench introduces a large-scale human-annotated video editing dataset (5,049 examples) with multi-dimensional quality labels and a specialized reward model for standardized evaluation of video editing systems. The paper addresses the lack of comprehensive benchmarks in AI-assisted video creation by providing VEFX-Dataset, VEFX-Reward, and a 300-video-prompt benchmark that reveals gaps in current editing models.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract introduces an end-to-end Diffusion Transformer framework that jointly models RGB appearance and HOI geometry to generate physically-plausible human-object interaction videos with stable hands/faces and zero inference overhead.