netflix/void-model

Hugging Face Models Trending 03/30/26, 07:45 PM Models

video-generation object-removal inpainting netflix open-source diffusion video-ai

Summary

Netflix releases VOID, a video inpainting model that removes objects from videos while realistically simulating physical interactions (e.g., objects falling when a person is removed), built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning.

Task: video-to-video Tags: video-inpainting, video-editing, object-removal, cogvideox, diffusion, video-generation, video-to-video, arxiv:2604.02296, license:apache-2.0, region:us

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 02:45 PM

netflix/void-model · Hugging Face

Source: https://huggingface.co/netflix/void-model

https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, butphysical interactionslike objects falling when a person is removed.

Project Page|Paper|GitHub|Demo

https://huggingface.co/netflix/void-model#quick-startQuick Start

The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with40GB+ VRAM(e.g., A100).

https://huggingface.co/netflix/void-model#model-detailsModel Details

VOID is built onCogVideoX-Fun-V1.5-5b-InPand fine-tuned for video inpainting with interaction-awarequadmaskconditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).

https://huggingface.co/netflix/void-model#checkpointsCheckpoints

FileDescriptionRequired?void\_pass1\.safetensorsBase inpainting modelYesvoid\_pass2\.safetensorsWarped-noise refinement for temporal consistencyOptional Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.

https://huggingface.co/netflix/void-model#architectureArchitecture

**Base:**CogVideoX 3D Transformer (5B parameters)
**Input:**Video + quadmask + text prompt describing the scene after removal
**Resolution:**384x672 (default)
**Max frames:**197
**Scheduler:**DDIM
**Precision:**BF16 with FP8 quantization for memory efficiency

https://huggingface.co/netflix/void-model#usageUsage

https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook

The easiest way — clone the repo and runnotebook\.ipynb:

git clone https://github.com/netflix/void-model.git
cd void-model

https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI

# Install dependencies
pip install -r requirements.txt

# Download the base model
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

# Download VOID checkpoints
hf download netflix/void-model \
    --local-dir .

# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="./sample" \
    --config.experiment.run_seqs="lime" \
    --config.experiment.save_path="./outputs" \
    --config.video_model.transformer_path="./void_pass1.safetensors"

https://huggingface.co/netflix/void-model#input-formatInput Format

Each video needs three files in a folder:

my-video/
  input_video.mp4      # source video
  quadmask_0.mp4       # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
  prompt.json          # {"bg": "description of scene after removal"}

The repo includes a mask generation pipeline (VLM\-MASK\-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.

https://huggingface.co/netflix/void-model#trainingTraining

Trained on paired counterfactual videos generated from two sources:

HUMOTO— human-object interactions rendered in Blender with physics simulation
Kubric— object-only interactions using Google Scanned Objects

Training was run on8x A100 80GB GPUsusing DeepSpeed ZeRO Stage 2. See theGitHub repofor full training instructions and data generation code.

https://huggingface.co/netflix/void-model#citationCitation

@misc{motamed2026void,
  title={VOID: Video Object and Interaction Deletion},
  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
  year={2026},
  eprint={2604.02296},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02296}
}

netflix/void-model

netflix/void-model · Hugging Face

https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion

https://huggingface.co/netflix/void-model#quick-startQuick Start

https://huggingface.co/netflix/void-model#model-detailsModel Details

https://huggingface.co/netflix/void-model#checkpointsCheckpoints

https://huggingface.co/netflix/void-model#architectureArchitecture

https://huggingface.co/netflix/void-model#usageUsage

https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook

https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI

https://huggingface.co/netflix/void-model#input-formatInput Format

https://huggingface.co/netflix/void-model#trainingTraining

https://huggingface.co/netflix/void-model#citationCitation

Similar Articles

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

Relit-LiVE: Relight Video by Jointly Learning Environment Video

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

HumanNet: Scaling Human-centric Video Learning to One Million Hours

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Submit Feedback

Similar Articles

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

Relit-LiVE: Relight Video by Jointly Learning Environment Video

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

HumanNet: Scaling Human-centric Video Learning to One Million Hours

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation