netflix/void-model

Hugging Face Models Trending Models

Summary

Netflix releases VOID, a video inpainting model that removes objects from videos while realistically simulating physical interactions (e.g., objects falling when a person is removed), built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning.

Task: video-to-video Tags: video-inpainting, video-editing, object-removal, cogvideox, diffusion, video-generation, video-to-video, arxiv:2604.02296, license:apache-2.0, region:us
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:45 PM

netflix/void-model · Hugging Face

Source: https://huggingface.co/netflix/void-model

https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, butphysical interactionslike objects falling when a person is removed.

Project Page|Paper|GitHub|Demo

https://huggingface.co/netflix/void-model#quick-startQuick Start

Open in Colab

The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with40GB+ VRAM(e.g., A100).

https://huggingface.co/netflix/void-model#model-detailsModel Details

VOID is built onCogVideoX-Fun-V1.5-5b-InPand fine-tuned for video inpainting with interaction-awarequadmaskconditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).

https://huggingface.co/netflix/void-model#checkpointsCheckpoints

FileDescriptionRequired?void\_pass1\.safetensorsBase inpainting modelYesvoid\_pass2\.safetensorsWarped-noise refinement for temporal consistencyOptional Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.

https://huggingface.co/netflix/void-model#architectureArchitecture

  • **Base:**CogVideoX 3D Transformer (5B parameters)
  • **Input:**Video + quadmask + text prompt describing the scene after removal
  • **Resolution:**384x672 (default)
  • **Max frames:**197
  • **Scheduler:**DDIM
  • **Precision:**BF16 with FP8 quantization for memory efficiency

https://huggingface.co/netflix/void-model#usageUsage

https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook

The easiest way — clone the repo and runnotebook\.ipynb:

git clone https://github.com/netflix/void-model.git
cd void-model

https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI

# Install dependencies
pip install -r requirements.txt

# Download the base model
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

# Download VOID checkpoints
hf download netflix/void-model \
    --local-dir .

# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="./sample" \
    --config.experiment.run_seqs="lime" \
    --config.experiment.save_path="./outputs" \
    --config.video_model.transformer_path="./void_pass1.safetensors"

https://huggingface.co/netflix/void-model#input-formatInput Format

Each video needs three files in a folder:

my-video/
  input_video.mp4      # source video
  quadmask_0.mp4       # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
  prompt.json          # {"bg": "description of scene after removal"}

The repo includes a mask generation pipeline (VLM\-MASK\-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.

https://huggingface.co/netflix/void-model#trainingTraining

Trained on paired counterfactual videos generated from two sources:

  • HUMOTO— human-object interactions rendered in Blender with physics simulation
  • Kubric— object-only interactions using Google Scanned Objects

Training was run on8x A100 80GB GPUsusing DeepSpeed ZeRO Stage 2. See theGitHub repofor full training instructions and data generation code.

https://huggingface.co/netflix/void-model#citationCitation

@misc{motamed2026void,
  title={VOID: Video Object and Interaction Deletion},
  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
  year={2026},
  eprint={2604.02296},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02296}
}

Similar Articles

Relit-LiVE: Relight Video by Jointly Learning Environment Video

Hugging Face Daily Papers

This paper introduces Relit-LiVE, a novel video relighting framework that produces physically consistent results without requiring camera pose information by using raw reference images and joint environment video prediction.

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Hugging Face Daily Papers

VEFX-Bench introduces a large-scale human-annotated video editing dataset (5,049 examples) with multi-dimensional quality labels and a specialized reward model for standardized evaluation of video editing systems. The paper addresses the lack of comprehensive benchmarks in AI-assisted video creation by providing VEFX-Dataset, VEFX-Reward, and a 300-video-prompt benchmark that reveals gaps in current editing models.

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Hugging Face Daily Papers

HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.