netflix/void-model

Hugging Face Models Trending Models

Summary

Netflix releases VOID, a video inpainting model that removes objects from videos while realistically simulating physical interactions (e.g., objects falling when a person is removed), built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning.

Task: video-to-video Tags: video-inpainting, video-editing, object-removal, cogvideox, diffusion, video-generation, video-to-video, arxiv:2604.02296, license:apache-2.0, region:us
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:45 PM

netflix/void-model · Hugging Face

Source: https://huggingface.co/netflix/void-model

https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, butphysical interactionslike objects falling when a person is removed.

Project Page|Paper|GitHub|Demo

https://huggingface.co/netflix/void-model#quick-startQuick Start

Open in Colab

The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with40GB+ VRAM(e.g., A100).

https://huggingface.co/netflix/void-model#model-detailsModel Details

VOID is built onCogVideoX-Fun-V1.5-5b-InPand fine-tuned for video inpainting with interaction-awarequadmaskconditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).

https://huggingface.co/netflix/void-model#checkpointsCheckpoints

FileDescriptionRequired?void\_pass1\.safetensorsBase inpainting modelYesvoid\_pass2\.safetensorsWarped-noise refinement for temporal consistencyOptional Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.

https://huggingface.co/netflix/void-model#architectureArchitecture

  • **Base:**CogVideoX 3D Transformer (5B parameters)
  • **Input:**Video + quadmask + text prompt describing the scene after removal
  • **Resolution:**384x672 (default)
  • **Max frames:**197
  • **Scheduler:**DDIM
  • **Precision:**BF16 with FP8 quantization for memory efficiency

https://huggingface.co/netflix/void-model#usageUsage

https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook

The easiest way — clone the repo and runnotebook\.ipynb:

git clone https://github.com/netflix/void-model.git
cd void-model

https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI

# Install dependencies
pip install -r requirements.txt

# Download the base model
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

# Download VOID checkpoints
hf download netflix/void-model \
    --local-dir .

# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="./sample" \
    --config.experiment.run_seqs="lime" \
    --config.experiment.save_path="./outputs" \
    --config.video_model.transformer_path="./void_pass1.safetensors"

https://huggingface.co/netflix/void-model#input-formatInput Format

Each video needs three files in a folder:

my-video/
  input_video.mp4      # source video
  quadmask_0.mp4       # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
  prompt.json          # {"bg": "description of scene after removal"}

The repo includes a mask generation pipeline (VLM\-MASK\-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.

https://huggingface.co/netflix/void-model#trainingTraining

Trained on paired counterfactual videos generated from two sources:

  • HUMOTO— human-object interactions rendered in Blender with physics simulation
  • Kubric— object-only interactions using Google Scanned Objects

Training was run on8x A100 80GB GPUsusing DeepSpeed ZeRO Stage 2. See theGitHub repofor full training instructions and data generation code.

https://huggingface.co/netflix/void-model#citationCitation

@misc{motamed2026void,
  title={VOID: Video Object and Interaction Deletion},
  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
  year={2026},
  eprint={2604.02296},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02296}
}

Similar Articles

nvidia/Cosmos3-Super-Image2Video

Hugging Face Models Trending

NVIDIA releases Cosmos3-Super-Image2Video, a model that generates temporally coherent video sequences from an input image and text instructions, part of the Cosmos 3 omnimodal world model platform for Physical AI applications.

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Hugging Face Daily Papers

EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.

Streaming Video Generation with Streaming Force Control

Hugging Face Daily Papers

StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.

Training Video Foundation Models with NVIDIA NeMo

Papers with Code Trending

This paper presents a scalable open-source pipeline using NVIDIA NeMo for training and inference of Video Foundation Models, addressing challenges in generating high-quality videos with accelerated dataset curation and parallelized training.