netflix/void-model

Hugging Face Models Trending Models

Summary

Netflix releases VOID, a video inpainting model that removes objects from videos while realistically simulating physical interactions (e.g., objects falling when a person is removed), built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning.

Task: video-to-video Tags: video-inpainting, video-editing, object-removal, cogvideox, diffusion, video-generation, video-to-video, arxiv:2604.02296, license:apache-2.0, region:us
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:45 PM

netflix/void-model · Hugging Face

Source: https://huggingface.co/netflix/void-model

https://huggingface.co/netflix/void-model#void-video-object-and-interaction-deletionVOID: Video Object and Interaction Deletion

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, butphysical interactionslike objects falling when a person is removed.

Project Page|Paper|GitHub|Demo

https://huggingface.co/netflix/void-model#quick-startQuick Start

Open in Colab

The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with40GB+ VRAM(e.g., A100).

https://huggingface.co/netflix/void-model#model-detailsModel Details

VOID is built onCogVideoX-Fun-V1.5-5b-InPand fine-tuned for video inpainting with interaction-awarequadmaskconditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).

https://huggingface.co/netflix/void-model#checkpointsCheckpoints

FileDescriptionRequired?void\_pass1\.safetensorsBase inpainting modelYesvoid\_pass2\.safetensorsWarped-noise refinement for temporal consistencyOptional Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.

https://huggingface.co/netflix/void-model#architectureArchitecture

  • **Base:**CogVideoX 3D Transformer (5B parameters)
  • **Input:**Video + quadmask + text prompt describing the scene after removal
  • **Resolution:**384x672 (default)
  • **Max frames:**197
  • **Scheduler:**DDIM
  • **Precision:**BF16 with FP8 quantization for memory efficiency

https://huggingface.co/netflix/void-model#usageUsage

https://huggingface.co/netflix/void-model#from-the-notebookFrom the Notebook

The easiest way — clone the repo and runnotebook\.ipynb:

git clone https://github.com/netflix/void-model.git
cd void-model

https://huggingface.co/netflix/void-model#from-the-cliFrom the CLI

# Install dependencies
pip install -r requirements.txt

# Download the base model
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

# Download VOID checkpoints
hf download netflix/void-model \
    --local-dir .

# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="./sample" \
    --config.experiment.run_seqs="lime" \
    --config.experiment.save_path="./outputs" \
    --config.video_model.transformer_path="./void_pass1.safetensors"

https://huggingface.co/netflix/void-model#input-formatInput Format

Each video needs three files in a folder:

my-video/
  input_video.mp4      # source video
  quadmask_0.mp4       # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
  prompt.json          # {"bg": "description of scene after removal"}

The repo includes a mask generation pipeline (VLM\-MASK\-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.

https://huggingface.co/netflix/void-model#trainingTraining

Trained on paired counterfactual videos generated from two sources:

  • HUMOTO— human-object interactions rendered in Blender with physics simulation
  • Kubric— object-only interactions using Google Scanned Objects

Training was run on8x A100 80GB GPUsusing DeepSpeed ZeRO Stage 2. See theGitHub repofor full training instructions and data generation code.

https://huggingface.co/netflix/void-model#citationCitation

@misc{motamed2026void,
  title={VOID: Video Object and Interaction Deletion},
  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
  year={2026},
  eprint={2604.02296},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02296}
}

Similar Articles

nvidia/Cosmos3-Super-Image2Video

Hugging Face Models Trending

NVIDIA releases Cosmos3-Super-Image2Video, a model that generates temporally coherent video sequences from an input image and text instructions, part of the Cosmos 3 omnimodal world model platform for Physical AI applications.

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Hugging Face Daily Papers

EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.

Streaming Video Generation with Streaming Force Control

Hugging Face Daily Papers

StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Hugging Face Daily Papers

CRONOS is a benchmark that evaluates counterfactual physical consistency in video prediction models by intervening on viewpoint, scene, object category, and appearance while keeping physical event types fixed. It reveals substantial failures in current video generators.