VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
Summary
VGGT-Edit proposes a feed-forward framework for text-conditioned native 3D scene editing using depth-synchronized text injection and residual field prediction, achieving superior quality and efficiency over 2D-lifting approaches.
View Cached Full Text
Cached at: 05/15/26, 04:23 AM
Paper page - VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
Source: https://huggingface.co/papers/2605.15186 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
VGGT-Edit enables text-conditioned 3D scene editing through depth-synchronized text injection and direct geometric displacement prediction, achieving superior quality and efficiency over 2D-lifting approaches.
High-quality3D scene reconstructionhas recently advanced toward generalizablefeed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introducesdepth-synchronized text injectionto align semantic guidance with the backbone’s spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by aresidual transformation head, which directly predicts 3Dgeometric displacementsto deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with amulti-term objective functionthat enforces geometric accuracy andcross-view consistency. We also construct theDeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.15186
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15186 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15186 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15186 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
EVA01 is a unified framework that integrates 3D mesh as a native modality into multimodal language models via a Mixture-of-Transformers architecture, enabling state-of-the-art text-to-3D generation and long-context multi-turn geometric editing.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
HP-Edit introduces a post-training framework that aligns diffusion-based image editing models with human preferences via RLHF, using a new 50K real-world dataset and an automatic VLM-based evaluator.
See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation
This paper introduces OmniManim, a render-feedback-aware framework for generating educational animations from natural language descriptions using large language models. It addresses visual defects like element overlap and misalignment by incorporating explicit visual planning, post-render diagnostics, and localized repair, demonstrating improved render quality on newly constructed datasets.