UniMesh: Unifying 3D Mesh Understanding and Generation
Summary
UniMesh introduces a single model that jointly handles 3D mesh generation and understanding via a Mesh Head, Chain-of-Mesh iterative editing, and a self-reflection error-correction mechanism.
View Cached Full Text
Cached at: 04/22/26, 06:17 AM
Paper page - UniMesh: Unifying 3D Mesh Understanding and Generation
Source: https://huggingface.co/papers/2604.17472
Abstract
UniMesh presents a unified framework that combines 3D generation and understanding tasks through novel components including a Mesh Head, Chain of Mesh for iterative editing, and a self-reflection mechanism for error correction.
Recent advances in3D visionhave led to specialized models for either 3D understanding (e.g., shape classification, segmentation, reconstruction) or 3D generation (e.g., synthesis, completion, and editing). However, these tasks are often tackled in isolation, resulting in fragmented architectures and representations that hinder knowledge transfer and holistic scene modeling. To address these challenges, we propose UniMesh, a unified framework that jointly learns 3D generation and understanding within a single architecture. First, we introduce a novelMesh Headthat acts as a cross model interface, bridgingdiffusion based image generationwithimplicit shape decoders. Second, we developChain of Mesh(CoM), a geometric instantiation ofiterative reasoningthat enables user drivensemantic mesh editingthrough a closed loop latent, prompting, and re generation cycle. Third, we incorporate a self reflection mechanism based on anActor Evaluator Self reflection triadto diagnose and correct failures in high level tasks like3D captioning. Experimental results demonstrate that UniMesh not only achieves competitive performance on standard benchmarks but also unlocks novel capabilities in iterative editing and mutual enhancement between generation and understanding. Code: https://github.com/AIGeeksGroup/UniMesh. Website: https://aigeeksgroup.github.io/UniMesh.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2604\.17472
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17472 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17472 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17472 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
EVA01 is a unified framework that integrates 3D mesh as a native modality into multimodal language models via a Mixture-of-Transformers architecture, enabling state-of-the-art text-to-3D generation and long-context multi-turn geometric editing.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit proposes using intelligent image editing as a single general task to simultaneously improve unified multimodal models' understanding, generation, and editing capabilities, with an automated data synthesis pipeline creating complex editing instructions.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath proposes a framework for adaptive coordination of understanding and generation in unified multimodal models, leveraging coordination-path diversity to improve performance over fixed strategies.
MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation
MeshWeaver presents an autoregressive mesh generation framework that directly predicts vertices using a multi-level sparse-voxel encoder, achieving state-of-the-art compression and geometric fidelity for high-poly meshes.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.