FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Hugging Face Daily Papers Papers

Summary

FLAT proposes a method to decode explicit triangle splats directly from video diffusion latents for geometrically accurate 3D scene generation. It introduces a ray-centered rotation parameterization and a product window function to improve gradient flow, achieving better geometric accuracy than prior feedforward methods while supporting real-time rendering.

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io
Original Article
View Cached Full Text

Cached at: 06/24/26, 09:47 AM

Paper page - FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Source: https://huggingface.co/papers/2606.24876

Abstract

Video diffusion models are adapted to decode explicit surface primitives directly from latent space, enabling high-quality 3D scene generation with improved geometric accuracy and real-time rendering capabilities.

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Currentvideo diffusion modelsoffer high-quality generation and implicitly encode multi-view geometric structure inlatent space. However, existing feedforward latent scene decoders typically output volumetric3D Gaussiansthat lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show thattriangle splatscan be decoded directly from video diffusion latents. Compared with decoding3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: aray-centered rotation parameterizationfor triangle regression and a novelproduct window functionthat improves gradient flow duringdifferentiable triangle rendering. On standard benchmarks, FLAT achieves significantly bettergeometric accuracywhile maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supportsreal-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs infeedforward scene generation. The project page is available at https://flat-splat.github.io

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.24876

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.24876 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.24876 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.24876 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Hugging Face Daily Papers

TriSplat is a feed-forward 3D reconstruction network that uses oriented triangle primitives to directly generate simulation-ready meshes from single images, bypassing expensive post-processing steps. It achieves geometry-faithful reconstructions while maintaining competitive novel-view rendering quality.

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

Hugging Face Daily Papers

GlobalSplat introduces an efficient feed-forward framework for 3D Gaussian splatting that achieves compact and consistent scene reconstruction using global scene tokens, reducing computational overhead and inference time to under 78ms. The method uses a coarse-to-fine training approach to prevent representation bloat while maintaining competitive novel-view synthesis performance with significantly fewer Gaussians (16K) compared to dense baselines.