SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Hugging Face Daily Papers Papers

Summary

Presents SceneAligner, a deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.
Original Article
View Cached Full Text

Cached at: 05/22/26, 06:34 AM

Paper page - SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Source: https://huggingface.co/papers/2605.22581

Abstract

Deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.

Many public buildings provide floorplans with a “you are here” indicator to help visitors orient themselves.Floorplan localizationseeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performingfloorplan localizationin the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs agravity-aligned3D scene and projects it into a 2Ddensity mapthat serves as a floorplan proxy.Floorplan localizationis then formulated as aligning this proxy with the input floorplan via a2D similarity transform. To bridge the appearance gap betweendensity maps and architectural floorplans, we adapt a2D foundation modelto learncross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preservingstructural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.22581

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22581 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22581 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22581 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

arXiv cs.CL

This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

arXiv cs.CL

LDARNet is a 120M-parameter hierarchical genomic foundation model that introduces learnable adaptive tokenization (inspired by H-Net's dynamic chunking) for masked language modeling on DNA sequences. It achieves state-of-the-art results on 5 histone modification tasks and outperforms models up to 20× larger on several genomic benchmarks, with learned token boundaries aligning with biological features like promoter motifs and splice junctions.