SceneAligner: 3D-Grounded Floorplan Localization in the Wild
Summary
Presents SceneAligner, a deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.
View Cached Full Text
Cached at: 05/22/26, 06:34 AM
Paper page - SceneAligner: 3D-Grounded Floorplan Localization in the Wild
Source: https://huggingface.co/papers/2605.22581
Abstract
Deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.
Many public buildings provide floorplans with a “you are here” indicator to help visitors orient themselves.Floorplan localizationseeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performingfloorplan localizationin the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs agravity-aligned3D scene and projects it into a 2Ddensity mapthat serves as a floorplan proxy.Floorplan localizationis then formulated as aligning this proxy with the input floorplan via a2D similarity transform. To bridge the appearance gap betweendensity maps and architectural floorplans, we adapt a2D foundation modelto learncross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preservingstructural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.22581
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22581 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22581 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22581 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs
OSMGraphCLIP is a model that learns global location embeddings from OpenStreetMap data using a graph-based encoder and contrastive alignment with a spherical-harmonics location encoder. It achieves strong performance across diverse geospatial tasks, often matching or exceeding satellite-based methods.
Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM
This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.
LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling
LDARNet is a 120M-parameter hierarchical genomic foundation model that introduces learnable adaptive tokenization (inspired by H-Net's dynamic chunking) for masked language modeling on DNA sequences. It achieves state-of-the-art results on 5 histone modification tasks and outperforms models up to 20× larger on several genomic benchmarks, with learned token boundaries aligning with biological features like promoter motifs and splice junctions.
@AdinaYakup: JD just released JoyAI-Echo An interesting long video generation model 5 minute multi shot video generation Cross modal…
JD released JoyAI-Echo, a long video generation model capable of 5-minute multi-shot video with cross-modal memory for character and voice consistency, native audio+video generation, and 7.5x speed improvement via DMD distillation.
ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
ChatHealthAI is a multimodal reasoning framework that aligns structured EHR representations with a frozen LLM to enable grounded clinical reasoning while maintaining predictive performance.