RemoteZero: Geospatial Reasoning with Zero Human Annotations
Summary
RemoteZero is a framework that eliminates the need for human-annotated box supervision in geospatial reasoning by leveraging the semantic verification capabilities of multimodal large language models (MLLMs) to enable self-evolving localization from unlabeled remote sensing data.
View Cached Full Text
Cached at: 05/08/26, 06:56 AM
Paper page - RemoteZero: Geospatial Reasoning with Zero Human Annotations
Source: https://huggingface.co/papers/2605.04451
Abstract
RemoteZero enables geospatial reasoning without box supervision by leveraging semantic verification capabilities of MLLMs for self-evolving localization from unlabeled remote sensing data.
Geospatial reasoningrequires models to resolve complex spatial semantics and user intent into precise target locations for Earth observation. Recent progress has liberated the reasoning path from manual curation, allowing models to generate their own inference chains. Yet a final dependency remains: they are still supervised by human-annotated ground-truth coordinates. This leaves the reasoning process autonomous, but not its spatial endpoint, and prevents trueself-evolutionon abundant unlabeled remote sensing data. To break this bottleneck, we introduce RemoteZero, abox-supervision-freeframework forgeospatial reasoning. RemoteZero is motivated by a simple asymmetry: anMLLMis typically better at verifying whether a region satisfies a query than at directly generating precise coordinates. Leveraging this stronger discriminative ability, RemoteZero replaces geometric supervision with intrinsicsemantic verificationand enablesGRPO trainingwithout box annotations. The resulting framework further supports iterativeself-evolution, allowing the model to improve from unlabeledremote sensing imagerythrough its own verification signal. Experiments show that RemoteZero achieves competitive performance against strong supervised methods, demonstrating the potential of self-verifying training forgeospatial reasoninglocalization.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.04451
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.04451 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.04451 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.04451 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
TRN-R1-Zero introduces a post-training framework that enables LLMs to perform zero-shot reasoning on text-rich networks using only reinforcement learning, without supervised fine-tuning or chain-of-thought data.
Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models
This paper investigates using large vision-language models for built environment reasoning tasks, such as design suggestions and risk identification, leveraging remote sensing imagery. It evaluates models like InternVL and Qwen, highlighting their potential for supporting smart city decision-making and quantitative reasoning.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
G-Zero: Self-Play for Open-Ended Generation from Zero Data
This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition
DiZiNER is a framework that uses disagreement between multiple LLMs to refine task instructions for zero-shot named entity recognition, achieving state-of-the-art results on 14 out of 18 benchmarks and significantly reducing the performance gap between zero-shot and supervised systems.