Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Hugging Face Daily Papers 05/21/26, 12:00 AM Papers

Summary

SAMOSA adapts SAM 2 for visual object tracking by incorporating motion prediction, semantic detection, and geometric constraints to improve robustness and generalization in complex scenarios with distractors, occlusion, and nonlinear motion.

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

Original Article

View Cached Full Text

Cached at: 05/22/26, 06:30 AM

Paper page - Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Source: https://huggingface.co/papers/2605.22538

Abstract

SAMOSA adapts SAM 2 for visual object tracking by incorporating motion prediction, semantic detection, and geometric constraints to improve robustness and generalization in complex scenarios.

Traditionalvisual object tracking(VOT) methods typically rely on task-specific supervised training, limiting theirgeneralizationto unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recentvision foundation models, exemplified bySAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applyingSAM 2to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adaptsSAM 2to complex VOT scenarios by explicitly leveraging motion, geometry, andsemantic cues. Specifically, we introduce a lightweight nonlinearmotion predictorto model target dynamics and guide mask selection as well as memory filtering. We further exploitsemantic cuesto detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improvetracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior ofSAM 2and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-artSAM 2--based approaches on general benchmarks, demonstrates strongergeneralizationthan supervised VOT methods, and achieves substantial gains onanti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

View arXiv page View PDF GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.22538

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22538 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22538 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22538 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Paper page - Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models

Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings

Unified Zero-Shot Time Series Forecasting: A Darts Foundation

The Future of AI is Intuitive (1 minute read)

@svpino: I don't think the companies building the best foundation models will also be the ones that win on the products built on…

Submit Feedback

Similar Articles

Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models

Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings

Unified Zero-Shot Time Series Forecasting: A Darts Foundation

The Future of AI is Intuitive (1 minute read)
General Intuition announces $320M Series A funding at $2.3B valuation to build large action foundation models using gameplay data from Medal.

@svpino: I don't think the companies building the best foundation models will also be the ones that win on the products built on…