Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
Summary
SAMOSA adapts SAM 2 for visual object tracking by incorporating motion prediction, semantic detection, and geometric constraints to improve robustness and generalization in complex scenarios with distractors, occlusion, and nonlinear motion.
View Cached Full Text
Cached at: 05/22/26, 06:30 AM
Paper page - Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
Source: https://huggingface.co/papers/2605.22538
Abstract
SAMOSA adapts SAM 2 for visual object tracking by incorporating motion prediction, semantic detection, and geometric constraints to improve robustness and generalization in complex scenarios.
Traditionalvisual object tracking(VOT) methods typically rely on task-specific supervised training, limiting theirgeneralizationto unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recentvision foundation models, exemplified bySAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applyingSAM 2to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adaptsSAM 2to complex VOT scenarios by explicitly leveraging motion, geometry, andsemantic cues. Specifically, we introduce a lightweight nonlinearmotion predictorto model target dynamics and guide mask selection as well as memory filtering. We further exploitsemantic cuesto detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improvetracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior ofSAM 2and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-artSAM 2--based approaches on general benchmarks, demonstrates strongergeneralizationthan supervised VOT methods, and achieves substantial gains onanti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.22538
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22538 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22538 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22538 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models
This paper introduces a categorical framework for constructing verifiable, local truth-preserving foundation models using composable foundries, implemented in the Odyssey system, and scheduled for a tutorial at ICML 2026.
Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings
This paper systematically evaluates time-series foundation models (TSFMs) such as Chronos-2 and MOMENT on electronic nose (E-Nose) data for gas identification and concentration prediction. It finds that fine-tuning is necessary and that fusing TSFM embeddings with specialized models can improve performance.
Unified Zero-Shot Time Series Forecasting: A Darts Foundation
Darts, a popular open-source Python library for time series analysis, introduces a unified FoundationModel class collection that integrates multiple time series foundation models (Chronos-2, TimesFM 2.5, TiRex, PatchTST-FM) for zero-shot and fine-tuned forecasting with standardized interfaces and minimal dependencies.
The Future of AI is Intuitive (1 minute read)
General Intuition announces $320M Series A funding at $2.3B valuation to build large action foundation models using gameplay data from Medal.
@svpino: I don't think the companies building the best foundation models will also be the ones that win on the products built on…
Santiago argues that the companies building the best foundation models won't necessarily win on the products built on them; focus and attention to detail are key, using cloud providers as an example.