Selective Synergistic Learning for Video Object-Centric Learning
Summary
Selective Synergistic Learning (SSync) improves video object-centric learning by selectively distilling reliable cues via pseudo-labeling and transitive merging, avoiding error propagation from indiscriminate dense alignment.
View Cached Full Text
Cached at: 06/20/26, 02:29 PM
Paper page - Selective Synergistic Learning for Video Object-Centric Learning
Source: https://huggingface.co/papers/2606.15527
Abstract
Selective Synergistic Learning (SSync) addresses limitations in video object-centric learning by selectively distilling reliable cues through pseudo-labeling and transitive merging to improve object decomposition quality and robustness.
Typicalvideo object-centric learning(VOCL) approaches employslot-based frameworksthat rely on reconstruction-drivenencoder-decoder architectures, where learning is mediated by two spatial maps:attention mapsfrom the encoder andobject mapsfrom the decoder. As these two distinct maps exhibit different properties, a recentdense alignmentstrategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches viacontrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync preventserror propagationby selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via apseudo-labelingwith linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based onspatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2606\.15527
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### WJ0830/SSync Updated1 day ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15527 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15527 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning
The article introduces SynIB, a scalable objective based on the information bottleneck principle that targets synergistic information in multimodal learning by penalizing confident predictions when a modality is masked, improving performance on tasks requiring cross-modal reasoning.
Weakly Supervised Concept Learning for Object-centric Visual Reasoning
This paper introduces a two-stage neuro-symbolic framework that uses weak supervision (as little as 1% labels) with a slot-based VAE to learn interpretable symbols for object-centric visual reasoning, outperforming foundation models in domain generalization.
Automatic Combination of Sample Selection Strategies for Few-Shot Learning
This paper proposes ACSESS, a method for automatically combining multiple sample selection strategies to improve few-shot learning across both in-context learning and gradient-based approaches. The work demonstrates that combining strategies consistently outperforms individual selection methods across 14 datasets with both text and image modalities.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM is a novel training strategy that aligns vision and language representations for fine-grained object understanding using only textual prompts, leveraging mask supervision during training to improve cross-modal attention. It introduces the NL-Refer dataset and achieves superior performance over visual-prompt-based methods.
SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models
SOCO benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.