Selective Synergistic Learning for Video Object-Centric Learning

Hugging Face Daily Papers 06/14/26, 12:00 AM Papers

Summary

Selective Synergistic Learning (SSync) improves video object-centric learning by selectively distilling reliable cues via pseudo-labeling and transitive merging, avoiding error propagation from indiscriminate dense alignment.

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

Original Article

View Cached Full Text

Cached at: 06/20/26, 02:29 PM

Paper page - Selective Synergistic Learning for Video Object-Centric Learning

Source: https://huggingface.co/papers/2606.15527

Abstract

Selective Synergistic Learning (SSync) addresses limitations in video object-centric learning by selectively distilling reliable cues through pseudo-labeling and transitive merging to improve object decomposition quality and robustness.

Typicalvideo object-centric learning(VOCL) approaches employslot-based frameworksthat rely on reconstruction-drivenencoder-decoder architectures, where learning is mediated by two spatial maps:attention mapsfrom the encoder andobject mapsfrom the decoder. As these two distinct maps exhibit different properties, a recentdense alignmentstrategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches viacontrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync preventserror propagationby selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via apseudo-labelingwith linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based onspatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2606\.15527

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### WJ0830/SSync Updated1 day ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15527 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15527 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Selective Synergistic Learning for Video Object-Centric Learning

Paper page - Selective Synergistic Learning for Video Object-Centric Learning

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

Automatic Combination of Sample Selection Strategies for Few-Shot Learning

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Submit Feedback

Similar Articles

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

Automatic Combination of Sample Selection Strategies for Few-Shot Learning

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models