Selective Synergistic Learning for Video Object-Centric Learning

Hugging Face Daily Papers Papers

Summary

Selective Synergistic Learning (SSync) improves video object-centric learning by selectively distilling reliable cues via pseudo-labeling and transitive merging, avoiding error propagation from indiscriminate dense alignment.

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:29 PM

Paper page - Selective Synergistic Learning for Video Object-Centric Learning

Source: https://huggingface.co/papers/2606.15527

Abstract

Selective Synergistic Learning (SSync) addresses limitations in video object-centric learning by selectively distilling reliable cues through pseudo-labeling and transitive merging to improve object decomposition quality and robustness.

Typicalvideo object-centric learning(VOCL) approaches employslot-based frameworksthat rely on reconstruction-drivenencoder-decoder architectures, where learning is mediated by two spatial maps:attention mapsfrom the encoder andobject mapsfrom the decoder. As these two distinct maps exhibit different properties, a recentdense alignmentstrategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches viacontrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync preventserror propagationby selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via apseudo-labelingwith linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based onspatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2606\.15527

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### WJ0830/SSync Updated1 day ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15527 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15527 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Automatic Combination of Sample Selection Strategies for Few-Shot Learning

arXiv cs.CL

This paper proposes ACSESS, a method for automatically combining multiple sample selection strategies to improve few-shot learning across both in-context learning and gradient-based approaches. The work demonstrates that combining strategies consistently outperforms individual selection methods across 14 datasets with both text and image modalities.

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Hugging Face Daily Papers

SOCO benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.