Tag
This paper presents 3D masked autoencoders for volumetric microscopy data, demonstrating that 3D modeling outperforms 2D max-projection and slice-based variants on downstream single-cell tasks, with cross-modal alignment to a protein language model further improving performance.
This paper uses layer-wise probing to investigate how wav2vec 2.0 and Whisper encode consonant cluster reduction in African American English, finding that both models distinguish reduced and canonical forms and preserve cues to underlying stops.
Microsoft's NextLat paper proposes a self-supervised training method where transformers predict their next hidden state instead of just the next token, leading to more compact world models, better planning and reasoning, and up to 3.3x faster generation.
Microsoft's NextLat introduces a training objective that rewards belief-state representations instead of relying solely on next-token prediction, pushing models toward compact world models for better generalization.
UniverSat introduces a Universal Patch Encoder for Vision Transformers that enables robust, sensor-agnostic spatial feature extraction across diverse Earth Observation data types, achieving strong results on classification and segmentation benchmarks.
PragReST is a self-supervised framework that improves LLM pragmatic reasoning by generating counterfactual reasoning traces and training models via supervised fine-tuning and reinforcement learning, achieving significant gains on pragmatic benchmarks without human-labeled data.
This paper presents a self-supervised transfer learning approach for parking spot occupancy recognition that achieves high accuracy (up to 97.8%) with minimal labeled data using a two-stage training strategy with SimCLR and ResNet-50.
This paper proposes Adaptive Binning, a learning-coupled feature-wise coarse-to-fine curriculum for tabular self-supervised learning that adaptively discretizes features, improving representations on medical datasets and establishing a unified benchmark.
A robotics researcher compares current robotics approaches to the language model landscape of 2023, arguing that representation prediction (JEPA) is the most scalable method as it can leverage action-free video data like YouTube, unlike other methods that require action-labeled data.
Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.
This paper investigates whether the wav2vec2.0 architecture exhibits perceptual compensation for tonal context in Mandarin Chinese, finding limited evidence in the self-supervised model compared to human listeners and suggesting that supervised fine-tuning may be necessary for such phonological abstraction.
Introduces Temporal Difference in Vision (TDV), a new paradigm for representation learning that relies solely on causality, eliminating the need for augmentations, masking, or cropping, and matches state-of-the-art methods like DINO and iBOT on dense spatial tasks.
Introduces Temporal Difference in Vision (TDV), a novel visual representation learning paradigm that learns useful representations without augmentations, masking, cropping, or reconstruction, and matches state-of-the-art methods on dense spatial tasks.
RECTOR is a self-supervised framework that learns joint region-channel-temporal representations from EEG/sEEG signals for affective and cognitive state classification, achieving state-of-the-art results on emotion recognition and task-engagement benchmarks.
ProtoX-AD is a prototype-based self-explainable framework for self-supervised time series anomaly detection that provides interpretable explanations for detected anomalies by learning transformation-aware prototypes, achieving performance comparable to black-box methods while offering semantic anomaly characterization.
The paper introduces Temporal Difference in Vision (TDV), a self-supervised learning method for video that relies only on a causal assumption that past causes future, avoiding strong inductive biases while matching state-of-the-art on dense spatial tasks.
ViT-Up introduces a task-agnostic feature upsampler for Vision Transformers that predicts features at arbitrary continuous image coordinates, enabling dense feature maps at any resolution and improving dense prediction and semantic correspondence benchmarks. It outperforms prior state-of-the-art upsamplers, with gains of up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k.
UR-BERT proposes a Romanized transcription-based text encoder for massively multilingual TTS, scaling to 495 languages by using universal Romanization and a speech token prediction objective to enhance phonetic alignment and generalization to unseen languages.
Investigates how self-supervised speech recognition models encode speaker group information (gender, age, dialect, ethnicity, native speaker status) across layers, and how finetuning for tasks like ASR or speaker identification affects this encoding.
A curated list of papers, models, code, datasets, and learning resources for Joint Embedding Predictive Architectures (JEPA), the self-supervised approach to world models proposed by Yann LeCun.