MERIT: Learning Disentangled Music Representations for Audio Similarity
Summary
MERIT is a framework that learns disentangled music representations for melody, rhythm, and timbre using conditional audio generation and source-separated stems, enabling nuanced and factor-specific audio similarity queries.
View Cached Full Text
Cached at: 06/03/26, 07:36 AM
Paper page - MERIT: Learning Disentangled Music Representations for Audio Similarity
Source: https://huggingface.co/papers/2605.27346
Abstract
MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.
Currentmusic similarity modelstypically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled,factor-specific music representationstailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that usesconditional audio generationandsource-separated stemsto strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.
View arXiv pageView PDFProject pageGitHub20Add to collection
Get this paper in your agent:
hf papers read 2605\.27346
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### amaai-lab/merit Feature Extraction• Updatedabout 3 hours ago • 1
Datasets citing this paper1
#### amaai-lab/merit Preview• Updatedabout 3 hours ago • 95 • 3
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.27346 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Multimodal Music Recommendation System using LLMs
Proposes a multimodal framework integrating audio, lyric, and semantic signals with LLM-based sequential reasoning for session-based music recommendation, achieving up to 95% recall improvement over ID-only baselines.
AudioMosaic: Contrastive Masked Audio Representation Learning
AudioMosaic introduces a contrastive learning-based audio encoder that uses structured time-frequency masking on spectrogram patches for efficient large-batch training, achieving state-of-the-art performance on audio benchmarks and improving audio-language models.
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
GEM reformulates LLM data curation as a variational problem on the hypersphere, using geometric entropy mixing and a minorize-maximize algorithm to discover balanced semantic clusters, achieving state-of-the-art improvements in data mixing strategies by up to 1.2% average downstream accuracy.
MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs
MusTBench is a benchmark for evaluating temporal grounding in Large Audio-Language Models (LALMs) for music understanding. The authors propose MusT, a four-stage training recipe that significantly improves temporal grounding performance over existing models.
Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation
SAM Audio is introduced as the first unified multimodal model for audio separation, enabling users to isolate specific sounds from complex mixtures using text, visual, or temporal prompts.