MERIT: Learning Disentangled Music Representations for Audio Similarity

Hugging Face Daily Papers 05/26/26, 12:00 AM Papers

Summary

MERIT is a framework that learns disentangled music representations for melody, rhythm, and timbre using conditional audio generation and source-separated stems, enabling nuanced and factor-specific audio similarity queries.

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

Original Article

View Cached Full Text

Cached at: 06/03/26, 07:36 AM

Paper page - MERIT: Learning Disentangled Music Representations for Audio Similarity

Source: https://huggingface.co/papers/2605.27346

Abstract

MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.

Currentmusic similarity modelstypically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled,factor-specific music representationstailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that usesconditional audio generationandsource-separated stemsto strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

View arXiv page View PDF Project page GitHub20 Add to collection

Get this paper in your agent:

hf papers read 2605\.27346

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### amaai-lab/merit Feature Extraction• Updatedabout 3 hours ago • 1

Datasets citing this paper1

#### amaai-lab/merit Preview• Updatedabout 3 hours ago • 95 • 3

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.27346 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MERIT: Learning Disentangled Music Representations for Audio Similarity

Paper page - MERIT: Learning Disentangled Music Representations for Audio Similarity

Abstract

Models citing this paper1

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

Multimodal Music Recommendation System using LLMs

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Taste-aware music retrieval from audio embeddings

AudioMosaic: Contrastive Masked Audio Representation Learning

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

Submit Feedback

Similar Articles

Multimodal Music Recommendation System using LLMs

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Taste-aware music retrieval from audio embeddings

AudioMosaic: Contrastive Masked Audio Representation Learning

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy