MERIT: Learning Disentangled Music Representations for Audio Similarity

Hugging Face Daily Papers Papers

Summary

MERIT is a framework that learns disentangled music representations for melody, rhythm, and timbre using conditional audio generation and source-separated stems, enabling nuanced and factor-specific audio similarity queries.

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.
Original Article
View Cached Full Text

Cached at: 06/03/26, 07:36 AM

Paper page - MERIT: Learning Disentangled Music Representations for Audio Similarity

Source: https://huggingface.co/papers/2605.27346

Abstract

MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.

Currentmusic similarity modelstypically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled,factor-specific music representationstailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that usesconditional audio generationandsource-separated stemsto strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

View arXiv pageView PDFProject pageGitHub20Add to collection

Get this paper in your agent:

hf papers read 2605\.27346

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### amaai-lab/merit Feature Extraction• Updatedabout 3 hours ago • 1

Datasets citing this paper1

#### amaai-lab/merit Preview• Updatedabout 3 hours ago • 95 • 3

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.27346 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Multimodal Music Recommendation System using LLMs

Hugging Face Daily Papers

Proposes a multimodal framework integrating audio, lyric, and semantic signals with LLM-based sequential reasoning for session-based music recommendation, achieving up to 95% recall improvement over ID-only baselines.

AudioMosaic: Contrastive Masked Audio Representation Learning

arXiv cs.LG

AudioMosaic introduces a contrastive learning-based audio encoder that uses structured time-frequency masking on spectrogram patches for efficient large-batch training, achieving state-of-the-art performance on audio benchmarks and improving audio-language models.

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv cs.LG

GEM reformulates LLM data curation as a variational problem on the hypersphere, using geometric entropy mixing and a minorize-maximize algorithm to discover balanced semantic clusters, achieving state-of-the-art improvements in data mixing strategies by up to 1.2% average downstream accuracy.