MuSViT: A Foundation Vision Model for Sheet Music Representation

Hugging Face Daily Papers 06/30/26, 12:00 AM Papers

vision-transformer sheet-music music-ai foundation-model masked-autoencoders imslp

Summary

MuSViT is the first foundation vision model for sheet music, pre-trained on millions of pages via Masked Autoencoders, achieving superior performance in score recognition and symbol detection tasks.

Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score Vision Transformer): the first foundation vision model for sheet music representation -- a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the IMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: a synthetic warm-up on typeset scores followed by large-scale training on the full IMSLP corpus. We evaluate MuSViT on four downstream tasks -- full-page and staff-level music score recognition, music symbol detection, and score difficulty classification -- under two scenarios: linear probing (frozen encoder) and fine-tuning. Under linear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Under fine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additional embedding-transcription consistency analysis reveals that MuSViT encodes symbolic musical structure directly in its representation space -- unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone for sheet music understanding.

Original Article

View Cached Full Text

Cached at: 07/01/26, 11:42 AM

Paper page - MuSViT: A Foundation Vision Model for Sheet Music Representation

Source: https://huggingface.co/papers/2606.31811

Abstract

MuSViT is a vision transformer-based foundation model pre-trained on millions of sheet music pages that demonstrates superior performance in music score recognition and symbol detection tasks through both linear probing and fine-tuning approaches.

Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks.Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music ScoreVision Transformer): the first foundation vision model forsheet musicrepresentation -- aViT encoderpre-trained viaMasked Autoencoderson 9.7 million pages from theIMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: asynthetic warm-upon typeset scores followed by large-scale training on the fullIMSLPcorpus. We evaluate MuSViT on four downstream tasks -- full-page and staff-levelmusic score recognition,music symbol detection, and score difficulty classification -- under two scenarios:linear probing(frozen encoder) andfine-tuning. Underlinear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Underfine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additionalembedding-transcription consistencyanalysis reveals that MuSViT encodes symbolic musical structure directly in its representation space -- unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone forsheet musicunderstanding.

View arXiv page View PDF Project page GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2606\.31811

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### PRAIG/musvit-light 39.4M• Updated25 minutes ago #### PRAIG/musvit 0.1B• Updated25 minutes ago • 124

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.31811 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.31811 in a Space README.md to link it from this page.

MuSViT: A Foundation Vision Model for Sheet Music Representation

Paper page - MuSViT: A Foundation Vision Model for Sheet Music Representation

Abstract

Models citing this paper2

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

ViMU: Benchmarking Video Metaphorical Understanding

@AdinaYakup: MOSS-VL Vision model from @Open_MOSS Model: https://huggingface.co/collections/OpenMOSS-Team/moss-vl… Demo: https://hug…

MVEB: Massive Video Embedding Benchmark

MetaphorVU: Towards Metaphorical Video Understanding

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Submit Feedback

Similar Articles

ViMU: Benchmarking Video Metaphorical Understanding

@AdinaYakup: MOSS-VL Vision model from @Open_MOSS Model: https://huggingface.co/collections/OpenMOSS-Team/moss-vl… Demo: https://hug…

MVEB: Massive Video Embedding Benchmark

MetaphorVU: Towards Metaphorical Video Understanding

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation