MuSViT: A Foundation Vision Model for Sheet Music Representation
Summary
MuSViT is the first foundation vision model for sheet music, pre-trained on millions of pages via Masked Autoencoders, achieving superior performance in score recognition and symbol detection tasks.
View Cached Full Text
Cached at: 07/01/26, 11:42 AM
Paper page - MuSViT: A Foundation Vision Model for Sheet Music Representation
Source: https://huggingface.co/papers/2606.31811
Abstract
MuSViT is a vision transformer-based foundation model pre-trained on millions of sheet music pages that demonstrates superior performance in music score recognition and symbol detection tasks through both linear probing and fine-tuning approaches.
Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks.Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music ScoreVision Transformer): the first foundation vision model forsheet musicrepresentation -- aViT encoderpre-trained viaMasked Autoencoderson 9.7 million pages from theIMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: asynthetic warm-upon typeset scores followed by large-scale training on the fullIMSLPcorpus. We evaluate MuSViT on four downstream tasks -- full-page and staff-levelmusic score recognition,music symbol detection, and score difficulty classification -- under two scenarios:linear probing(frozen encoder) andfine-tuning. Underlinear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Underfine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additionalembedding-transcription consistencyanalysis reveals that MuSViT encodes symbolic musical structure directly in its representation space -- unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone forsheet musicunderstanding.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.31811
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper2
#### PRAIG/musvit-light 39.4M• Updated25 minutes ago
#### PRAIG/musvit 0.1B• Updated25 minutes ago • 124
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.31811 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.31811 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
ViMU: Benchmarking Video Metaphorical Understanding
ViMU is the first benchmark designed to evaluate video understanding models' ability to interpret metaphorical, ironic, and social meanings beyond literal visual comprehension, using hint-free open-ended and multiple-choice questions.
@AdinaYakup: MOSS-VL Vision model from @Open_MOSS Model: https://huggingface.co/collections/OpenMOSS-Team/moss-vl… Demo: https://hug…
Open_MOSS released MOSS-VL, an 11B Apache 2.0 vision-language model using cross-attention and XRoPE that outperforms Qwen3-VL-8B by 8.3 points on VSI-bench.
MVEB: Massive Video Embedding Benchmark
This paper introduces MVEB, a large-scale benchmark for evaluating video embeddings across 23 tasks, finding that no single model dominates and that audio's contribution depends on dataset annotation provenance. It integrates into the MTEB ecosystem for unified multimodal evaluation.
MetaphorVU: Towards Metaphorical Video Understanding
This paper introduces MetaphorVU-Bench, the first systematic benchmark for metaphorical video understanding, and proposes MetaphorBoost, an inference-time enhancement framework that improves cross-domain mapping in multimodal large language models.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark and adaptive evaluation framework for multi-shot audio-video generation, assessing 19 models across diverse tasks and achieving high alignment with human judgment.