MuseNet

OpenAI Blog Models

Summary

OpenAI released MuseNet, a deep neural network based on GPT-2 architecture that generates 4-minute musical compositions with 10 instruments by learning patterns from hundreds of thousands of MIDI files. The model can combine multiple music styles and blend them in novel ways.

We’ve created MuseNet, a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files. MuseNet uses the same general-purpose unsupervised technology as GPT-2, a large-scale transformer model trained to predict the next token in a sequence, whether audio or text.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:55 PM

# MuseNet Source: [https://openai.com/index/musenet/](https://openai.com/index/musenet/) We’ve created MuseNet, a deep neural network that can generate 4\-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles\. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files\. MuseNet uses the same general\-purpose unsupervised technology as[GPT‑2⁠](https://openai.com/index/better-language-models/), a large\-scale[transformer⁠\(opens in a new window\)](https://arxiv.org/abs/1706.03762)model trained to predict the next token in a sequence, whether audio or text\. Since MuseNet knows many different styles, we can blend generations in novel ways\.[A](https://openai.com/index/musenet/#citation-bottom-A)Here the model is given the first 6 notes of a Chopin Nocturne, but is asked to generate a piece in a pop style with piano, drums, bass, and guitar\. The model manages to blend the two styles convincingly, with the full band joining in at around the 30 second mark: We collected training data for MuseNet from many different sources\.[ClassicalArchives⁠\(opens in a new window\)](https://www.classicalarchives.com/)and[BitMidi⁠\(opens in a new window\)](https://bitmidi.com/)donated their large collections of MIDI files for this project, and we also found several collections online, including jazz, pop, African, Indian, and Arabic styles\. Additionally, we used the[MAESTRO dataset⁠\(opens in a new window\)](https://arxiv.org/abs/1810.12247)\. The transformer is trained on sequential data: given a set of notes, we ask it to predict the upcoming note\. We experimented with several different ways to encode the MIDI files into tokens suitable for this task\. First, a chordwise approach that considered every combination of notes sounding at one time as an individual “chord”, and assigned a token to each chord\. Second, we tried condensing the musical patterns by only focusing on the starts of notes, and tried further compressing that using a byte pair encoding scheme\. We also tried two different methods of marking the passage of time: either tokens that were scaled according to the piece’s tempo \(so that the tokens represented a musical beat or fraction of a beat\), or tokens that marked absolute time in seconds\. We landed on an encoding that combines expressivity with conciseness: combining the pitch, volume, and instrument information into a single token\.

Similar Articles

Music AI Sandbox, now with new features and broader access

Google DeepMind Blog

Google DeepMind expands Music AI Sandbox with new features including Lyria 2 music generation model and broader access to musicians in the U.S., enabling AI-assisted music creation through tools for generating, extending, and editing musical content.

Jukebox

OpenAI Blog

OpenAI's Jukebox is a generative model that produces music as raw audio, including vocals and instruments, using a VQ-VAE for compression and hierarchical Sparse Transformer priors to handle long-range musical structure. It represents a significant step beyond symbolic music generation by operating directly in the raw audio domain.

GPT-4

OpenAI Blog

OpenAI releases GPT-4, a large multimodal model that accepts image and text inputs and demonstrates human-level performance on professional and academic benchmarks, significantly outperforming GPT-3.5 across various evaluation metrics.

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

Hugging Face Daily Papers

ArtifactNet is a lightweight neural network framework that detects AI-generated music by analyzing codec-specific artifacts in audio signals, achieving F1=0.9829 on a new 6,183-track benchmark (ArtifactBench) with 49x fewer parameters than competing methods. The approach uses forensic physics principles to extract codec residuals through a bounded-mask UNet and compact CNN, with codec-aware training reducing cross-codec drift by 83%.