@junmingong: Khala 1.0 just dropped — a music generation model from the Central Conservatory of Music in Beijing. Paper, code, weigh…
Summary
Khala 1.0 is an open-source music generation model for high-fidelity full-song generation from text and lyrics, using a unified acoustic-token pipeline. It was released by the Central Conservatory of Music in Beijing with paper, code, weights, and demo.
View Cached Full Text
Cached at: 05/16/26, 01:18 PM
Khala 1.0 just dropped — a music generation model from the Central Conservatory of Music in Beijing. Paper, code, weights, and demo all open-sourced.
I gave a talk there recently on ACE-Step and got an early look at Khala. Excited to see it officially out. Open-source music gen is thriving.
https://github.com/Khala-Music-AI/Khala… https://arxiv.org/abs/2605.01790 https://khala-music-ai.github.io/Khala-demo/
Khala-Music-AI/Khala
Source: https://github.com/Khala-Music-AI/Khala
✨ What Is Khala?
Khala is an open-source system for high-fidelity song generation, capable of generating complete songs from text descriptions and lyric conditions. Unlike approaches built around semantic tokens, diffusion models, or multi-stage audio generation stacks, Khala follows a unified acoustic-token route and generates both coarse musical structure and fine acoustic detail within the same discrete audio representation space.
The core characteristics of Khala include:
- Full-song generation: designed for complete song generation rather than short clips or loop-style accompaniment.
- Text and lyric control: supports natural-language prompts and lyrics to control style, mood, vocals, and content.
- Unified acoustic-token representation: built on a 64-layer RVQ acoustic token hierarchy that represents audio as coarse-to-fine discrete acoustic tokens.
- Two-stage generation pipeline: a backbone first generates coarse acoustic tokens, then a super-resolution model completes higher RVQ token layers, and finally a decoder reconstructs the waveform.
- Complete system implementation: includes a frontend UI, a FastAPI backend dispatcher, a single-GPU inference worker, model loading, and the end-to-end audio generation path rather than just standalone inference scripts.
📰 News
⚠️ [2026-05-07]We have identified a potential issue that may significantly affect inference quality. The problem is currently under investigation and may be related to numerical precision. Until this notice is removed, please treat current generation quality as unstable.
✅ Updated
[2026-05-16]The online audio demo page is now available: Khala Demo[2026-05-11]Backend inference launch now supports single-GPU safe startup by default, plus multi-GPU and runtime-mode overrides for deployment compatibility.[2026-05-05]The arXiv paper is now available: Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation[2026-05-01]The codebase, environment documentation, and Dockerfile have been cleaned up for release.
⏳ TODOs
[Coming Soon]A full deployment guide for musicians and beginner users.[Coming Soon]Discord community server.
🖥️ Web UI
Prompt Mode

Tag Mode

🎧 Audio Samples
Listen to generated samples on the online demo page: Khala Demo
✅ Runtime Requirements
The current release is mainly intended for researchers and developers who are already familiar with GPU servers.
- NVIDIA GPU, with 24GB or more VRAM recommended for the full inference pipeline, such as an RTX 4090 or a higher-tier GPU.
- Docker and NVIDIA Container Toolkit.
- A CUDA-compatible NVIDIA driver.
- Python and Node.js are already included in the prebuilt image.
- Model weights need to be downloaded into the
checkpoints/directory at the repository root.
🚀 Quick Start
This section is intended for researchers and developers who are already comfortable with basic Docker and CUDA workflows, and provides the shortest path to running the system.
If you want to configure the environment step by step from a clean NGC container, please read:
If you want to understand the backend structure and runtime logic, please read:
1. Prepare the runtime environment
The currently available prebuilt image is:
docker pull ghcr.io/davidliujiafeng/khala-env:ngc25.02-node24
docker run --gpus all -it --rm \
--name khala \
-p 30869:30869 \
-p 8889:8889 \
ghcr.io/davidliujiafeng/khala-env:ngc25.02-node24
Note: the command above uses
--rm, so files created inside the container will be removed after the container exits. If you want a long-lived development container or want to keep downloaded model weights, use a mounted directory or remove--rm.
2. Clone the repository
After entering the container, run:
cd /workspace
git clone https://github.com/Khala-Music-AI/Khala.git
cd Khala
3. Download the model checkpoints
Model repository:
From the repository root, run:
mkdir -p checkpoints
hf download liujiafeng/Khala-MusicGeneration-v1.0 --local-dir checkpoints
This command downloads the model repository contents into the local checkpoints/ directory.
4. Start the backend
cd /workspace/Khala/backend
bash run_backend.sh
The default launcher now starts in a single-GPU safe mode. Advanced users can also select specific GPU ids and switch between one_shot and keep_loaded runtime modes from the same script; see backend/README_backend.md for details.
5. Start the frontend
In another terminal, run:
cd /workspace/Khala/frontend
npm install
npm run dev
6. Open the web UI
Default URL:
🧠 System Overview
The current system has three layers:
- Frontend: accepts prompts, lyrics, and generation settings, and displays results.
- API dispatcher: receives requests, creates jobs, queues them, and dispatches them to idle workers.
- Inference worker: runs backbone, super-resolution, and decoder inference.
The request path is:
flowchart LR
A["Frontend UI"] --> B["backend_api.py"]
B --> C["backend_worker.py"]
C --> D["Backbone"]
D --> E["Super-resolution"]
E --> F["Decoder"]
F --> G["Generated Audio"]
G --> B
B --> A
🔗 Project Resources
- Demo page: Khala Demo
- arXiv paper: Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation
- Model weights: https://huggingface.co/liujiafeng/Khala-MusicGeneration-v1.0
- Environment setup: ENVIRONMENT_SETUP.md
- Backend docs: backend/README_backend.md
🗂 Repository Structure
Khala/
├── backend/
├── frontend/
├── core/
├── models/
├── checkpoints/
├── assets/
├── Dockerfile
├── requirements.txt
├── ENVIRONMENT_SETUP.md
└── ENVIRONMENT_SETUP_zh.md
Main directories:
frontend/: frontend pages and the Vite project.backend/: backend API, worker, and launcher scripts.core/: project-specific core modules.models/: Megatron, decoder, and tokenizer related code.checkpoints/: model checkpoint directory.assets/: images used by the README and demo materials.
📚 Citation
If this project is helpful to your research or development work, you are welcome to cite our paper:
The final BibTeX information will be added later to both the paper page and the repository documentation.
🙏 Acknowledgements
The current implementation builds on a number of excellent open-source projects and tools, including but not limited to:
- NVIDIA NGC
- Megatron / Megatron Core
- Hugging Face
- FastAPI
- Vite / React
📜 License
The model weights are currently intended to be released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
💬 Contact
Feel free to join the WeChat group for discussion, usage questions, and future updates:
Similar Articles
@HuggingPapers: Google just released Magenta RealTime 2 on Hugging Face The only open-weights model for real-time continuous music gene…
Google released Magenta RealTime 2 on Hugging Face, an open-weights model for real-time continuous music generation on device with ~200ms latency, steerable by text, audio, or MIDI.
Jukebox
OpenAI's Jukebox is a generative model that produces music as raw audio, including vocals and instruments, using a VQ-VAE for compression and hierarchical Sparse Transformer priors to handle long-range musical structure. It represents a significant step beyond symbolic music generation by operating directly in the raw audio domain.
The BEST local AI music generator is here! Free & unlimited
ACE-Step 1.5 XL is an open-source music generator that surpasses Suno & Udio in quality and speed, running unlimited on a 12 GB GPU with ~120× real-time generation.
@thepatch_kev: some ai music models are actually made with musicians in mind stable audio 3 is a great example of that. grateful to @z…
Stability AI has released Stable Audio 3.0, an open-weight model family for generative audio, designed for artistic experimentation and integration into DAWs like gary4juce.
google/magenta-realtime-2
Google DeepMind released Magenta RealTime 2, an open music generation model for on-device streaming with low-latency control via text, audio examples, and MIDI.