@junmingong: Khala 1.0 just dropped — a music generation model from the Central Conservatory of Music in Beijing. Paper, code, weigh…

X AI KOLs Timeline Models

Summary

Khala 1.0 is an open-source music generation model for high-fidelity full-song generation from text and lyrics, using a unified acoustic-token pipeline. It was released by the Central Conservatory of Music in Beijing with paper, code, weights, and demo.

Khala 1.0 just dropped — a music generation model from the Central Conservatory of Music in Beijing. Paper, code, weights, and demo all open-sourced. I gave a talk there recently on ACE-Step and got an early look at Khala. Excited to see it officially out. Open-source music gen is thriving. https://github.com/Khala-Music-AI/Khala… https://arxiv.org/abs/2605.01790 https://khala-music-ai.github.io/Khala-demo/
Original Article
View Cached Full Text

Cached at: 05/16/26, 01:18 PM

Khala 1.0 just dropped — a music generation model from the Central Conservatory of Music in Beijing. Paper, code, weights, and demo all open-sourced.

I gave a talk there recently on ACE-Step and got an early look at Khala. Excited to see it officially out. Open-source music gen is thriving.

https://github.com/Khala-Music-AI/Khala… https://arxiv.org/abs/2605.01790 https://khala-music-ai.github.io/Khala-demo/


Khala-Music-AI/Khala

Source: https://github.com/Khala-Music-AI/Khala

Khala Logo

High-Fidelity Song Generation With a Unified Acoustic-Token Pipeline

English | 中文

Demo Paper Model Weights Environment Setup Backend Docs

✨ What Is Khala?

Khala is an open-source system for high-fidelity song generation, capable of generating complete songs from text descriptions and lyric conditions. Unlike approaches built around semantic tokens, diffusion models, or multi-stage audio generation stacks, Khala follows a unified acoustic-token route and generates both coarse musical structure and fine acoustic detail within the same discrete audio representation space.

The core characteristics of Khala include:

  • Full-song generation: designed for complete song generation rather than short clips or loop-style accompaniment.
  • Text and lyric control: supports natural-language prompts and lyrics to control style, mood, vocals, and content.
  • Unified acoustic-token representation: built on a 64-layer RVQ acoustic token hierarchy that represents audio as coarse-to-fine discrete acoustic tokens.
  • Two-stage generation pipeline: a backbone first generates coarse acoustic tokens, then a super-resolution model completes higher RVQ token layers, and finally a decoder reconstructs the waveform.
  • Complete system implementation: includes a frontend UI, a FastAPI backend dispatcher, a single-GPU inference worker, model loading, and the end-to-end audio generation path rather than just standalone inference scripts.

📰 News

  • ⚠️ [2026-05-07] We have identified a potential issue that may significantly affect inference quality. The problem is currently under investigation and may be related to numerical precision. Until this notice is removed, please treat current generation quality as unstable.

✅ Updated

⏳ TODOs

  • [Coming Soon] A full deployment guide for musicians and beginner users.
  • [Coming Soon] Discord community server.

🖥️ Web UI

Prompt Mode

Khala Frontend Demo 1

Tag Mode

Khala Frontend Demo 2

🎧 Audio Samples

Listen to generated samples on the online demo page: Khala Demo

✅ Runtime Requirements

The current release is mainly intended for researchers and developers who are already familiar with GPU servers.

  • NVIDIA GPU, with 24GB or more VRAM recommended for the full inference pipeline, such as an RTX 4090 or a higher-tier GPU.
  • Docker and NVIDIA Container Toolkit.
  • A CUDA-compatible NVIDIA driver.
  • Python and Node.js are already included in the prebuilt image.
  • Model weights need to be downloaded into the checkpoints/ directory at the repository root.

🚀 Quick Start

This section is intended for researchers and developers who are already comfortable with basic Docker and CUDA workflows, and provides the shortest path to running the system.

If you want to configure the environment step by step from a clean NGC container, please read:

If you want to understand the backend structure and runtime logic, please read:

1. Prepare the runtime environment

The currently available prebuilt image is:

docker pull ghcr.io/davidliujiafeng/khala-env:ngc25.02-node24

docker run --gpus all -it --rm \
  --name khala \
  -p 30869:30869 \
  -p 8889:8889 \
  ghcr.io/davidliujiafeng/khala-env:ngc25.02-node24

Note: the command above uses --rm, so files created inside the container will be removed after the container exits. If you want a long-lived development container or want to keep downloaded model weights, use a mounted directory or remove --rm.

2. Clone the repository

After entering the container, run:

cd /workspace
git clone https://github.com/Khala-Music-AI/Khala.git
cd Khala

3. Download the model checkpoints

Model repository:

From the repository root, run:

mkdir -p checkpoints
hf download liujiafeng/Khala-MusicGeneration-v1.0 --local-dir checkpoints

This command downloads the model repository contents into the local checkpoints/ directory.

4. Start the backend

cd /workspace/Khala/backend
bash run_backend.sh

The default launcher now starts in a single-GPU safe mode. Advanced users can also select specific GPU ids and switch between one_shot and keep_loaded runtime modes from the same script; see backend/README_backend.md for details.

5. Start the frontend

In another terminal, run:

cd /workspace/Khala/frontend
npm install
npm run dev

6. Open the web UI

Default URL:

🧠 System Overview

The current system has three layers:

  • Frontend: accepts prompts, lyrics, and generation settings, and displays results.
  • API dispatcher: receives requests, creates jobs, queues them, and dispatches them to idle workers.
  • Inference worker: runs backbone, super-resolution, and decoder inference.

The request path is:

flowchart LR
    A["Frontend UI"] --> B["backend_api.py"]
    B --> C["backend_worker.py"]
    C --> D["Backbone"]
    D --> E["Super-resolution"]
    E --> F["Decoder"]
    F --> G["Generated Audio"]
    G --> B
    B --> A

🔗 Project Resources

🗂 Repository Structure

Khala/
├── backend/
├── frontend/
├── core/
├── models/
├── checkpoints/
├── assets/
├── Dockerfile
├── requirements.txt
├── ENVIRONMENT_SETUP.md
└── ENVIRONMENT_SETUP_zh.md

Main directories:

  • frontend/: frontend pages and the Vite project.
  • backend/: backend API, worker, and launcher scripts.
  • core/: project-specific core modules.
  • models/: Megatron, decoder, and tokenizer related code.
  • checkpoints/: model checkpoint directory.
  • assets/: images used by the README and demo materials.

📚 Citation

If this project is helpful to your research or development work, you are welcome to cite our paper:

The final BibTeX information will be added later to both the paper page and the repository documentation.

🙏 Acknowledgements

The current implementation builds on a number of excellent open-source projects and tools, including but not limited to:

  • NVIDIA NGC
  • Megatron / Megatron Core
  • Hugging Face
  • FastAPI
  • Vite / React

📜 License

The model weights are currently intended to be released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

💬 Contact

Feel free to join the WeChat group for discussion, usage questions, and future updates:

Khala WeChat Group QR Code

Similar Articles

Jukebox

OpenAI Blog

OpenAI's Jukebox is a generative model that produces music as raw audio, including vocals and instruments, using a VQ-VAE for compression and hierarchical Sparse Transformer priors to handle long-range musical structure. It represents a significant step beyond symbolic music generation by operating directly in the raw audio domain.

google/magenta-realtime-2

Hugging Face Models Trending

Google DeepMind released Magenta RealTime 2, an open music generation model for on-device streaming with low-latency control via text, audio examples, and MIDI.