jaaari/kokoro-82m

Replicate Explore Models

Summary

Kokoro-82M is an efficient, high-quality text-to-speech model available on Replicate, supporting multiple languages and voices with low inference cost.

jaaari / kokoro-82m
Original Article
View Cached Full Text

Cached at: 05/08/26, 06:26 AM

# Kokoro-82M: High-Quality, Efficient Text-to-Speech on Replicate Source: [https://replicate.com/jaaari/kokoro-82m](https://replicate.com/jaaari/kokoro-82m) ## Run time and cost This model costs approximately $0\.00022 to run on Replicate, or 4545 runs per $1, but this varies depending on your inputs\. It is also open source and you can[run it on your own computer with Docker](https://replicate.com/jaaari/kokoro-82m/api)\. This model runs on[Nvidia T4 GPU hardware](https://replicate.com/docs/billing)\. Predictions typically complete within 1 seconds\. ## Readme license: apache\-2\.0 language: \- en base\_model: \- yl4579/StyleTTS2\-LJSpeech pipeline\_tag: text\-to\-speech --- ## Disclaimer This is a fork of the original Kokoro repo, in order to provide easy inference on Replicate\. I am not affiliated with the original Kokoro authors, and this is not an official release of the Kokoro model\. Similar to the Huggingface Space, this implementation provides automatic text splitting to support long form text inputs\. See the original README below for more details\. --- ## Voices **Training Duration**\- How much audio was seen during training? Smaller durations result in a lower overall grade\. \- 10 hours <= HH hours < 100 hours \- 1 hour <= H hours < 10 hours \- 10 minutes <= MM minutes < 100 minutes \- 1 minute <= M minutes < 10 minutes ### American English 🇺🇸 - [`misaki\[en\]`](https://github.com/hexgrad/misaki)`lang\_code='a'`with`en\-us`espeak\-ng fallback NameTraitsTarget QualityTraining DurationOverall GradeSHA256af\_alloy🚺BMM minutesC`6d877149`af\_aoede🚺BH hoursC\+`c03bd1a4`af\_bella🚺🔥**A****HH hours****A\-**`8cb64e02`af\_jessica🚺CMM minutesD`cdfdccb8`af\_kore🚺BH hoursC\+`8bfbc512`af\_nicole🚺🎧B**HH hours**B\-`c5561808`af\_nova🚺BMM minutesC`e0233676`af\_river🚺CMM minutesD`e149459b`af\_sarah🚺BH hoursC\+`49bd364e`af\_sky🚺BM minutesC\-`c799548a`am\_adam🚹DH hoursF\+`ced7e284`am\_echo🚹CMM minutesD`8bcfdc85`am\_eric🚹CMM minutesD`ada66f0e`am\_fenrir🚹BH hoursC\+`98e507ec`am\_liam🚹CMM minutesD`c8255075`am\_michael🚹BH hoursC\+`9a443b79`am\_onyx🚹CMM minutesD`e8452be1`am\_puck🚹BH hoursC\+`dd1d8973`### British English 🇬🇧 - [`misaki\[en\]`](https://github.com/hexgrad/misaki)`lang\_code='b'`with`en\-gb`espeak\-ng fallback NameTraitsTarget QualityTraining DurationOverall GradeSHA256bf\_alice🚺CMM minutesD`d292651b`bf\_emma🚺B**HH hours**B\-`d0a423de`bf\_isabella🚺BMM minutesC`cdd4c370`bf\_lily🚺CMM minutesD`6e09c2e4`bm\_daniel🚹CMM minutesD`fc3fce4e`bm\_fable🚹BMM minutesC`d44935f3`bm\_george🚹BMM minutesC`f1bc8122`bm\_lewis🚹CH hoursD\+`b5204750`### French 🇫🇷 - espeak\-ng`fr\-fr` - Total French training data: <11 hours NameTraitsTarget QualityTraining DurationOverall GradeSHA256CC BYff\_siwis🚺B<11 hoursB\-`8073bf2d`[SIWIS](https://datashare.ed.ac.uk/handle/10283/2353)### Hindi 🇮🇳 - espeak\-ng`hi` - Total Hindi training data: H hours NameTraitsTarget QualityTraining DurationOverall GradeSHA256hf\_alpha🚺BMM minutesC`06906fe0`hf\_beta🚺BMM minutesC`63c0a1a6`hm\_omega🚹BMM minutesC`b55f02a8`hm\_psi🚹BMM minutesC`2f0f055c`### Italian 🇮🇹 - espeak\-ng`it` - Total Italian training data: H hours NameTraitsTarget QualityTraining DurationOverall GradeSHA256if\_sara🚺BMM minutesC`6c0b253b`im\_nicola🚹BMM minutesC`234ed066`### Japanese 🇯🇵 - [`misaki\[ja\]`](https://github.com/hexgrad/misaki) - Total Japanese training data: H hours NameTraitsTarget QualityTraining DurationOverall GradeSHA256CC BYjf\_alpha🚺BH hoursC\+`1bf4c9dc`jf\_gongitsune🚺BMM minutesC`1b171917`[gongitsune](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__gongitsune.txt)jf\_nezumi🚺BM minutesC\-`d83f007a`[nezuminoyomeiri](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__nezuminoyomeiri.txt)jf\_tebukuro🚺BMM minutesC`0d691790`[tebukurowokaini](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__tebukurowokaini.txt)jm\_kumo🚹BM minutesC\-`98340afd`[kumonoito](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__kumonoito.txt)### Mandarin Chinese 🇨🇳 - [`misaki\[zh\]`](https://github.com/hexgrad/misaki) - Total Mandarin Chinese training data: H hours NameTraitsTarget QualityTraining DurationOverall GradeSHA256zf\_xiaobei🚺CMM minutesD`9b76be63`zf\_xiaoni🚺CMM minutesD`95b49f16`zf\_xiaoxiao🚺CMM minutesD`cfaf6f2d`zf\_xiaoyi🚺CMM minutesD`b5235dba`zm\_yunjian🚹CMM minutesD`76cbf8ba`zm\_yunxi🚹CMM minutesD`dbe6e1ce`zm\_yunxia🚹CMM minutesD`bb2b03b0`zm\_yunyang🚹CMM minutesD`5238ac22` --- ✨ You can now[`pip install kokoro`](https://github.com/hexgrad/kokoro)\! See[Usage](https://huggingface.co/hexgrad/Kokoro-82M#usage)\. **Kokoro**is an open\-weight TTS model with 82 million parameters\. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost\-efficient\. With Apache\-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects\. - [Releases](https://huggingface.co/hexgrad/Kokoro-82M#releases) - [Usage](https://huggingface.co/hexgrad/Kokoro-82M#usage) - [Voices and Languages](https://huggingface.co/hexgrad/Kokoro-82M#voices-and-languages) - [Model Facts](https://huggingface.co/hexgrad/Kokoro-82M#model-facts) - [Training Details](https://huggingface.co/hexgrad/Kokoro-82M#training-details) - [Creative Commons Attribution](https://huggingface.co/hexgrad/Kokoro-82M#creative-commons-attribution) - [Acknowledgements](https://huggingface.co/hexgrad/Kokoro-82M#acknowledgements) ### Releases ModelPublishedTraining DataCompute \(A100 80GB\)Langs & VoicesSHA256**v1\.0****2025 Jan 27****Few hundred hrs****$1000 for 1000 hrs**[**6 & 46**](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md)`496dba11`[v0\.19](https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19)2024 Dec 25<100 hrs$400 for 500 hrs1 & 10`3b0c392f`### Usage [`pip install kokoro`](https://pypi.org/project/kokoro/)installs the inference library at[https://github\.com/hexgrad/kokoro](https://github.com/hexgrad/kokoro) Under the hood,`kokoro`uses[`misaki`](https://pypi.org/project/misaki/), a G2P library at[https://github\.com/hexgrad/misaki](https://github.com/hexgrad/misaki) ### Model Facts **Architecture:**\- StyleTTS 2:[https://arxiv\.org/abs/2306\.07691](https://arxiv.org/abs/2306.07691)\- ISTFTNet:[https://arxiv\.org/abs/2203\.02395](https://arxiv.org/abs/2203.02395)\- Decoder only: no diffusion, no encoder release **Architected by:**Li et al @[https://github\.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) **Trained by**:`@rzvzn`on Discord **Languages:**American English, British English, French, Hindi **Model SHA256 Hash:**`496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4` ### Training Details **Compute:**About $1000 for 1000 hours of A100 80GB vRAM **Data:**Kokoro was trained exclusively on**permissive/non\-copyrighted audio data**and IPA phoneme labels\. Examples of permissive/non\-copyrighted audio include: \- Public domain audio \- Audio licensed under Apache, MIT, etc \- Synthetic audio<sup\>\[1\]</sup\> generated by closed<sup\>\[2\]</sup\> TTS models from large providers \[1\][https://copyright\.gov/ai/ai\_policy\_guidance\.pdf](https://copyright.gov/ai/ai_policy_guidance.pdf) \[2\] No synthetic audio from open TTS models or “custom voice clones” **Total Dataset Size:**A few hundred hours of audio ### Creative Commons Attribution The following CC BY audio was part of the dataset used to train Kokoro v1\.0\. Audio DataDuration UsedLicenseAdded to Training Set After[Koniwa](https://github.com/koniwa/koniwa)`tnc`<1h[CC BY 3\.0](https://creativecommons.org/licenses/by/3.0/deed.ja)v0\.19 / 22 Nov 2024[SIWIS](https://datashare.ed.ac.uk/handle/10283/2353)<11h[CC BY 4\.0](https://datashare.ed.ac.uk/bitstream/handle/10283/2353/license_text)v0\.19 / 22 Nov 2024### Acknowledgements - 🛠️[@yl4579](https://huggingface.co/yl4579)for architecting StyleTTS 2\. - 🏆[@Pendrokar](https://huggingface.co/Pendrokar)for adding Kokoro as a contender in the TTS Spaces Arena\. - 📊 Thank you to everyone who contributed synthetic training data\. - ❤️ Special thanks to all compute sponsors\. - 👾 Discord server:[https://discord\.gg/QuGxSWBfQy](https://discord.gg/QuGxSWBfQy) - 🪽 Kokoro is a Japanese word that translates to “heart” or “spirit”\. Kokoro is also the name of an[AI in the Terminator franchise](https://terminator.fandom.com/wiki/Kokoro)\. ![kokoro](https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg) Model createdover 1 year ago

Similar Articles

Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU

Reddit r/LocalLLaMA

A detailed CPU benchmark comparing Kokoro 82M and Supertonic 3 TTS models, measuring RTF, latency, and throughput across text lengths. Results show Supertonic 3 is faster but Kokoro produces more natural speech, with practical recommendations for different use cases.

Aratako/Irodori-TTS-500M-v3

Hugging Face Models Trending

Irodori-TTS-500M-v3 is a Japanese TTS model based on Rectified Flow Diffusion Transformer, supporting zero-shot voice cloning and unique emoji-based style/sound effect control.