Aratako/Irodori-TTS-500M-v3

Hugging Face Models Trending Models

Summary

Irodori-TTS-500M-v3 is a Japanese TTS model based on Rectified Flow Diffusion Transformer, supporting zero-shot voice cloning and unique emoji-based style/sound effect control.

Task: text-to-speech Tags: safetensors, speech, voice, tts, text-to-speech, ja, base_model:Aratako/Irodori-TTS-500M-v2, base_model:finetune:Aratako/Irodori-TTS-500M-v2, license:mit, region:us
Original Article
View Cached Full Text

Cached at: 05/18/26, 12:28 AM

Aratako/Irodori-TTS-500M-v3 · Hugging Face

Source: https://huggingface.co/Aratako/Irodori-TTS-500M-v3

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#irodori-tts-500m-v3Irodori-TTS-500M-v3

CodeWandBDemo Space

Irodori-TTS-500M-v3is a Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. The architecture and training design largely followEcho-TTS, using continuous latents as the generation target. It supports zero-shot voice cloning from reference audio.

A unique feature of this model isemoji-based style and sound effect control— by inserting specific emojis into the input text, you can control speaking styles, emotions, and even sound effects in the generated audio.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%8C%9F-key-features🌟 Key Features

  • **Flow Matching TTS:**Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
  • **Voice Cloning:**Zero-shot voice cloning from a short reference audio clip.
  • **Emoji-based Style Control:**Control speaking styles, emotions, and sound effects by embedding emojis directly in the input text. SeeEMOJI\_ANNOTATIONS\.mdfor the full list of supported emojis and their effects.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%E2%9C%A8-whats-new-in-v3✨ What’s New in v3

This version brings several major improvements over the previous v2 model:

  • **Variable-length Training & Duration Predictor:**Transitioned from fixed-length to variable-length training and introduced a Duration Predictor. This improves training efficiency and enhances the Real-Time Factor (RTF) during inference.
  • **Expanded Training Data:**The model was trained on a larger dataset, resulting in more natural speech synthesis and improved robustness across various speaking styles.
  • **Integrated Watermarking:**IntegratedSilentCipherto apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%8F%97%EF%B8%8F-architecture🏗️ Architecture

The model (approximately 500M parameters) consists of four main components:

  1. **Text Encoder:**Token embeddings initialized fromllm-jp/llm-jp-3-150m, followed by self-attention + SwiGLU transformer layers with RoPE.
  2. **Reference Latent Encoder:**Encodes patched reference audio latents for speaker/style conditioning via self-attention + SwiGLU layers.
  3. **Diffusion Transformer:**Joint-attention DiT blocks with Low-Rank AdaLN (timestep-conditioned adaptive layer normalization), half-RoPE, and SwiGLU MLPs.
  4. **Duration Predictor:**Predicts audio duration from encoded text and speaker conditioning using stacked SwiGLU MLP blocks.

Audio is represented as continuous latent sequences via theAratako/Semantic-DACVAE-Japanese-32dimcodec (32-dim), enabling high-quality 48kHz waveform reconstruction.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%8E%A7-audio-samples🎧 Audio Samples

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#1-standard-tts1. Standard TTS

Basic Japanese text-to-speech generation (without reference audio).

CaseTextGenerated AudioSample 1“お電話ありがとうございます。ただいま電話が大変混み合っております。恐れ入りますが、発信音のあとに、ご用件をお話しください。”Sample 2“その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。”

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#2-emoji-annotation-control2. Emoji Annotation Control

Examples of controlling speaking style and effects with emojis. For the full list of supported emojis, seeEMOJI\_ANNOTATIONS\.md.

CaseText (with Emoji)Generated AudioSample 1なーに、どうしたの?…え?もっと近づいてほしい?…👂😮‍💨👂😮‍💨こういうのが好きなんだ?Sample 2うぅ…😭そんなに酷いこと、言わないで…😭Sample 3🤧🤧ごめんね、風邪引いちゃってて🤧…大丈夫、ただの風邪だからすぐ治るよ🥺

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#3-voice-cloning-zero-shot3. Voice Cloning (Zero-shot)

Examples of cloning a voice from a reference audio clip.

CaseReference AudioGenerated AudioExample 1****Example 2

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%9A%80-usage🚀 Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

👉GitHub: Aratako/Irodori-TTS

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%93%8A-training-data–annotation📊 Training Data & Annotation

The model was trained on an expanded, high-quality Japanese speech dataset. To enable the emoji-based style control, the training texts were enriched with emoji annotations. These annotations were automatically generated and labeled using a fine-tuned model based onQwen/Qwen3-Omni-30B-A3B-Instruct.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%E2%9A%A0%EF%B8%8F-limitations⚠️ Limitations

  • **Japanese Only:**This model currently supports Japanese text input only.
  • **Emoji Control:**While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
  • **Audio Quality:**Quality depends on training data characteristics. Performance may vary for voices or speaking styles underrepresented in the training data.
  • **Kanji Reading Accuracy:**The model’s ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%93%9C-license–ethical-restrictions📜 License & Ethical Restrictions

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#licenseLicense

This model is released under**MIT**.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#ethical-restrictionsEthical Restrictions

In addition to the license terms, the following ethical restrictions apply:

  1. **No Impersonation:**Do not use this model to clone or impersonate the voice of any individual (e.g., voice actors, celebrities, public figures) without their explicit consent.
  2. **No Misinformation:**Do not use this model to generate deepfakes or synthetic speech intended to mislead others or spread misinformation.
  3. **Voice Generation Disclaimer:**When generating speech purely from text without using a reference audio, it is possible that the generated voice may coincidentally resemble that of a real person. This is strictly a probabilistic artifact within the latent space. The model was not trained with the intent of reproducing specific individuals.
  4. **Liability Disclaimer:**The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%99%8F-acknowledgments🙏 Acknowledgments

This project builds upon the following works:

We would also like to extend our special thanks to**Respair**for the inspiration behind the emoji annotation feature, and togabrielclark3330for supporting this project.

https://huggingface.co/Aratako/Irodori-TTS-500M-v3#%F0%9F%96%8A%EF%B8%8F-citation🖊️ Citation

If you use Irodori-TTS-v3 in your research or project, please cite it as follows:

@misc{irodori-tts-v3,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v3}}
}

Similar Articles

jaaari/kokoro-82m

Replicate Explore

Kokoro-82M is an efficient, high-quality text-to-speech model available on Replicate, supporting multiple languages and voices with low inference cost.

jdopensource/JoyAI-Echo

Hugging Face Models Trending

JD Open Source releases JoyAI-Echo (Echo-LongVideo), a text-to-audio-video diffusion model capable of generating minute-level multi-shot videos with consistent character identity and voice, using DMD distillation for 7.5x speedup.