@lxfater: NetEase Youdao open-sourced ZiYue 4 model, within 27B parameters, SOTA in math and science. But what really interests me is its voice feature!! Cloning a voice is nothing new, ElevenLabs could do it long ago. But they all share a common flaw: cross-language accent. Take your Chinese voice and use it to speak Japanese — it has a Chinese accent, you can tell it's a foreigner struggling...

X AI KOLs Timeline 05/22/26, 07:35 AM Models

open-source large-language-model voice-cloning cross-lingual netease-youdao ziyue4 ai

Summary

NetEase Youdao open-sourced the ZiYue 4 model with 27B parameters, achieving SOTA in math and science; its voice feature supports 3-second cross-language voice cloning across 14 languages with no accent issue, along with open-sourcing the all-scenario intelligent agent 'Longxia' (Lobster).

NetEase Youdao open-sourced the ZiYue 4 model, within 27B parameters, SOTA in math and science But what really interests me is its voice feature!! Cloning a voice is nothing new, ElevenLabs could do it long ago. But they all share a common flaw: cross-language accent. Take your Chinese voice and use it to speak Japanese — it has a Chinese accent, you can tell it's a foreigner struggling. What ZiYue 4 focuses on is getting rid of that accent. Your voice speaking Japanese sounds like a native Japanese speaker. 3-second cloning, supports 14 languages (check the video demo). For voice-overs, dubbing, digital humans — saves money and delivers great results, right? Plus, ZiYue 4 is truly open-source; it provides weights, not just an API. The 54GB weights are directly downloadable and can run locally. Youdao's Lobster all-scenario intelligent agent is also 100% open-source. The large model is open-sourced, and Lobster is open-sourced too — everything is given away. What's Youdao after? How do they make money? I actually dug up the secret to their profits below:

Original Article

View Cached Full Text

Cached at: 05/22/26, 11:49 AM

NetEase Youdao has open-sourced the ZiYue 4 model, with under 27B parameters, achieving SOTA in math and science.

But what really caught my interest is its voice feature!!

Cloning a voice is nothing new; ElevenLabs could already do it.
But they all share a common flaw: cross-language accent interference.
If you use your Chinese voice to speak Japanese, it will carry a Chinese accent, sounding like a foreigner struggling.
ZiYue 4’s main selling point is removing that accent.

Your voice speaking Japanese will sound like a native Japanese speaker.
3-second cloning, supports 14 languages (see video demo).
For dubbing, voiceovers, and digital humans, it saves money and delivers great results, right?

Moreover, the ZiYue 4 model is truly open-source; it provides weights, not just an API.
54GB of weights available for direct download, can run locally.

The YouDao Lobster all-scenario agent is also 100% open-source.
The large model is open-sourced, and Lobster is open-sourced too — everything is given away.

What is YouDao after? How do they make money?
I actually found the secret to their revenue below:

Similar Articles

@Honcia13: Open-source TTS is going crazy! New weapons for industrial park scams? Tsinghua OpenBMB just released VoxCPM2: 20 billion parameters + 2 million hours of multilingual data training, 48kHz studio-quality sound! The most intense part is—no Tokenizer needed at all, performing diffusion autoregression directly in continuous latent space, maximizing detail retention!

X AI KOLs Timeline

Tsinghua University's OpenBMB has released VoxCPM2, an open-source multilingual TTS model with 20 billion parameters. It supports continuous latent space diffusion autoregressive generation without a Tokenizer, offering 48kHz studio-quality audio and powerful voice cloning and design capabilities.

@aigclink: Alibaba Tongyi Lab just dropped Fun-ASR 1.5—one industrial-grade model handles 30 languages, all 7 major Chinese dialect families + 20+ regional accents, even classical-poetry recitation. Dialect CER down 56.2 % vs last gen; 5 dialects top 90 % accuracy…

X AI KOLs Timeline

Alibaba Tongyi Lab releases Fun-ASR 1.5: a single model covering 30 languages, seven Chinese dialect groups and 20+ local accents; character-error rate in key dialect scenarios falls 56.2 %, with five dialects exceeding 90 % accuracy.

@billtheinvestor: Shanghai Jiao Tong University open-sources F5-TTS speech generation model. The model is trained on 100,000 hours of data and supports bilingual synthesis in Chinese and English. Technical features include zero-shot voice cloning, total-duration-based speed control, emotion expression control, and long text synthesis. Commercial use is allowed.

X AI KOLs Timeline

Shanghai Jiao Tong University has open-sourced the F5-TTS speech generation model, trained on 100,000 hours of data, supporting bilingual synthesis in Chinese and English and zero-shot voice cloning, and allowing commercial use.

@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…

X AI KOLs Timeline

Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

Similar Articles

@aigclink: Alibaba Tongyi Lab just dropped Fun-ASR 1.5—one industrial-grade model handles 30 languages, all 7 major Chinese dialect families + 20+ regional accents, even classical-poetry recitation. Dialect CER down 56.2 % vs last gen; 5 dialects top 90 % accuracy…

Submit Feedback