@gkxspace: I spend two to three thousand on AI subscriptions every month, some for TTS, ASR, etc. The mainstream ones are expensive and their API protocols differ. I kept thinking: is there a single plan that covers voice cloning, meeting transcription, AI podcast generation, real-time voice Q&A, voice input, and coding? Finally found a godsend—StepFun's S...

X AI KOLs Timeline Products

Summary

StepFun launches Step Plan subscription at $6.99/month, integrating LLM, TTS, ASR, image generation, and other AI models. Supports direct OpenAI SDK connection, applicable for voice cloning, meeting transcription, AI podcast generation, etc.

I spend two to three thousand on AI subscriptions every month, some for TTS, ASR, etc. The mainstream ones are expensive and their API protocols differ. I kept thinking: is there a single plan that covers: Voice cloning, meeting transcription, AI podcast generation, real-time voice Q&A, voice input, and coding Finally found a godsend—StepFun's Step Plan, $6.99/month, more than enough. So I gradually canceled the others. One subscription includes top-tier models across categories: 1. LLM: Step 3.5 Flash, extremely low latency, also compatible with Claude / Cursor / Cline 2. TTS: stepaudio-2.5-tts (ranked higher than ElevenLabs, as I checked) 3. ASR: Real-time voice dialogue, voice cloning supported 4. Image generation: text-to-image + image editing, 0.7 seconds per image All directly connectable via OpenAI SDK, just change the base_url. Here are some use cases (details in comments): 1. English recording → Chinese notes in 54 seconds 2. English long text → dual-speaker mp3 for commute listening 3. Same text → TTS performs 7 emotions 4. Lu Xun's "Kong Yiji" → audiobook with auto-split characters 5. English podcast → end-to-end Chinese remake @StepFun_ai
Original Article
View Cached Full Text

Cached at: 05/20/26, 04:35 PM

I used to spend two to three thousand yuan a month on AI subscriptions, some of which were for TTS, ASR, etc. The mainstream services are quite expensive, and their API protocols are all different.

I’ve always been looking for a single plan that could do it all: voice cloning, meeting transcription, AI podcast generation, real-time voice Q&A, voice input, and code writing.

Finally found a true lifesaver — Step Plan by StepFun. It costs $6.99 per month and I can never use it all up. So I gradually canceled all the others.

One subscription gets you access to top-tier models of all kinds:

  1. LLM: Step 3.5 Flash — incredibly low latency, and you can also integrate it with Claude / Cursor / Cline.
  2. TTS: stepaudio-2.5-tts (I checked; its ranking is higher than ElevenLabs).
  3. ASR: Real-time voice conversations with voice cloning support.
  4. Image generation: Text-to-image + image editing, generating images in 0.7 seconds.

All accessible directly via the OpenAI SDK — just change the base URL.

Here are some use cases (details in the comments):

  1. English audio recording → Chinese notes in 54 seconds
  2. Long English article → dual‑speaker MP3 for commuting
  3. Same text → TTS with 7 different emotions
  4. Lu Xun’s Kong Yiji → automatic role‑based audiobook
  5. English podcast → end-to-end Chinese remake

@StepFun_ai

Similar Articles

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

@yhslgg: Old Yang shares another gem open-source tool—KrillinAI, 10,000 stars on GitHub, a must-see for multilingual audio/video content! In a nutshell: from video download to subtitle translation, AI dubbing, video compositing, the entire pipeline is covered, and it can even auto-generate platform covers, supporting Bilibili, Douyin, Xiaohongshu, YouTube…

X AI KOLs Timeline

KrillinAI is an open-source tool that integrates the entire workflow of video downloading, subtitle translation, AI dubbing, and video compositing. It supports context-aware translation, voice cloning, auto layout, and cover generation, and is compatible with multiple AI models, suitable for multilingual audio/video content creation and distribution.

@denziideng: Another AI voice cloning 'dimensional reduction attack'... The CosyVoice I shared before can clone in 3 seconds, which I thought was already scary enough. But today's tool is even more lethal — after casually recording 1 minute of my own voice for training, it directly replicates tone, mannerisms, emotions, breathing, and pauses. It's almost like the soul of the original person possessed it! C...

X AI KOLs Timeline

GPT-SoVITS is an open-source AI voice cloning tool that supports zero-shot (5-second voice) and few-shot (1-minute training) high-fidelity voice cloning, cross-lingual inference, and comes with a complete WebUI toolchain. It has garnered 57.8k stars on GitHub, becoming the leading open-source project in the voice cloning field.