I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC

Reddit r/LocalLLaMA 05/12/26, 10:24 PM Tools

voice-synthesis text-to-speech gui open-source kokoro rvc audio-processing

Summary

Derpy Turtle is a Windows GUI tool designed to enhance Kokoro voice outputs by integrating voice search, RVC model training, and post-generation voice conversion into a unified workflow.

I’ve been working on a tool called Derpy Turtle: The Kokoro Trainer. It started as a random-walk experiment for Kokoro voices, but it has grown into its own thing: a Windows GUI for creating better local voice outputs by combining Kokoro voice search with RVC voice conversion. The short version: Kokoro is good at generating speech. RVC is good at matching a target voice. Derpy Turtle connects the two. The app lets you: \- Load a target voice clip. \- Search/refine Kokoro \`.pt\` voices against that target. \- Train an RVC model from your target audio. \- Generate Kokoro speech. \- Automatically pass the output through your trained RVC model. \- Save the final converted \`\_rvc.wav\`. The important lesson I learned is that chasing a super high Kokoro similarity score alone is not enough. I was stuck around the low/mid 80% range even after very long runs. The output improved, but it still did not sound close enough. The better approach was to use Kokoro as the clean speech source, then let RVC handle the final voice identity. So the current workflow is: 1. Train an RVC model from clean target audio. 2. Run a short Kokoro search/refinement to get stable speech. 3. Enable “Use Latest RVC”. 4. Generate the line. 5. Listen to the \`\_rvc.wav\`, not just the optimizer score. The GUI has presets, queue management, ETA logging, extra target audio support, per-audio transcript mapping, CUDA support, and a launcher \`.exe\` that handles first-time setup. A few practical notes: \- You need clean training audio. A smaller clean dataset beats a larger noisy one. \- RVC helps with timbre/identity, but it does not magically fix bad pacing or pronunciation. \- The Kokoro similarity score is pre-RVC, so the final converted audio can sound much better even if the score does not change. \- CUDA makes a huge difference. On my RTX 3060, GPU mode cut one run from roughly 26 hours on CPU to about 4 hours. It's 100% free for non-commercial use. Personal/research use is allowed, but anyone wanting commercial use would need to contact me. The goal is to make local voice experimentation more accessible. I made everything as user-friendly as possible. I wanted something where a non-technical user could run an .exe, load target audio, train/refine, and actually get usable output without manually wiring together a bunch of tools. I've added this process to [my game here](https://store.steampowered.com/app/4312840/Detective_Turtletop_Who_Killed_Harambe/), if anyone wants to experience it in practice. All the voices are trained using this trainer. Enjoy!

Original Article

View Cached Full Text

Cached at: 05/13/26, 12:19 AM

BovineOverlord/Derpy-Turtle-The-Kokoro-Trainer

Source: https://github.com/BovineOverlord/Derpy-Turtle-The-Kokoro-Trainer

Derpy Turtle: The Kokoro Trainer

Derpy Turtle is a Windows-first GUI for building better local Kokoro voice outputs. It combines Kokoro voice search, target-audio scoring, RVC model training, and post-generation voice conversion into one queue-based workflow.

The practical goal is simple: generate clear Kokoro speech, then use a target-trained RVC model to move the final audio closer to the desired voice.

This project wouldn’t be possible without…

Kokoro — the TTS model and voice tensor system that everything here is built around.
kvoicewalk — the original Kokoro random-walk voice search that this project grew out of.
RVC — Retrieval-based Voice Conversion — the voice conversion technique used to push generated speech toward a target voice identity.
rvc-python — the Python wrapper that makes RVC straightforward to integrate.
Applio — the RVC training backend used by Train Target RVC Model.
faster-whisper — the transcription engine powering Transcribe Many (built on OpenAI’s Whisper).
resemblyzer — speaker embedding and voice similarity scoring used to evaluate candidate voices against your target.
librosa — audio analysis and resampling used throughout the audio processing pipeline.
so-vits-svc-fork — the alternative voice conversion backend (sovits mode).
PyTorch — the ML backbone under all of the above.

What It Does

Runs Kokoro random-walk and hybrid voice searches against one or more target clips.
Trains a target RVC model from your own clean reference audio.
Applies RVC automatically after generation and writes a final _rvc.wav.
Provides a queue GUI with presets, ETA, progress logging, text mapping for extra audio clips, and playback for generated WAV files.
Bootstraps its own Python environment from a single launcher executable.

Recommended Workflow

Train an RVC model from the target voice.
Run a short Kokoro search or refinement to get stable source speech.
Enable Use Latest RVC.
Generate and listen to the _rvc.wav output.

The optimizer score is measured before RVC conversion. If the _rvc.wav sounds better, trust the audio over the pre-RVC score.

First Run

Run:

.\derpy-turtle-kokoro-trainer.exe

On first launch, the executable creates .venv, installs the Python dependencies, prepares the selected voice-conversion backend, and opens the GUI.

Setup logs are written to:

derpy-turtle-launcher.log

Simplified Instructions

Note that you do not always need to let Random Walk run all its steps if its made enough progress- check the out .wav files to confirm progress.
Load Target Audio (Recommended at least 30 sec long)
Write Target Audio’s text under “Target Text”
Select a Preset “Balanced” or “Fast Iterate”
Click “Add Task”
Click “Start Queue”
When finished, load “Starting Voice” (the most recent .pt in /out)
Change mode to “Train Target RVC Model” (in top left)
Change VC Preset to RVC
Click “Add Task”
Click “Start Queue”
When finished, change mode back to random walk
Change preset to “Accent Focus” or “Similarity Recovery”
Click “Use Latest RVC”
Click “Add Task”
Click “Start Queue”
The final .pt should be very close to your target voice. If not, you’ll need to play around, but this should cover most cases.

Requirements

Windows 10/11.
Python 3.10, 3.11, or 3.12 available through py or python.
NVIDIA GPU recommended.
CUDA-capable PyTorch is installed for the RVC backend when available.
CPU mode works, but long searches and RVC conversion are much slower.

Observed local performance: CUDA mode used about 4 GB VRAM and reduced a run from roughly 26 hours on CPU to about 4 hours on an RTX 3060.

GUI Modes

Random Walk

Searches Kokoro voice tensors and writes .pt plus .wav candidates.

Train Target RVC Model

Builds an RVC model from the primary target audio plus any extra target clips. The result is exported under vc_models/rvc/trained/<model_name>/ and becomes available through Use Latest RVC.

Test Voice

Generates a quick WAV from a selected .pt voice and the current target text. Output is written to out/<output_name>_test.wav.

Transcribe Many

Transcribes a file or folder of audio clips to text files.

Export Voices Bin

Exports source voice data for faster startup paths.

RVC Training

Use Train Target RVC Model when random walk has plateaued or when voice identity matters more than the Kokoro similarity score.

Recommended starting settings:

RVC Epochs: 250-350
RVC Batch: 4
RVC Sample Rate: 48000
Prepare Dataset Only: off

Data quality matters more than huge step counts:

Use 10-30 minutes of clean target speech as a minimum.
45-90 minutes is better when the style is consistent.
Remove music, heavy reverb, clipping, background speakers, and noisy sections.
Include the emotional range, pitch range, pace, and accent you want the model to reproduce.

After training, click Use Latest RVC before adding the generation task.

Kokoro Search Settings

For RVC-based output, do not spend days chasing a higher random-walk score. The Kokoro voice is now the source performance, not the final identity.

Good defaults:

Preset: Fast Iterate or Similarity Recovery
Steps: 500-1500 for source checks
Device: cuda
Post VC: enabled
VC: Use Latest RVC

Run longer searches only when the pre-RVC source has pronunciation, pacing, stability, or accent problems.

Outputs

Random-walk and hybrid runs write result folders under:

out/

Typical files:

*.pt          Kokoro voice tensor candidate
*.wav         pre-RVC Kokoro output
*_rvc.wav     final RVC-converted output

Use the _rvc.wav as the final audio when RVC is enabled.

Extra Target Audio

Extra target clips can improve scoring and RVC training. Each extra clip can have its own transcript through Map Texts.

Different text is useful. Multiple clips with different words, pacing, and emotion give the trainer more information than repeating the same phrase.

Playback

Use Play Latest WAV in the GUI to open the newest generated WAV under out with the Windows default audio player.

Building The Launcher

To rebuild the launcher executable:

.\build-launcher.cmd

The output is:

derpy-turtle-kokoro-trainer.exe

Safety

Only train on and clone voices you have permission to use. Do not use this project to impersonate people without consent.