Removing 'um' from a recording is harder than it sounds

Hacker News Top Tools

Summary

A local CLI tool that uses OpenAI's Whisper to detect and remove filler words (um, uh, erm) from audio recordings, employing techniques to avoid audio artifacts like clicks and background hiss.

No content available
Original Article
View Cached Full Text

Cached at: 06/12/26, 05:51 AM

# "erm: A Local CLI That Strips Ums, Uhs, and Erms From Speech" Source: [https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/) May 2, 2026·1547 words·8 minute read Linguists have a word for the`um`s,`uh`s,`er`s, and elongated versions \(`ummmm`,`uhhhhh`\) that pad spoken English:*disfluencies*\. I don’t record a lot of voice audio, but a few friends do, and they tell me editing those out by hand is miserable\. So I built[erm](https://github.com/dougcalobrisi/erm)to do it\. That’s the whole interface for the common case\. It writes a cleaned`\.wav`and a JSON cut list next to the input\. This post walks through how it works, because the obvious approach doesn’t sound very good and most of the code is the stuff that fixes that\. ## The naive version doesn’t work[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#the-naive-version-doesnt-work) You’d expect the job to be: transcribe with word\-level timestamps, find tokens like`um`and`uh`, cut those ranges with ffmpeg\. That gets you maybe 60% of the way, and the result sounds worse than the original\. Three reasons: - Whisper quietly leaves a lot of fillers out of the transcript, so there’s no`um`token to match in the first place\. - Slicing audio at an arbitrary point in time produces a tiny step in the waveform\. Your ear hears it as a click\. - Even when the splice itself is clean, the background hiss before and after the cut doesn’t quite match, so you hear a faint shift at every edit\. Most of erm is the work of fixing those three things\. ## A quick word on Whisper[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#a-quick-word-on-whisper) [Whisper](https://github.com/openai/whisper)is OpenAI’s open\-source speech\-to\-text model\. You hand it audio, it hands you back a transcript, and with the right flag it’ll also tell you the start and end timestamp of every word\. It runs locally, which is what makes a tool like this possible without sending your recordings anywhere\. erm uses[`faster\-whisper`](https://github.com/SYSTRAN/faster-whisper), a reimplementation that’s several times faster than the reference one and uses less memory\. Same model weights, same output, just a better runtime\. The default is the`medium\.en`model, which is a good speed/accuracy balance\. You can override with`\-\-model`if you want`small\.en`\(faster\), but I’d actually reach for`large\-v3`\. It’s noticeably better at picking up fillers and worth the extra compute\. ## Detection[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#detection) First, run Whisper\. erm asks for word\-level timestamps and gives it a small instruction up front telling it not to clean up the transcript\. Whisper, left alone, will edit out fillers because most of its training transcripts are clean prose\. Any word that comes back as a known filler \(`um`,`uh`,`er`, etc\.\) is flagged for cutting\. Elongated versions like`ummmm`get matched against the`um`stem on the fly\. Whisper still misses things, so three more passes look at the audio directly: **Gap fillers\.**If there’s an unusually long pause between two transcribed words \(more than 350ms by default\), erm checks whether somebody is actually making a sound during that “pause\.” If a chunk of voice is sitting inside what Whisper marked as silence, that’s a filler Whisper deleted entirely\.*It really does just drop them\. No token at all, just a hole in the transcript where an`um`used to be\.* **Fillers hiding inside a word\.**Whisper sometimes glues a filler onto an adjacent word, so`"in, uhhhhh"`comes back as a single`in`token\. erm looks at long single\-token words, splits them at brief dips in the audio, figures out which chunk is the actual word \(based on how long that word should reasonably take to say\), and treats the rest as filler\. **Words that are much too long\.**If a word lasts way longer than its text could plausibly take to pronounce, the tail end is suspicious\. erm scans the tail for voiced sound, and optionally double\-checks with a pitch test: does the suspicious chunk sound like someone holding a vowel \(`uhhhhh`\), or like someone just speaking slowly? A held vowel has a steady, simple acoustic shape; real speech is constantly changing as you move between sounds\. The pitch test keeps the tool from trimming slow talkers\. All four passes \(the Whisper one and the three audio ones\) produce candidate cuts independently, and the lists get merged before the next step\. ## Refining the cut points[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#refining-the-cut-points) A cut at exactly`t = 1\.234s`lands wherever the waveform happens to be at that instant, almost never at zero\. Stitching two arbitrary points together leaves a step in the waveform, and that step is the click you hear\. Two small fixes, in order\. First, each cut endpoint is allowed to slide a tiny bit \(up to 60ms\) to land in the quietest spot nearby\. If there’s a momentary lull in the audio just before or after the original cut point, slide there\. The slide is bounded so it can’t cross into a neighboring word, otherwise you’d chew off real speech\. Second, from that quiet spot, the endpoint snaps to the nearest moment when the waveform is exactly crossing zero\. Two zero points stitched together produce a continuous waveform with no step, and no click\. After all that, very short surviving fragments get cleaned up: if two adjacent cuts would leave a sliver of audio shorter than about 120ms between them, the sliver gets merged into one bigger cut\. A fragment that small can’t survive the smoothing on either side anyway and just sounds like a blip\. ## Splicing[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#splicing) ffmpeg does the actual stitching using a*crossfade*\. Instead of butting the two pieces of audio together, it briefly overlaps them and fades one out as the other fades in\. That smooths over any remaining mismatch\. The trick is picking how*long*to overlap\. A fixed length \(most tutorials say 80ms or so\) sounds wrong both ways: short cuts get smeared together, long cuts still pop\. erm scales the length to the size of the cut: a tiny clip of`uh`gets a short crossfade, a long`ummmmm`gets a longer one\. There’s a floor and ceiling \(50ms to 120ms\), and the crossfade is never allowed to reach back across the start of a real word, which would muddy the speech on either side\. ## Room tone[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#room-tone) Even after all of the above, the background hiss of the recording \(the ambient sound of the room when nobody’s talking\) doesn’t perfectly match across cuts\. Every room has a slightly different “silence,” and stitching two near\-silences together still produces a faint shift you can hear\. The fix is dumb but it works\. Find a quiet stretch in the original recording \(a real piece of “this room when nobody’s talking”\) and loop it underneath the entire output at low volume\. Now the background is identical everywhere, because it’s the same loop everywhere\. Any small mismatch at each splice gets covered up by the steady tone sitting on top\. By default the quiet stretch is found automatically\. You can also point it at a specific time range if you know a good one\. ## The denoiser is sneaky[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#the-denoiser-is-sneaky) ffmpeg has a built\-in noise reducer, and you can run it on the audio at various points in the pipeline\. The catch: denoising smooths out the very details \(volume bumps and pitch wiggles\) that the detectors rely on to find fillers\. So it matters*when*you do it\. erm has four modes: ModeDetection looks atThe output is cut from`none`the originalthe original`pre`a denoised copythe denoised copy`post`the originalthe original; denoised at the end`hybrid`the originala denoised copy`hybrid`is the default, and the one you want: detection runs on the original audio \(so it can see all the cues\), but the actual cuts come from a clean, denoised copy \(so the splices sound nice\)\. `pre`looks sensible but is the worst option, because running the detectors on denoised audio hides the very things they’re looking for\. ## Validation[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#validation) Audio renders can break in subtle ways, so there’s a`validate`subcommand: ``` uvx erm validate input.wav cleaned.wav --cuts cuts.json ``` It runs three checks: - The output file actually opens\. - The output is shorter than the input by roughly the total length of the cuts \(within a small margin\)\. - When you transcribe the cleaned file back to text, no fillers come back\. That last one is the useful one\. It’s end\-to\-end: it tells you the tool actually did what it claimed\. ## What it won’t touch[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#what-it-wont-touch) It leaves`like`,`you know`, and`I mean`alone\. Those sound like fillers but they’re doing real work in the sentence, and cutting them automatically would change what someone said\. The rule erm follows: only remove things that are sound, not language\. It also doesn’t touch repeated words, false starts, or long thinking pauses\. Those aren’t noise on top of the speech; they*are*the speech, just messier than the speaker would like\. Cleaning them up is an editorial decision about which take to keep, and erm doesn’t have an opinion about that\. ## Try it[🔗](https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/#try-it) The quickest way is with[uv](https://github.com/astral-sh/uv), which fetches and runs the tool in one step without a permanent install: ``` uvx erm input.wav --dry-run # see what would be cut uvx erm input.wav # render ``` If you’d rather install it the usual way: ``` pip install erm # or: pipx install erm erm input.wav ``` You’ll also need`ffmpeg`and`ffprobe`on your`PATH`\(`brew install ffmpeg`on macOS\)\. [github\.com/dougcalobrisi/erm](https://github.com/dougcalobrisi/erm)\. Audio stays local\. If you record voice notes or podcasts and your every other word is`um`, give it a try\.

Similar Articles

@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…

X AI KOLs Timeline

Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.

Hush

Product Hunt

Hush is an open-source tool for noise suppression designed for voice AI agents, improving audio clarity in real-time interactions.

vaibhavs10/incredibly-fast-whisper

Replicate Explore

A highly optimized version of OpenAI's Whisper Large v3 using Transformers, Optimum, and Flash Attention 2, capable of transcribing 150 minutes of audio in under 2 minutes on Replicate.

Introducing Whisper

OpenAI Blog

OpenAI introduces Whisper, an end-to-end encoder-decoder Transformer model trained on large-scale diverse audio data for robust multilingual speech recognition, language identification, and speech-to-English translation. Whisper achieves 50% fewer errors than specialized models on diverse datasets and outperforms supervised benchmarks on speech translation despite not being fine-tuned to specific datasets.