@svpino: I've built two voice pipelines for two different companies. They both look like this: Audio → STT → Clean transcript → …
Summary
Santiago highlights the limitation of traditional STT pipelines that lose tone and emotion, then introduces Velma, a voice-native AI model from Modulate that analyzes raw audio to capture intent, emotion, and other acoustic signals, available via API at 10x cheaper than LLM-based approaches.
View Cached Full Text
Cached at: 06/05/26, 03:18 PM
I’ve built two voice pipelines for two different companies.
They both look like this:
Audio → STT → Clean transcript → NLP → Classify → Act
This works, but there’s still a problem I can’t solve.
Every time I convert audio to text, I’m keeping the words but throwing away the meaning. Tone, hesitation, sarcasm, and stress are all gone. I have the text, but miss its soul.
The folks at @modulate_ai reached out and showed me how to solve this.
Velma is the voice model that’s been running inside Call of Duty and GTA Online to catch toxicity in real time.
This model skips the transcript entirely and works directly on the raw audio. This allows the model to take into account the “invisible clues” other models miss.
It can detect up to 150 invisible clues that none else does!
You can access Velma through an API, and it’s ~10x cheaper than pushing audio through an LLM.
If you want to give it a try, use this link to get 1,000 free credits:
http://modulate.ai/api/velma?utm_source=x&utm_medium=influencer&utm_campaign=velmaapi&utm_term=socialpost&utm_content=santiago…
Thanks to the team for partnering with me on this post.
Velma API
Understand the true meaning of every conversation
Transcription discards signals like emotion, tone and other audio cues that carry what a conversation actually means. Velma is a voice-native model that listens to the audio itself.
Velma turns voice conversations into signals and behaviors you can act on — out of the box, no LLM needed. The future of voice AI is built with Velma.
MEET VELMA
Audio-native AI that identifies and escalates your risks
THE VELMA DIFFERENCE
Transcription captures words. Velma captures meaning.
Words are just the surface. Velma hears the full picture.
Word-based transcription discards the true meaning of a conversation. Velma leverages acoustic signals to understand conversations like a human.
THE INDUSTRY STANDARD
Transcription + LLM pipeline
Voice signals discarded
Tone, emotion, hesitation, stress, speaker dynamics, intent, sarcasm and many more
WHAT TRANSCRIPTION CAPTURES
1 layer
Misunderstands intent and vulnerability
Loses anger, frustration, fear, joy, sarcasm
Ignores pauses or unique delivery
Overlooks interruptions and side comments
Deception and stress cues
Lost
Misses hesitation and vocal anxiety
Acoustic authenticity
Lost
Cannot catch deepfakes or spoofing
VELMA BY MODULATE
Voice-native AI
Voice signals analyzed
Tone, emotion, intent, rhythm, context, accents, deepfakes, sarcasm, vocal biomarkers and more.
WHAT VELMA CAPTURES
7 layers
Best-in-class transcription accuracy
Intent and behavior
Captured
Any behavior detectable in real time
20+ emotions from the acoustic signal
Pitch, rhythm, emphasis, pacing
Multi-speaker diarization and patterns
Deception and stress cues
Captured
Vocal stress, lying, coercion signals
Acoustic authenticity
Captured
#1 deepfake detection on Hugging Face
BEHAVIORS
Define the risks that matter to your business. Velma hears them in the audio.
Tell Velma what matters — edit any behavior or write your own, all in plain language. Velma uses every audio signal to detect them accurately.
Detect when an agent skips requir
Saved: Unauthorized Data Disclosure
You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.
Velma vs. industry standard
Audio-native capabilities from a better architecture.
Voice-nativeEnsemble Listening Model(ELM)
Transcription + LLM pipeline
100+ specialized sub-models, each optimized for a specific signal or task
A transcript without audio signals + text-based LLM
Understands emotion from audio, not word choice to. 20+ emotions.
None built-in. Requires a separate SER model.
Tone, emotion, prosody, rhythm, vocal stress.
Laughing, shouting, crying, shouting, hesitation, pitch, pacing
98.9% accuracy, #1 on Hugging Face, same API call
Not a feature. Separate model + pipeline stage.
Describe in plain English. Velma uses audio + text together for higher accuracy.
Possible via prompt engineering. Accuracy limited to what words alone can reveal.
50 by default, 100 more as templates— fraud, churn, compliance & escalation
None. Each requires prompt engineering + ongoing maintenance.
Industry-leading, handles overlap and noise
Varies; overlap is a common failure
Drop-in. Send audio, receive structured JSON. A few lines of code.
Manage STT + LLM separately, plus custom logic to enrich context.
Build with Velma
Build on top of audio understanding, not transcription
Smarter voice agents
AI agents that understand voice signals for better responses.
AI voice guardrails
Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.
Emotion-driven apps
Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.
Conversation analytics
Replace your STT/ASR layer with better conversational insights.
Live coaching tools
Real-time agent assist that surfaces what to say next, based on how the call is going.
Anything you can imagine
Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.
Where Velma fits
A drop-in layer for your voice stack
Understanding layer
Velma API
REST + WebSocket
Drop Velma into any voice pipeline. The underlying model handles the rest.
Velma is the #1 model for Conversation Understanding
Conversation Understanding Benchmark —
Accuracy vs. Cost Evaluates a model’s ability to identify conversation types, topics, speaker roles and key behaviors.Methodology ↗
Highest accuracy lowest cost
Inference cost
Accuracy score
velma-2-fast
velma-2
grok-4.1-fast-non-reasoning
grok-4.1-fast-reasoning
gemini-2-flash-lite
deepseek-v3.1
gemini-2-flash
deepseek-v3.2
gemini-3-flash-min
deepseek-r1
gemini-3-flash-med
gemini-2.5-pro
gemini-3-pro
grok-3
nova-3-intelligence
scribe-v2
grok-4-heavy
gpt-5-mini
gpt-5.2-pro
gpt-5.2
1
2
3
4
5
6
7
8
9
10
$0.01
0.02
0.03
0.04
0.05
0.06
0.07
$0.08
$0.10
0.50
1.00
$1.50
0
Get started in minutes
Drop-in by design — three steps, one API
Send audio
Point Velma at a file or a live stream — or connect the platform you already use (Five9, Genesys, Teams, Twilio, SIP). One endpoint, no pipeline to assemble.
Velma analyzes
A single voice-native model does all the work — no separate transcription, LLM, or enrichment services to wire together and keep in sync.
Output, where and how you like it
A structured JSON — stream it live, drop it in your warehouse, or trigger alerts. You decide where it goes.
It really is this short — streaming, start to finish:
# 1 · open a connection 2 · stream audio 3 · read results
ws = connect(“wss://modulate-developer-apis.com/api/velma-2-streaming?api_key=…”)
ws.send(config)# what to detect — or just use the default package
ws.send(audio_chunk)# stream your audio
foreventinws:# clips, behaviors, topics, summary…
handle(event)
Start building with Velma.
Grab an API key or try the playground to see Velma understand a real conversation.
More from Modulate
Explore Modulate’s other leading voice models
Audio-native APIs built for real-time performance — designed to drop right into your stack.
Deepfake Detection
Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.
Transcription
Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.
PII/PHI Redaction
Auto-redact sensitive content from both transcripts and audio. Compliance-ready.
Music Detection
Detect music vs. speech in any audio stream. Real-time and batch.
Similar Articles
@paulabartabajo_: Advice for AI engineers If you're building voice agents, stop wiring up 3 separate models, for audio-to-text, text-to-a…
Announces liquid-audio, an open-source repository for Liquid AI's end-to-end speech-to-speech LFM models (LFM2-Audio-1.5B and LFM2.5-Audio-1.5B) with interleaved and sequential generation modes and fine-tuning support.
jamiepine/voicebox
Voicebox is an open-source, local-first AI voice studio for voice cloning, speech generation, dictation, and AI agent integration, offering privacy and multi-engine TTS support.
@multimodalart: they extracted only the audio bit of LTX-2.3, fine-tuned for TTS task and achieved SOTA TTS emotional control??? try it…
A fine-tuned version of the LTX-2.3 model's audio component achieves state-of-the-art emotional control in text-to-speech, now available as a Hugging Face Space called DramaBox by ResembleAI.
I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)
A fully offline, CPU-only voice loop for local LLMs using Silero VAD, Parakeet STT, and Supertonic TTS, integrated via a one-command installer. Works with Ollama, LM Studio, and various agent frameworks.
@svpino: Step-by-step video to build a voice agent from scratch. I'm doing this using Claude Code, because writing code by hand …
A step-by-step video tutorial on building a voice agent from scratch using Claude Code and AssemblyAI's new Voice Agent API.