@svpino: I've built two voice pipelines for two different companies. They both look like this: Audio → STT → Clean transcript → …

X AI KOLs Following Products

Summary

Santiago highlights the limitation of traditional STT pipelines that lose tone and emotion, then introduces Velma, a voice-native AI model from Modulate that analyzes raw audio to capture intent, emotion, and other acoustic signals, available via API at 10x cheaper than LLM-based approaches.

I've built two voice pipelines for two different companies. They both look like this: Audio → STT → Clean transcript → NLP → Classify → Act This works, but there's still a problem I can't solve. Every time I convert audio to text, I'm keeping the words but throwing away the meaning. Tone, hesitation, sarcasm, and stress are all gone. I have the text, but miss its soul. The folks at @modulate_ai reached out and showed me how to solve this. Velma is the voice model that's been running inside Call of Duty and GTA Online to catch toxicity in real time. This model skips the transcript entirely and works directly on the raw audio. This allows the model to take into account the "invisible clues" other models miss. It can detect up to 150 invisible clues that none else does! You can access Velma through an API, and it's ~10x cheaper than pushing audio through an LLM. If you want to give it a try, use this link to get 1,000 free credits: http://modulate.ai/api/velma?utm_source=x&utm_medium=influencer&utm_campaign=velmaapi&utm_term=socialpost&utm_content=santiago… Thanks to the team for partnering with me on this post.
Original Article
View Cached Full Text

Cached at: 06/05/26, 03:18 PM

I’ve built two voice pipelines for two different companies.

They both look like this:

Audio → STT → Clean transcript → NLP → Classify → Act

This works, but there’s still a problem I can’t solve.

Every time I convert audio to text, I’m keeping the words but throwing away the meaning. Tone, hesitation, sarcasm, and stress are all gone. I have the text, but miss its soul.

The folks at @modulate_ai reached out and showed me how to solve this.

Velma is the voice model that’s been running inside Call of Duty and GTA Online to catch toxicity in real time.

This model skips the transcript entirely and works directly on the raw audio. This allows the model to take into account the “invisible clues” other models miss.

It can detect up to 150 invisible clues that none else does!

You can access Velma through an API, and it’s ~10x cheaper than pushing audio through an LLM.

If you want to give it a try, use this link to get 1,000 free credits:

http://modulate.ai/api/velma?utm_source=x&utm_medium=influencer&utm_campaign=velmaapi&utm_term=socialpost&utm_content=santiago…

Thanks to the team for partnering with me on this post.


Velma API

Source: https://www.modulate.ai/api/velma?utm_source=x&utm_medium=influencer&utm_campaign=velmaapi&utm_term=socialpost&utm_content=santiago

Understand the true meaning of every conversation

Transcription discards signals like emotion, tone and other audio cues that carry what a conversation actually means. Velma is a voice-native model that listens to the audio itself.

Velma turns voice conversations into signals and behaviors you can act on — out of the box, no LLM needed. The future of voice AI is built with Velma.

MEET VELMA

Audio-native AI that identifies and escalates your risks

THE VELMA DIFFERENCE

Transcription captures words. Velma captures meaning.

Words are just the surface. Velma hears the full picture.

Word-based transcription discards the true meaning of a conversation. Velma leverages acoustic signals to understand conversations like a human.

THE INDUSTRY STANDARD

Transcription + LLM pipeline

Voice signals discarded

Tone, emotion, hesitation, stress, speaker dynamics, intent, sarcasm and many more

WHAT TRANSCRIPTION CAPTURES

1 layer

Misunderstands intent and vulnerability

Loses anger, frustration, fear, joy, sarcasm

Ignores pauses or unique delivery

Overlooks interruptions and side comments

Deception and stress cues

Lost

Misses hesitation and vocal anxiety

Acoustic authenticity

Lost

Cannot catch deepfakes or spoofing

VELMA BY MODULATE

Voice-native AI

Voice signals analyzed

Tone, emotion, intent, rhythm, context, accents, deepfakes, sarcasm, vocal biomarkers and more.

WHAT VELMA CAPTURES

7 layers

Best-in-class transcription accuracy

Intent and behavior

Captured

Any behavior detectable in real time

20+ emotions from the acoustic signal

Pitch, rhythm, emphasis, pacing

Multi-speaker diarization and patterns

Deception and stress cues

Captured

Vocal stress, lying, coercion signals

Acoustic authenticity

Captured

#1 deepfake detection on Hugging Face

BEHAVIORS

Define the risks that matter to your business. Velma hears them in the audio.

Tell Velma what matters — edit any behavior or write your own, all in plain language. Velma uses every audio signal to detect them accurately.

Detect when an agent skips requir

Saved: Unauthorized Data Disclosure

You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.

Velma vs. industry standard

Audio-native capabilities from a better architecture.

Voice-nativeEnsemble Listening Model(ELM)

Transcription + LLM pipeline

100+ specialized sub-models, each optimized for a specific signal or task

A transcript without audio signals + text-based LLM

Understands emotion from audio, not word choice to. 20+ emotions.

None built-in. Requires a separate SER model.

Tone, emotion, prosody, rhythm, vocal stress.

Laughing, shouting, crying, shouting, hesitation, pitch, pacing

98.9% accuracy, #1 on Hugging Face, same API call

Not a feature. Separate model + pipeline stage.

Describe in plain English. Velma uses audio + text together for higher accuracy.

Possible via prompt engineering. Accuracy limited to what words alone can reveal.

50 by default, 100 more as templates— fraud, churn, compliance & escalation

None. Each requires prompt engineering + ongoing maintenance.

Industry-leading, handles overlap and noise

Varies; overlap is a common failure

Drop-in. Send audio, receive structured JSON. A few lines of code.

Manage STT + LLM separately, plus custom logic to enrich context.

Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.

Where Velma fits

A drop-in layer for your voice stack

Understanding layer

Velma API

REST + WebSocket

Drop Velma into any voice pipeline. The underlying model handles the rest.

Velma is the #1 model for Conversation Understanding

Conversation Understanding Benchmark —

Accuracy vs. Cost Evaluates a model’s ability to identify conversation types, topics, speaker roles and key behaviors.Methodology ↗

Highest accuracy lowest cost

Inference cost

Accuracy score

velma-2-fast

velma-2

grok-4.1-fast-non-reasoning

grok-4.1-fast-reasoning

gemini-2-flash-lite

deepseek-v3.1

gemini-2-flash

deepseek-v3.2

gemini-3-flash-min

deepseek-r1

gemini-3-flash-med

gemini-2.5-pro

gemini-3-pro

grok-3

nova-3-intelligence

scribe-v2

grok-4-heavy

gpt-5-mini

gpt-5.2-pro

gpt-5.2

1

2

3

4

5

6

7

8

9

10

$0.01

0.02

0.03

0.04

0.05

0.06

0.07

$0.08

$0.10

0.50

1.00

$1.50

0

Get started in minutes

Drop-in by design — three steps, one API

Send audio

Point Velma at a file or a live stream — or connect the platform you already use (Five9, Genesys, Teams, Twilio, SIP). One endpoint, no pipeline to assemble.

Velma analyzes

A single voice-native model does all the work — no separate transcription, LLM, or enrichment services to wire together and keep in sync.

Output, where and how you like it

A structured JSON — stream it live, drop it in your warehouse, or trigger alerts. You decide where it goes.

It really is this short — streaming, start to finish:

# 1 · open a connection 2 · stream audio 3 · read results

ws = connect(“wss://modulate-developer-apis.com/api/velma-2-streaming?api_key=…”)

ws.send(config)# what to detect — or just use the default package

ws.send(audio_chunk)# stream your audio

foreventinws:# clips, behaviors, topics, summary…

handle(event)

Start building with Velma.

Grab an API key or try the playground to see Velma understand a real conversation.

More from Modulate

Explore Modulate’s other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.

See how it works

Transcription

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.

See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.

See how it works

Music Detection

Detect music vs. speech in any audio stream. Real-time and batch.

See how it works

Similar Articles

jamiepine/voicebox

GitHub Trending (daily)

Voicebox is an open-source, local-first AI voice studio for voice cloning, speech generation, dictation, and AI agent integration, offering privacy and multi-engine TTS support.