Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Reddit r/MachineLearning 06/05/26, 01:01 PM News

benchmark onnx-runtime hf-transformers gguf speech-recognition cpu-inference asr

Summary

A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.

Sharing a small CPU inference benchmark for nvidia/parakeet-tdt-0.6b-v3 that turned up a result I didn't expect going in. **Setup:** 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU. Test audio: 16.78s Harvard sentences at 16kHz mono. **Results:** |Inference path|RTF|Peak Memory|CPU utilization| |:-|:-|:-|:-| |HF Transformers bfloat16|0.519|\~430MB delta|—| |ONNX Runtime FP32 (onnx-asr)|0.328|2,667MB|49.9%| |GGUF Q6\_K (parakeet.cpp)|0.708|928MB|99.8%| ONNX Runtime is 37% faster than HF Transformers bfloat16 on this hardware. The gap comes from operator fusion and AVX2-optimized execution providers in ONNX Runtime that the PyTorch CPU path doesn't exploit as aggressively. Memory cost is the tradeoff — FP32 weights load at \~2.7GB peak. GGUF Q6\_K trades throughput for memory efficiency. 928MB peak vs 2.7GB, but RTF doubles and CPU utilization hits 99.8%. For memory-constrained deployments it's the right call. For sustained throughput on a box with headroom, ONNX wins. One methodological note worth flagging for anyone doing ASR benchmarking with synthetic audio: espeak-ng inflated WER to 20.9% on a sentence set where gTTS got 4.65%. Both runtimes got identical WER within each run, confirming it's the TTS distribution mismatch rather than model or quantization quality. NVIDIA reports 1.93% on LibriSpeech — the gTTS number is a much more honest CPU-only proxy. Github repo with code, raw results, and evaluation scripts in comments below. *Disclosure: benchmark was run using Neo, a local AI engineering agent inside Claude Code using its MCP. Mentioning because the runtime and audio choices came from its research phase, not prior knowledge on my end.*

Original Article

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Similar Articles

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

Submit Feedback

Similar Articles

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python