Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]
Summary
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
Similar Articles
I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
Omi Health founder fine-tuned NVIDIA's Parakeet TDT 0.6B for medical ASR, releasing open-weights model Omi Med STT v1 that achieves competitive medical-WER while running locally on Mac, CUDA, or CPU.
Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]
Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.
I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python
NVIDIA's Parakeet speech-to-text models have been ported to pure C++/ggml, achieving byte-identical output to NeMo, up to 5x faster inference on GPU, and quantized GGUF variants for efficient deployment anywhere without Python or PyTorch.
Using Gemma 4 E4B with the LiteRT engine - ~2.4x speedup over Q4 GGUF in text generation, image processing roughly the same
A developer benchmarks Gemma 4 E4B using Google's LiteRT engine against a Q4 GGUF quant, finding ~2.4x speedup in text generation due to multi-token prediction (MTP), but only 1.1x in image captioning. The post provides a Python wrapper for an OpenAI-compatible endpoint, though with limitations like deterministic output and single-session engine.
Forge-UGC: FX optimization and register-graph engine for universal graph compiler
Forge-UGC is a four-phase universal graph compiler that speeds up transformer deployment on NPUs, cutting compilation time 6.9-9.2×, inference latency 18-36 % and energy 30-41 % versus OpenVINO/ONNX Runtime.