21 GPU's benchmarked running a small TTS model (vram peak: 5GB)

Reddit r/LocalLLaMA 05/18/26, 09:46 PM News

gpu-benchmark tts voice-cloning omni-voice inference consumer-gpus

Summary

A user benchmarks 21 consumer GPUs on vast.ai running a small TTS model (OmniVoice) with peak VRAM of 5GB, comparing performance relative to real-time and to an RTX 3090.

I rented different GPUs on vast.ai for a few minutes each to benchmark a small TTS model, OmniVoice, with a peak VRAM usage of about 5 GB. I wanted to see how various mostly consumer GPUs would stack up against my own RTX 3090. This is by no means an extensive or scientific analysis, but I think it gives a rough estimate of how these GPUs perform relative to each other. xRT means times real-time. It shows how much faster than real-time the GPU generates audio. Average of 3 runs of a small paragraph with reference audio provided (voice cloning).

Original Article

Similar Articles

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Reddit r/LocalLLaMA

A detailed benchmark of 20 small LLMs quantized for a 6GB GPU, measuring speed and VRAM usage at various context lengths, with qualitative probing for tool-use and instruction following. The report aims to help users with modest hardware choose models for local, private automation tasks.

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Reddit r/LocalLLaMA

The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.

Qwen 35B-A3B is very usable with 12GB of VRAM

Reddit r/LocalLLaMA

A user benchmarks Qwen 35B-A3B (a 35B MoE model) on a 12GB RTX 3060, finding that 12GB VRAM is a practical sweet spot for running the model with 32k context, achieving ~47 t/s generation.

RTX Pro 4500 Blackwell Performance Numbers

Reddit r/LocalLLaMA

A user shares performance benchmarks comparing the Nvidia RTX Pro 4500 Blackwell 32GB GPU against the RTX 5060 Ti 16GB for AI inference, showing 1.6-6x speed improvements depending on model size and quantization.

@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …

X AI KOLs Timeline

A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.

Similar Articles

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

Qwen 35B-A3B is very usable with 12GB of VRAM

RTX Pro 4500 Blackwell Performance Numbers

@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …

Submit Feedback