21 GPU's benchmarked running a small TTS model (vram peak: 5GB)
Summary
A user benchmarks 21 consumer GPUs on vast.ai running a small TTS model (OmniVoice) with peak VRAM of 5GB, comparing performance relative to real-time and to an RTX 3090.
Similar Articles
Benchmarks of 20 small LLMs on a 6GB RTX 4050
A detailed benchmark of 20 small LLMs quantized for a 6GB GPU, measuring speed and VRAM usage at various context lengths, with qualitative probing for tool-use and instruction following. The report aims to help users with modest hardware choose models for local, private automation tasks.
Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers
The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.
Qwen 35B-A3B is very usable with 12GB of VRAM
A user benchmarks Qwen 35B-A3B (a 35B MoE model) on a 12GB RTX 3060, finding that 12GB VRAM is a practical sweet spot for running the model with 32k context, achieving ~47 t/s generation.
RTX Pro 4500 Blackwell Performance Numbers
A user shares performance benchmarks comparing the Nvidia RTX Pro 4500 Blackwell 32GB GPU against the RTX 5060 Ti 16GB for AI inference, showing 1.6-6x speed improvements depending on model size and quantization.
@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …
A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.