Tag
A deep benchmark of 8 tiny LLMs (135M to 1B parameters) on a $250 Jetson Orin Nano Super across four power modes finds 25W to be Pareto-optimal, with SmolLM2-135M achieving 165.1 tok/s and best efficiency.
A user benchmarks RTX 5090 and RTX 6000 PRO GPUs for AI diffusion tasks, comparing performance at different power limits and showing tradeoffs between speed and power consumption.
A user shares power limit testing on a 4x RTX 3090 setup running Qwen3.6-27B with vLLM, finding 220W as the sweet spot for peak efficiency with minimal throughput loss.
A user benchmarks the Nvidia 5090 RTX GPU for LLM inference using llama.cpp, measuring prompt processing and token generation at various power levels, finding that prompt processing is more sensitive to power limits than token generation, and noting differences from the 4090 RTX.
User benchmarks dual Asus GX10 (DGX Spark) running MiniMax-M2.7-AWQ-4bit, achieving 30–40 tokens/s while drawing only ~100 W each, replacing noisy multi-GPU rigs.