[Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level

Reddit r/LocalLLaMA 05/14/26, 11:38 AM News

gpu-benchmarking power-efficiency llm-inference nvidia-5090 llama-cpp token-generation prompt-processing

Summary

A user benchmarks the Nvidia 5090 RTX GPU for LLM inference using llama.cpp, measuring prompt processing and token generation at various power levels, finding that prompt processing is more sensitive to power limits than token generation, and noting differences from the 4090 RTX.

Inspired by [https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop\_wasting\_electricity/](https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/) I've decided to put my 5090 to test and see how do the curves look like for the device and whether there were any obvious sweet spots (apart from setting it to minimum 400w). **Graphs and outcomes:** https://preview.redd.it/t0icb8j7831h1.png?width=1700&format=png&auto=webp&s=f787b987c14ff1670d26171304dbdfc6e9fc3a69 https://preview.redd.it/6pe7k7j7831h1.png?width=1700&format=png&auto=webp&s=62b08ebab967f7af6dc8a7a865b2d22856d54a0c https://preview.redd.it/vya398j7831h1.png?width=1700&format=png&auto=webp&s=d7f4330159964e5373266c717a1cde7c491df3f3 https://preview.redd.it/o7inv8j7831h1.png?width=1700&format=png&auto=webp&s=0baced5e3ffd1b33558bf9085d7ffea0622ce3f2 **Inputs:** Backend: llama.cpp in a docker container, FA on, batch 2048, max context 122k. Model: [https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced](https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced) Quant: Q6\_K\_P Hardware: Threadripper 6970, 2 channel RAM 64GB, 5090RTX Prompt: 30k prompt composed of 3 x 10k copies of the same benchmark for heavy reasoning, math and computations, can present upon request - was generated by QWEN 3.6 specifically for benchmarking. **Methodology:** Generation stopped after 2 minutes for the brevity of the sessions and due to the asymptotic nature of the further TG metric. Measurements were performed on a warm card as cold measurements would've taken too much time between sessions. Between measurements the server was restarted completely to reset KV cache and result in proper PP measurements of the same input. **Power Level Range:** 400w - 600w, 25w step **Notes:** Max power consumption registered was at 592w with the PL set to 600w, sustained load never reached 600w, stabilizing at 580w even when uncapped. In all of other launches a trend was visible of max values going beyond the set PL by 10-12w, reflecting sharp spikes 5090RTX is already famous for. A cold card is faster than a warm card by 2-3%, making sustained load tasks naturally slower than man-driven ones. Prompt Processing is much more sensitive to power limit, while Token Generation is almost linear at these numbers. Not exactly apples to apples when compared to the setup used in the [https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop\_wasting\_electricity/](https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/) post, but the difference between 4090rtx and 5090rtx seems to go beyond more power, yet are not equally applied to PP and to TG: |PL|PP 5090|PP 4090|%|TG 5090|TG 4090|%| |:-|:-|:-|:-|:-|:-|:-| |450w|2273|2113|1.075721723|49.3|41|1.202439024| |425w|2248|2093|1.074056378|48.9|41.6|1.175480769| |400w|2135|2061|1.035904901|48.7|42.5|1.145882353|

Original Article

[Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level

Similar Articles

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

@Snixtp: More efficiency tests on a single 3090 TL;DR: - I tested 8 local LLMs on a single RTX 3090, power limit from 100W to 45…

Small comparison on full compute performance (Anima) of 5090 (600,475 and 400W) vs 6000 PRO MaxQ (325W), and 6000 PRO WS/SE (600W).

Benchmarks of 20 small LLMs on a 6GB RTX 4050

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Submit Feedback

Similar Articles

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

@Snixtp: More efficiency tests on a single 3090 TL;DR: - I tested 8 local LLMs on a single RTX 3090, power limit from 100W to 45…

Small comparison on full compute performance (Anima) of 5090 (600,475 and 400W) vs 6000 PRO MaxQ (325W), and 6000 PRO WS/SE (600W).

Benchmarks of 20 small LLMs on a 6GB RTX 4050

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp