RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Reddit r/LocalLLaMA News

Summary

A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.

have have a server running a 4500 blackwell on cuda 13.1 and nvidia/595.58.03 with 48GB mem assigned to it. I have build: dcad77cc3 (8933) with Qwen3.6-27B UD-Q5\_K\_XL loaded and connected it to Roo code. seems ok. Anything I am missing or can I run a larger model? I guess I am looking for it to run a little better / smarter? im building stuff in ue5 now but using codex and claude mostly. What use can I put this too? these are api tests ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32126 MiB): Device 0: NVIDIA RTX PRO 4500 Blackwell, compute capability 12.0, VMM: yes, VRAM: 32126 MiB | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | qwen35 27B Q5_K - Medium | 18.65 GiB | 26.90 B | CUDA | 999 | 1 | pp512 | 1751.21 ± 54.18 | | qwen35 27B Q5_K - Medium | 18.65 GiB | 26.90 B | CUDA | 999 | 1 | tg128 | 35.83 ± 0.02 | build: dcad77cc3 (8933) `these are results` "prompt_n": 31, "prompt_per_second": 166.60307087079664, "predicted_n": 300, "predicted_ms": 8429.475, "predicted_per_second": 35.58940503412134 root@pve:~# [Unit] Description=llama.cpp server — Qwen3.6-27B UD-Q5_K_XL (thinking, precise coding) ExecStart=/opt/llama.cpp/build/bin/llama-server \ --model /opt/llama.cpp/models/Qwen3.6-27B/Qwen3.6-27B-UD-Q5_K_XL.gguf \ --alias Qwen3.6-27B \ --ctx-size 131072 \ --n-gpu-layers 999 \ --flash-attn on \ --jinja \ --threads 16 \ --batch-size 512 \ --ubatch-size 512 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 Restart=on-failure RestartSec=10 TimeoutStartSec=300
Original Article

Similar Articles

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Reddit r/LocalLLaMA

A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.