6x P40 running Minimax M2.7_Q3_XL

Reddit r/LocalLLaMA 07/02/26, 06:49 PM News

home-lab benchmark p40 minimax llama.cpp gpu-inference quantization

Summary

A detailed home lab setup with 6x P40 GPUs running a quantized MiniMax M2.7 model, including hardware specs, benchmark results, and optimal configuration using llama.cpp.

I've been a lurker for a while and have been building my own home lab with P40's and MI50's. I've learned so much from the community and I just felt like it's time to give back. Even though I'm still learning I'm sure this information will be valuable to someone out there. I'll be posting MI50's details once I'm done fine tuning my P40 box. Hardware: Asus X99-E-WS (Modded BIOS to support a large number GPU's ) Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz 128GB DDR4 RAM (mixed batch of Non-ECC sticks) SSD 6x P40's 144GB VRAM (Gen3 x8,x8,x8,x8,x8,x8) Memory distribution during benchmark The below table shows benchmarks I ran with my findings: Test configuration Context pp512 tg128 pp512+tg128 pp4096+tg128 Result F16 KV, FA on, batch 2048, ubatch 512 32,768 73.20 10.45 33.50 129.51 Original baseline F16 KV, FA on, batch 2048, ubatch 512 65,536 42.68 6.43 19.49 77.22 Original baseline F16 KV, FA on, batch 2048, ubatch 512 126,720 24.16 3.51 10.90 44.22 Fits Q8 KV, FA on, batch 2048, ubatch 512 65,536 42.53 6.14 — — Slower than F16 Q8 KV, FA on, batch 2048, ubatch 512 126,720 23.91 3.06 — — Generation −12.8% F16 KV, FA on, batch 1024, ubatch 256 32,768 105.76 10.70 37.34 128.94 Strong improvement F16 KV, FA on, batch 1024, ubatch 256 65,536 66.00 6.18 22.63 79.39 Strong improvement F16 KV, FA on, batch 2048, ubatch 256 32,768 105.91 10.50 37.41 129.42 Selected F16 KV, FA on, batch 2048, ubatch 256 65,536 65.86 6.38 22.63 79.37 Selected F16 KV, FA off, batch 1024, ubatch 256 32,768 34.16 2.72 — — Major regression F16 KV, FA off, batch 1024, ubatch 256 65,536 19.34 1.50 — — Major regression F16 KV, FA off, batch 1024, ubatch 256 126,720 — — — — Context creation failed F16 KV, FA on, 2048/256, GGML_CUDA_P2P=1 32,768 105.76 10.68 37.38 129.40 No measurable gain F16 KV, FA on, 2048/256, GGML_CUDA_P2P=1 65,536 66.00 6.18 22.63 79.35 No measurable gain F16 KV, FA on, 2048/256, launch queues 4× 32,768 105.53 10.69 37.36 129.34 No measurable gain F16 KV, FA on, 2048/256, launch queues 4× 65,536 66.03 6.18 22.63 79.34 No measurable gain Tensor split — — — — — Crashed / unsupported Layer split, equal 1/1/1/1/1/1 — — — — — Stable and selected Here is where I ended up as far as optimal configuration is concerned: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \ "$HOME/llama.cpp/build-cuda/bin/llama-server" \ -m "$HOME/.lmstudio/models/unsloth/MiniMax-M2.7-GGUF/MiniMax-M2.7-UD-Q3_K_XL-00001-of-00004.gguf" \ -dev CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5 \ -ngl 999 \ --fit off \ --split-mode layer \ --tensor-split 1,1,1,1,1,1 \ --ctx-size 131072 \ --parallel 1 \ --cache-type-k f16 \ --cache-type-v f16 \ --batch-size 2048 \ --ubatch-size 256 \ --flash-attn on \ --jinja \ --temp 1.0 \ --top-p 0.95 \ --top-k 40 \ --min-p 0.01 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --n-predict 8192 \ --host 0.0.0.0 \ --port 8080 \ --timeout 30000

Original Article

6x P40 running Minimax M2.7_Q3_XL

Similar Articles

@stevibe: MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs:…

Best config for Qwen3.6 27b / llama.cpp / opencode

Cheapest way to run GLM 5.x locally that's not a unified memory system?

@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

Submit Feedback

Similar Articles

@stevibe: MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs:…

Best config for Qwen3.6 27b / llama.cpp / opencode
Community thread sharing optimized llama.cpp launch commands for running the 27B Qwen3.6 GGUF model with long 100K-512K context on multi-GPU setups.

Cheapest way to run GLM 5.x locally that's not a unified memory system?

@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp