Someone just ran a 744B parameter model at 30 tok/s across 6 consumer GPUs in 6 different US states over the open internet

Reddit r/ArtificialInteligence 06/20/26, 10:45 AM Tools

decentralized-ai distributed-inference speculative-decoding open-source large-language-model cuda-graph shard

Summary

A researcher debuted Shard, achieving 30 tok/s inference on a 744B parameter model distributed across 6 consumer GPUs over the open internet, a 15-20x improvement over previous methods.

A researcher named leyten published a project called Shard this week and the results are genuinely exciting. They split GLM-5.2 (744B parameters) across 6 RTX Pro 6000 GPUs in Nevada, Texas, Washington, Minnesota, Missouri, and Utah — connected over regular WAN with 22-75ms latency between nodes — and achieved ~30 tokens/second. For context, the previous best attempt at this (Petals, 2022) got 1-2 tok/s on much smaller models. This is a 15-20x improvement and a meaningful moment for decentralized AI. How they did it: Three techniques combined: Speculative decoding over WAN — a small draft model proposes K tokens, the distributed large model verifies them all in one network round-trip. WAN latency is the scarce resource, so you amortize it. Ring pipelining with direct return — the final node sends results directly back to the coordinator instead of relaying through every stage. CUDA-graphed draft model — pre-compiling the draft model as a CUDA graph gave a 3.8-5.3x speedup. Baseline to final: Plain WAN decode: 1.87 tok/s async pipelining: 16.6 tok/s CUDA-graphed draft: ~30 tok/s Shard is the infrastructure powering c0mpute.ai — a network where anyone can contribute their GPU and earn USDC for running inference jobs. The network has its own token, $ZERO, which accrues value as the network grows. This result shows the foundation is real and the engineering is serious. Every run has a published receipt with GPU UUIDs, IP addresses, latency measurements and output hashes. Code is open source. Repo: github.com/leyten/shard

Original Article

Someone just ran a 744B parameter model at 30 tok/s across 6 consumer GPUs in 6 different US states over the open internet

Similar Articles

@sudoingX: this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 o…

@rumgewieselt: Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently

@onusoz: 16x parallel Gemma-4-26B-A4B-NVFP4 runs 18 output tokens/s, aggregate 300 tok/s 🫪 1 DGX Spark with 128 GB unified memo…

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Xiaomi & TileRT just hit 1,000+ TPS on a 1-Trillion Parameter model… on standard commodity GPUs. It’s over for custom silicon?

Submit Feedback

Similar Articles

@sudoingX: this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 o…

@rumgewieselt: Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently

@onusoz: 16x parallel Gemma-4-26B-A4B-NVFP4 runs 18 output tokens/s, aggregate 300 tok/s 🫪 1 DGX Spark with 128 GB unified memo…

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Xiaomi & TileRT just hit 1,000+ TPS on a 1-Trillion Parameter model… on standard commodity GPUs. It’s over for custom silicon?