Someone just ran a 744B parameter model at 30 tok/s across 6 consumer GPUs in 6 different US states over the open internet
Summary
A researcher debuted Shard, achieving 30 tok/s inference on a 744B parameter model distributed across 6 consumer GPUs over the open internet, a 15-20x improvement over previous methods.
Similar Articles
@sudoingX: this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 o…
A 31B parameter model runs locally on a laptop via Hermes agent at 15 tok/s, using 22.8 GB VRAM and 94 W power, highlighting fully autonomous, private AI inference without cloud dependencies.
@rumgewieselt: Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently
A user demonstrates successful local inference of a 27B parameter Qwen model across three GTX 1080 Ti GPUs, achieving approximately 28-30 tokens per second using TurboQuant optimization.
@onusoz: 16x parallel Gemma-4-26B-A4B-NVFP4 runs 18 output tokens/s, aggregate 300 tok/s 1 DGX Spark with 128 GB unified memo…
@onusoz demonstrates running 16 parallel instances of NVIDIA's quantized Gemma-4-26B-A4B-NVFP4 model on a single DGX Spark with 128GB unified memory, achieving 300 tok/s aggregate, showcasing high concurrency without flashinfer.
$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s
A user shares a configuration of 4x RTX 5060 Ti 16GB with P2P to run Qwen3.6-27B-FP8 at 55 tok/s with 262K context, highlighting the low cost of about $1800 for single-user inference.
Xiaomi & TileRT just hit 1,000+ TPS on a 1-Trillion Parameter model… on standard commodity GPUs. It’s over for custom silicon?
Xiaomi and TileRT achieved over 1,000 tokens per second inference on a 1-trillion parameter model using standard commodity GPUs, suggesting a major alternative to custom silicon.