Tag
Antirez announces high probability of merging a branch implementing GLM 5.2 in DwarfStar, which could become the best model for 512GB Mac Studio and potentially run on distributed 128GB MacBooks with 2-bit quantization.
A researcher debuted Shard, achieving 30 tok/s inference on a 744B parameter model distributed across 6 consumer GPUs over the open internet, a 15-20x improvement over previous methods.
A pull request to vLLM adds support for tensor parallelism degree 3 for MiniMax M3 with its NVFP4 quantization, enabling the model to run on 3x DGX Sparks with 87GB memory each.
vLLM integrates Mooncake Store for distributed KV cache reuse, enabling cross-node prefix caching to efficiently serve agentic workloads with high token reuse.
A tweet recommending a paper that is described as the bible of distributed inference.
A blog post guides readers through setting up a Raspberry Pi cluster for distributed training and inference, part of a series aimed at making distributed AI accessible using affordable hardware.
antirez announces receiving an M5 Max 128GB MacBook Pro from audreyt to develop DwarfStar4 and experiment with distributed inference across M3 Max and M5 Max hardware.
Federation of Experts (FoE) restructures mixture-of-experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks and improving inference throughput and latency by up to 5.2x while maintaining generation quality.
A user shares their $25k hardware setup of two 512GB RAM M3 Ultra Mac Studios for running large language models locally, having tested DeepSeek V3 Q8 and GLM 5.1 Q4 via the exo distributed inference backend, while awaiting Kimi 2.6 MLX optimization.