ZeroGPU
Summary
ZeroGPU is a compute efficient layer designed for AI inference, aiming to optimize GPU usage and reduce costs.
Similar Articles
General Compute
General Compute is a product offering an inference cloud optimized for speed to run AI models.
@MaxForAI: http://Z.ai and this ZCube paper from Tsinghua—worth a read for anyone in Infra. Many people's first reaction when talking about AI infra is still GPU, memory, quantization, and inference frameworks. But once you get into long context and Prefill-Decode separation, the network is no longer just a 'supporting role' in the data center. Every...
ZCube is a new network architecture that flattens the topology and mixes single/multi-rail access to optimize KV Cache transmission in long-context and PD separation scenarios. In the GLM-5.1 production cluster, it achieved a 33% reduction in switch/optical module costs, a 15% increase in GPU inference throughput, and a 40.6% decrease in TTFT P99.
Popping the GPU Bubble
Moondream's Photon inference engine eliminates GPU bubbles through pipelined decoding, achieving near-realtime VLM inference with up to 35% higher decode throughput.
How to achieve truly serverless GPUs (20 minute read)
Modal explains the four key ingredients they developed to spin up serverless GPU inference replicas in seconds instead of minutes, enabling efficient GPU allocation for variable AI workloads.
The GPUless Revolution: How Efficient AI Models Are Democratizing Artificial Intelligence
A quiet revolution is making powerful AI models runnable on consumer hardware without expensive GPUs, thanks to breakthroughs in quantization and optimized implementations like llama.cpp's Gemma4 MTP support, democratizing access for hobbyists, small businesses, and edge computing.