@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…

X AI KOLs Timeline Models

Summary

We're the first to run the full GLM-5.2 (753B FP8) on RTX 4090s by porting sparse-attention kernels to Ada GPUs, enabling frontier open-weights model on commodity hardware.

We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and it officially ships for datacenter GPUs only: H100, H200, B200. We ported its sparse-attention kernel stack to consumer hardware. A frontier open model, off the scarce GPUs and onto the abundant kind. http://github.com/renning22/glm-5.2-4090… @Zai_org
Original Article
View Cached Full Text

Cached at: 06/18/26, 08:21 PM

We’re the first to make the full GLM-5.2 (FP8) run on RTX 4090s.

GLM-5.2 is the new 753B SOTA open-weights model, and it officially ships for datacenter GPUs only: H100, H200, B200. We ported its sparse-attention kernel stack to consumer hardware.

A frontier open model, off the scarce GPUs and onto the abundant kind.

http://github.com/renning22/glm-5.2-4090…

@Zai_org


renning22/glm-5.2-4090

Source: https://github.com/renning22/glm-5.2-4090

GLM-5.2 on RTX 4090

Running the full GLM-5.2 — the 753B-parameter SOTA open-weights model — in native FP8 on consumer NVIDIA RTX 4090 GPUs. As far as we know, the first time a DeepSeek-Sparse-Attention (DSA) model has run correctly on Ada (sm_89). The stock sglang / vLLM stack hard-requires Hopper (H100/H200) or Blackwell (B200) for GLM-5.2 — its sparse-attention kernels are gated to sm_90 / sm_100 with no Ada fallback. This repo is a drop-in ada_dsa.py that ports that whole kernel stack to the 4090.

prompt:  "The capital of France is"
output:  "Paris. Distance from Paris to Lyon is 391 km, while direct flight time is 1 h 5 min"

Highlights

  • Full 753B model, full FP8 — not a distilled or int4 variant. The complete GLM-5.2-FP8 weights.
  • 24× RTX 4090-48GB (3 nodes × 8), pipeline + tensor parallel — proven, coherent chat / reasoning / code.
  • ~10 tokens/sec single-stream (CUDA-graph) — interactive speed for the full 753B on commodity cards.
  • Every ported kernel validated against a reference, down to ~1e-6 — including 0.999999 cosine on the live model’s real tensors.
  • Open — the kernels, the one-call installer, and the verification scripts are all here.

Hardware sizing

The FP8 weights are ~753 GB, so the model has to be split across enough cards to hold the weights plus KV-cache and activations. Roughly:

GPUVRAM/cardGPUs neededLayoutStatus
RTX 409048 GB24TP=8 × PP=3 (3 nodes)✅ proven (this repo)
RTX 409024 GB~40–48TP=8 × PP=5–6 (5–6 nodes)sizing estimate
RTX 509032 GB~32TP=8 × PP=4 (4 nodes)sizing estimate¹

¹ The RTX 5090 is sm_120 (consumer Blackwell), which the stock sm_90/sm_100 DSA kernels also don’t cover — so it needs this same port (widen the capability guard to include sm_120). Only the 4090-48GB config is tested here; the others are VRAM-fit estimates (assume ~6–8 GB/card reserved for KV + activations + CUDA context, more for larger context windows).

Usage

Tested against an sglang build with the nsa / tilelang DSA backend, in an environment where tilelang is available (we grafted tilelang 0.1.11 + tvm-ffi from KTransformers).

  1. Put ada_dsa.py on PYTHONPATH.

  2. Apply the two small sglang edits (idempotent, backs up originals):

    python apply_sglang_patches.py
    

    This (a) calls ada_dsa.apply_patches() from nsa_indexer.py to swap the SM90+/SM100 DSA kernels, and (b) adds the one-line deep_gemm guard needed for CUDA-graph. (Both edits are documented in TECHNICAL.md if you’d rather do them by hand.)

  3. Launch (example: 24× RTX 4090-48GB, TP=8 × PP=3):

    export SGLANG_NSA_FUSE_TOPK=1            # use the ported fused page-mapping transforms
    export SGLANG_ENABLE_JIT_DEEPGEMM=0
    
    python -m sglang.launch_server \
      --model-path zai-org/GLM-5.2-FP8 \
      --tp-size 8 --pp-size 3 --nnodes 3 --dist-init-addr <rank0-ip>:30200 \
      --trust-remote-code --kv-cache-dtype fp8_e4m3 --mem-fraction-static 0.85 \
      --attention-backend nsa --nsa-decode-backend tilelang --nsa-prefill-backend tilelang \
      --fp8-gemm-backend triton \
      --disable-shared-experts-fusion \      # REQUIRED on Ada (the MoE fix), see TECHNICAL.md
      --tool-call-parser glm47 --reasoning-parser glm45 \
      --node-rank <0|1|2> --host 0.0.0.0 --port 8000
    

--disable-shared-experts-fusion is required on Ada. CUDA-graph is on by default here and gives the full speed (about 10 tok/s single-stream vs about 2.5 in eager); it needs one extra one-line guard in deep_gemm_wrapper/entrypoint.py (see TECHNICAL.md). If you’d rather not patch that, add --disable-cuda-graph and run eager. Configure NCCL transport (NCCL_P2P_DISABLE / NCCL_IB_DISABLE) to match your fabric.

How it works

ada_dsa.py monkeypatches GLM-5.2’s SM90+/SM100-only DSA kernels — the lightning-indexer GEMM, the top-k + page-mapping, and the MLA sparse decode — with portable Triton + a non-WGMMA tilelang path, only on sub-Hopper GPUs. Plus one config fix for the MoE. Full write-up, kernel-by-kernel walkthrough, and the verification table: TECHNICAL.md.

Status

GLM-5.2 runs correctly on consumer hardware where the stock stack hard-crashes, at about 10 tokens/sec single-stream (CUDA-graph; about 2.5 in eager mode). That’s interactive speed for the full 753B model on commodity cards. The portable indexer / top-k / page-transform stack is model-agnostic and should apply to other DSA models (e.g. DeepSeek-V3.2-style) with minor adjustment.

License

Apache-2.0.


Built by @renning22.

Similar Articles