@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…

X AI KOLs Timeline 06/18/26, 12:46 PM Models

glm-5.2 open-weights rtx-4090 sparse-attention consumer-gpu deepseek-sparse-attention port

Summary

We're the first to run the full GLM-5.2 (753B FP8) on RTX 4090s by porting sparse-attention kernels to Ada GPUs, enabling frontier open-weights model on commodity hardware.

We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and it officially ships for datacenter GPUs only: H100, H200, B200. We ported its sparse-attention kernel stack to consumer hardware. A frontier open model, off the scarce GPUs and onto the abundant kind. http://github.com/renning22/glm-5.2-4090… @Zai_org

Original Article

View Cached Full Text

Cached at: 06/18/26, 08:21 PM

We’re the first to make the full GLM-5.2 (FP8) run on RTX 4090s.

GLM-5.2 is the new 753B SOTA open-weights model, and it officially ships for datacenter GPUs only: H100, H200, B200. We ported its sparse-attention kernel stack to consumer hardware.

A frontier open model, off the scarce GPUs and onto the abundant kind.

http://github.com/renning22/glm-5.2-4090…

@Zai_org

renning22/glm-5.2-4090

Source: https://github.com/renning22/glm-5.2-4090

GLM-5.2 on RTX 4090

Running the full GLM-5.2 — the 753B-parameter SOTA open-weights model — in native FP8 on consumer NVIDIA RTX 4090 GPUs. As far as we know, the first time a DeepSeek-Sparse-Attention (DSA) model has run correctly on Ada (sm_89). The stock sglang / vLLM stack hard-requires Hopper (H100/H200) or Blackwell (B200) for GLM-5.2 — its sparse-attention kernels are gated to sm_90 / sm_100 with no Ada fallback. This repo is a drop-in ada_dsa.py that ports that whole kernel stack to the 4090.

prompt:  "The capital of France is"
output:  "Paris. Distance from Paris to Lyon is 391 km, while direct flight time is 1 h 5 min"

Highlights

Full 753B model, full FP8 — not a distilled or int4 variant. The complete GLM-5.2-FP8 weights.
24× RTX 4090-48GB (3 nodes × 8), pipeline + tensor parallel — proven, coherent chat / reasoning / code.
~10 tokens/sec single-stream (CUDA-graph) — interactive speed for the full 753B on commodity cards.
Every ported kernel validated against a reference, down to ~1e-6 — including 0.999999 cosine on the live model’s real tensors.
Open — the kernels, the one-call installer, and the verification scripts are all here.

Hardware sizing

The FP8 weights are ~753 GB, so the model has to be split across enough cards to hold the weights plus KV-cache and activations. Roughly:

GPU	VRAM/card	GPUs needed	Layout	Status
RTX 4090	48 GB	24	TP=8 × PP=3 (3 nodes)	✅ proven (this repo)
RTX 4090	24 GB	~40–48	TP=8 × PP=5–6 (5–6 nodes)	sizing estimate
RTX 5090	32 GB	~32	TP=8 × PP=4 (4 nodes)	sizing estimate¹

¹ The RTX 5090 is sm_120 (consumer Blackwell), which the stock sm_90/sm_100 DSA kernels also don’t cover — so it needs this same port (widen the capability guard to include sm_120). Only the 4090-48GB config is tested here; the others are VRAM-fit estimates (assume ~6–8 GB/card reserved for KV + activations + CUDA context, more for larger context windows).

Usage

Tested against an sglang build with the nsa / tilelang DSA backend, in an environment where tilelang is available (we grafted tilelang 0.1.11 + tvm-ffi from KTransformers).

Put ada_dsa.py on PYTHONPATH.
Apply the two small sglang edits (idempotent, backs up originals):
```
python apply_sglang_patches.py
```
This (a) calls ada_dsa.apply_patches() from nsa_indexer.py to swap the SM90+/SM100 DSA kernels, and (b) adds the one-line deep_gemm guard needed for CUDA-graph. (Both edits are documented in TECHNICAL.md if you’d rather do them by hand.)

Launch (example: 24× RTX 4090-48GB, TP=8 × PP=3):

export SGLANG_NSA_FUSE_TOPK=1            # use the ported fused page-mapping transforms
export SGLANG_ENABLE_JIT_DEEPGEMM=0

python -m sglang.launch_server \
  --model-path zai-org/GLM-5.2-FP8 \
  --tp-size 8 --pp-size 3 --nnodes 3 --dist-init-addr <rank0-ip>:30200 \
  --trust-remote-code --kv-cache-dtype fp8_e4m3 --mem-fraction-static 0.85 \
  --attention-backend nsa --nsa-decode-backend tilelang --nsa-prefill-backend tilelang \
  --fp8-gemm-backend triton \
  --disable-shared-experts-fusion \      # REQUIRED on Ada (the MoE fix), see TECHNICAL.md
  --tool-call-parser glm47 --reasoning-parser glm45 \
  --node-rank <0|1|2> --host 0.0.0.0 --port 8000

--disable-shared-experts-fusion is required on Ada. CUDA-graph is on by default here and gives the full speed (about 10 tok/s single-stream vs about 2.5 in eager); it needs one extra one-line guard in deep_gemm_wrapper/entrypoint.py (see TECHNICAL.md). If you’d rather not patch that, add --disable-cuda-graph and run eager. Configure NCCL transport (NCCL_P2P_DISABLE / NCCL_IB_DISABLE) to match your fabric.

How it works

ada_dsa.py monkeypatches GLM-5.2’s SM90+/SM100-only DSA kernels — the lightning-indexer GEMM, the top-k + page-mapping, and the MLA sparse decode — with portable Triton + a non-WGMMA tilelang path, only on sub-Hopper GPUs. Plus one config fix for the MoE. Full write-up, kernel-by-kernel walkthrough, and the verification table: TECHNICAL.md.

Status

GLM-5.2 runs correctly on consumer hardware where the stock stack hard-crashes, at about 10 tokens/sec single-stream (CUDA-graph; about 2.5 in eager mode). That’s interactive speed for the full 753B model on commodity cards. The portable indexer / top-k / page-transform stack is model-agnostic and should apply to other DSA models (e.g. DeepSeek-V3.2-style) with minor adjustment.

License

Apache-2.0.

Built by @renning22.

@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…

renning22/glm-5.2-4090

GLM-5.2 on RTX 4090

Highlights

Hardware sizing

Usage

How it works

Status

License

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

GLM 5.2 API is live, weights are on HF, and ollama has it already

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

Submit Feedback

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

GLM 5.2 API is live, weights are on HF, and ollama has it already

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding