@FeitengLi: Built a ReAct agent system by hand: Doing agent systems with LLMs. While walking this evening, I was thinking about how to train an LLM's agentic capabilities, data preparation, model training, constructing RL training with agent trajectory actions, and also about Claude's progress over the past year…

X AI KOLs Following 05/20/26, 02:02 PM Papers

react-agent llm glm-5 reinforcement-learning agentic-engineering coding zhipu-ai

Summary

The author shares their experience building a ReAct agent system and introduces the GLM-5 technical report released by Zhipu AI, which achieves breakthroughs in agentic, reasoning, and coding capabilities.

Built a ReAct agent system by hand: doing agent systems with LLMs While walking this evening, I was thinking about how to train an LLM's agentic capabilities, data preparation, model training, constructing RL training with agent trajectory actions, and also about what SFT and RL advances contributed to Claude's progress over the past year; After dinner, I read Zhipu AI's "GLM-5: from Vibe Coding to Agentic Engineering" – it's a real technical report with very rich details; it was pretty similar to what I had thought, but I was surprised that they used 9T code/data. Multiple inference frameworks' top-k implementations actually have randomness. https://arxiv.org/html/2602.15763v2…

Original Article

View Cached Full Text

Cached at: 05/20/26, 04:32 PM

I put together a ReAct agent system: thinking about LLM-based agent systems during an evening walk — how to train LLM agentic capabilities, data preparation, model training, constructing agent trajectories and actions for RL training; also thinking about which SFT/RL improvements drove Claude’s progress over the past year. After dinner, I read Zhipu’s “GLM-5: from Vibe Coding to Agentic Engineering” — it’s a real technical report, very rich in detail; pretty close to what I had in mind, though seeing 9T code or data surprised me. Also, in multiple inference frameworks the top-k implementations still have randomness. https://arxiv.org/html/2602.15763v2…

GLM-5: from Vibe Coding to Agentic Engineering

Source: https://arxiv.org/html/2602.15763v2 GLM-5 Team Zhipu AI & Tsinghua University (For the complete list of authors, please refer to theContribution (https://arxiv.org/html/2602.15763v2#S9)section)

Abstract

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available athttps://github.com/zai-org/GLM-5. Refer to captionFigure 1:Results of GLM-5, DeepSeek-V3.2, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 (xhigh) on 8 agentic, reasoning, and coding benchmarks: Humanity’s Last Exam, SWE-bench Verified, SWE-bench Multilingual, Terminal-Bench 2.0, BrowseComp, MCP-Atlas,τ2\tau^{2}-Bench, Vending Bench 2.## 1Introduction The pursuit of Artificial General Intelligence (AGI) requires not only scaling model parameters but also fundamentally rethinking the efficiency of intelligence and the architecture of autonomous improvement. With the release of GLM-4.5, we demonstrated that uniting Agentic, Reasoning, and Coding (ARC) capabilities into a single Model-of-Experts (MoE) architecture could yield state-of-the-art results across diverse benchmarks. However, as Large Language Models (LLMs) transition from passive knowledge repositories to active problem solvers, the dual challenges of computational cost and real-world adaptability—particularly in complex software engineering—have become the primary bottlenecks. We present GLM-5, our next-generation flagship model designed to overcome these barriers. GLM-5 represents a paradigm shift in both performance and efficiency, achieving state-of-the-art status on major open leaderboards, including ArtificialAnalysis.ai, the LMArena Text, and the LMArena Code. More significantly, GLM-5 redefines the standard for real-world coding, demonstrating an unprecedented ability to handle complex, end-to-end software development tasks that go far beyond the scope of traditional static benchmarks like SWE-bench.

Results. Figure1 (https://arxiv.org/html/2602.15763v2#S0.F1)shows the results of GLM-5, GLM-4.7, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 (xhigh) on 8 agentic, reasoning, and coding benchmarks: Humanity’s Last Exam[34 (https://arxiv.org/html/2602.15763v2#bib.bib20)], SWE-bench Verified[19 (https://arxiv.org/html/2602.15763v2#bib.bib27)], SWE-bench Multilingual[53 (https://arxiv.org/html/2602.15763v2#bib.bib69)], Terminal-Bench 2.0[45 (https://arxiv.org/html/2602.15763v2#bib.bib19)], BrowseComp[50 (https://arxiv.org/html/2602.15763v2#bib.bib17)], MCP-Atlas[6 (https://arxiv.org/html/2602.15763v2#bib.bib66)],τ2\tau^{2}-Bench[55 (https://arxiv.org/html/2602.15763v2#bib.bib11);7 (https://arxiv.org/html/2602.15763v2#bib.bib65)], Vending Bench 2[3 (https://arxiv.org/html/2602.15763v2#bib.bib68)]. On average, GLM-5 achieves about 20% improvement over our last version GLM-4.7, and is comparable to Claude Opus 4.5 and GPT-5.2 (xhigh), and better than Gemini 3 Pro. GLM-5 scores 50 on the Intelligence Index v4.0 and is the new open weights leader (Cf. Figure2 (https://arxiv.org/html/2602.15763v2#S1.F2)), up from GLM-4.7’s score of 42 - an 8 point jump driven by improvements across agentic performance and knowledge/hallucination. This is the first time an open weights model has achieved a score of 50 on the Artificial Analysis Intelligence Index v4.0. Refer to captionFigure 2:Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA,τ2\tau^{2}-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt.Refer to caption Refer to caption Figure 3:On LMArena, GLM-5 is the #1 open model in both Text Arena and Code Arena.Refer to caption Refer to caption Figure 4:Results on several long-horizon tasks. Left: Vending-Bench 2; Right: CC-Bench-V2.LMArena, initiated by UC Berkeley, is a transparent, shared space to evaluate and compare frontier AI capabilities by human judgment with millions of real tasks, including writing, coding, reasoning, designing, searching, and creating. The large volume of human interactions generates signals of real-world utility, making it different from the other static benchmarks. Figure3 (https://arxiv.org/html/2602.15763v2#S1.F3)shows that GLM-5 again is the #1 open model in both Text Arena and Code Arena, and overall on par with Claude-Opus-4.5 and Gemini-3-pro. Long-term coherence in agents becomes more and more important. Coding agents can now write code autonomously for hours, and the length and breadth of tasks AI models are able to complete are likely to increase. We use two benchmarks, Vending-Bench 2 and CC-Bench-V2, to evaluate how GLM-5 is able to complete long-horizon tasks. Vending-Bench 2 is a benchmark for measuring AI model performance in running a business over long time horizons. Models are tasked with running a simulated vending machine business over a year and are scored on their bank account balance at the end. Figure4 (https://arxiv.org/html/2602.15763v2#S1.F4)(left) shows that GLM-5 ranks #1 among all open-source models, finishing with a final account balance of $4,432. It approaches Claude Opus 4.5, demonstrating strong long-term planning and resource management. Figure4 (https://arxiv.org/html/2602.15763v2#S1.F4)(right) further shows results on our internal evaluation suite CC-Bench-V2. GLM-5 significantly outperforms GLM-4.7 across frontend, backend, and long-horizon tasks, narrowing the gap with Claude Opus 4.5. Refer to captionFigure 5:Overall training pipeline of GLM-5.

Methods. Figure5 (https://arxiv.org/html/2602.15763v2#S1.F5)shows the overall training pipeline of GLM-5. Our Base Model training began with a massive 27 trillion token corpus, prioritizing code and reasoning early on. We then employed a distinct Mid-training phase to progressively extend context length from 4K to 200K, focusing specifically on long-context agentic data to ensure stability in complex workflows. In Post-Training, we moved beyond standard SFT. We implemented a sequential Reinforcement Learning pipeline—starting with Reasoning RL, followed by Agentic RL, and finishing with General RL. Crucially, we utilized On-Policy Cross-Stage Distillation throughout this process to prevent catastrophic forgetting, ensuring the model retains its sharp reasoning edge while becoming a robust generalist. In summary, the leap in GLM-5’s performance is driven by the following technical contributions: First, we adopt DSA (DeepSeek Sparse Attention)[9 (https://arxiv.org/html/2602.15763v2#bib.bib1)], a novel architectural innovation that significantly reduces both training and inference costs. While GLM-4.5 improved efficiency through a standard MoE architecture, DSA allows GLM-5 to dynamically allocate attention resources based on token importance, drastically lowering the computational overhead without compromising long-context understanding or reasoning depth. With DSA, we scale the model parameters up to 744B and extend the training token budget to 28.5T tokens. Second, we have engineered a new asynchronous reinforcement learning infrastructure. Building on the “slime” framework and the decoupled rollout engines initialized in GLM-4.5, our new infrastructure further decouples generation from training to maximize GPU utilization. This system allows for massive-scale exploration of agent trajectories without the synchronization bottlenecks that previously hampered iteration speed, significantly improving the efficiency of our RL post-training pipeline. Third, we present novel asynchronous Agent RL algorithms designed to enhance the quality of autonomous decision-making. In GLM-4.5, we utilized iterative self-distillation and outcome supervision to train agents. For GLM-5, we have developed asynchronous algorithms that allow the model to learn from diverse, long-horizon interactions continuously. These algorithms are specifically optimized to improve the model’s planning and self-correction capabilities in dynamic environments, directly contributing to our dominance in real-world coding scenarios. Last, one more technical contribution lies in the fact that, from the first day, GLM-5 is full-stack adapted to Chinese GPU ecosystems. We have successfully completed deep optimization—spanning from underlying kernels to upper-level inference frameworks—across seven mainstream domestic chip platforms, including Huawei Ascend, Moore Threads, Hygon, Cambricon, Kunlunxin, MetaX, and Enflame. With these advancements, GLM-5 stands not just as a more powerful model but as a more efficient and practical foundation for the next generation of AI agents. We release GLM-5 to the community to further advance the frontier of efficient, agentic general intelligence.

2Pre-Training

Similar to GLM-4.5, the base model of GLM-5 goes through two stages: pre-training for general language and coding capacity, and mid-training for agentic and long-context capacity. We extend the training token budget for all the training stages of GLM-5, totaling 28.5 trillion tokens for the base model.

2.1Architecture

Model size scaling. GLM-5 scales to 256 experts and reduces its layer count to 80 to minimize expert parallelism communication overhead. This results in a 744B parameter model (40B active parameters), doubling the total size of GLM-4.5, which utilized 355B total and 32B active parameters.

Table 1:Evaluation results for GQA-8 and variants of MLA.DatasetHellaswagMMLUC-EvalRACEBBHGSM8KHumanEvalGQA-877.361.260.079.653.347.638.5MLA77.361.559.777.848.946.233.5MLA + Muon Split77.862.562.179.951.845.036.7MLA-256 + Muon Split77.462.059.979.651.347.536.6

Multi-latent Attention. By employing reduced key-value vectors, Multi-latent attention (MLA)[24 (https://arxiv.org/html/2602.15763v2#bib.bib38)]matches the effectiveness of Grouped-Query Attention (GQA) but offers superior GPU memory savings and faster processing for long-context sequences. However, in our experiments with Muon optimizer, we find that MLA with a 576-dimension latent KV-cache cannot match the performance of GQA with 8 query groups (denoted as GQA-8, 2048-dimension KV-cache). To overcome the performance gap, we propose an adaptation to the recipe of Muon optimizer in GLM-4.5. In the original recipe, we apply matrix orthogonalization to the up-projection matricesWUQ,WUK,WUVW^{UQ},W^{UK},W^{UV}for multi-head queries, keys, and values. Instead, we split these matrices into smaller matrices for different heads and apply matrix orthogonalization to these independent matrices. The method, denoted as Muon Split, enables projection weights for different attention heads to update at different scales. As shown inTable ̃1 (https://arxiv.org/html/2602.15763v2#S2.T1), the method effectively improves the performance of MLA to match that of GQA-8. In practice, we also find that with Muon Split, the scale of attention logits of GLM-5 remains stable during pre-training without any clipping strategy. Another disadvantage of MLA is its high computational cost during decoding. In decoding, MLA performs a 576-dimensional dot product, higher than the 128-dimensional computation of GQA. While the number of attention heads in DeepSeek-V3 is selected according to the roofline of H800[60 (https://arxiv.org/html/2602.15763v2#bib.bib103)], it is inappropriate for other hardware. Given the Multi-head Attention (MHA) style of MLA during training and prefilling, we increase the head dimension from 192 to 256 and decrease the number of attention heads by 1/3. This keeps the training computation and the number of parameters constant while decreasing the decoding computation. The variant, denoted as MLA-256 inTable ̃1 (https://arxiv.org/html/2602.15763v2#S2.T1), matches the performance of MLA under Muon Split.

Table 2:Comparison of accept lengths of DeepSeek-V3.2 and GLM-5.ModelAccept LengthDeepSeek-V3.22.55GLM-52.76

Multi-token Prediction with Parameter Sharing. Multi-token prediction (MTP)[13 (https://arxiv.org/html/2602.15763v2#bib.bib40);25 (https://arxiv.org/html/2602.15763v2#bib.bib37)]increases the performance of base models and acts as draft models for speculative decoding[20 (https://arxiv.org/html/2602.15763v2#bib.bib102)]. However, during training, to predict the nextnntokens,nnMTP layers are required. As a result, the memory usage of MTP parameters and the kv cache scales linearly with the number of speculative steps. Instead, DeepSeek-V3 is trained with a single MTP layer and predicts the next 2 tokens during inference. The training-inference discrepancy reduces the acceptance rate of the second token. Therefore, we propose sharing the parameters of 3 MTP layers during training. This keeps the memory cost of the draft model consistent with DeepSeek-V3 while increasing the acceptance rate. InTable ̃2 (https://arxiv.org/html/2602.15763v2#S2.T2), we show that the acceptance length of GLM-5 is longer than DeepSeek-V3.2, given the same number of speculative steps (4) in our private prompt set.

2.1.1Continued Pre-Training with DeepSeek Sparse Attention (DSA)

Table 3:Comparison of long-context benchmarks between MLA and DSA base models.MQ-NIAH-128kMV-NIAH-128kSQuAD-128kHotpotQA-128kMLA100.095.579.766.3DSA100.097.086.063.0We use DSA in our training. The core philosophy of DSA[9 (https://arxiv.org/html/2602.15763v2#bib.bib1)]is to replace the traditional denseO(L2)O(L^{2})attention—which becomes prohibitively expensive at128K128\text{K}contexts—with a dynamic, fine-grained selection mechanism. Unlike fixed patterns (like sliding windows), DSA “looks” at the content to decide which tokens are important. What makes DSA particularly interesting from a researcher’s perspective is how it was introduced via Continued Pre-Training from a dense base model. This avoided the “astronomical” cost of training from scratch. The transition follows a two-stage “dense warm-up and sparse training adaptation” strategy. DeepSeek-V3.2-Exp maintains the same benchmark performance as its dense predecessor, proving that 90% of attention entries in long contexts are indeed redundant. DSA reduces the attention computation by roughly 1.5-2× for long sequences, which is very important for the reasoning-heavy agents we are building, being able to handle 128K contexts at half the GPU cost. Refer to captionFigure 6:SFT lo

GLM-5: from Vibe Coding to Agentic Engineering

Abstract

2Pre-Training

2.1Architecture

Model size scaling. GLM-5 scales to 256 experts and reduces its layer count to 80 to minimize expert parallelism communication overhead. This results in a 744B parameter model (40B active parameters), doubling the total size of GLM-4.5, which utilized 355B total and 32B active parameters.

2.1.1Continued Pre-Training with DeepSeek Sparse Attention (DSA)

Similar Articles

@FeitengLi: Just said this morning: The intelligence of embodied intelligence should copy the homework of LLM + RL + Agentic. Here it is: Agentic VLA crushes the models of leading embodied companies across the board https://x.com/FeitengLi/status/205909864717506193...

@dongxi_nlp: I saw discussions about whether to use Python for building Agents. Go check out Shunyu Yao's ReAct source code – just a few notebooks. I remember running those simple lines of code and collapsing into my chair; it was one of the rare experiences in life. No exaggeration, these note…

Submit Feedback

Similar Articles

This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.

@FeitengLi: Just said this morning: The intelligence of embodied intelligence should copy the homework of LLM + RL + Agentic. Here it is: Agentic VLA crushes the models of leading embodied companies across the board https://x.com/FeitengLi/status/205909864717506193...

@teach_fireworks: AI Coding is now entering a very interesting phase. In the past, discussions focused heavily on model capabilities, context length, Agent Loops, Tool Use, and automated programming. However, once Agents are placed in real-world development environments for extended periods, many teams realize the issue isn't just about 'whether code can be generated...',

@vintcessun: Tonight I came across a learning roadmap project that redefined where to start learning Agent. I used to think Agent was just a pile of tools and frameworks, but its core is the "observe-think-execute" loop and the harness engineering's organization of permissions, state, and backtracking. It breaks down learning into building a minimal Agent loop from scratch all the way to deploying a real Agent, with 8 stages, each with clear deliverables and recommended resources — not just links but an actionable todo list. This systematic approach made me realize my previous learning was too fragmented.

@dongxi_nlp: I saw discussions about whether to use Python for building Agents. Go check out Shunyu Yao's ReAct source code – just a few notebooks. I remember running those simple lines of code and collapsing into my chair; it was one of the rare experiences in life. No exaggeration, these note…