@0x_kaize: https://x.com/0x_kaize/status/2068775813785506091
Summary
A guide on avoiding rate limits and reducing costs when using the GLM 5.2 model, covering prompt batching, caching, free model alternatives, effort levels, context window management, and self-hosting.
View Cached Full Text
Cached at: 06/22/26, 03:35 AM
How To Never Hit GLM 5.2 Limits
On June 13, the GLM 5.2 model was released, which is a direct competitor to Fable 5, but people immediately ran into a problem: hitting limits way too fast.
But many users don’t understand one important thing:
-
GLM’s Coding Plan meters prompts - NOT TOKENS!
-
API meters tokens - BUT PEOPLE SPEND THEM TOO QUICKLY!
I was spending everything too quickly, too, until I changed these 10 things.
1. Why you hit limits
First you need to understand what you’re actually paying for.
GLM has two completely different meters:
1. The Coding Plan (subscription) - counts prompts, not tokens.
jsonLite (~$18/mo): ~80 prompts per 5-hour cycle Pro: ~600 prompts per 5-hour cycle Max/Team: much higher
One massive prompt = one tiny prompt - same price.
People burn their quota sending 50 one-line questions when they could batch them into 5 well-structured ones.
2. The API (pay-per-token) - counts tokens.
jsoninput: $1.40 / 1M output: $4.40 / 1M cached input: $0.26 / 1M
In other words, if you’re on a subscription plan, you should never, under any circumstances, send pointless prompts - otherwise, you’ll hit your limits too quickly.
2. Cahing - the 81% discount
When you send a long stable prefix repeatedly - a system prompt, tool definitions, a big file you keep referencing - the provider caches the processed prefix.
The next call bills that part at $0.26/M instead of $1.40/M: ~81% discount on the repeated part of every prompt.
Rules to make it work:
1/ Keep reused content at the FRONT of the prompt. 2/ Keep variable content at the END (caches key off the prefix). 3/ Caches expire - the discount applies to calls that land close together, not once an hour.
Coding agents like Claude Code, Cline, and Cursor resend a huge stable preamble every single turn: instructions, tool schemas, repo context.
Caching that preamble cuts your per-turn bill dramatically. If you’re not caching, you’re paying full price to resend the same tokens over and over.
3. Free models - for everything that doesn’t need 5.2 version
Most of your tasks don’t need a frontier mode (GLM 5.2). Zhipu gives you two genuinely free models, no trial limit:
1/ GLM-4.7-Flash: free, 203k context, formatting + simple completions. 2/ GLM-4.5-Flash: free, lightweight general purpose.
In Flash, it’s best to handle formatting, renaming, quick syntax questions, and boilerplate code snippets.
Use GLM 5.2 for tasks that require an analytical approach.
This habit alone allows you to use the “Lite” pricing plan for twice as long.
4. Effort levels - stop running Max
GLM 5.2 has two thinking presets: High & Max.
Zhipu says Max should be the default for coding, but Max burns more quota and more tokens on every call - and most tasks don’t need maximum reasoning depth.
-
High: routine edits, drafts, simple logic.
-
Max: complex refactors, architecture, hard bugs.
Make the right decisions when using Max, and under no circumstances use it to fix a single line of code - otherwise, you’ll hit the limits very quickly.
5. The 1M trap
The 1M context window is the headline feature and it’s a trap if you use it wrong.
The full window loads via the glm-5.2[1m] model suffix, but loading a massive context means processing massive input every turn - even when the model only needs a fraction of it.
Rules:
-
Don’t load the entire 50k-line repo for a one-file fix.
-
Load the 1M window only when the task genuinely.
Use the big window when you need it.
For everything else, keep your context tight - the model re-reads everything you give it, every turn.
6. Self-hosting - zero per-token cost forever
GLM 5.2 ships under an MIT license - the weights are FREE.
if your volume is high enough and you have the hardware, you run the model yourself and pay zero per token.
It turns a metered bill into a fixed compute cost:
-
753B MoE (~40B active)
-
1M context, MIT weights
-
run it on your own infra = no quota, no token fees
The community is already quantizing the weights into 4-bit and 2-bit variants.
**The realistic play for most people: **
Stay on the hosted plan for now, watch for a single-node config, then re-evaluate self-hosting once your volume justifies it.
For heavy users this is the real “free GLM 5.2.”
7. Config - the exact setup
Wiring GLM 5.2 into Claude Code through the Coding Plan:
jsonexport ANTHROPIC_BASE_URL=“https://api.z.ai/api/coding/paas/v4” export ANTHROPIC_API_KEY=“your-glm-coding-plan-key” export ANTHROPIC_DEFAULT_SONNET_MODEL=“glm-5.2[1m]” export ANTHROPIC_DEFAULT_OPUS_MODEL=“glm-5.2[1m]” export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000 export API_TIMEOUT_MS=3000000
The API_TIMEOUT_MS value matters. Without a long timeout, Claude Code kills long large-context calls before GLM 5.2 finishes.
Set it high or you’ll waste quota on calls that never complete.
Note:
The Coding Plan key is a different credential from the standard API key. And calls outside the supported tools fall back to normal API billing.
8. Batch your prompts
This is the biggest fix for Coding Plan users specifically.
Remember: the plan counts prompts, not tokens.
-
10 separate one-line questions cost 10 prompts.
-
10 questions in one structured message cost 1.
**Don’t make 10 prompts: ** “rename this variable” “now fix the import” “add a type here”..
**Male 1 prompt: **“do all of these: rename X to Y, fix the import on line 4, add types to the function params, and update the test”..
Batching related work into single prompts can stretch your quota 5-10x.
This one habit changes everything if you’re on Lite.
9. Compact long sessions
A growing chat history is a growing bill on every single turn.
By message 40, the model is re-reading thousands of tokens of context every time you send something:
-
On the API that’s input tokens you pay for over and over.
-
On the Coding Plan it eats into your effective throughput.
Rules:
-
compact or start a fresh session every 30-40 messages
-
don’t keep one giant session running all day
-
start clean when you switch tasks
The model has no reason to carry your morning’s context into an afternoon task.
10. Drop to GLM-4.7 when you don’t need 5.2
5.2 is the flagship, but 4.7 still hits 73.8% on SWE-bench and costs less per call.
-
GLM 4.7: most everyday coding, edits, standard features.
-
GLM 5.2: complex reasoning, 1M context tasks, hard bugs.
Most coding work doesn’t need the absolute frontier.
Reserve 5.2 for the tasks that genuinely need its reasoning, and let 4.7 handle the bulk.
Between 4.7 for mid-tier work and Flash for simple work, your 5.2 stops being the bottleneck entirely.
The honest part
GLM 5.2 isn’t free. the “free tokens” framing floating around is mostly wrong - the only genuinely free paths are the Flash models and self-hosting the open weights.
But the gap between someone who hits their limit in an hour and someone who codes all day on the same plan isn’t the plan - it’s these 10 habits.
Do that, and the cheapest frontier coding model on the planet gets even cheaper.
And most importantly, it’s not your tokens that are being spent - it’s your prompts.
Similar Articles
Cheapest way to run GLM 5.x locally that's not a unified memory system?
A discussion on the cheapest local hardware setups for running GLM 5.x and similarly sized models at 4-bit quantization, including CPU-only and multi-GPU options, with a user sharing their experience running Minimax 2.7 and Qwen 3.6 on a 5900X + 128GB DDR4 + 7900XT setup.
@hooeem: https://x.com/hooeem/status/2068752941553476002
A comprehensive guide to setting up GLM 5.2, an open-source AI model that claims to beat GPT-5.5 on coding benchmarks while being cheaper, covering cloud and local setup options.
Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)
A user runs GLM-5.2 locally on CPU only, demonstrating how to run a large model on a modest setup.
@aiedge_: GLM-5.2 - The Complete Guide. Everything you need to know: how to set it up, how to prompt, tips, and more.
A comprehensive guide to setting up, prompting, and using GLM-5.2, including tips and tricks.
I’m seeing a lot of hype over GLM 5.2 but is the coding plan actually generous for heavy usage?
The article questions whether the pricing plan for GLM 5.2 is generous for heavy users, despite the surrounding hype.