@jinchenma_ai: https://x.com/jinchenma_ai/status/2061835131107860582
Summary
The article proposes an engineering methodology based on AI Agent (Skill), suggesting that deterministic tasks be solidified into scripts to reduce new decisions made by the large model at runtime, thereby improving stability and token efficiency. Taking video subtitle processing as an example, it demonstrates a four-step engineering process.
View Cached Full Text
Cached at: 06/03/26, 05:44 AM
Your Skill is Unstable and Token-Intensive? Then You Should Learn How to Engineer It (Prompt Included)
Many friends, after creating their own Skill, gradually discover a problem: the same Skill works fine one time, but goes haywire the next. They ask it to handle a simple task, but it drags on for ages, burns through a pile of Tokens, and finally throws an error. Staring at that long stream of output in the terminal, you have no idea at which step it started to go astray.
This isn’t because your Skill is poorly written.
The way large models work is fundamentally different from traditional programs. They don’t execute step-by-step according to fixed rules; instead, they select a most likely answer from probabilities. For the same requirement, it might get it right this time, but misjudge next time. For the same function, the code it generates this time might work, but next time it could miss an edge case, write a wrong parameter, or call the wrong tool.
Large models inherently carry uncertainty and hallucinations. This is a fundamental mechanism, not that it deliberately acts differently each time.
Moreover, as your Skill becomes more complex, there are more places where the large model needs to make on-the-fly judgments, generate code on the spot, and fill in details ad hoc. Every time it has to improvise, the uncertainty gets amplified repeatedly.
That’s why you need to learn how to engineer your Skill.
This article is for people who have already created a Skill but want to make it more stable and more Token-efficient.
I will share how to apply engineering principles on top of a Skill.
If you haven’t built a Skill yet, you can bookmark this for later.
01 | Sources of Instability
When a simple Skill is first built, it usually does one simple thing: read a set of rules, execute them. At that point, it’s very stable, because you’ve already fixed everything you can in the rules.
But when you start having it handle complex tasks, problems arise.
The same Skill: the first run works. The second time, with a different input file, the parameters or format change, and it doesn’t know which one to call. Then you run it in batch, and some detail in the middle differs from the previous item; it starts skipping steps, missing steps, repeating steps. You watch the Token consumption skyrocket, and in the end, the output doesn’t match.
Breaking down the sources of instability, there are actually just five categories:
Unstable logical judgment. Each time, the large model infers the next step based on context (all chat history and file content it can see when processing a task). The more complex the task, the longer the judgment chain, the higher the probability of going astray.
Unstable code generation. Each time it temporarily generates code for the same function, variable names, dependencies, exception handling, and edge conditions can vary. More troublesome are hallucinated calls: it might call a function that doesn’t exist, or fabricate a parameter that looks reasonable but doesn’t actually exist.
Unstable detail completion. Paths, parameters, formats, tool names, API usage — if these are not explicitly solidified, the large model has to guess. Sometimes it guesses correctly, but if it guesses one letter wrong, it will repeatedly try again.
Unstable process execution. Without steps solidified into a process file or Skill scheduling logic, as tasks get longer, there’s a risk of skipping, missing, or repeating steps. It’s not that it doesn’t want to follow order; the longer the context, the more likely it is to “reinterpret” what it should do somewhere in the middle.
Unstable context carrying capacity. The longer the context, the more distracting information, the easier for the large model to grasp the wrong key points. Rules stated earlier might be forgotten later. Not because the rules are poorly written, but because it has too much to remember simultaneously.
The root cause of all five types is the same: you’ve handed too much over to the large model to improvise on the spot.
Prompts try to guide it “this time” to drift a little less. Engineering is about reducing the number of times it needs to improvise “each time.”
02 | What is Engineering?
Don’t be intimidated by the term “engineering.” It’s not the exclusive jargon of senior programmers.
In the context of Agents and Skills, engineering solves a very straightforward problem: which things can we continue to let the large model judge, and which things should not be left to it to improvise every time.
Put simply: Hand over uncertainty to the large model; solidify certainty into the system.
Large models are good at understanding, judging, decomposing, rewriting, reviewing — things that require flexibility, leaving them to the model is fine. But fixed processes, fixed code, fixed parameters, fixed input/output, fixed acceptance criteria — these things should not be guessed anew every time.
Specifically, you can break it down into four steps:
- First, run the task successfully once to confirm the thing can actually be done.
- Break the task into fixed steps, clearly specifying input, output, and success criteria for each step.
- Solidify the parts that can be done stably with code into scripts; don’t let the Agent generate them temporarily each time.
- Finally, let the Skill only be responsible for scheduling: read rules, call scripts, check results, and fix if problems occur.
Engineering isn’t about making AI smarter. It’s about preventing AI from having to rethink things that have already been figured out.
03 | A Student’s Video Subtitle Case
The four-step engineering approach described above might still sound a bit abstract. Let’s look at a real case to make it clear.
A student wanted to create a Skill for video processing.
Her requirement was clear: feed in a batch of English videos, and automatically produce a batch of videos with bilingual Chinese-English subtitles.
If you directly gave this task to an Agent with “Help me make a video subtitle Skill,” what would it do? It would start judging from scratch: how to process the video, how to extract subtitles, how to do translation, how to write back. Every step requires on-the-spot thinking, on-the-spot code writing, on-the-spot debugging. Running it once might work; changing videos would likely require redoing everything.
The engineering mindset is completely different. Instead of starting with “how to write this Skill,” you first ask, “what is the loop when this task is run repeatedly?”
Breaking it down, the loop for this task is actually quite clear:
Input English video → Transcribe to English subtitles → Proofread English → Translate to Chinese → Write subtitles back to video → Output video with bilingual subtitles.
After confirming the loop, examine each step: which ones are suitable for the large model to judge, and which should be solidified into scripts.
04 | Four-Step Engineering
Breaking the loop into four steps, the processing method for each is different.
Video to subtitles. This step is suitable to be fixed with a script, directly calling a transcription tool or local capability. No large model judgment needed. Input video file, output subtitle file. Pure execution.
English proofreading. This step requires understanding semantics, so it does need a large model. But the main Agent doesn’t need to do it personally. The best approach is to write a script that calls an external large model API — for example, DeepSeek — send the subtitle text, get the proofread result back. The main Agent is only responsible for invoking the script and checking the result.
English to Chinese translation. Same logic. Translation is a typical text processing task, better handled by an external model. The script acts as a “messenger”: send English to DeepSeek, get Chinese back, continue the flow.
Subtitle write-back to video. This step is back to pure execution. The subtitle rendering logic is fixed in a script. Input video and bilingual subtitles, output final video. No large model involvement needed.
There are three practical reasons to use external large models for the middle two steps:
First, separate the main Agent’s workflow from the specific text processing tasks. Codex or Claude Code is responsible for viewing the process, calling scripts, checking results. Don’t let it carry the full context and simultaneously do proofreading and translation in the main session. Proofreading and translation are long-text processing tasks that will directly overload the main Agent’s context.
Second, save Tokens and costs. English proofreading and translation don’t necessarily require the most expensive flagship model. Cheap models called via API like DeepSeek are sufficient. The code is just a messenger and doesn’t consume the main Agent’s Tokens.
Third, easier troubleshooting. The main Agent only needs to check if the script was called successfully and if the output was generated. If translation quality is off, modify the translation script’s prompt or switch to a different external model. This won’t disrupt the entire main flow.
05 | How to Engineer Specifically
We’ve clarified the loop and how to handle the four steps. First, run a single video successfully. Don’t jump in trying to batch process 20. Running a single point is to verify the process, scripts, input/output are all fine. Once one video runs, the rest is just looping.
Then solidify each of the four steps into scripts: transcription script, proofreading script, translation script, subtitle write-back script. Each script’s input and output should be clearly documented:
- Transcription script: Input video file → Output English subtitle file.
- Proofreading script: Input English subtitles → Call DeepSeek API for proofreading → Output proofread English subtitles.
- Translation script: Input proofread English subtitles → Call DeepSeek API for translation → Output Chinese subtitles.
- Write-back script: Input video + bilingual subtitles → Output final video with bilingual subtitles.
The proofreading and translation scripts call external large model APIs; the main Agent does not do proofreading or translation itself. It only invokes the scripts.
Finally, write the Skill to call these four scripts in order, checking for output at each step. If a step fails, the Skill should be able to tell you which step failed, which script, which input, or which external model call, instead of rerunning the entire workflow.
There’s a cost to consider: main Agents like Codex, Claude Code are meant for scheduling and fixing the workflow, not for running all text processing. Tasks that consume a lot of text, like proofreading and translation, are better handled by cheaper external models.
06 | Judgment Criteria
Not every Skill needs engineering. If it does something simple, rules fixed, input/output stable, prompts are sufficient.
To decide whether to engineer, the core question isn’t “should I make this into a Skill?” but rather: “Does this Skill have too many things that should be fixed but are still left for the large model to improvise each time?”
Four quick diagnostic questions:
Is there a fixed process? If it’s always the same sequence of steps, just with different inputs, it’s suitable for engineering.
Is there fixed code? If each time the Agent writes the same type of script ad hoc — video processing, batch file operations, API calls — it’s suitable for engineering.
Is there fixed input/output? If the output of each step can cleanly feed into the next, it’s suitable for engineering.
Is there clear division of labor? If some steps are better done by scripts, some by external models, with the main Agent only scheduling, it’s suitable for engineering.
Whether to make a Skill depends on whether the task will be done repeatedly. Whether to engineer depends on whether the Skill has too many parts that shouldn’t be left to the large model to improvise.
07 | Directly Reusable Prompt
The following prompt can be fed directly to Codex or Claude Code to help you engineer your existing Skill.
textPlease help me engineer this Skill. Principle: anything that can be fixed with code, generate a script; anything that cannot be fixed with code, write into the Skill’s rules and scheduling logic. Don’t just give a plan; try to land it in executable files and testable workflows.
Requirements:
- First, list the complete loop of this task.
- Break the loop into fixed steps.
- Specify the input and output of each step.
- Determine if each step can be scripted with code.
- For steps that can be scripted, directly generate or update the corresponding script, and clearly state the script name, script responsibility, input/output, and how to run it.
- For steps that cannot be scripted, write them into the Skill’s rules, judgment criteria, or scheduling logic.
- Generate or update the Skill file so it calls scripts, reads rules, and checks results in a fixed order.
- First run through the flow with a minimal example; don’t start batch processing.
- If a step fails, indicate which script, which input, or which external model call caused the failure; don’t rewrite the entire workflow.
Output should be divided into three parts:
- Task loop
- Generated / suggested scripts
- Skill rules, scheduling logic, and testing approach
Final Words
The next time your Skill becomes unstable, consumes too many Tokens, or is hard to debug, don’t just keep adding to the prompt.
The real question to ask is: In this Skill, what things should no longer be left to the large model to improvise each time?
For things that can be scripts, let AI help you write them as scripts. For things that can call cheaper external models, split them out of the main Agent. For things that can have fixed input/output, document the format and path. Finally, let the Skill handle scheduling, checking, and fixing.
You’ll gradually feel that running the Skill is no longer a gamble.
You’ll know which step is run by a script, which step is judged by a model, and where to fix when something goes wrong. This feeling will spill over into other work: anything with a fixed process, you’ll subconsciously ask: Can I engineer this too?
Golden Dust Horse | Big Tech Programmer | 30 days ×10,000 followers, monetization over 10,000 | Continuously sharing AI money-making, programmer transition, OPC insights | Contact info on profile: https://x.com/jinchenma_ai
Similar Articles
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
@wsl8297: When running complex tasks with AI agents, the most painful thing is often not that the model isn't strong enough, but that as the conversation gets longer, the context starts to overflow. You have to keep filling in background details, re-explaining the process, plus the redundant logs from tool calls — tokens just gush out like a broken pipe. Recently, I saw TencentDB Agent Memory open-sourced by Tencent...
Tencent has open-sourced TencentDB Agent Memory, which solves the AI agent long-context overflow problem through hierarchical memory management (symbolic short-term memory + hierarchical long-term memory). Benchmarks show token consumption reduced by up to 61% and task success rate improved by over 50%.
@axichuhai: https://x.com/axichuhai/status/2062146611472400461
Shares 8 curated AI skills, covering basic configuration, product development, and content creation, to boost AI productivity for agents such as Claude Code and CodeX.
@XAMTO_AI: There is an open-source project called narrator-ai-cli-skill. Plug it into an agent like Claude Code, and just say 'Help me make a commentary video for The Shawshank Redemption', and the AI handles everything: automatically generates commentary script without you lifting a finger, precisely matches corresponding movie clips, …
Introducing an open-source project called narrator-ai-cli-skill, which can be integrated into AI agents like Claude Code. With just one sentence, users can automatically complete the entire process of creating a movie commentary video — script, voiceover, editing, background music, etc. — greatly reducing the production cost of film commentary channels.
@vintcessun: I always thought AI agents could only write ordinary code. Turns out MIT HAN Lab is directly using an agent workflow to design and optimize CUDA kernels. Hand-tuning is time-consuming and easy to miss solutions. They came up with a workflow of "task contract + agent loop + small-step verification", letting the agent research, implement, verify...
MIT HAN Lab proposes a method to automatically design and optimize CUDA kernels using an AI agent workflow. Through a process of task contracts, agent loops, and small-step verification, the agent can autonomously iterate and optimize within a specialized toolchain, replacing manual tuning.