@shao__meng: The Internal Design, Iteration, and Maintenance of Agent Skills at Perplexity. The public version of Perplexity Agents' internal standards presents a counter-intuitive core argument: writing a Skill is not about writing code, but about building context for the model. Applying the instinct of engineers writing code directly to Skills...
Summary
The Perplexity team has published guidelines for the design, iteration, and maintenance of Agent Skills, emphasizing that writing Skills is not traditional coding but rather constructing context for the model. The article proposes a counter-intuitive methodology focused on evaluation-first approaches, progressive loading, and optimizing Agent behavior by handling edge cases (Gotchas).
View Cached Full Text
Cached at: 05/09/26, 06:11 PM
The Perplexity Agent Skills Design, Iteration, and Maintenance Guide: The Official Internal Norms
The core argument is counter-intuitive: Writing Skills is not writing code; it is building context for models. Applying an engineer’s coding instincts directly to Skills will almost certainly lead to failure.
Source: https://research.perplexity.ai/articles/designing-refining-and-maintaining-agent-skills-at-perplexity…
Skill ≠ Code: Python Zen vs. Skill Anti-Patterns
- Python: Simple is better than complex. Skill: A Skill is a folder; complexity is a feature.
- Python: Explicit is better than implicit. Skill: Activation relies on implicit pattern matching + progressive disclosure.
- Python: Sparse is better than dense. Skill: Every token must yield maximum signal.
- Python: Special cases aren’t special enough to break the rules. Skill: Gotchas are the highest-value content.
- Python: If the implementation is easy to explain, it may be a good idea. Skill: If it’s easy to explain, the model already knows it → Delete it.
The Four Definitions of a Skill
-
A Skill is a Directory (not a single file)
- Standard Structure:
SKILL.md+scripts/+references/+assets/+config.json - Complex domains require multi-level hierarchies. For example, U.S. tax law involves 1,945 IRC sections; flat loading performs worse than no loading at all. It only becomes usable after three levels of nesting. However, hierarchy has a cost, requiring navigation tools (quick references, custom search) to mitigate indirection.
- Standard Structure:
-
A Skill is a Format
- Frontmatter must include
name(lowercase, hyphenated, matching the directory name) anddescription. - The
descriptionis a routing trigger, not documentation. Common error: Writing “This Skill does X”; Correct approach: “Load when…”. depends:is used for cascading dependencies. Runtime metadata can be isolated in auxiliary JSON/YAML files to avoid polluting the context.
- Frontmatter must include
-
A Skill is Invocable
- Loading Flow:
load_skill()→ Copy directory into sandbox → Recursively load dependencies → Strip frontmatter, exposing only the body and associated files.
- Loading Flow:
-
A Skill is Progressive (This is the most critical cost model)
- Index: Name + Description for all Skills; ~100 tokens/Skill; Paid in every session, for every user, always.
- Load:
SKILL.mdbody; ~5,000 tokens; Loaded once, occupied for the duration of the session. - Runtime:
scripts/,references/, sub-Skills; Unbounded cost; Paid only when the model actually reads them. - Rule: The higher the layer, the more expensive every word is. The Index is a “luxury boutique”; Runtime is an “infinite warehouse”.
When You Don’t Need a Skill
The phrase “Every Skill is a tax” is emphasized repeatedly. Three typical cases of abuse:
- What the model already knows: Writing a string of git commands is good documentation, but a bad Skill.
- Repeating system prompts: General knowledge should go into global context, not conditional loading.
- Too volatile: Remote MCP tool versions change frequently → Skills will drift, leading to hallucinations.
- The Litmus Test: “Would the Agent make a mistake without this sentence?” If the answer is no, delete it.
As Pascal said: “This letter is long only because I had no time to make it short.” Writing concise Skills is hard; Skills written quickly are likely problematic. Research also shows that LLM-generated Skills offer no average benefit because models cannot reliably write out “procedural knowledge useful for self-consumption.”
The Five-Step Construction Method (Order is Fixed)
Step 0 — Write Evals First: Derived from real queries, known failures, and neighborhood confusion. Negative examples are often more important than positive ones.
Step 1 — Write the Description (The Hardest Line):
- Start with “Load when…”, ≤ 50 words.
- Describe user intent (use real complaint language: “babysit”, “watch CI”, “make sure this lands”).
- Do not summarize the workflow.
- Single Goal: Accurate routing, minimizing regression impact on other Skills.
Step 2 — Write the Body: Skip the obvious; do not list command sequences; use intent statements instead of process scripts.
- Bad:
git log; git checkout main; git checkout -b; git cherry-pick - Good: “Cherry-pick to a clean branch, resolve conflicts while preserving intent, explain reasons if it fails.”
- Focus on gotchas / negative examples.
Step 3 — Use Hierarchy: Move conditional, heavy, or template-like content into scripts/, references/, assets/, config.json.
Step 4 — Iterate: Use an eval suite for fine-grained tuning (a single word difference in the description can trigger a routing cascade).
Step 5 — Ship.
Maintenance: The Gotcha Flywheel
Skills are “append-only”:
- Agent makes a mistake → Add a gotcha.
- False positive load → Tighten description + Add negative examples.
- Should load but didn’t → Add keywords + Add positive examples.
- System prompt changes → Check for conflicts and duplication.
The journey from 80/20 to 99.9% relies almost entirely on the growth of the gotcha list, not on rewriting descriptions or adding longer instructions. If a PR changes the description without attaching evals, “it has already gone astray.”
Four Categories of Eval Suites:
- Loading Evals: Precision, recall, and prohibition of loading (to avoid polluting the neighborhood).
- Progressive Loading Evals: Does the Skill correctly read associated files (e.g.,
FORMATTING.md) after loading? - End-to-End Task Evals: Run the full agent loop, scored by an LLM judge based on a rubric.
- Cross-Model Evals: Run simultaneously on GPT / Opus / Sonnet (behavioral differences are significant).
Key Takeaways
- Evals first, then Skills; negative examples and “prohibited loads” are as important as positive examples.
- The Description is the hardest line; start with “Load when…”.
- Gotchas are the highest-value content; start thin, grow through failure.
- Action at a Distance: Adding a new Skill can silently degrade existing Skills — this is a default risk, not an edge case.
- The ability to write Skills grows with compound interest; any workflow repeated daily/weekly/quarterly is a potential Skill.
Designing, Refining, and Maintaining Agent Skills at Perplexity
Source: https://research.perplexity.ai/articles/designing-refining-and-maintaining-agent-skills-at-perplexity
Perplexity’s frontier agent products rest on a foundation of know-how and domain expertise packaged in modular Agent Skills. We maintain a carefully curated library of Skills across our technical environments. These Skills include many of the general-purpose utilities powering Perplexity Computer; vertical-specific capabilities in areas such as finance, law, and health; and a very long tail of modules for addressing user needs. Some Skills are infrequently invoked but critical when invoked. To ensure a consistently excellent user experience, Perplexity’s Agents team prioritizes Skill quality just as much as code quality.
The intuitions and best practices required to develop a high-quality Skill differ significantly from those required to build traditional software. The Agents team reviews many pull requests from excellent engineers who develop Skills in the course of their work. The result is almost always numerous comments and suggestions for revision. This is because many useful patterns for writing code become antipatterns in Skill creation. For example, if you take some of the aphorisms from PEP 20 – The Zen of Python, it quickly becomes clear that writing good Python code is unlike writing good Skills. Of the 20 lines of wisdom, at least half are fully wrong or actively misleading when writing Skills.
Here are five of them:
| Zen of Python | Zen of Skills |
|---|---|
| Simple is better than complex | A Skill is a folder, not a file. Complexity is the feature. |
| Explicit is better than implicit | Activation is implicit pattern matching. Progressive disclosure. |
| Sparse is better than dense | Context is expensive. Maximum signal per token. |
| Special cases aren’t special enough to break the rules | Gotchas ARE the special cases (they’re the highest-value content). |
| If the implementation is easy to explain, it may be a good idea | If it’s easy to explain, the model already knows it. Delete it. |
This guide is the document that engineers across Perplexity use when developing and reviewing Skills. We’re also releasing this guide to the public so that our discoveries and learnings can benefit the broader community. Whether you’re an engineer designing production Skills in your day-to-day work, a Computer user looking to develop your own Skill in an area you know best, or both, this guide is for you.
What is a Skill?
When you write a Skill, you aren’t writing plain old software (even though Skills are now part of the main logical engines for agent systems). Rather, you’re building context for models and their environments. A Skill has different constraints and different design principles. If you write a Skill like you do code, you will fail.
A Skill is at least four things, especially in the context of how we build them at Perplexity.
A Skill is a Directory
A Skill is not just a single SKILL.md file. In many cases, a Skill includes several files. Under the directory named after your Skill, you might have:
SKILL.md: frontmatter and instructionsscripts/: code the agent runs, not reinventsreferences/: heavy docs, loaded conditionallyassets/: templates, schemas, and dataconfig.json: first-run user setup
This hub-and-spoke pattern allows you to keep Skills very focused and tight, and one can use the folder structure in a very creative way. Sometimes, particularly intricate Skills benefit from multiple levels of hierarchy to help the model navigate better. Suppose a Skill requires knowledge across 300 topics, groupable into 20 subject matter areas. Reliably choosing the right topic among 300 is an unsolved challenge even for today’s best frontier models. It’s a much easier choice problem for a model to hone in on one of 20 areas, than among the 15 topics within that area.
As one example of how multilevel hierarchy provides value, our team employed three levels of topical nesting within the Skills powering Computer’s U.S. income tax capabilities this past tax season. This hierarchy was absolutely indispensable given the complexity of tax law: in our early tests, presenting the model with a single folder containing all 1,945 sections of the U.S. Internal Revenue Code resulted in worse performance than not loading the Skill at all. Organizing the information into logical subdivisions was indispensable for ensuring high-precision read operations.
Yet this hierarchy did not come free. Increasing levels of hierarchy require increasing levels of curation across the information architecture to manage the resulting indirection. We devised quick reference guides, custom search utilities, and other tools to support the model in locating information with a minimum of indirection. In this case, doing the hard work of curation ultimately produced a positive end result: a Skill that allowed models to perform tax-related tasks much more capably than using general tools alone.
A Skill is a Format
A Skill is a format. The core root SKILL.md file must have both a name and a description. Furthermore, the Skill needs to exactly map to the directory name in which the Skill is located. The name must be all lower-case characters, have no spaces, and can use hyphens.
The description is the routing trigger. This is a common failure point: the description is not internal documentation for what the Skill does. It amounts to instructions for the model for when to load the Skill. So, you will frequently see “Load when,” not “This Skill does.” This is important because of the way that most implementations inject the description into the model context.
Within the frontmatter, there is also “depends:”, which allows you to create hierarchical Skill dependencies, and “metadata:”, which is used for reviews and evaluations. Different agent systems can even define their own frontmatter fields, to be used in a manner specific to those systems. As an alternative, Skill-specific metadata can be packaged in an auxiliary JSON or YAML configuration file. This is desirable when building agent systems that need to facilitate different types of runtime behavior per Skill without polluting the model’s context with minutiae.
Finally, similar behavior is obtainable through stripping Skill frontmatter on read. Computer employs this methodology, which allows configuration to be preserved in the root SKILL.md file. Careful attention to detail is required in the parsing logic, and one might wish to implement conditional stripping if there are certain fields that are useful to have within the model context.
A Skill is Invocable
A Skill is invocable. The agent loads a Skill at runtime. Importantly, Skills aren’t always bundled into the context. By default, most agent systems unfold Skills progressively upon specific need. There are at least three tiers of context costs in the way that we’ve implemented Skills in Computer. Here is the process:
- Computer calls
load_skill(name="...") - Computer copies the Skill directory into the isolated execution sandbox
- Computer recursively auto-loads dependencies in the “
depends:” tag - Computer then strips the frontmatter and the agent thus only sees the body and the additional files
Different agent systems can choose to expose Skill content in different ways. As an example, some systems might choose not to expose the file hierarchy at all, leaving it to the model to discover the hierarchy through filesystem operations. Other systems may choose to give the model a mapping of the entire filetree up to a certain truncation and/or depth limit. To keep context clean, Computer omits full file hierarchies from the invocation context; however, this is overridable on a per-Skill basis.
A Skill is Progressive
Skills are progressive. In Computer, there are three different tiers of context costs, and we incur all three at various stages:
| Tier | What loads | Budget | When you pay |
|---|---|---|---|
| Index | name: description for every non-hidden Skill | ~100 tokens per Skill | Every session, every user, always paid |
| Load | Full SKILL.md body | ~5,000 tokens | ~5,000 tokens |
| Runtime | Files in scripts/, references/, assets/, subskills, FORMATTING.md, SPECIAL_CASES.md | Unbounded | Only when the agent reads them |
Computer builds a Skill index that has the name and the description for every available Skill. The budget for this is around 100 tokens per Skill (shorter is even better). It’s so tight because you’re paying this cost in every session, for every user. This is injected into the system prompt at the very beginning of the conversation. The model has access to a bunch of named Skills and descriptions so that it can decide whether to call “load_skill()”. The bar to getting into this index is extremely high. Your Skill needs to be very useful, and the description needs to be extremely dense and terse because everyone is paying the cost all the time.
After the agent system loads the Skill, there’s the full SKILL.md body. Ideally, the body text does not exceed 5,000 tokens. Even then, you want every sentence to matter because once you load a Skill, the rest of the conversation has to pay that until you hit the compaction boundary. Many threads load anywhere between three and five different Skills, multiplying this cost. Skills with a lot of fluff will almost certainly degrade other Skills as well as overall agentic capabilities. In short, if your Skill loads and it doesn’t do the right thing, that’s wasted context.
The final level of progression is scripts or special cases, like subskills or formatting. This is where you want to put unbounded conditional branched logic. The agent will only use it when it needs to, meaning there’s a much lower bar for what you want to put in here. In the index, every token is important. The loaded Skill body is more relaxed, and the runtime is the most relaxed. This could be 20,000 tokens or zero tokens. This is the level at which you might think about expanding the context of the model in a progressive fashion.
When do you need a Skill?
The Agents team is often asked to opine on whether a Skill is truly needed for a given domain or use case. Very rarely do we have a definitive answer from first principles alone. The only way to really figure this out is to start with your agent without the Skill, run several hero queries, and then figure out whether the agent is doing a good job.
When you need a Skill
There are many tasks that are in distribution for trained models. You only need to apply a Skill if you want to change that behavior in some specific way that you can’t with, say, one sentence in your prompt. So, you need a Skill when the agent will get it wrong without special context, or if there’s some inconsistency or non-determinism that you need to be extremely consistent across runs.
It could be that your knowledge is durable but not in the training data. There could be cutoffs or enterprise-specific workflows, or it could be a matter of taste. For example, we have several design-related Skills in Computer written by Henry Modisett (our head of design). The reason that every token exists in those Skills is because Henry has very good taste when it comes to designing websites and PDFs. Henry specifies which fonts to use and which fonts not to use, how those fonts feel, and other matters of judgment that the model can’t learn from training data alone.
When you don’t need a Skill
We see many Skills in which engineers have written a series of git commands that need to be executed in order. That’s unnecessary because the model already knows how to do that, meaning it makes for great documentation but a poor Skill. We see examples where Skills recapitulate instructions from the system prompt. You don’t need a Skill for that. Knowledge relevant for the majority of requests should be included in global context, not in a conditionally loaded Skill.
If there’s something that’s changing faster than you can maintain it, you don’t need a Skill. For example, if you’re hitting some remote MCP endpoint and its tools or the versions of those tools are changing frequently, you shouldn’t inject those into a Skill. If you do, you’ll just end up with drift and the model will make mistakes.
Every Skill is a tax
Here’s a useful test you can apply to every sentence in your Skill: “Would the agent get this wrong without this instruction?” If the sentence does not need to be there, it cannot afford to be there because everyone is paying this cost every single time. When you are deciding whether to add a Skill or not, remember this tax wherein every session and every user costs tokens.
The following famous quote, which sounds much better in French, roughly translates to “I have only made this letter longer because I have not had the time to make it shorter.”
« Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. » — Blaise Pascal, Les Provinciales
Similar Articles
@berryxia: Perplexity is going open source too! Such generosity! It has completely rewritten the rules for building agent skills. They just released their internal handbook: Building agent skills requires an entirely new developer mindset. The research paper is here https://research.perpl…
Perplexity has released its internal Agent Skills building handbook, proposing a new developer mindset distinct from traditional software engineering, emphasizing context management and implicit pattern matching principles for AI agents.
@SaitoWu: Garry Tan has a crucial skill called Plan-Eng-Review. The workflow for this skill is roughly: First, have the agent plan, then have the agent draw ASCII diagrams, mapping out all data flows, user flows, and state machines. Then proceed to code implementat...
Introduces Garry Tan's 'Plan-Eng-Review' skill, emphasizing that before using AI for coding, one should first use an Agent to generate ASCII diagrams to plan data flows and state machines, in order to prevent the code implementation from deviating from the intended direction.
Most agent frameworks miss a key distinction: what a skill is vs how it executes
A technical analysis proposing that agent frameworks should distinguish between what a skill describes (persona, tool, workflow) and how it executes (stateless vs stateful), arguing this distinction is crucial for building robust real-world agent systems.
addyosmani/agent-skills
agent-skills is a collection of production-grade engineering skills designed to enhance the capabilities of AI coding agents.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.