@shao__meng: The Internal Design, Iteration, and Maintenance of Agent Skills at Perplexity. The public version of Perplexity Agents' internal standards presents a counter-intuitive core argument: writing a Skill is not about writing code, but about building context for the model. Applying the instinct of engineers writing code directly to Skills...

X AI KOLs Timeline Papers

Summary

The Perplexity team has published guidelines for the design, iteration, and maintenance of Agent Skills, emphasizing that writing Skills is not traditional coding but rather constructing context for the model. The article proposes a counter-intuitive methodology focused on evaluation-first approaches, progressive loading, and optimizing Agent behavior by handling edge cases (Gotchas).

The Internal Design, Iteration, and Maintenance of Agent Skills at Perplexity This is the public version of the Perplexity Agents team's internal standards. The core argument is counter-intuitive: Writing a Skill is not writing code; it is building context for the model. Applying an engineer's instinct for writing code directly to Skills will almost certainly lead to failure. https://research.perplexity.ai/articles/designing-refining-and-maintaining-agent-skills-at-perplexity… # Skill ≠ Code: Examples from Python’s Zen vs. Skill Anti-Zen * **Python:** Simple is better than complex. **Skill:** A Skill is a folder; complexity is a feature. * **Python:** Explicit is better than implicit. **Skill:** Activation relies on implicit pattern matching + progressive disclosure. * **Python:** Sparse is better than dense. **Skill:** Squeeze maximum signal from every token. * **Python:** Special cases aren't special enough to break the rules. **Skill:** Gotchas are the highest-value content. * **Python:** If the implementation is hard to explain, it's a bad idea. **Skill:** If it's easy to explain, the model already knows it → Delete it. # The Four Definitions of a Skill **1. A Skill is a directory (not a single file)** * Standard structure: `SKILL.md` + `scripts/` + `references/` + `assets/` + `config.json` * Complex domains require multi-level hierarchies. For example: US tax law has 1,945 IRC sections. Flat loading performs worse than no loading; it only becomes usable after three levels of nesting. However, hierarchy has a cost. Navigation tools (quick reference, custom retrieval) are needed to hedge against indirection. **2. A Skill is a format** * Frontmatter must include `name` (lowercase, hyphenated, same as directory name) and `description`. * The `description` is a routing trigger, not documentation. Common mistake: Writing "This Skill does X"; Correct approach: "Load when…". * `depends:` is used for cascading dependencies. Runtime metadata can be isolated in auxiliary JSON/YAML to avoid polluting the context. **3. A Skill is callable** * Loading process: `load_skill()` → Copy directory into sandbox → Recursively install dependencies → Strip frontmatter, exposing only the body and attached files. **4. A Skill is progressive (this is the most important cost model)** * **Index:** Name + description of all Skills; ~100 tokens/Skill; every session, every user, always. * **Load:** `SKILL.md` body; ~5,000 tokens; loaded once, occupies memory for the duration of the session. * **Runtime:** scripts / references / sub-Skills; unbounded; only when the model actually reads them. The higher the layer, the more expensive each word is. The Index is the "luxury boutique window," while Runtime is the "infinite warehouse." # When is a Skill Unnecessary? Repeated emphasis on "Every Skill is a tax." Three typical types of abuse: * **What the model already knows:** Writing a sequence of git commands → This is good documentation, but a bad Skill. * **Duplicating system prompts:** General knowledge should go into the global context, not conditional loading. * **Changes too frequently:** Remote MCP tool versions changing often → Skills will drift, leading to hallucinations. **The ruler for deciding whether to keep a single sentence:** "Without this sentence, will the Agent make a mistake?" If the answer is no, delete it. Quoting Pascal: "I would have written a shorter letter, but I did not have the time." — Writing short Skills is hard; Skills written quickly are likely problematic. A study is also cited: LLM self-generated Skills yield no average benefit because models cannot reliably write out "procedural knowledge useful for self-consumption." # The Five-Step Construction Method (Order Cannot Be Changed) **Step 0 — Write Evals First:** Derived from real queries, known failures, and neighborhood confusion. Negative examples are often more important than positive ones. **Step 1 — Write the Description (The Hardest Line):** * Start with "Load when…", ≤ 50 words. * Describe user intent (using real complaint language: "babysit," "watch CI," "make sure this lands"). * Do not summarize the workflow. * The sole goal: Accurate routing, minimizing regression impact on other Skills. **Step 2 — Write the Body:** Skip the obvious; do not list command sequences; use intent statements instead of procedural scripts. `git log; git checkout main; git checkout -b; git cherry-pick` → "Cherry-pick to a clean branch, resolve conflicts while preserving intent, and explain the reason if it fails to land." Focus on gotchas / negative examples. **Step 3 — Use Hierarchy:** Split conditional, heavy, and template-like content into `scripts/`, `references/`, `assets/`, `config.json`. **Step 4 — Iterate:** Use a single eval set for fine-tuning at the word level (a difference of one word in the description can trigger a routing cascade). **Step 5 — Ship.** # Maintenance: The Gotchas Flywheel Skills are "append-only by default": * Agent makes a mistake → Add a gotcha. * Incorrect loading → Tighten description + add negative examples. * Failed to load when it should have → Add keywords + add positive examples. * System prompt changes → Check for conflicts and duplication. The process of moving from 80/20 to 99.9% relies almost entirely on the growth of the gotcha list, not on changing descriptions or adding longer instructions. Once a PR changes a description without attaching evals, "it has already gone off track." Eval suites are divided into four categories: 1. **Loading Evals:** Precision, recall, and prohibited loading (to avoid polluting the neighborhood). 2. **Progressive Loading Evals:** Whether attached files (e.g., `FORMATTING.md`) are read correctly after the Skill is loaded. 3. **End-to-End Task Evals:** Run the complete agent loop, using an LLM judge to score based on a rubric. 4. **Cross-Model Evals:** Run simultaneously on GPT / Opus / Sonnet (behavioral differences are significant). # Key Takeaways 1. Evals first, Skills second; negative examples and "prohibited misloading" are as important as positive examples. 2. The Description is the hardest line; start with "Load when…". 3. Gotchas are the highest-value content; start thin and grow with failures. 4. Action at a distance: Adding a new Skill will silently degrade existing Skills — this is a default risk, not an edge case. 5. The ability to write Skills grows with compound interest; any workflow repeated daily/weekly/quarterly is a potential Skill.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/09/26, 06:11 PM

The Perplexity Agent Skills Design, Iteration, and Maintenance Guide: The Official Internal Norms

The core argument is counter-intuitive: Writing Skills is not writing code; it is building context for models. Applying an engineer’s coding instincts directly to Skills will almost certainly lead to failure.

Source: https://research.perplexity.ai/articles/designing-refining-and-maintaining-agent-skills-at-perplexity…

Skill ≠ Code: Python Zen vs. Skill Anti-Patterns

  • Python: Simple is better than complex. Skill: A Skill is a folder; complexity is a feature.
  • Python: Explicit is better than implicit. Skill: Activation relies on implicit pattern matching + progressive disclosure.
  • Python: Sparse is better than dense. Skill: Every token must yield maximum signal.
  • Python: Special cases aren’t special enough to break the rules. Skill: Gotchas are the highest-value content.
  • Python: If the implementation is easy to explain, it may be a good idea. Skill: If it’s easy to explain, the model already knows it → Delete it.

The Four Definitions of a Skill

  1. A Skill is a Directory (not a single file)

    • Standard Structure: SKILL.md + scripts/ + references/ + assets/ + config.json
    • Complex domains require multi-level hierarchies. For example, U.S. tax law involves 1,945 IRC sections; flat loading performs worse than no loading at all. It only becomes usable after three levels of nesting. However, hierarchy has a cost, requiring navigation tools (quick references, custom search) to mitigate indirection.
  2. A Skill is a Format

    • Frontmatter must include name (lowercase, hyphenated, matching the directory name) and description.
    • The description is a routing trigger, not documentation. Common error: Writing “This Skill does X”; Correct approach: “Load when…”.
    • depends: is used for cascading dependencies. Runtime metadata can be isolated in auxiliary JSON/YAML files to avoid polluting the context.
  3. A Skill is Invocable

    • Loading Flow: load_skill() → Copy directory into sandbox → Recursively load dependencies → Strip frontmatter, exposing only the body and associated files.
  4. A Skill is Progressive (This is the most critical cost model)

    • Index: Name + Description for all Skills; ~100 tokens/Skill; Paid in every session, for every user, always.
    • Load: SKILL.md body; ~5,000 tokens; Loaded once, occupied for the duration of the session.
    • Runtime: scripts/, references/, sub-Skills; Unbounded cost; Paid only when the model actually reads them.
    • Rule: The higher the layer, the more expensive every word is. The Index is a “luxury boutique”; Runtime is an “infinite warehouse”.

When You Don’t Need a Skill

The phrase “Every Skill is a tax” is emphasized repeatedly. Three typical cases of abuse:

  • What the model already knows: Writing a string of git commands is good documentation, but a bad Skill.
  • Repeating system prompts: General knowledge should go into global context, not conditional loading.
  • Too volatile: Remote MCP tool versions change frequently → Skills will drift, leading to hallucinations.
  • The Litmus Test: “Would the Agent make a mistake without this sentence?” If the answer is no, delete it.

As Pascal said: “This letter is long only because I had no time to make it short.” Writing concise Skills is hard; Skills written quickly are likely problematic. Research also shows that LLM-generated Skills offer no average benefit because models cannot reliably write out “procedural knowledge useful for self-consumption.”

The Five-Step Construction Method (Order is Fixed)

Step 0 — Write Evals First: Derived from real queries, known failures, and neighborhood confusion. Negative examples are often more important than positive ones.

Step 1 — Write the Description (The Hardest Line):

  • Start with “Load when…”, ≤ 50 words.
  • Describe user intent (use real complaint language: “babysit”, “watch CI”, “make sure this lands”).
  • Do not summarize the workflow.
  • Single Goal: Accurate routing, minimizing regression impact on other Skills.

Step 2 — Write the Body: Skip the obvious; do not list command sequences; use intent statements instead of process scripts.

  • Bad: git log; git checkout main; git checkout -b; git cherry-pick
  • Good: “Cherry-pick to a clean branch, resolve conflicts while preserving intent, explain reasons if it fails.”
  • Focus on gotchas / negative examples.

Step 3 — Use Hierarchy: Move conditional, heavy, or template-like content into scripts/, references/, assets/, config.json.

Step 4 — Iterate: Use an eval suite for fine-grained tuning (a single word difference in the description can trigger a routing cascade).

Step 5 — Ship.

Maintenance: The Gotcha Flywheel

Skills are “append-only”:

  • Agent makes a mistake → Add a gotcha.
  • False positive load → Tighten description + Add negative examples.
  • Should load but didn’t → Add keywords + Add positive examples.
  • System prompt changes → Check for conflicts and duplication.

The journey from 80/20 to 99.9% relies almost entirely on the growth of the gotcha list, not on rewriting descriptions or adding longer instructions. If a PR changes the description without attaching evals, “it has already gone astray.”

Four Categories of Eval Suites:

  1. Loading Evals: Precision, recall, and prohibition of loading (to avoid polluting the neighborhood).
  2. Progressive Loading Evals: Does the Skill correctly read associated files (e.g., FORMATTING.md) after loading?
  3. End-to-End Task Evals: Run the full agent loop, scored by an LLM judge based on a rubric.
  4. Cross-Model Evals: Run simultaneously on GPT / Opus / Sonnet (behavioral differences are significant).

Key Takeaways

  1. Evals first, then Skills; negative examples and “prohibited loads” are as important as positive examples.
  2. The Description is the hardest line; start with “Load when…”.
  3. Gotchas are the highest-value content; start thin, grow through failure.
  4. Action at a Distance: Adding a new Skill can silently degrade existing Skills — this is a default risk, not an edge case.
  5. The ability to write Skills grows with compound interest; any workflow repeated daily/weekly/quarterly is a potential Skill.

Designing, Refining, and Maintaining Agent Skills at Perplexity

Source: https://research.perplexity.ai/articles/designing-refining-and-maintaining-agent-skills-at-perplexity

Perplexity’s frontier agent products rest on a foundation of know-how and domain expertise packaged in modular Agent Skills. We maintain a carefully curated library of Skills across our technical environments. These Skills include many of the general-purpose utilities powering Perplexity Computer; vertical-specific capabilities in areas such as finance, law, and health; and a very long tail of modules for addressing user needs. Some Skills are infrequently invoked but critical when invoked. To ensure a consistently excellent user experience, Perplexity’s Agents team prioritizes Skill quality just as much as code quality.

The intuitions and best practices required to develop a high-quality Skill differ significantly from those required to build traditional software. The Agents team reviews many pull requests from excellent engineers who develop Skills in the course of their work. The result is almost always numerous comments and suggestions for revision. This is because many useful patterns for writing code become antipatterns in Skill creation. For example, if you take some of the aphorisms from PEP 20 – The Zen of Python, it quickly becomes clear that writing good Python code is unlike writing good Skills. Of the 20 lines of wisdom, at least half are fully wrong or actively misleading when writing Skills.

Here are five of them:

Zen of PythonZen of Skills
Simple is better than complexA Skill is a folder, not a file. Complexity is the feature.
Explicit is better than implicitActivation is implicit pattern matching. Progressive disclosure.
Sparse is better than denseContext is expensive. Maximum signal per token.
Special cases aren’t special enough to break the rulesGotchas ARE the special cases (they’re the highest-value content).
If the implementation is easy to explain, it may be a good ideaIf it’s easy to explain, the model already knows it. Delete it.

This guide is the document that engineers across Perplexity use when developing and reviewing Skills. We’re also releasing this guide to the public so that our discoveries and learnings can benefit the broader community. Whether you’re an engineer designing production Skills in your day-to-day work, a Computer user looking to develop your own Skill in an area you know best, or both, this guide is for you.

What is a Skill?

When you write a Skill, you aren’t writing plain old software (even though Skills are now part of the main logical engines for agent systems). Rather, you’re building context for models and their environments. A Skill has different constraints and different design principles. If you write a Skill like you do code, you will fail.

A Skill is at least four things, especially in the context of how we build them at Perplexity.

A Skill is a Directory

A Skill is not just a single SKILL.md file. In many cases, a Skill includes several files. Under the directory named after your Skill, you might have:

  • SKILL.md: frontmatter and instructions
  • scripts/: code the agent runs, not reinvents
  • references/: heavy docs, loaded conditionally
  • assets/: templates, schemas, and data
  • config.json: first-run user setup

This hub-and-spoke pattern allows you to keep Skills very focused and tight, and one can use the folder structure in a very creative way. Sometimes, particularly intricate Skills benefit from multiple levels of hierarchy to help the model navigate better. Suppose a Skill requires knowledge across 300 topics, groupable into 20 subject matter areas. Reliably choosing the right topic among 300 is an unsolved challenge even for today’s best frontier models. It’s a much easier choice problem for a model to hone in on one of 20 areas, than among the 15 topics within that area.

As one example of how multilevel hierarchy provides value, our team employed three levels of topical nesting within the Skills powering Computer’s U.S. income tax capabilities this past tax season. This hierarchy was absolutely indispensable given the complexity of tax law: in our early tests, presenting the model with a single folder containing all 1,945 sections of the U.S. Internal Revenue Code resulted in worse performance than not loading the Skill at all. Organizing the information into logical subdivisions was indispensable for ensuring high-precision read operations.

Yet this hierarchy did not come free. Increasing levels of hierarchy require increasing levels of curation across the information architecture to manage the resulting indirection. We devised quick reference guides, custom search utilities, and other tools to support the model in locating information with a minimum of indirection. In this case, doing the hard work of curation ultimately produced a positive end result: a Skill that allowed models to perform tax-related tasks much more capably than using general tools alone.

A Skill is a Format

A Skill is a format. The core root SKILL.md file must have both a name and a description. Furthermore, the Skill needs to exactly map to the directory name in which the Skill is located. The name must be all lower-case characters, have no spaces, and can use hyphens.

The description is the routing trigger. This is a common failure point: the description is not internal documentation for what the Skill does. It amounts to instructions for the model for when to load the Skill. So, you will frequently see “Load when,” not “This Skill does.” This is important because of the way that most implementations inject the description into the model context.

Within the frontmatter, there is also “depends:”, which allows you to create hierarchical Skill dependencies, and “metadata:”, which is used for reviews and evaluations. Different agent systems can even define their own frontmatter fields, to be used in a manner specific to those systems. As an alternative, Skill-specific metadata can be packaged in an auxiliary JSON or YAML configuration file. This is desirable when building agent systems that need to facilitate different types of runtime behavior per Skill without polluting the model’s context with minutiae.

Finally, similar behavior is obtainable through stripping Skill frontmatter on read. Computer employs this methodology, which allows configuration to be preserved in the root SKILL.md file. Careful attention to detail is required in the parsing logic, and one might wish to implement conditional stripping if there are certain fields that are useful to have within the model context.

A Skill is Invocable

A Skill is invocable. The agent loads a Skill at runtime. Importantly, Skills aren’t always bundled into the context. By default, most agent systems unfold Skills progressively upon specific need. There are at least three tiers of context costs in the way that we’ve implemented Skills in Computer. Here is the process:

  1. Computer calls load_skill(name="...")
  2. Computer copies the Skill directory into the isolated execution sandbox
  3. Computer recursively auto-loads dependencies in the “depends:” tag
  4. Computer then strips the frontmatter and the agent thus only sees the body and the additional files

Different agent systems can choose to expose Skill content in different ways. As an example, some systems might choose not to expose the file hierarchy at all, leaving it to the model to discover the hierarchy through filesystem operations. Other systems may choose to give the model a mapping of the entire filetree up to a certain truncation and/or depth limit. To keep context clean, Computer omits full file hierarchies from the invocation context; however, this is overridable on a per-Skill basis.

A Skill is Progressive

Skills are progressive. In Computer, there are three different tiers of context costs, and we incur all three at various stages:

TierWhat loadsBudgetWhen you pay
Indexname: description for every non-hidden Skill~100 tokens per SkillEvery session, every user, always paid
LoadFull SKILL.md body~5,000 tokens~5,000 tokens
RuntimeFiles in scripts/, references/, assets/, subskills, FORMATTING.md, SPECIAL_CASES.mdUnboundedOnly when the agent reads them

Computer builds a Skill index that has the name and the description for every available Skill. The budget for this is around 100 tokens per Skill (shorter is even better). It’s so tight because you’re paying this cost in every session, for every user. This is injected into the system prompt at the very beginning of the conversation. The model has access to a bunch of named Skills and descriptions so that it can decide whether to call “load_skill()”. The bar to getting into this index is extremely high. Your Skill needs to be very useful, and the description needs to be extremely dense and terse because everyone is paying the cost all the time.

After the agent system loads the Skill, there’s the full SKILL.md body. Ideally, the body text does not exceed 5,000 tokens. Even then, you want every sentence to matter because once you load a Skill, the rest of the conversation has to pay that until you hit the compaction boundary. Many threads load anywhere between three and five different Skills, multiplying this cost. Skills with a lot of fluff will almost certainly degrade other Skills as well as overall agentic capabilities. In short, if your Skill loads and it doesn’t do the right thing, that’s wasted context.

The final level of progression is scripts or special cases, like subskills or formatting. This is where you want to put unbounded conditional branched logic. The agent will only use it when it needs to, meaning there’s a much lower bar for what you want to put in here. In the index, every token is important. The loaded Skill body is more relaxed, and the runtime is the most relaxed. This could be 20,000 tokens or zero tokens. This is the level at which you might think about expanding the context of the model in a progressive fashion.

When do you need a Skill?

The Agents team is often asked to opine on whether a Skill is truly needed for a given domain or use case. Very rarely do we have a definitive answer from first principles alone. The only way to really figure this out is to start with your agent without the Skill, run several hero queries, and then figure out whether the agent is doing a good job.

When you need a Skill

There are many tasks that are in distribution for trained models. You only need to apply a Skill if you want to change that behavior in some specific way that you can’t with, say, one sentence in your prompt. So, you need a Skill when the agent will get it wrong without special context, or if there’s some inconsistency or non-determinism that you need to be extremely consistent across runs.

It could be that your knowledge is durable but not in the training data. There could be cutoffs or enterprise-specific workflows, or it could be a matter of taste. For example, we have several design-related Skills in Computer written by Henry Modisett (our head of design). The reason that every token exists in those Skills is because Henry has very good taste when it comes to designing websites and PDFs. Henry specifies which fonts to use and which fonts not to use, how those fonts feel, and other matters of judgment that the model can’t learn from training data alone.

When you don’t need a Skill

We see many Skills in which engineers have written a series of git commands that need to be executed in order. That’s unnecessary because the model already knows how to do that, meaning it makes for great documentation but a poor Skill. We see examples where Skills recapitulate instructions from the system prompt. You don’t need a Skill for that. Knowledge relevant for the majority of requests should be included in global context, not in a conditionally loaded Skill.

If there’s something that’s changing faster than you can maintain it, you don’t need a Skill. For example, if you’re hitting some remote MCP endpoint and its tools or the versions of those tools are changing frequently, you shouldn’t inject those into a Skill. If you do, you’ll just end up with drift and the model will make mistakes.

Every Skill is a tax

Here’s a useful test you can apply to every sentence in your Skill: “Would the agent get this wrong without this instruction?” If the sentence does not need to be there, it cannot afford to be there because everyone is paying this cost every single time. When you are deciding whether to add a Skill or not, remember this tax wherein every session and every user costs tokens.

The following famous quote, which sounds much better in French, roughly translates to “I have only made this letter longer because I have not had the time to make it shorter.”

« Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. » — Blaise Pascal, Les Provinciales

Similar Articles

@berryxia: Perplexity is going open source too! Such generosity! It has completely rewritten the rules for building agent skills. They just released their internal handbook: Building agent skills requires an entirely new developer mindset. The research paper is here https://research.perpl…

X AI KOLs Timeline

Perplexity has released its internal Agent Skills building handbook, proposing a new developer mindset distinct from traditional software engineering, emphasizing context management and implicit pattern matching principles for AI agents.

@SaitoWu: Garry Tan has a crucial skill called Plan-Eng-Review. The workflow for this skill is roughly: First, have the agent plan, then have the agent draw ASCII diagrams, mapping out all data flows, user flows, and state machines. Then proceed to code implementat...

X AI KOLs Timeline

Introduces Garry Tan's 'Plan-Eng-Review' skill, emphasizing that before using AI for coding, one should first use an Agent to generate ASCII diagrams to plan data flows and state machines, in order to prevent the code implementation from deviating from the intended direction.

addyosmani/agent-skills

GitHub Trending (daily)

agent-skills is a collection of production-grade engineering skills designed to enhance the capabilities of AI coding agents.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Hugging Face Daily Papers

SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.