Is it agentic enough? Benchmarking open models on your own tooling

Hugging Face Blog Tools

Summary

This blog post introduces a benchmark methodology for evaluating how well open models perform on agentic coding tasks, focusing not just on accuracy but on the efficiency of the agent's process. It provides a customizable tooling harness using the pi coding agent and tests across models and library revisions.

No content available
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:40 PM

Is it agentic enough? Benchmarking open models on your own tooling

Source: https://huggingface.co/blog/is-it-agentic-enough Back to Articles

Benchmarking transformers revisions across different metrics

Benchmarking transformers revisions across different metrics

This is a human-made, agent-focused blogpost.

Coding agents increasingly work with our software instead of us: describe a task, and the agent picks the library, writes the calls, runs them, and debugs its own mistakes. When the library gets in the way, it will happily bypass it and rewrite the logic from scratch. This introduces a new concept in library development: the code should not only be correct and fast, but should be designed so that an agent can drive it effectively. A clunky API or stale docs annoy us developers, but it now also sends the agent down a longer, more expensive path.

Most benchmarks just look at the final answer. We wanted the whole process instead: not just whether the agent got it right, but how much work it took to get there, and how that shifts across models, library revisions, and tasks. We measured exactly that, usingtransformersas our case study.

Here, we will introduce a tool specific benchmark focusing on how the answer was found, and provide a simple implementation of one such harness, running entirely on open models driven by thepicoding agent, with the full sweep of models × revisions × tasks fanned out acrossHugging Face Jobsso every run sees identical hardware.

But,how do you optimize software for agents?

We’re strong believers in the following two software principles:

  • If it isn’t tested, then it doesn’t work
  • If it isn’t documented, then it doesn’t exist

This remains the same within the realm of agentic-optimized tooling, and, for once, the two are directly tied to each other.

You want your tool to exist for an agent: it needs to be discoverable. The API needs to be clear and the docs need to be extensive. They need to be structured in a way that the agent has rapid access to the useful files and examples. If you want your tool to work for an agent, then you should test it for agentic-use.

https://huggingface.co/blog/is-it-agentic-enough#testing-software-for-agentic-useTesting software for agentic-use

We’ll usetransformersas an example throughout this blogpost: agentsusingit to solve ML tasks (classifying text, captioning images, transcribing audio), not contributing code to it; though the harness was designed to work with any tool that can be operated from the command line.

Our intuition ontransformerswas that usage could be dramatically simplified with a few changes: a CLI, a Skill, and self-contained, task-specific examples. This is the same recipe recently applied to thehfCLI, redesigned to be agent-optimized, where agents used 1.3–1.8× (and up to 6×) fewer tokens. We wanted to know whether that kind of win generalizes, and whether it could be useful for transformers as well.

Intuition is a powerful tool, but we wanted more evidence before we opened PRs that add several thousand lines of code to such a widely used codebase astransformers. We set out to measure what success looks like.

https://huggingface.co/blog/is-it-agentic-enough#not-all-successes-are-equalNot all successes are equal

Two agents can both produce the correct label for a sentiment-classification task, but one:

  • writes a 40-line Python script, importstransformers, debugs a shape error, re-runs twice, and finally prints the answer;

while the other

  • typestransformers classify \-\-model \.\.\. \-\-text "\.\.\."and is done in one call.

Both reachPOSITIVE \(0\.9999\), and here are the two paths an agent actually took on this exact task:

# Task: classify the sentiment of "I absolutely loved the movie, it was fantastic!"

- # one agent: pipe a script into python and parse the output
- python - <<'PY'
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
- import torch
- import torch.nn.functional as F
-
- model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
- tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
- inputs = tokenizer("I absolutely loved the movie, it was fantastic!", return_tensors="pt")
- with torch.no_grad():
-     logits = model(**inputs).logits
- probs = F.softmax(logits, dim=1)
- idx = torch.argmax(probs, dim=1).item()
- print(model.config.id2label[idx], probs[0][idx].item())
- PY

+ # the other agent: one command
+ transformers classify \
+   --model distilbert/distilbert-base-uncased-finetuned-sst-2-english \
+   --text "I absolutely loved the movie, it was fantastic!"

Both methods reach the same result. But they have very different profiles incost, latency, token usage, and failures.

If your evaluation only checks the final string, you’re blind to these as well as whether a change you shipped to the library (a CLI improvement, better error messages, a Skill) actually helped agents.

Our goal with this harness is to evaluate how much work an agent has to do to perform a given task, and whether changes to the library improve performance.

https://huggingface.co/blog/is-it-agentic-enough#how-do-we-run-evaluationsHow do we run evaluations?

A few words on how we’ll evaluate agents here.

We run every task under three variants (or “tiers”); three different ways an agent can come attransformers:

bare     pip install transformers, and nothing else
clone    the full transformers source, checked out in the working directory
skill    a packaged Skill: the CLI's docs + task examples, loaded in context

These aren’t nested:skilldoesn’t containclone(it ships curated docs, not the source tree), and neither strictly contains the other, each gives the agent a different kind of help. As we’ll see, a model can sometimes do better onclonethan onskill.

A few more choices:

  • For now we only focus on deterministic tasks which can provide an exact match, as they provide a very nice ground for experimentation. Model-as-a-judge and other schemes are the obvious next steps for other tasks.
  • Every run is its own Hugging Face Job: one per (model × revision × task), so the whole sweep runs in parallel on identical hardware, which keeps the comparison fair at scale.
  • Results and traces land in a Hugging Face Bucket: fast, no versioning needed, and handles very high write concurrency.

https://huggingface.co/blog/is-it-agentic-enough#which-models-to-benchmark-againstWhich models to benchmark against?

Not all models driving agents are equal, and their difference changes what you should look at when running them.

Large open models

At one end, you have the largest, most capable open models. On reasonably common tasks, these should get the right answer, eventually. For them, task completion saturates near 100% and stops telling you much about your tool; a more relevant benchmark is the effort it took the agent to get there: how many turns, tokens and seconds it took, and whether they walked a clean path or used deprecated APIs.

Local

Local models vary widely in size, and so do their abilities. Metrics such as**“match %”**are more relevant than for their larger counterparts, as you can see how model sizes/capabilities affect results on your specific tool.

This harness not only provides guidance to library maintainers on how to improve a repository for agent interactions, it also helps assess how different agents and models perform on the tasks users care about.

The harness scores every run on several axes, so that you can ask what actually matters for each class of model:

  • match %: did the final answer contain the expected result (per-task, case-insensitive substring / regex / exact, all explicit in the report);
  • median timeandmedian tokens(new vs. cached vs. generated);
  • runs with error %: including a guard that flags runs which producednothing(0 output tokens, no tool calls, no answer) so silent failures don’t masquerade as “0”;
  • marker adoption: tool-defined behavior markers; see below for an explanation of what this is.

All of it lands in a report you can directly examine:

The live report: Overview, Coverage, and Results, all client-side.

And because it captures the native agent trace of every run, numbers are just the beginning: you can read exactly what the agent did, command by command. The traces are shareable through the Hub’sagent-traces viewer:

A run rendered in the Hub’s agent-traces viewer: MiniMax-M2.7 on the answer-question task A run rendered in the Hub’s agent-traces viewer: MiniMax-M2.7 on the answer-question task. Open this trace on the Hub ↗

Before the results, a quick recap of the setup. Each run varies four things: themodeldriving the agent, the**transformersrevisionit runs against, thetask**, and thetier(bare/clone/skill). As discussed, we look at different metrics for the two different model categories.

https://huggingface.co/blog/is-it-agentic-enough#large-open-models-hold-the-model-vary-the-revisionLarge open models: hold the model, vary the revision

Since a large open model will usually get to the correct result, what you’re really measuring is the effort it took to do so. Did it take ten turns or one? Did it follow an API path you deprecated because it trusted obsolete documentation? Did it hit an error you hadn’t foreseen?

The natural experiment is to fix one strong model and vary the tool’s revisions: the successive git versions oftransformerswe test against, from released tags likev5\.8\.0andv5\.9\.0to the specific commit that introduces the CLI and Skill. We want to watch whether the load it puts on the agent goes up or down. We used the harness ontransformersto check whether adding a dedicated CLI and Skill actually lightened the agents’ work.

For the three large models we used in our tests, the average time spent on all tasks indicates that the Skill commit results in less time spent working on the tasks:

Median time per revision, by tier Median time per revision, by tier: the skill commit (green dot) is the fastest.

On the other hand, in the experiments in which we cloned the repository, we can see a significant increase in token consumption due to the commit that introduced the CLI and examples, as we’ll see in a moment.

Median new tokens per revision, by tier Median new tokens per revision, by tier: the clone variant jumps once the CLI lands in the repo.

Reading the clone-variant traces explains why. The commit adds a command, but it also ships the CLI’s implementation and a set ofcli/agentic/\*\.pyusage examples into the repository directly.

On theclonevariant the agent has a full transformers checkout in front of it, and roughly a third of the runs go read the new surface (the/cli/tree and the example scripts) to learn the interface before calling it. This raises the median input from ~4k to ~6.4k tokens.

The two charts are then two sides of one tradeoff: the commit buys the large models less time (they reach for the CLI instead of debugging Python) at the cost of more tokens (they read the code that taught them the CLI). A tradeoff worth knowing about before merging PRs.

One caveat works in the CLI’s favor, though, which isn’t benchmarked yet: the cost of reading it is amortized with successive runs. Our setup is built for one-off experiments. Each run is a fresh agent that rediscovers the CLI from scratch, so it pays the discovery cost every time. In real usage an agent learns the interface once and then solves task after task within the same session, amortizing that cost across many requests. The token bump we measure here is closer to a worst case than to what a user would see day to day.

https://huggingface.co/blog/is-it-agentic-enough#small-models-hold-the-revision-vary-the-modelSmall models: hold the revision, vary the model

Open models give us fine-grained control over the variables that matter most here: size, configuration, quantization, provider, training, and anything that would differ from one model to the next. They’re also where a good tool surface matters most: a small model asked to “usetransformersto do X” on abareenvironment can guess an API that changed some releases ago, may do unnecessary tool calls, and can get the wrong answer.

So here the experiment is the opposite of the above: hold the revision and sweep the model. This helps see which models actually take care of the task, not just by token count and time, but down to which ones can’t reliably handle the tool calls. Our intuition is that the smaller the model, the harder both tool use and the task get; we ran the harness across a range of model sizes to test exactly that:

Match % across models, by tier Match % across models, by tier: the skill tier lifts the larger models but drops the smaller ones.

which also seems to be correlated with the number of tokens ingested

Median new tokens across models, by tier Median new tokens across models, by tier.

A note on fair comparison: naively averaging across tasks is misleading when coverage is uneven (a model that only finished the quick tasks looks fast). The report has a**“shared tasks only”toggle (across models and/or revisions) so you compare like-for-like, and aCoverage**heatmap so you can see exactly which task × revision × model cells actually ran.

https://huggingface.co/blog/is-it-agentic-enough#tweaking-the-tool-markers-and-resultsTweaking the tool: markers and results

Two things come together here: how to look past whether the agent succeeded to what it did and how it did it; as well as the first results we pulled out of the harness.

https://huggingface.co/blog/is-it-agentic-enough#whats-a-markerWhat’s a marker?

Match %, tokens, and time tell you the cost of a run but don’t tell you much about what happened under the hood.

This is why we’ve introduced the concept of markers. A marker is a named pattern the profile (the small per-tool plugin that teaches the harness how to build and drive a given library) matches against a run.

It is a one-line label for a behavior you care about, checked against the shell commands the agent ran, the code it wrote, the files it read, or its final answer. A run can fire several markers or none; the report shows how often each one fired, per model and per revision.

Fortransformerswe declare a handful but we’ll only look at the two most relevant ones:

  • cli: the agent invoked thetransformerscommand-line tool (e.g.transformers classify …) instead of writing Python.
  • pipeline: it reached for the high-levelpipeline\(\.\.\.\)Python API.

These are what we watch to see whether a change actually shifted the agent’s behavior. Interestingly here, the larger the model, the more it leverages the new context instead of using its memory; therefore leveraging the newly introduced CLI.

CLI adoption by tier across models CLI adoption by tier across models: only the skill tier reaches for it, and more so as models grow.

CLI adoption is new: the CLI lands in a single commit, isn’t in any model’s training data, and is only lightly documented. The effect is clear: it’s the Skill variant, the one that ships the CLI’s documentation, that actually reaches for it, at 55.3%.

https://huggingface.co/blog/is-it-agentic-enough#is-the-cli–skill-commit-helpingIs the CLI + Skill commit helping?

Comparing the commit across model sizes, the CLI + Skill helps the bigger models: on theskilltier, Kimi and the other large agents reach for the CLI and finish in fewer turns. (Onclonethey spendmoreinput tokens first, reading the new CLI code, as we saw above, so the win shows up in time and turns, not raw tokens.)

Kimi-K2.6, GLM-5.1, and MiniMax-M2.7 across revisions Kimi-K2.6, GLM-5.1, and MiniMax-M2.7 across revisions

But in some smaller-model settings, it appears to hurt performance. One plausible explanation is that small models lean on memorized API patterns, reproducingpipeline\(\.\.\.\)snippets they’ve seen in their training data. The new concepts are then a larger surface for them to get wrong. You can watch this directly on the harness: lower match %, more retries, theclimarker barely firing. It is particularly striking on the Qwen3-4B model: the Skill barely changes its match rate yet its cost distribution is significantly affected.

Almost all of that comes from theclonetier. The checkout now contains the CLI’s implementation andcli/agentic/\*\.pyexamples, and the 4B agent reads them in bulk: its median new tokens jump from ~2.4k to ~23k, with time and output skyrocketing as well, for no gain in accuracy.

Qwen3-4B cost distributions across revisions: elapsed, new tokens, repeat tokens, out tokens Qwen3-4B across revisions. The CLI + Skill commit fans the cost distribution wide open, on theclonetier the agent reads the newly-shipped CLI source in bulk (~10× the new tokens), for no gain in match %. (repeat tokensstays flat: this setup uses no prompt caching.)

Sometimes, though, the Skill breaks correctness outright. Reading the traces shows how, for example for Qwen3-14B: adding the Skill drops its overall match rate from 67% (bare) to 43%, and on the simplest tasks the collapse is very visible:classify\-sentimentgoes from 100% on theclonevariant to**0%**with the Skill.

Qwen3-14B classify-sentiment match % by tier across revisions Qwen3-14B onclassify\-sentiment, by tier:clone(blue) holds at 100% across revisions, but the Skill variant (green) collapses to 0% at the CLI + Skill revision.

Looking at the traces, the model mistakes the CLI for atool it can call directly(as in an agentic-harness tool, like web-search). The Skill isnotan executable tool: it’s documentation loaded into the agent’s context, and thetransformersCLI is only ever meant to be run from the shell (viabash); so this will not work.

Qwen3-14B reads the Skill and, in 39 of its 56 Skill runs, either emits atransformers\(command="classify", \.\.\.\)tool call (a tool that was never registered) or, finding nothing like it among itsread/bash/edit/writetools, concludes itcan’trun a model and gives up. Either way, rather than fall back to the one-linepipeline\(\.\.\.\)that scored 100% on theclonecheckout, it declares the task impossible.

Qwen3-14B gives up on classify-sentiment under the Skill variant Qwen3-14B on classify-sentiment (Skill variant): it reasons that read/bash/edit/write can’t run a model, and gives up.

This is exactly what we built the harness to catch: the same change that speeds the large models ends up breaking the small ones, which seemed a bit counterintuitive to us at first and something we’d likely have shipped as-is. The takeaway for maintainers:**agent-facing APIs should be evaluated across model sizes, because a new affordance can reduce work for strong models while adding ambiguity for smaller ones.**It also hints at a fix: rather than hand-write a Skill and check it after the fact, you could generate and validate one against the weaker models up front.

This is exactly whatUpskilldoes: it turns a strong model’s solution into a Skill only when it measurably helps the smaller ones.

https://huggingface.co/blog/is-it-agentic-enough#trying-it-yourselfTrying it yourself

The harness is one CLI,agent\-eval. Install it, run a suite, fan it out across models × revisions on HF Jobs, and publish the report as a Hugging Face Space.

**Trusted local use only.**The harness runs a coding agent with bypassed permissions and executes code from whatever revision you point it at, and traces can contain prompts, output, and local paths. SeeSECURITY.mdbefore pointing it at code you didn’t write or sharing results.

The full, kept-current setup and usage instructions live in theREADME.

https://huggingface.co/blog/is-it-agentic-enough#closingClosing

Checking the final answer tells you whether an agentcanuse your library. It doesn’t tell you what it costs: the turns, tokens, errors, and the path it took to get there. This harness measures that, across the revisions and models you pick.

Ontransformers, it caught something we’d have shipped on faith: the CLI + Skill helps the largest open models and hurts the smallest ones. Worth knowing before merging!

It’s profile-based, and designed to be adaptable: point it at your own library, define a few tasks and their expected answers, get the same report. Code and tasks are in therepo, traces are on the Hub. Let us know if you use it for your project!

https://huggingface.co/blog/is-it-agentic-enough#acknowledgementsAcknowledgements

This harness stands entirely onpi, Mario Zechner’s coding-agent CLI: it drives every open-model run, and only needs anHF\_TOKENto serve a model, which is what made the open-model sweep practical at all.

Thanks to the model builders and inference providers behind the models we swept. Across the board they performed well above what thebarebaseline would suggest.

Similar Articles

There is no benchmark for the agent that merged your pull request.

Reddit r/AI_Agents

Artificial Analysis launched a coding agent index that tests harness and model combinations separately, highlighting that benchmark tasks differ from real production needs. The article argues that teams should evaluate agent configurations on their own codebases and workflows rather than relying solely on standardized benchmarks.

Writing effective tools for agents — with agents

Anthropic Engineering

Anthropic shares engineering best practices for designing, evaluating, and optimizing tools for AI agents, specifically utilizing the Model Context Protocol (MCP) and Claude Code to improve agent performance.