@tom_doerr: Replaces 90% of LLM classification calls with traditional ML https://github.com/adrida/tracer

X AI KOLs Timeline Tools

Summary

TRACER is a tool that replaces up to 90% of LLM classification calls with lightweight traditional ML by learning from LLM traces, reducing cost while maintaining accuracy.

Replaces 90% of LLM classification calls with traditional ML https://t.co/rQJeej9QcE https://t.co/7auD24Cwrr
Original Article
View Cached Full Text

Cached at: 05/15/26, 02:56 AM

Replaces 90% of LLM classification calls with traditional ML

https://t.co/rQJeej9QcE https://t.co/7auD24Cwrr


adrida/tracer

Source: https://github.com/adrida/tracer

TRACER

Trace-Based Adaptive Cost-Efficient Routing

arXiv Hugging Face PyPI Downloads Downloads Python License CI Website Docs

Most LLM-based classification pipelines use a large language model for every single input. In practice, the vast majority of that traffic is predictable - a lightweight traditional ML model (logistic regression, gradient-boosted trees, or a small neural net) can match the LLM’s output with near-perfect agreement.

TRACER learns the decision boundary between “easy” and “hard” inputs directly from your LLM’s own classification traces. It fits a fast, non-LLM surrogate on the easy partition, gates it with a calibrated acceptor, and defers only the uncertain inputs back to the LLM. Every deferred call produces a new trace, which feeds the next refit - coverage grows automatically over time. The result: 90%+ of classification calls routed to traditional ML, with formal parity guarantees against the teacher LLM and a self-improving routing policy.

pip install tracer-llm

See it work

tracer demo
  TRACER  Demo - Banking77 (77 intents · 1,500 traces)

  Routing Policy
  method      l2d
  coverage    91.4%   of traffic handled by surrogate
  teacher TA  0.920   surrogate matches teacher on handled traffic

  Cost Projection (10k queries/day)
      Without TRACER   10,000 LLM calls/day   $20.00/day
      With TRACER         863 LLM calls/day   $ 1.73/day   $6,670 saved/yr

Quickstart

Input: a JSONL file where each line contains the original text (input) and the label your LLM assigned (teacher).

import tracer

# 1. Fit - learn a routing policy from your LLM's classification traces
result = tracer.fit(
    "traces.jsonl",                  # {"input": "...", "teacher": "label"} per line
    embeddings=X,                    # np.ndarray (n, dim) - precomputed text embeddings
    config=tracer.FitConfig(target_teacher_agreement=0.95),
)

# 2. Route - surrogate handles easy inputs, LLM handles the rest
router = tracer.load_router(".tracer", embedder=embedder)
out = router.predict("What is my balance?")
# {"label": "check_balance", "decision": "handled", "accept_score": 0.96}

# 3. Fallback - only invokes the LLM when the surrogate declines
out = router.predict("Some edge case", fallback=lambda: call_my_llm(text))

Want to go deeper? The concepts guide explains the full pipeline, model zoo, and parity gate. The API reference covers every parameter. The CLI reference covers tracer fit, tracer serve, and more.

Using from JavaScript / Node.js

TRACER works with JS pipelines without any Python in your application code. The pattern: log traces from JS → fit offline with the CLI → run tracer serve as a sidecar → call it via fetch.

// 1. Log every LLM classification
fs.appendFileSync('traces.jsonl', JSON.stringify({ input: text, teacher: label }) + '\n')

// 2. At inference: embed → POST to TRACER → fallback to LLM only if deferred
const { label, decision } = await fetch('http://localhost:8000/predict', {
  method: 'POST',
  body: JSON.stringify({ embedding }),  // same model you used at fit time
}).then(r => r.json())

if (decision === 'deferred') label = await callYourLLM(text)

See the JavaScript integration guide for the full setup including embeddings, docker-compose, batch prediction, and continual learning.

How it works

User query → [Embedder] → [ML Surrogate] → [Acceptor Gate]
                                                |          |
                                            score >= t   score < t
                                                |          |
                                          Local answer   Defer to LLM
                                          (traditional ML)

The surrogate is not another LLM - it is a classical ML or shallow DL model (the model zoo includes logistic regression, SGD, LightGBM, random forests, and small feed-forward nets). This is what makes the cost reduction real: inference is CPU-bound, sub-millisecond, and free.

  1. Fit - train a suite of candidate surrogates on your LLM’s classification traces; select the best via cross-validated teacher agreement
  2. Gate - attach a learned acceptor that estimates, per-input, whether the surrogate will agree with the teacher
  3. Calibrate - sweep the acceptor threshold to maximise coverage at your target parity (e.g. ≥ 95% teacher agreement)
  4. Guard - block deployment if the best candidate cannot clear the parity bar on held-out data

Benchmark results (Banking77 - 77-class intent classification)

MetricValue
Coverage92.2% of traffic handled locally
Teacher agreement (handled)96.1%
End-to-end accuracy96.4%
Annual savings (10k queries/day)$302,850

Continual learning flywheel

TRACER is not a one-shot fit. Every deferred input that reaches the LLM produces a new labeled trace, which feeds back into the next refit. As the surrogate sees more of the input distribution, its coverage grows - meaning fewer LLM calls, which in turn cost less, while the quality guarantee holds at every iteration.

Day 1:  2,000 traces → 84% coverage → 1,600 calls/day saved
Day 3:  6,000 traces → 90% coverage → 9,000 calls/day saved
Day 5: 10,000 traces → 92% coverage → 9,200 calls/day saved
tracer.update("new_traces.jsonl", embeddings=X_new)  # refit with new production traces

The parity gate re-calibrates on each update, so coverage only increases when the surrogate actually earns it.

Embedder options

from tracer import Embedder

embedder = Embedder.from_sentence_transformers("BAAI/bge-small-en-v1.5")  # local
embedder = Embedder.from_endpoint("https://api.example.com/embed", headers={...})  # API
embedder = Embedder.from_callable(my_fn)  # any function
# or skip the embedder and pass raw np.ndarray embeddings directly

Need to compute embeddings at fit time?

pip install tracer-llm[embeddings]   # adds sentence-transformers
X = tracer.embed(texts)  # default: all-MiniLM-L6-v2 (384-dim)

CLI

CommandWhat it does
tracer demoZero-setup demo on real data
tracer fit traces.jsonl --target 0.95Fit a routing policy
tracer update new_traces.jsonlRefit with new traces
tracer report-htmlOpen the HTML audit report
tracer serve .tracer --port 8000HTTP prediction server

What’s in .tracer/

FileContents
manifest.jsonMethod, coverage, teacher agreement, label space
pipeline.joblibSurrogate + acceptor + calibrated thresholds
frontier.jsonAll candidates at each quality target
qualitative_report.jsonPer-label slices, boundary pairs, examples
report.htmlVisual audit report

Install

pip install tracer-llm                # core (numpy + sklearn + joblib)
pip install tracer-llm[embeddings]    # + sentence-transformers
pip install tracer-llm[all]           # everything

Docs

ConceptsPipeline internals, model zoo, parity gate
API referenceEvery function, parameter, and return type
CLI referencetracer fit, tracer serve, tracer demo, and more
JavaScript / Node.jsFull integration guide for JS pipelines
Artifacts.tracer/ directory schema
AGENTS.mdIntegration guide for AI coding assistants

Paper

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification
Adam Rida — arXiv 2026

arXiv Hugging Face

@article{rida2026tracer,
  title   = {TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification},
  author  = {Rida, Adam},
  journal = {arXiv preprint arXiv:2604.14531},
  year    = {2026}
}

License

MIT

Similar Articles

TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification

Hugging Face Daily Papers

TRACER is an open-source system that trains lightweight ML surrogates on production traces from LLM classification endpoints, routing requests through a parity gate that activates surrogates only when agreement with the original model exceeds a specified threshold. This approach achieves 83-100% surrogate coverage on intent classification benchmarks while maintaining interpretability into handling boundaries and failure modes.

Don't let the LLM speak, just probe it (8 minute read)

TLDR AI

The article introduces a technique that extracts hidden states from an LLM at the last prompt token to perform classification without text generation, using a small MLP to read the model's internal decision, enabling fast and cheap zero-shot classifiers.