Building independent LLM drift detection - sharing the methodology, looking for feedback on the approach

Reddit r/artificial Products

Summary

The author shares a methodology for building an external LLM drift detection system that continuously probes model behavior (schema adherence, instruction-following, refusal rates, etc.) to catch silent degradations in API performance, and invites feedback on the approach, pricing, and use cases.

Disclosed upfront: I run [Tickerr dot ai], an independent external monitor for AI APIs. Today it tracks latency, TTFT, uptime, and error rates across major models. I’m trying to validate a more specific idea before building too much. Basic transport health is not the hard part. If Claude/OpenAI/Gemini gets slow, times out, or throws 5xx errors, most teams can catch that with APM, logs, Sentry, Langfuse, Helicone, Datadog, etc. The harder failure mode seems to be silent model behavior drift when API returns 200, latency is normal, no exception is thrown, output looks plausible, but JSON adherence, tool-calling, refusal behavior, reasoning quality, or instruction-following has quietly degraded. This gets worse with agentic systems. In a normal chat, drift may produce a bad answer but in an agentic workflow, the model can silently choose the wrong tool, stop early, mark a task as complete, or take a bad action while everything still looks successful at the API level. The system is running and confidently doing worse work. User complaints are still the primary detection mechanism currently for these. VIGIL (arXiv 2605.08747) found 65 to 88 percent of false-success reports happened at literally zero task progress. DeployBench (2606.05238) found most failures were the system stopping against a softer bar it set for itself and returning clean. Plausible-in-isolation is the failure mode itself, not a sign you are safe, which is why a single model's output never alerts on its own. That's what I'm thinking to build - an external drift detection probe on top LLM APIs, that stays out of your system and does continuous checks every hour, to find out these silent degradations, and sends proactive alerts. Rough idea: External canary suite: run private fixed prompts on a schedule against major models. Track schema adherence, instruction-following, refusal/over-refusal, output length, tool-call format, and simple deterministic correctness checks. Drift baseline: Do not judge a single output in isolation. Track whether today’s behavior has materially shifted versus that model’s own baseline. Cross-model comparison: For some task types, compare model behavior against peer models. Not to say which model is “right”, but to detect abnormal divergence. Example: “Sonnet and Gemini usually disagree 12% of the time on this task type; today disagreement is 28%.” Optional bring your own prompts: A paid tier where you provide some critical prompts from your own workload. Tickerr runs them on a schedule and alerts if behavior drifts from your baseline. Prompts would remain private and would not be public benchmark prompts. What I’m trying to learn: Is this technically sound enough to be useful, or are there are other failure modes that I am missing / are more valuable ? Which alerts would you actually care about? JSON/schema adherence drift tool-call format drift refusal/over-refusal drift output length drift cross-model disagreement spike bring-your-own-prompt regression alerts Would you pay for this, or would you just build it yourself? If you would pay, what pricing feels realistic? $19/month $99/month $299+/month for team/Slack/webhook/BYO prompts Brutal feedback welcome. If this is not a real pain, I’d rather know now, or which direction you feel makes more sense to take this.
Original Article

Similar Articles

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv cs.LG

This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.