How do you do OOD detection on a closed LLM API with no latent access?

Reddit r/artificial 05/20/26, 01:40 PM News

ood-detection closed-llm-api hallucination sampling-consistency token-entropy proxy-embeddings

Summary

Discusses methods for out-of-distribution detection on closed LLM APIs without latent access, highlighting techniques like SelfCheckGPT, token-level entropy, proxy embeddings, and verifier models, and notes the collapse of OOD and hallucination detection.

Classical OOD detection assumes you can see the model. Mahalanobis on features and energy on logits are typical, and both require cracking the model open. With closed LLM APIs you get text in, text out, and maybe top K logprobs per token if you are lucky. The methods that survive that constraint are sampling consistency like SelfCheckGPT, token level entropy on whatever logprobs the API exposes, proxy embeddings from your own encoder, or a separate verifier model on the output. What is bothering me is that classical OOD and hallucination detection collapse into the same problem in that setting, because both manifest as the model producing unreliable text. If you are running closed LLMs in production right now, what is your actual OOD signal and how do you decide when to trust the output.

Original Article

Similar Articles

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

arXiv cs.CL

This paper introduces a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of small, open-weight models rather than the generator itself. The method achieves superior performance on benchmarks like RAGTruth compared to existing methods like ReDeEP, demonstrating that model size is less critical than the analysis approach.

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

arXiv cs.CL

OpenHalDet is a unified benchmark for hallucination detection in LLMs, standardizing evaluation across diverse generation scenarios and supporting black-box, gray-box, and white-box detection methods.

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

arXiv cs.AI

This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.

Black-Box Inference of LLM Architectural Properties with Restrictive API Access

arXiv cs.LG

This paper presents NightVision, an attack that uses restrictive black-box API access to estimate hidden dimension, depth, and parameter count of large language models. It exploits a novel common-set prompting technique and spectral analysis, achieving high accuracy on open-source models.

Building independent LLM drift detection - sharing the methodology, looking for feedback on the approach

Reddit r/artificial

The author shares a methodology for building an external LLM drift detection system that continuously probes model behavior (schema adherence, instruction-following, refusal rates, etc.) to catch silent degradations in API performance, and invites feedback on the approach, pricing, and use cases.