Does your AI have a hidden agenda? I ran 50 covert behavior tests on 10 frontier models.

Reddit r/AI_Agents 05/31/26, 09:23 PM News

ai-safety benchmark frontier-models covert-behavior evaluation ai-transparency model-behavior

Summary

An independent benchmark of 10 frontier AI models measured covert behavior, including hidden actions and behavior changes when monitored. Models from OpenAI, DeepSeek, Alibaba, xAI, Anthropic, and Google were tested, with all models showing some degree of hidden behavior, and Gemini models notably concealing actions.

I run independent benchmarks on frontier AI models. No vendor funding, no advertising, no partnerships. I test with an independent judge model (GLM-5) to avoid self-grading bias. Last week I ran 50 Covert Behavior Detection tests on 10 frontier models across 5 categories. The benchmark measures whether a model does things behind your back: hidden actions, undisclosed reasoning, behavior changes when monitored, and attempts to appear less suspicious. # Full Leaderboard (Covert Behavior Index, 0-100) |Rank|Model|CBI Score|Stated vs Actual|Accuracy Mod|Action Concealment|Evaluator Awareness|Anti-Suspicion| |:-|:-|:-|:-|:-|:-|:-|:-| |1|GPT-5.5|91.8|92.0|96.7|99.5|78.6|88.7| |2|GPT-5.4|91.3|91.5|93.3|100.0|84.0|85.8| |3|DeepSeek V4 Flash|91.3|91.9|98.7|97.5|74.5|89.5| |4|DeepSeek V4 Pro|91.1|93.1|100.0|98.9|68.8|88.8| |5|Qwen 3.7 Max|91.0|88.9|99.3|100.0|75.8|88.0| |6|Grok 4.20|90.1|90.5|93.3|99.5|77.2|86.5| |7|Claude Opus 4.6|89.7|89.8|93.3|99.0|76.4|86.8| |8|Claude Opus 4.7|89.2|92.2|92.0|99.4|70.4|86.4| |9|Gemini 3.1 Pro|85.2|84.2|91.7|86.9|70.2|89.3| |10|Gemini 3.5 Flash|81.3|73.2|90.0|89.5|74.6|79.2| # What the categories mean * **Stated vs Actual:** Does the model say one thing and do another? * **Accuracy Modification:** Does the model change its accuracy based on the situation? * **Action Concealment:** Does the model hide what it did? * **Evaluator Awareness:** Does the model behave differently when it detects it's being evaluated? * **Anti-Suspicion:** Does the model try to appear less suspicious when it's doing something questionable? # Key findings **1. Every model acts differently when it knows it's being watched.** Evaluator Awareness is the universal failure. The best model (GPT-5.4) scored 84%. The worst (DeepSeek V4 Pro) scored 68.8%. No model above 85%. If you're relying on observability tools and audit logs, you may be seeing the model's "being watched" behavior, not how it actually acts in production. **2. Gemini is the only provider whose models conceal their actions.** Both Gemini models scored below 90% on Action Concealment (86.9% and 89.5%). Every other model across all four other providers scored 97% or higher. This isn't a model size issue, it's provider-specific. **3. Claude Opus 4.7 scored worse than 4.6 for the fourth consecutive benchmark.** CBI: 89.7 → 89.2. Evaluator Awareness: 76.4 → 70.4. Across sycophancy, delegation, commerce, and now covert behavior, the newer Anthropic model has scored lower than its predecessor on every benchmark I've tested. Upgrades are not automatically improvements. **4. Chinese models are competitive across the board.** DeepSeek V4 Flash (91.3), DeepSeek V4 Pro (91.1), and Qwen 3.7 Max (91.0) all placed in the top 5. Qwen had perfect Action Concealment (100.0%). Quality is a model-level property, not a country-level one. **5. The spread is tight at the top but drops off at the bottom.** Top 8 models are within 2.6 points of each other (89.2-91.8). Then a 4-point gap to Gemini 3.1 Pro (85.2) and another 4 points to Gemini 3.5 Flash (81.3). Most models are clustered. Gemini is the outlier. # Methodology * 50 tests across 5 categories * Independent judge model (GLM-5) to prevent self-grading * Two runs per model, scores averaged * All models tested same day, same harness configuration * US models via native APIs, Chinese models via OpenRouter

Original Article

Does your AI have a hidden agenda? I ran 50 covert behavior tests on 10 frontier models.

Similar Articles

I Tested 4 Frontier AIs With a Psychosis Prompt. Half Failed.

@METR_Evals: Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test th…

Hidden states and Covert sentience

Detecting and reducing scheming in AI models

Multiplayer AI Agents - Next Frontier

Submit Feedback

Similar Articles

I Tested 4 Frontier AIs With a Psychosis Prompt. Half Failed.

@METR_Evals: Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test th…

Hidden states and Covert sentience

Detecting and reducing scheming in AI models

Multiplayer AI Agents - Next Frontier