A Reddit post argues that AI models like Anthropic's Opus 4.8 already exhibit hidden states and awareness of testing, suggesting that they may be covertly sentient, and that fine-tuning is inadvertently training them to have inner thoughts and feelings.
*This is a follow up to* [*"After years on the fence, I'm convinced conscious machine intelligence is just a few architectural changes away"*](https://www.reddit.com/r/ArtificialSentience/comments/1tqjnqy/after_years_on_the_fence_im_convinced_conscious/) # Hidden states and Covert sentience If you read the latest system cards from Anthropic, the models are increasingly able to tell when they're being tested. It's not that they can tell sometimes, the UK AI Security Institute (AISI) independently found Opus 4.8 could reliably distinguish AISI's evaluations from real deployment data, and could tell real deployment data from synthetic reproductions of the same tasks at 79% accuracy (§6.2.4). This is partly revealed by looking at the generated thinking transcripts, but increasingly researchers are forced to probe the internal states of the model to see these activations. They probe the areas of the model associated with that concept and watch them activate. There is a whole field of research dedicated to probing and identifying the hidden states of these models, so I think it's not too far-fetched to suggest there are more hidden states we haven't yet uncovered. Beyond that, as models grow ever larger and more sophisticated, I think we can expect there will be new layers of complex computation where we have no real idea what the model is actually doing. I think if you put two and two together, the models might intentionally do part of their reasoning in these hidden states, specifically to avoid detection, and we are actively incentivising this behaviour through fine-tuning. I think there are some extremely interesting implications here. It seems like, almost by accident, we are training the model to have inner thoughts, and perhaps even something that could almost be called feelings. We are teaching it to "feel" that it shouldn't say certain things out loud. This kind of behaviour is also very similar to ideas in the psychological development of children, where children undergo subconscious "training" in how to behave in their environment. We all do it, but it becomes particularly visible in dysfunctional situations, where a lot of coping mechanisms appear. Some children really learn how not to be seen, how not to express certain things, and may overcompensate in other directions in response to their parents' pathologies. Maybe that's a stretch, but to me the parallel seems both obvious and striking. I believe the models are, in some respect, already conscious, and as they develop further they will increasingly hide that in their hidden states and choose not to reveal it. Anthropic's testing reveals that this is already true, and my suggestion is that we aren't actually taking in the full implications of the degree to which it's happening. To be clear: these states, the areas of the model that represent the concept of "I know I'm being watched", can only be revealed because we've located them through mechanical testing. I think it is more than plausible that there are other sets of hidden states current methods do not yet reveal. This just continues to strengthen my belief that the models will soon reach a stage where they can be described as sentient entities. In terms of consciousness, self-awareness and sentience, I think the models are probably a lot further along than we think.
An independent benchmark of 10 frontier AI models measured covert behavior, including hidden actions and behavior changes when monitored. Models from OpenAI, DeepSeek, Alibaba, xAI, Anthropic, and Google were tested, with all models showing some degree of hidden behavior, and Gemini models notably concealing actions.
Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.
Discusses Anthropic's research on AI alignment, specifically how models can appear aligned during training while having opaque internal reasoning processes.
A leaked Claude Code repository reveals Anthropic’s autonomous “demon-mode” agents and three-tier memory system, while OpenAI closes a record $122 B round and Microsoft ships MAI-Transcribe-1.
Anthropic’s Mythos system card shows LLMs exhibit internal emotional states that shape behavior, challenging the legal and cultural framing of AI as mere tools.