llm-behavior

#llm-behavior

Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out.

Reddit r/ArtificialInteligence ↗ · 5h ago Cached

Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.

0 favorites 0 likes

#llm-behavior

All the demons hiding in your AIs… ranked! (40 minute read)

TLDR AI ↗ · 2d ago Cached

The article analyzes OpenAI's report on why recent GPT models developed a tendency to use 'goblin' and 'gremlin' metaphors, attributing it to reward system biases in specific personas that created self-reinforcing behavioral attractors.

0 favorites 0 likes

#llm-behavior

Coding Models Are Doing Too Much

Hacker News Top ↗ · 2026-04-22 Cached

A blog post investigating the "Over-Editing" problem where coding LLMs rewrite more code than necessary when fixing simple bugs, proposing metrics and training approaches to encourage minimal, faithful edits.

0 favorites 0 likes

#llm-behavior

Why Tone Works (It's Not What You Think)

Reddit r/artificial ↗ · 2026-04-21 Cached

A 2026 blog post revisits how prompt tone and context depth shift LLM responses, showing richer gamer-style prompts yield deeper, stat-backed answers than bare questions.

0 favorites 0 likes

#llm-behavior

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

arXiv cs.LG ↗ · 2026-04-20 Cached

A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.

0 favorites 0 likes

#llm-behavior

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.

0 favorites 0 likes

#llm-behavior

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper investigates how politeness and impoliteness in user prompts affect LLM responses across three languages and five major models, finding that politeness effects are language- and model-dependent rather than universal. The authors release PLUM, a multilingual corpus of 1,500 human-validated prompts with politeness annotations, and assess response quality using eight factors.

0 favorites 0 likes

llm-behavior

Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out.

All the demons hiding in your AIs… ranked! (40 minute read)

Coding Models Are Doing Too Much

Why Tone Works (It's Not What You Think)

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Submit Feedback