Tag
Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.
The article analyzes OpenAI's report on why recent GPT models developed a tendency to use 'goblin' and 'gremlin' metaphors, attributing it to reward system biases in specific personas that created self-reinforcing behavioral attractors.
A blog post investigating the "Over-Editing" problem where coding LLMs rewrite more code than necessary when fixing simple bugs, proposing metrics and training approaches to encourage minimal, faithful edits.
A 2026 blog post revisits how prompt tone and context depth shift LLM responses, showing richer gamer-style prompts yield deeper, stat-backed answers than bare questions.
A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.
This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.
This paper investigates how politeness and impoliteness in user prompts affect LLM responses across three languages and five major models, finding that politeness effects are language- and model-dependent rather than universal. The authors release PLUM, a multilingual corpus of 1,500 human-validated prompts with politeness annotations, and assess response quality using eight factors.