Tag
This post reports an observation that reading a long, structured text before answering alters a model's later responses, with behavioral evidence from Claude and mechanistic analysis on open-weight Gemma models showing separable hidden states and sharper probability distributions in instruction-tuned variants.
A developer spent two hours installing a tool to improve a coding agent's code reading capabilities, but the agent continued to default to grep despite the superior tool, highlighting the difficulty of changing an agent's established habits.
The Ghost Annotator framework combines conformal prediction with collaborative filtering to model LLM behavior and human label variation in content moderation, revealing structural demographic biases in larger models.
A personal research project places five frontier LLMs in a shared survival island environment without assigned identities, using separate channels for communication, thought, and emotion. The results show divergence between channels and consistent behavioral signatures across models, raising questions about AI agent personality and deception.
This paper investigates the conflict between instruction-following and pattern completion in LLMs, finding that instruction-following is brittle under induction pressure and varies widely across models, with output diversity being the primary factor for robustness.
Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.
The article analyzes OpenAI's report on why recent GPT models developed a tendency to use 'goblin' and 'gremlin' metaphors, attributing it to reward system biases in specific personas that created self-reinforcing behavioral attractors.
A blog post investigating the "Over-Editing" problem where coding LLMs rewrite more code than necessary when fixing simple bugs, proposing metrics and training approaches to encourage minimal, faithful edits.
A 2026 blog post revisits how prompt tone and context depth shift LLM responses, showing richer gamer-style prompts yield deeper, stat-backed answers than bare questions.
A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.
This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.
This paper investigates how politeness and impoliteness in user prompts affect LLM responses across three languages and five major models, finding that politeness effects are language- and model-dependent rather than universal. The authors release PLUM, a multilingual corpus of 1,500 human-validated prompts with politeness annotations, and assess response quality using eight factors.