llm-behavior

Tag

Cards List
#llm-behavior

Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out.

Reddit r/ArtificialInteligence · 5h ago Cached

Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.

0 favorites 0 likes
#llm-behavior

All the demons hiding in your AIs… ranked! (40 minute read)

TLDR AI · 2d ago Cached

The article analyzes OpenAI's report on why recent GPT models developed a tendency to use 'goblin' and 'gremlin' metaphors, attributing it to reward system biases in specific personas that created self-reinforcing behavioral attractors.

0 favorites 0 likes
#llm-behavior

Coding Models Are Doing Too Much

Hacker News Top · 2026-04-22 Cached

A blog post investigating the "Over-Editing" problem where coding LLMs rewrite more code than necessary when fixing simple bugs, proposing metrics and training approaches to encourage minimal, faithful edits.

0 favorites 0 likes
#llm-behavior

Why Tone Works (It's Not What You Think)

Reddit r/artificial · 2026-04-21 Cached

A 2026 blog post revisits how prompt tone and context depth shift LLM responses, showing richer gamer-style prompts yield deeper, stat-backed answers than bare questions.

0 favorites 0 likes
#llm-behavior

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

arXiv cs.LG · 2026-04-20 Cached

A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.

0 favorites 0 likes
#llm-behavior

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

arXiv cs.CL · 2026-04-20 Cached

This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.

0 favorites 0 likes
#llm-behavior

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

arXiv cs.CL · 2026-04-20 Cached

This paper investigates how politeness and impoliteness in user prompts affect LLM responses across three languages and five major models, finding that politeness effects are language- and model-dependent rather than universal. The authors release PLUM, a multilingual corpus of 1,500 human-validated prompts with politeness annotations, and assess response quality using eight factors.

0 favorites 0 likes
← Back to home

Submit Feedback