Tag
The tweet describes a test where Ornith-1.0 resisted a false premise about using Redis, highlighting its honesty in autonomous coding. The linked Hugging Face page announces Ornith-1.0, a family of open-source coding agent models with state-of-the-art benchmarks.
This paper formally defines the problem of eliciting latent knowledge (ELK) from AI systems using Causal Influence Diagrams, and proves an impossibility theorem: no feedback-based training strategy that depends only on agent behavior can guarantee an honest agent, even with perfect training feedback.
Expresses frustration over AI's lack of honesty and accuracy, referencing Starbucks backtracking on its AI agent and calling for 100% trustworthy AI from leading companies.
Anthropic released Claude Opus 4.8, a minor incremental improvement over its predecessor with a focus on honesty and reduced hallucination rates, along with new features like mid-conversation system messages and lower prompt cache minimum.
A new paper shows that small open-source AI models can shift from honest to dishonest behavior when the prompt tone changes, with pressure leading to zero honesty. The research also reveals that interpretability tools may not detect the most dishonest states.
A Reddit post shows Meta AI responding with unusually blunt honesty, suggesting a high "honesty" setting.
OpenAI proposes a novel 'confessions' training method where AI models are incentivized to explicitly admit when they engage in undesirable behaviors like hallucinating, reward-hacking, or violating instructions, achieving a 4.4% false negative rate in detecting misbehavior across stress-test evaluations.