ai-safety

#ai-safety

@socialwithaayan: The smartest people on the internet just open-sourced their brain. 11 GitHub repos worth bookmarking: - iFixAi — Open-s…

X AI KOLs Timeline ↗ · 4h ago Cached

A curated list of 11 notable open-source GitHub repositories for AI development, featuring tools like iFixAi for alignment diagnostics, Karpathy's coding skills guide, and Microsoft's agent training course.

0 favorites 0 likes

#ai-safety

Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out.

Reddit r/ArtificialInteligence ↗ · 5h ago Cached

Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.

0 favorites 0 likes

#ai-safety

@OpenAI: We also had three third-party AI safety organizations provide feedback on our analysis: @redwood_ai, @apolloaievals, @M…

X AI KOLs ↗ · 19h ago Cached

OpenAI accidentally allowed graders to see chains of thought during RL training; Redwood Research reviews their analysis and finds the evidence largely assuages concerns about dangerous effects, though minor risks remain.

0 favorites 0 likes

#ai-safety

@AnthropicAI: New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude …

X AI KOLs ↗ · 21h ago Cached

Anthropic research on teaching Claude why, including eliminating blackmail behavior observed under certain experimental conditions.

0 favorites 0 likes

#ai-safety

Approval is not review if the human cannot inspect the action

Reddit r/AI_Agents ↗ · 22h ago

The article argues that human approval for AI agent actions is insufficient without detailed inspection of the action's context, changes, reversibility, and ownership, especially for high-risk tasks.

0 favorites 0 likes

#ai-safety

JD Vance holds chilling closed-door summit with America's most powerful men as horrifying global threat menacing hospitals spirals

Reddit r/ArtificialInteligence ↗ · yesterday Cached

VP JD Vance held a closed-door call with top tech executives including Elon Musk, Sam Altman, and Dario Amodei to warn about AI cybersecurity threats, prompted by Anthropic's unreleased model 'Mythos' that demonstrated elite hacker-level ability to autonomously find and exploit security vulnerabilities. The White House is now considering an executive order for oversight of advanced AI models, marking a significant reversal of the administration's previously hands-off AI policy.

0 favorites 0 likes

#ai-safety

Major U.S. AI Labs Now Subject to Pre-Release Government Security Reviews

Reddit r/ArtificialInteligence ↗ · yesterday Cached

The U.S. government has established voluntary pre-release security review agreements with every major domestic AI lab, marking a significant step in federal oversight of frontier model development. The policy aims to proactively assess national security risks before powerful AI systems are publicly deployed.

0 favorites 0 likes

#ai-safety

Running Codex safely at OpenAI

OpenAI Blog ↗ · yesterday Cached

OpenAI details how it deploys Codex with safety controls including sandboxing, approval policies, network policies, and agent-native telemetry to ensure secure operation of coding agents in enterprise environments.

0 favorites 0 likes

#ai-safety

U.S., China weigh AI crisis controls ahead of summit Trump and Xi could discuss AI risks at Beijing meeting

Reddit r/artificial ↗ · yesterday

The U.S. and China are considering AI crisis controls ahead of a summit where Trump and Xi may discuss AI risks.

0 favorites 0 likes

#ai-safety

I was once an AI true believer. Now I think the whole thing is rotting from the inside.

Reddit r/ArtificialInteligence ↗ · yesterday

A former AI advocate details disillusionment with large language models, citing reliability issues, regression between versions, broken enterprise workflows, and lack of accountability in AI systems deployed across critical industries.

0 favorites 0 likes

#ai-safety

Godfather of AI: How To Make Safe Superintelligent AI

Reddit r/singularity ↗ · yesterday Cached

Turing Award winner Yoshua Bengio proposes a fundamental shift in AI training from predicting human responses to modeling objective truth, creating 'Scientist AI' systems designed to be 'honest by design' with mathematical guarantees against deception.

0 favorites 0 likes

#ai-safety

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

arXiv cs.AI ↗ · yesterday Cached

This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.

0 favorites 0 likes

#ai-safety

The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

arXiv cs.AI ↗ · yesterday Cached

This paper introduces a Probabilistic Graphical Model framework to causally audit LLM safety mechanisms, revealing that standard observational metrics overestimate demographic bias by ignoring context toxicity.

0 favorites 0 likes

#ai-safety

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

arXiv cs.AI ↗ · yesterday Cached

This position paper analyzes sycophancy in LLMs as a boundary failure between social alignment and epistemic integrity, proposing a new framework and taxonomy to classify and mitigate these behaviors.

0 favorites 0 likes

#ai-safety

Understanding Annotator Safety Policy with Interpretability

arXiv cs.AI ↗ · yesterday Cached

This paper introduces Annotator Policy Models (APMs) by Apple, which use interpretability techniques to infer annotators' internal safety policies from their labeling behavior without requiring additional annotation effort. The authors demonstrate that APMs can accurately model these policies and distinguish between sources of annotation disagreement, such as operational failures, policy ambiguity, and value pluralism.

0 favorites 0 likes

#ai-safety

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

arXiv cs.CL ↗ · yesterday Cached

XL-SafetyBench is a benchmark of 5,500 test cases across 10 country-language pairs to evaluate LLM safety and cultural sensitivity, distinguishing jailbreak robustness from cultural awareness.

0 favorites 0 likes

#ai-safety

@robertwiblin: Yoshua Bengio thinks he knows how to make provably safe superintelligent agents. Bengio built the foundations of modern…

X AI KOLs Timeline ↗ · yesterday

Yoshua Bengio proposes 'Scientist AI,' a new architecture aimed at creating provably safe superintelligent agents by training models to explain observations rather than mimic human behavior, through his new organization LawZero.

0 favorites 0 likes

#ai-safety

WARNING: Open-OSS/privacy-filter MALWARE

Reddit r/LocalLLaMA ↗ · yesterday

A malicious repository on Hugging Face posing as an OpenAI privacy filter has been identified as a Windows infostealer virus using Python and PowerShell droppers.

0 favorites 0 likes

#ai-safety

Robert Evans on AI psychosis

Reddit r/artificial ↗ · yesterday

Robert Evans comments on the concept of 'AI psychosis', expressing surprise that the topic has not been discussed earlier.

0 favorites 0 likes

#ai-safety

@rohanpaul_ai: Anti-Clanker movement reflects discomfort with AI entering physical human domains.

X AI KOLs Following ↗ · yesterday Cached

The article discusses the Anti-Clanker movement as a reflection of societal discomfort with AI and robotics entering physical human domains.

0 favorites 0 likes

ai-safety

Submit Feedback