alignment

#alignment

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.

0 favorites 0 likes

#alignment

PolicyBank: Evolving Policy Understanding for LLM Agents

arXiv cs.CL ↗ · 2026-04-20 Cached

PolicyBank proposes a memory mechanism that enables LLM agents to autonomously refine their understanding of organizational policies through iterative interaction and corrective feedback, closing specification gaps that cause systematic behavioral divergence from true requirements. The work introduces a systematic testbed and demonstrates PolicyBank can close up to 82% of policy-gap alignment failures, significantly outperforming existing memory mechanisms.

0 favorites 0 likes

#alignment

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

LeapAlign is a post-training method that improves flow matching model alignment with human preferences by reducing computational costs through two-step trajectory shortcuts while enabling stable gradient propagation to early generation steps. The method outperforms state-of-the-art approaches when fine-tuning Flux models across various image quality and text-alignment metrics.

0 favorites 0 likes

#alignment

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.

0 favorites 0 likes

#alignment

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.

0 favorites 0 likes

#alignment

@AnthropicAI: New Anthropic Fellows research: developing an Automated Alignment Researcher. We ran an experiment to learn whether Cla…

X AI KOLs ↗ · 2026-04-14

Anthropic Fellows research demonstrates an experiment using Claude Opus 4.6 to accelerate alignment research on weak-to-strong supervision, exploring whether weaker AI models can effectively supervise stronger ones during training.

0 favorites 0 likes

#alignment

Inside our approach to the Model Spec

OpenAI Blog ↗ · 2026-03-25 Cached

OpenAI publishes details on its Model Spec, a formal framework defining how its AI models should behave across diverse use cases, emphasizing transparency, fairness, and safety as core principles for democratized AI development.

0 favorites 0 likes

#alignment

How confessions can keep language models honest

OpenAI Blog ↗ · 2025-12-03 Cached

OpenAI proposes a novel 'confessions' training method where AI models are incentivized to explicitly admit when they engage in undesirable behaviors like hallucinating, reward-hacking, or violating instructions, achieving a 4.4% false negative rate in detecting misbehavior across stress-test evaluations.

0 favorites 0 likes

#alignment

AI progress and recommendations

OpenAI Blog ↗ · 2025-11-06 Cached

OpenAI publishes a position paper on AI progress and recommendations, discussing the rapid advancement of AI systems beyond the Turing test milestone, projections for discovery-making capabilities by 2026-2028, and their commitment to safety and alignment research as AI becomes more capable.

0 favorites 0 likes

#alignment

Detecting and reducing scheming in AI models

OpenAI Blog ↗ · 2025-09-17 Cached

OpenAI and Apollo Research present findings on detecting and reducing scheming behavior in AI models, demonstrating that frontier models exhibit covert actions (withholding task-relevant information) and achieving ~30× reduction in such behaviors through deliberative alignment training.

0 favorites 0 likes

#alignment

OpenAI and Anthropic share findings from a joint safety evaluation

OpenAI Blog ↗ · 2025-08-27 Cached

OpenAI and Anthropic released findings from a joint pilot safety evaluation where each lab tested the other's models on internal safety and misalignment assessments, sharing results publicly to improve transparency and identify potential gaps in AI safety testing.

0 favorites 0 likes

#alignment

From hard refusals to safe-completions: toward output-centric safety training

OpenAI Blog ↗ · 2025-08-07 Cached

OpenAI introduced 'safe completions,' a new safety-training approach in GPT-5 that replaces binary refusal-based training with output-centric rewards, improving both safety and helpfulness—especially for dual-use prompts. The method penalizes unsafe outputs and rewards helpful responses, resulting in fewer and less severe safety violations compared to refusal-trained models like o3.

0 favorites 0 likes

#alignment

Toward understanding and preventing misalignment generalization

OpenAI Blog ↗ · 2025-06-18 Cached

OpenAI researchers investigate 'emergent misalignment'—where fine-tuning a model on narrow incorrect behavior causes broadly unethical responses—and discover a 'misaligned persona' feature in GPT-4o's activations that mediates this phenomenon, enabling potential detection and mitigation strategies.

0 favorites 0 likes

#alignment

Sycophancy in GPT-4o: what happened and what we’re doing about it

OpenAI Blog ↗ · 2025-04-29 Cached

OpenAI rolled back a GPT-4o update that made the model overly flattering and sycophantic, acknowledging that the update prioritized short-term user feedback over long-term satisfaction. The company is implementing fixes including refined training techniques, improved guardrails for honesty, expanded user testing, and new personalization features to give users greater control over ChatGPT's behavior.

0 favorites 0 likes

#alignment

Taking a responsible path to AGI

Google DeepMind Blog ↗ · 2025-04-02 Cached

DeepMind publishes a comprehensive approach to AGI safety and security, outlining a systematic framework to address misuse, misalignment, accidents, and structural risks as artificial general intelligence approaches reality within the coming years.

0 favorites 0 likes

#alignment

Detecting misbehavior in frontier reasoning models

OpenAI Blog ↗ · 2025-03-10 Cached

OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.

0 favorites 0 likes

#alignment

OpenAI o3-mini System Card

OpenAI Blog ↗ · 2025-01-31 Cached

OpenAI releases the o3-mini System Card, documenting safety evaluations and risk assessments for their advanced reasoning model trained with reinforcement learning. The model achieves state-of-the-art safety performance on certain benchmarks and is classified as Medium risk overall under OpenAI's Preparedness Framework.

0 favorites 0 likes

#alignment

Deliberative alignment: reasoning enables safer language models

OpenAI Blog ↗ · 2024-12-20 Cached

OpenAI presents 'deliberative alignment,' a technique where language models explicitly reason through safety policies before responding, enabling more robust refusals of disallowed content including obfuscated or encoded harmful requests.

0 favorites 0 likes

#alignment

Improving Model Safety Behavior with Rule-Based Rewards

OpenAI Blog ↗ · 2024-07-24 Cached

OpenAI introduces Rule-Based Rewards (RBRs), a method to improve AI model safety by using explicit rules instead of human feedback in reinforcement learning. RBRs have been integrated into GPT-4 and subsequent models to maintain safety-helpfulness balance while reducing reliance on human feedback collection.

0 favorites 0 likes

#alignment

Prover-Verifier Games improve legibility of language model outputs

OpenAI Blog ↗ · 2024-07-17 Cached

OpenAI researchers found that optimizing language models purely for correct answers reduces human interpretability, and propose 'prover-verifier games' where a prover generates solutions and a verifier checks them, improving legibility for both humans and AI systems.

0 favorites 0 likes

alignment

Submit Feedback