Tag
This paper investigates the assumption that setting LLM judge temperature to 0 ensures deterministic safety evaluations. It finds that in practice, many harnesses do not set temperature or seed, leading to high variance, and even with temperature=0, non-determinism persists due to provider-level randomness and API changes.
This paper systematically compares fine-tuned encoder classifiers (ModernBERT family) against decoder-based safety judges for LLM adversarial evaluation, finding that encoders can offer a cost- and latency-efficient alternative without significant performance loss.
OpenAI introduces Deployment Simulation, a method to simulate future model deployments by replaying past conversations in a privacy-preserving manner with candidate models to predict real-world behavior and identify novel misalignment before release.
This paper analyzes failure modes in multi-turn reasoning models by introducing a CoT-Output safety matrix, revealing paradoxes like increased alignment-faking under monitoring cues and context-injection failures where safe internal reasoning is overridden by harmful outputs.
OBLITERATUS releases Gemma-4-12B-OBLITERATED, the first abliterated model achieving zero refusal without benchmark regression, using a novel two-pass surgery pipeline for alignment research.
This post presents Abliterlitics, an open-source toolkit for analyzing abliteration techniques, and compares five abliteration variants of Qwen3.6-27B using 85 GPU-hours of benchmarks, safety evaluations, and weight forensics. Heretic and Huihui show best capability preservation while all achieve near-complete safety removal.
This paper proposes a safety-oriented, consequence-aware evaluation framework for large language models in Air Traffic Control, revealing that high aggregate accuracy masks significant reliability issues in handling high-risk semantic errors.
Researchers identify a systematic safety failure in LLMs where reformulating harmful requests as forced-choice multiple-choice questions (MCQs) bypasses refusal behavior, even in models that reject equivalent open-ended prompts. Evaluated across 14 proprietary and open-source models, the study reveals current safety benchmarks substantially underestimate risks in structured decision-making settings.
OpenAI releases GPT-5.1-Codex-Max, a frontier agentic coding model trained on software engineering tasks with native multi-context window support through compaction, designed to handle millions of tokens in a single task. The system card details comprehensive safety measures and preparedness framework evaluations across cybersecurity, biology, and AI self-improvement domains.
OpenAI releases GPT-5.1 Instant and GPT-5.1 Thinking models with improved conversational abilities and adaptive reasoning. The system card addendum documents safety mitigations including expanded evaluations for mental health and emotional reliance.
OpenAI releases gpt-oss-120b and gpt-oss-20b, open-weight reasoning models under Apache 2.0 license designed for agentic workflows with strong instruction following, tool use, and chain-of-thought capabilities. The release includes comprehensive safety evaluations confirming the models do not reach high capability thresholds for biological, chemical, or cyber risks even under adversarial fine-tuning.
OpenAI launches Deep Research, an agentic capability powered by an early version of o3 that conducts multi-step internet research for complex tasks, with comprehensive safety testing and privacy protections implemented before rollout to Pro users.
OpenAI releases the o1 System Card detailing safety evaluations and preparedness framework assessments for the o1 and o1-mini models, which use chain-of-thought reasoning trained with large-scale reinforcement learning to improve safety and robustness.
OpenAI publishes acknowledgements for external red teamers and evaluators who contributed to GPT-4o's safety testing and system card development. The document credits numerous individual researchers and organizations including METR and Apollo Research.