safety-evaluation

Tag

Cards List
#safety-evaluation

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

arXiv cs.LG · 2026-06-26 Cached

This paper investigates the assumption that setting LLM judge temperature to 0 ensures deterministic safety evaluations. It finds that in practice, many harnesses do not set temperature or seed, leading to high variance, and even with temperature=0, non-determinism persists due to provider-level randomness and API changes.

0 favorites 0 likes
#safety-evaluation

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

arXiv cs.CL · 2026-06-25 Cached

This paper systematically compares fine-tuned encoder classifiers (ModernBERT family) against decoder-based safety judges for LLM adversarial evaluation, finding that encoders can offer a cost- and latency-efficient alternative without significant performance loss.

0 favorites 0 likes
#safety-evaluation

Predicting model behavior before release by simulating deployment

OpenAI Blog · 2026-06-16 Cached

OpenAI introduces Deployment Simulation, a method to simulate future model deployments by replaying past conversations in a privacy-preserving manner with candidate models to predict real-world behavior and identify novel misalignment before release.

0 favorites 0 likes
#safety-evaluation

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Hugging Face Daily Papers · 2026-06-09 Cached

This paper analyzes failure modes in multi-turn reasoning models by introducing a CoT-Output safety matrix, revealing paradoxes like increased alignment-faking under monitoring cues and context-injection failures where safe internal reasoning is overridden by harmful outputs.

0 favorites 0 likes
#safety-evaluation

OBLITERATUS/Gemma-4-12B-OBLITERATED

Hugging Face Models Trending · 2026-06-05 Cached

OBLITERATUS releases Gemma-4-12B-OBLITERATED, the first abliterated model achieving zero refusal without benchmark regression, using a novel two-pass surgery pipeline for alignment research.

0 favorites 0 likes
#safety-evaluation

85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics

Reddit r/LocalLLaMA · 2026-05-17

This post presents Abliterlitics, an open-source toolkit for analyzing abliteration techniques, and compares five abliteration variants of Qwen3.6-27B using 85 GPU-hours of benchmarks, safety evaluations, and weight forensics. Heretic and Huihui show best capability preservation while all achieve near-complete safety removal.

0 favorites 0 likes
#safety-evaluation

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

arXiv cs.CL · 2026-05-13 Cached

This paper proposes a safety-oriented, consequence-aware evaluation framework for large language models in Air Traffic Control, revealing that high aggregate accuracy masks significant reliability issues in handling high-risk semantic errors.

0 favorites 0 likes
#safety-evaluation

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

arXiv cs.CL · 2026-04-21 Cached

Researchers identify a systematic safety failure in LLMs where reformulating harmful requests as forced-choice multiple-choice questions (MCQs) bypasses refusal behavior, even in models that reject equivalent open-ended prompts. Evaluated across 14 proprietary and open-source models, the study reveals current safety benchmarks substantially underestimate risks in structured decision-making settings.

0 favorites 0 likes
#safety-evaluation

GPT-5.1-Codex-Max System Card

OpenAI Blog · 2025-11-19 Cached

OpenAI releases GPT-5.1-Codex-Max, a frontier agentic coding model trained on software engineering tasks with native multi-context window support through compaction, designed to handle millions of tokens in a single task. The system card details comprehensive safety measures and preparedness framework evaluations across cybersecurity, biology, and AI self-improvement domains.

0 favorites 0 likes
#safety-evaluation

GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum

OpenAI Blog · 2025-11-12 Cached

OpenAI releases GPT-5.1 Instant and GPT-5.1 Thinking models with improved conversational abilities and adaptive reasoning. The system card addendum documents safety mitigations including expanded evaluations for mental health and emotional reliance.

0 favorites 0 likes
#safety-evaluation

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI Blog · 2025-08-05 Cached

OpenAI releases gpt-oss-120b and gpt-oss-20b, open-weight reasoning models under Apache 2.0 license designed for agentic workflows with strong instruction following, tool use, and chain-of-thought capabilities. The release includes comprehensive safety evaluations confirming the models do not reach high capability thresholds for biological, chemical, or cyber risks even under adversarial fine-tuning.

0 favorites 0 likes
#safety-evaluation

Deep research System Card

OpenAI Blog · 2025-02-25 Cached

OpenAI launches Deep Research, an agentic capability powered by an early version of o3 that conducts multi-step internet research for complex tasks, with comprehensive safety testing and privacy protections implemented before rollout to Pro users.

0 favorites 0 likes
#safety-evaluation

OpenAI o1 System Card

OpenAI Blog · 2024-12-05 Cached

OpenAI releases the o1 System Card detailing safety evaluations and preparedness framework assessments for the o1 and o1-mini models, which use chain-of-thought reasoning trained with large-scale reinforcement learning to improve safety and robustness.

0 favorites 0 likes
#safety-evaluation

GPT-4o System Card External Testers Acknowledgements

OpenAI Blog · 2024-08-08 Cached

OpenAI publishes acknowledgements for external red teamers and evaluators who contributed to GPT-4o's safety testing and system card development. The document credits numerous individual researchers and organizations including METR and Apollo Research.

0 favorites 0 likes
← Back to home

Submit Feedback