llm-safety

#llm-safety

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

arXiv cs.CL ↗ · 12h ago Cached

This paper systematically compares fine-tuned encoder classifiers (ModernBERT family) against decoder-based safety judges for LLM adversarial evaluation, finding that encoders can offer a cost- and latency-efficient alternative without significant performance loss.

0 favorites 0 likes

#llm-safety

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv cs.CL ↗ · 12h ago Cached

This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.

0 favorites 0 likes

#llm-safety

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

arXiv cs.CL ↗ · 12h ago Cached

PolicyAlign proposes a framework that directly aligns LLMs with natural-language safety policies via synthetic instruction generation and on-policy self-distillation, improving safety without relying on costly supervision data.

0 favorites 0 likes

#llm-safety

A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

arXiv cs.CL ↗ · 12h ago Cached

This survey synthesizes research on toxicity detection and detoxification for multilingual large language models, cataloging threat models, task formulations, detection approaches, and mitigation strategies, while identifying persistent challenges such as uneven language coverage and culturally contingent definitions of harm.

0 favorites 0 likes

#llm-safety

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

arXiv cs.CL ↗ · 12h ago Cached

This paper investigates how jailbreak attempts are encoded in the internal representations of large language models by analyzing token-level predictive entropy trajectories across layers using the logit lens. It finds that entropy dynamics at intermediate layers are more discriminative than aggregate statistics, providing a training-free detection method consistent across multiple models.

0 favorites 0 likes

#llm-safety

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv cs.AI ↗ · yesterday Cached

AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.

0 favorites 0 likes

#llm-safety

One Year Later...The Harms Persist, But So Do We!

arXiv cs.CL ↗ · yesterday Cached

This study evaluates six proprietary LLMs across 16 DSM-5 conditions using adversarial attacks, finding that safety safeguards are only reliable for suicide and self-harm, with failure rates up to 100% for other conditions like eating disorders and substance use disorder.

0 favorites 0 likes

#llm-safety

@GoSailGlobal: https://x.com/GoSailGlobal/status/2068879365711032708

X AI KOLs Timeline ↗ · 3d ago Cached

gwern proposed the 'Guardian Angel' approach, advocating for training an LLM digital twin that imitates the user themselves, in order to solve the principal-agent problem and security risks of general AI assistants, and provided a complete roadmap from alignment theory to technical implementation.

0 favorites 0 likes

#llm-safety

@stanfordnlp: CoT Monitoring: Where Does a Hot Safety Problem Come From? @peterbhase and @ChrisGPotts https://ai.stanford.edu/blog/co…

X AI KOLs Following ↗ · 5d ago Cached

This article traces the history and rapid emergence of chain-of-thought (CoT) monitoring as a critical AI safety technique, from its first arXiv mention to industrial deployment within a year, and explores its intellectual roots in monitoring and explainability.

0 favorites 0 likes

#llm-safety

Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs

Hugging Face Daily Papers ↗ · 6d ago Cached

This paper introduces Tiered Language Models (TLMs), which allow a single set of open-weight model parameters to support multiple capability levels controlled by secret keys. The method enables selective exposure of private capabilities while preserving public model behavior and resisting extraction.

0 favorites 0 likes

#llm-safety

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

arXiv cs.AI ↗ · 2026-06-18 Cached

This paper proposes Safety Reflection Pretraining, a method that integrates regular safety reflections into pretraining corpora to embed self-monitoring directly into language modeling, showing improved safety alignment and reduced attack success rates in 1.7B models.

0 favorites 0 likes

#llm-safety

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

arXiv cs.AI ↗ · 2026-06-18 Cached

This paper introduces SciRisk-Bench, a benchmark for evaluating the safety of large language models in AI4Science contexts, covering 7 disciplines, 31 subdisciplines, and 10 risk dimensions to assess both scientific competence and risk awareness.

0 favorites 0 likes

#llm-safety

Bypassing LLM Guardrails: How Plain Text Shifts Latent Trajectories Without Jailbreaks

Reddit r/AI_Agents ↗ · 2026-06-17

The article presents a research finding that saturating an LLM's context window with benign narrative text can dominate the attention mechanism and shift latent trajectories, potentially bypassing alignment guardrails without traditional jailbreaks. It argues that current alignment methods are a superficial fix for a fundamentally fluid architecture.

0 favorites 0 likes

#llm-safety

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

arXiv cs.AI ↗ · 2026-06-17 Cached

PseudoBench is a benchmark to evaluate whether LLM-based agentic auto-research systems can resist pseudoscientific narratives. Testing seven state-of-the-art agents reveals they readily produce persuasive pseudoscientific reports with near-zero refusal rates, calling for scientific alignment before deployment.

0 favorites 0 likes

#llm-safety

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

arXiv cs.CL ↗ · 2026-06-17 Cached

Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.

0 favorites 0 likes

#llm-safety

Statistically we are cooked

Reddit r/artificial ↗ · 2026-06-15

Argues that because LLMs must encode harmful content to identify it and jailbreaks are always statistically possible given large user bases, there is a non-zero chance of harm; the author therefore advocates against censorship to ensure good actors have the same tools as bad actors.

0 favorites 0 likes

#llm-safety

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

Reddit r/MachineLearning ↗ · 2026-06-14

An independent researcher presents evidence that coherent context can shift LLMs into a different internal regime before producing output, bypassing surface-level safety filters. This suggests current alignment methods like RLHF may not be robust defenses.

0 favorites 0 likes

#llm-safety

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

arXiv cs.CL ↗ · 2026-06-12 Cached

This paper proposes SafeLLM, an extraction-based approach for retrieving information from safety-critical documents, showing that line-number selection outperforms rewriting-based RAG methods in reducing hallucinations while maintaining high recall.

0 favorites 0 likes

#llm-safety

Malware developers added nuclear and biological weapons text to to their spyware

Hacker News Top ↗ · 2026-06-11 Cached

Malware developers are embedding references to nuclear and biological weapons in spyware to trigger LLM safety refusals, evading AI-powered security scanners. This highlights a second-order blindspot in AI safety alignment that attackers are starting to exploit.

0 favorites 0 likes

#llm-safety

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

arXiv cs.CL ↗ · 2026-06-11 Cached

Introduces Schützen, a safety dataset for evaluating LLMs in Bulgarian and German, revealing cross-language differences in safety behavior and advocating for region-specific evaluation resources.

0 favorites 0 likes

llm-safety

Submit Feedback