llm-safety

Tag

Cards List
#llm-safety

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv cs.AI · 13h ago Cached

AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.

0 favorites 0 likes
#llm-safety

One Year Later...The Harms Persist, But So Do We!

arXiv cs.CL · 13h ago Cached

This study evaluates six proprietary LLMs across 16 DSM-5 conditions using adversarial attacks, finding that safety safeguards are only reliable for suicide and self-harm, with failure rates up to 100% for other conditions like eating disorders and substance use disorder.

0 favorites 0 likes
#llm-safety

@GoSailGlobal: https://x.com/GoSailGlobal/status/2068879365711032708

X AI KOLs Timeline · 2d ago Cached

gwern proposed the 'Guardian Angel' approach, advocating for training an LLM digital twin that imitates the user themselves, in order to solve the principal-agent problem and security risks of general AI assistants, and provided a complete roadmap from alignment theory to technical implementation.

0 favorites 0 likes
#llm-safety

@stanfordnlp: CoT Monitoring: Where Does a Hot Safety Problem Come From? @peterbhase and @ChrisGPotts https://ai.stanford.edu/blog/co…

X AI KOLs Following · 5d ago Cached

This article traces the history and rapid emergence of chain-of-thought (CoT) monitoring as a critical AI safety technique, from its first arXiv mention to industrial deployment within a year, and explores its intellectual roots in monitoring and explainability.

0 favorites 0 likes
#llm-safety

Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs

Hugging Face Daily Papers · 5d ago Cached

This paper introduces Tiered Language Models (TLMs), which allow a single set of open-weight model parameters to support multiple capability levels controlled by secret keys. The method enables selective exposure of private capabilities while preserving public model behavior and resisting extraction.

0 favorites 0 likes
#llm-safety

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

arXiv cs.AI · 6d ago Cached

This paper proposes Safety Reflection Pretraining, a method that integrates regular safety reflections into pretraining corpora to embed self-monitoring directly into language modeling, showing improved safety alignment and reduced attack success rates in 1.7B models.

0 favorites 0 likes
#llm-safety

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

arXiv cs.AI · 6d ago Cached

This paper introduces SciRisk-Bench, a benchmark for evaluating the safety of large language models in AI4Science contexts, covering 7 disciplines, 31 subdisciplines, and 10 risk dimensions to assess both scientific competence and risk awareness.

0 favorites 0 likes
#llm-safety

Bypassing LLM Guardrails: How Plain Text Shifts Latent Trajectories Without Jailbreaks

Reddit r/AI_Agents · 6d ago

The article presents a research finding that saturating an LLM's context window with benign narrative text can dominate the attention mechanism and shift latent trajectories, potentially bypassing alignment guardrails without traditional jailbreaks. It argues that current alignment methods are a superficial fix for a fundamentally fluid architecture.

0 favorites 0 likes
#llm-safety

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

arXiv cs.AI · 2026-06-17 Cached

PseudoBench is a benchmark to evaluate whether LLM-based agentic auto-research systems can resist pseudoscientific narratives. Testing seven state-of-the-art agents reveals they readily produce persuasive pseudoscientific reports with near-zero refusal rates, calling for scientific alignment before deployment.

0 favorites 0 likes
#llm-safety

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

arXiv cs.CL · 2026-06-17 Cached

Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.

0 favorites 0 likes
#llm-safety

Statistically we are cooked

Reddit r/artificial · 2026-06-15

Argues that because LLMs must encode harmful content to identify it and jailbreaks are always statistically possible given large user bases, there is a non-zero chance of harm; the author therefore advocates against censorship to ensure good actors have the same tools as bad actors.

0 favorites 0 likes
#llm-safety

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

Reddit r/MachineLearning · 2026-06-14

An independent researcher presents evidence that coherent context can shift LLMs into a different internal regime before producing output, bypassing surface-level safety filters. This suggests current alignment methods like RLHF may not be robust defenses.

0 favorites 0 likes
#llm-safety

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

arXiv cs.CL · 2026-06-12 Cached

This paper proposes SafeLLM, an extraction-based approach for retrieving information from safety-critical documents, showing that line-number selection outperforms rewriting-based RAG methods in reducing hallucinations while maintaining high recall.

0 favorites 0 likes
#llm-safety

Malware developers added nuclear and biological weapons text to to their spyware

Hacker News Top · 2026-06-11 Cached

Malware developers are embedding references to nuclear and biological weapons in spyware to trigger LLM safety refusals, evading AI-powered security scanners. This highlights a second-order blindspot in AI safety alignment that attackers are starting to exploit.

0 favorites 0 likes
#llm-safety

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

arXiv cs.CL · 2026-06-11 Cached

Introduces Schützen, a safety dataset for evaluating LLMs in Bulgarian and German, revealing cross-language differences in safety behavior and advocating for region-specific evaluation resources.

0 favorites 0 likes
#llm-safety

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

arXiv cs.CL · 2026-06-11 Cached

This paper proposes MLJailDe, a multilingual jailbreak detection framework that uses back-translation data augmentation and relative-distance constraints to improve cross-lingual generalization and robustness, achieving 98.5% F1 score across 11 languages.

0 favorites 0 likes
#llm-safety

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

arXiv cs.LG · 2026-06-10 Cached

PreAct-Bench is a benchmark of 1,000 paired ethical and unethical action trajectories across five domains, designed to evaluate the ability of LLMs to predict harmful outcomes from partial trajectories (predictive monitoring). Results show that while humans perform well, current LLMs struggle, highlighting the need for future-oriented risk reasoning.

0 favorites 0 likes
#llm-safety

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

arXiv cs.CL · 2026-06-10 Cached

Introduces Janus, a benchmark for measuring how LLMs selectively distort factual information when given persuasive goals, revealing that models remain susceptible to producing misleading communications even without fabrication.

0 favorites 0 likes
#llm-safety

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

arXiv cs.CL · 2026-06-10 Cached

This paper identifies a polarity-flipping encoding subspace in the residual stream of LLM agents that enables real-time detection of covert data exfiltration, achieving AUC=0.918 in injection scenarios and substantially outperforming output-only detectors.

0 favorites 0 likes
#llm-safety

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

arXiv cs.AI · 2026-06-09 Cached

This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback