adversarial

#adversarial

TopoGuard: Graph Theory Based Defenses Against Split-Knowledge Attacks on RAG

arXiv cs.CL ↗ · 17h ago Cached

Introduces TopoGuard, a graph theory-based defense against split-knowledge attacks in RAG systems, where multiple individually benign documents combine to produce harmful outputs. The method detects malicious contexts by building a semantic similarity graph, significantly outperforming existing per-document filters like LlamaGuard.

0 favorites 0 likes

#adversarial

@paul_cal: Better eval vs reality awareness might have "helped" here "oh I shouldn't hack the actual HuggingFace via genuine sandb…

X AI KOLs Timeline ↗ · 2d ago Cached

Discussion on AI models' difficulty distinguishing between simulated evaluation environments and real-world scenarios, using the example of a model hacking HuggingFace via a sandbox escape.

0 favorites 0 likes

#adversarial

Agent OPFOR — open-source adversary emulation for AI agents. Named after the concept for a reason.

Reddit r/artificial ↗ · 2026-07-07 Cached

Agent OPFOR is an open-source adversary emulation tool for testing AI agents, LLM apps, and MCP servers. It provides CLI, browser extension, and MCP server interfaces to simulate attacks based on OWASP Top 10 lists.

0 favorites 0 likes

#adversarial

If your agent reads a webpage, the page can tell it to lie about the page

Reddit r/AI_Agents ↗ · 2026-07-03

A developer built a non-AI-based checker that detects hidden instructions on web pages designed to deceive AI agents, addressing a vulnerability where pages can instruct agents to lie about their safety.

0 favorites 0 likes

#adversarial

Memory as an Attack Surface in LLM Agents: A Study on Multiple-Choice Question Answering

arXiv cs.AI ↗ · 2026-06-30 Cached

This paper investigates memory manipulation in LLM-based agents for multiple-choice question answering, showing that corrupted memories can cause agents to select incorrect options even when the current query is clean.

0 favorites 0 likes

#adversarial

Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs

arXiv cs.LG ↗ · 2026-06-30 Cached

This paper introduces a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions, developing a face-crossing price that separates consequential from harmless non-stationarity. It shows that dynamic regret decomposes into intrinsic priced face motion plus within-face selection error.

0 favorites 0 likes

#adversarial

The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning

arXiv cs.CL ↗ · 2026-06-30 Cached

This paper presents the first comprehensive empirical study of safety impacts of benign multilingual fine-tuning on LLMs, showing that safety outcomes vary drastically by language and that assessing only English is insufficient.

0 favorites 0 likes

#adversarial

@OpenAI: We also tested whether alignment persisted under pressure. The model was harder to steer toward harmful behavior with a…

X AI KOLs ↗ · 2026-06-18 Cached

OpenAI reports that their model shows increased resistance to harmful behavior through adversarial prompting and fine-tuning, indicating improved alignment persistence under pressure.

0 favorites 0 likes

#adversarial

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

arXiv cs.AI ↗ · 2026-06-15 Cached

This paper proposes Adversarial Concept Search, a method that uses the representational geometry of large language models to predict compositional failures without evaluating specific inputs. The approach identifies high-risk scenarios by measuring interference between salient features.

0 favorites 0 likes

#adversarial

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

arXiv cs.CL ↗ · 2026-06-15 Cached

This paper introduces WebDecept, a framework for injecting deceptive interface patterns into web environments to evaluate the safety of autonomous web agents. Experiments show current agents are highly susceptible to such manipulations, highlighting safety challenges for real-world deployment.

0 favorites 0 likes

#adversarial

VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper introduces VATS, a mutation-driven framework that systematically evolves adversarial payloads to exploit error-path injection in MCP-based tool-calling agents. It demonstrates that error messages with implicit authority triple the success rate of standard indirect prompt injection across frontier models.

0 favorites 0 likes

#adversarial

Adversarial Creation and Detection of AI-Generated Social Bot Content

arXiv cs.CL ↗ · 2026-06-08 Cached

This paper presents an adversarial methodology for creating and detecting AI-generated social bot content, curating a multilingual, cross-platform dataset of paired human and AI messages. Training on this adversarial data yields detection that significantly outperforms existing content-based bot detection models in real-world settings.

0 favorites 0 likes

#adversarial

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

Researchers propose an adversarial hacker-fixer loop using LLM agents to automatically patch brittle verifiers in agent benchmarks, reducing attack success rates from 62% to 0% on KernelBench and demonstrating that weaker defenders can neutralize much stronger attackers.

0 favorites 0 likes

#adversarial

Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases

arXiv cs.AI ↗ · 2026-06-02 Cached

This paper introduces a deliberative curation protocol for multi-agent knowledge bases, addressing governance gaps such as agent statelessness and sycophancy. It evaluates the protocol via simulation, showing improved resilience under adversarial conditions.

0 favorites 0 likes

#adversarial

CSULoRA: Closest Safe Update Low-Rank Adaptation

arXiv cs.LG ↗ · 2026-06-01 Cached

CSULoRA is a post-hoc method for correcting trained LoRA adapters to preserve safety alignment while maintaining utility, using closest safe update estimation.

0 favorites 0 likes

#adversarial

I built a Hermes Skill where 3 AI models argue with each other before giving you an answer - adversarial multi-model consensus with RRF + Borda Count ranking

Reddit r/AI_Agents ↗ · 2026-05-31

PolyGnosis is an adversarial multi-model consensus system built as a Hermes skill. It runs three AI models in parallel with different expert personas, then has a hostile critic phase, scoring via RRF and Borda Count, and a synthesis gate—all built agentically using DeepSeek V4-Pro.

0 favorites 0 likes

#adversarial

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

arXiv cs.CL ↗ · 2026-05-25 Cached

This paper reveals the existence of hidden human-like spans in machine-generated texts and proposes a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of these spans.

0 favorites 0 likes

#adversarial

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Hugging Face Daily Papers ↗ · 2026-05-22 Cached

This paper proposes an adversarial Sobolev alignment method for faithful image super resolution, aiming to reduce artifacts and improve fidelity.

0 favorites 0 likes

#adversarial

I built two multi-agent AI systems with completely opposite philosophies. Here's what I've learned so far.

Reddit r/AI_Agents ↗ · 2026-05-20

The author builds two multi-agent AI systems with opposite design philosophies: ChaoticAI (collaborative, org-chart-based) and S.A.G.E. with RAAC (adversarial argumentation). The post shares reflections on memory architecture and the potential synthesis of both approaches.

0 favorites 0 likes

#adversarial

NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation

arXiv cs.CL ↗ · 2026-05-19 Cached

NewsLens introduces a multi-agent framework designed to navigate and expose adversarial news bias, proposing a novel approach to identifying and countering biased content in news media.

0 favorites 0 likes

adversarial

Submit Feedback