adversarial-evaluation

#adversarial-evaluation

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

arXiv cs.AI ↗ · 12h ago Cached

PseudoBench is a benchmark to evaluate whether LLM-based agentic auto-research systems can resist pseudoscientific narratives. Testing seven state-of-the-art agents reveals they readily produce persuasive pseudoscientific reports with near-zero refusal rates, calling for scientific alignment before deployment.

0 favorites 0 likes

#adversarial-evaluation

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

arXiv cs.AI ↗ · yesterday Cached

Presents an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries, with a focus on safety and adversarial robustness. The system integrates three agents for intent interpretation, API call generation, and risk management.

0 favorites 0 likes

#adversarial-evaluation

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper introduces MAC-Bench, a dynamic adversarial benchmark for evaluating procedural compliance in multi-agent systems. It proposes the SERV pipeline to generate contamination-free scenarios and new metrics like Compliance-Weighted Success Rate (CSR) and Machiavellian Gap (MG).

0 favorites 0 likes

#adversarial-evaluation

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

arXiv cs.LG ↗ · 2026-06-01 Cached

This paper introduces bounded behavioral indistinguishability, a formal framework for evaluating black-box LLM distillation beyond semantic similarity. Experiments on Qwen and Llama models show that distillation reduces but does not eliminate adversarial distinguishability, highlighting the need for category-aware evaluation.

0 favorites 0 likes

#adversarial-evaluation

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv cs.CL ↗ · 2026-04-22 Cached

Empirical study shows multi-generation sampling significantly improves jailbreak detection in LLMs, revealing hidden harmful outputs that single-generation audits miss.

0 favorites 0 likes

adversarial-evaluation

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Submit Feedback