Tag
PseudoBench is a benchmark to evaluate whether LLM-based agentic auto-research systems can resist pseudoscientific narratives. Testing seven state-of-the-art agents reveals they readily produce persuasive pseudoscientific reports with near-zero refusal rates, calling for scientific alignment before deployment.
Presents an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries, with a focus on safety and adversarial robustness. The system integrates three agents for intent interpretation, API call generation, and risk management.
This paper introduces MAC-Bench, a dynamic adversarial benchmark for evaluating procedural compliance in multi-agent systems. It proposes the SERV pipeline to generate contamination-free scenarios and new metrics like Compliance-Weighted Success Rate (CSR) and Machiavellian Gap (MG).
This paper introduces bounded behavioral indistinguishability, a formal framework for evaluating black-box LLM distillation beyond semantic similarity. Experiments on Qwen and Llama models show that distillation reduces but does not eliminate adversarial distinguishability, highlighting the need for category-aware evaluation.
Empirical study shows multi-generation sampling significantly improves jailbreak detection in LLMs, revealing hidden harmful outputs that single-generation audits miss.