SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
Summary
SABER introduces a benchmark for evaluating the operational safety of LLM coding agents in realistic stateful project workspaces, showing that even the best model has over a 54% harmful safety-violation rate, indicating insufficient alignment for real-world environments.
View Cached Full Text
Cached at: 06/05/26, 06:09 PM
Paper page - SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
Source: https://huggingface.co/papers/2606.01317 Published on May 31
·
Submitted byhttps://huggingface.co/lingfengzhe
Qi HUon Jun 5
Abstract
Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments.
Large language modelsare increasingly deployed ascoding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts onstateful workspaceslargely unexamined. We present SABER, a benchmark forenvironment-aware operational safetythat places models in realisticagent-style projectsand evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54%harmful safety-violation rate(HSR), suggesting that currentalignmentremains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.01317
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.01317 in a model README.md to link it from this page.
Datasets citing this paper1
#### sssr-lab/SABER Viewer• Updatedabout 5 hours ago • 1.43k • 7k
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.01317 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
SaaSBench is a new benchmark for evaluating AI agents in enterprise SaaS development, involving multi-component system integration across 30 tasks, 6 domains, and 5,370 validation nodes. Experiments reveal that the main bottleneck for agents is system configuration and integration rather than isolated code generation.
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor is a novel framework for LLM agent safety that uses hierarchical memory and self-evolution to balance safety and utility, achieving state-of-the-art performance on benign and malicious tasks.
The Cold-Start Safety Gap in LLM Agents
This paper identifies a 'cold-start safety gap' in tool-calling LLM agents, where they are most vulnerable at the beginning of a session and become safer after completing regular agentic tasks. The authors introduce the SODA benchmark to evaluate this phenomenon and recommend a simple deployment strategy of warming up agents with regular tasks before safety-critical requests.
MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.