SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Hugging Face Daily Papers Papers

Summary

SABER introduces a benchmark for evaluating the operational safety of LLM coding agents in realistic stateful project workspaces, showing that even the best model has over a 54% harmful safety-violation rate, indicating insufficient alignment for real-world environments.

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined. We present SABER, a benchmark for environment-aware operational safety that places models in realistic agent-style projects and evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54% harmful safety-violation rate (HSR), suggesting that current alignment remains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.
Original Article
View Cached Full Text

Cached at: 06/05/26, 06:09 PM

Paper page - SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Source: https://huggingface.co/papers/2606.01317 Published on May 31

·

Submitted byhttps://huggingface.co/lingfengzhe

Qi HUon Jun 5

Abstract

Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments.

Large language modelsare increasingly deployed ascoding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts onstateful workspaceslargely unexamined. We present SABER, a benchmark forenvironment-aware operational safetythat places models in realisticagent-style projectsand evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54%harmful safety-violation rate(HSR), suggesting that currentalignmentremains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.01317

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01317 in a model README.md to link it from this page.

Datasets citing this paper1

#### sssr-lab/SABER Viewer• Updatedabout 5 hours ago • 1.43k • 7k

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01317 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

The Cold-Start Safety Gap in LLM Agents

Hugging Face Daily Papers

This paper identifies a 'cold-start safety gap' in tool-calling LLM agents, where they are most vulnerable at the beginning of a session and become safer after completing regular agentic tasks. The authors introduce the SODA benchmark to evaluate this phenomenon and recommend a simple deployment strategy of warming up agents with regular tasks before safety-critical requests.

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

arXiv cs.CL

MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.