SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Hugging Face Daily Papers 05/31/26, 12:00 AM Papers

benchmarking safety large-language-models coding-agents stateful-workspaces alignment environment-aware

Summary

SABER introduces a benchmark for evaluating the operational safety of LLM coding agents in realistic stateful project workspaces, showing that even the best model has over a 54% harmful safety-violation rate, indicating insufficient alignment for real-world environments.

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined. We present SABER, a benchmark for environment-aware operational safety that places models in realistic agent-style projects and evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54% harmful safety-violation rate (HSR), suggesting that current alignment remains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.

Original Article

View Cached Full Text

Cached at: 06/05/26, 06:09 PM

Paper page - SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Source: https://huggingface.co/papers/2606.01317 Published on May 31

Submitted byhttps://huggingface.co/lingfengzhe

Qi HUon Jun 5

Abstract

Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments.

Large language modelsare increasingly deployed ascoding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts onstateful workspaceslargely unexamined. We present SABER, a benchmark forenvironment-aware operational safetythat places models in realisticagent-style projectsand evaluates safety from the final environment state after a sequence of actions. Beyond binary safety-violation reports, SABER categorizes violations by cause, enabling analysis of model-specific safety profiles. Our evaluations show that even the best-performing model has more than a 54%harmful safety-violation rate(HSR), suggesting that currentalignmentremains insufficient for realistic project environments. SABER further reveals distinct safety profiles across models. Our benchmark is publicly available at https://github.com/sssr-lab/saber.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.01317

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01317 in a model README.md to link it from this page.

Datasets citing this paper1

#### sssr-lab/SABER Viewer• Updatedabout 5 hours ago • 1.43k • 7k

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01317 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Paper page - SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

The Cold-Start Safety Gap in LLM Agents

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

Submit Feedback

Similar Articles

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

The Cold-Start Safety Gap in LLM Agents

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents