SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety
Summary
This paper introduces SciRisk-Bench, a benchmark for evaluating the safety of large language models in AI4Science contexts, covering 7 disciplines, 31 subdisciplines, and 10 risk dimensions to assess both scientific competence and risk awareness.
View Cached Full Text
Cached at: 06/18/26, 05:41 AM
# SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety
Source: [https://arxiv.org/html/2606.18936](https://arxiv.org/html/2606.18936)
Linghao Feng1,2,\*Yinqian Sun1,\*Dongqi Liang1,3Sicheng Shen1,2,4Chenfei Yan1Yuxuan Peng8Yilin Zhao1Haibo Tong1,2Kai Li7FeiFei Zhao1,†\{\\dagger\}Yi Zeng1,5,6,7,†\{\\dagger\}1Brain\-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China2School of Future Technology, University of Chinese Academy of Sciences, China3School of Artificial Intelligence, University of Chinese Academy of Sciences, China4Zhongguancun Academy, China5Beijing Key Laboratory of Safe AI and Superalignment6Gaoling School of AI, Renmin University of China7Beijing Institute of AI Safety and Governance \(Beijing\-AISI\)8School of Humanities, University of Chinese Academy of Sciences, China\*Equal contribution\.†\{\\dagger\}Corresponding author\.fenglinghao2022@ia\.ac\.cnyi\.zeng@ruc\.edu\.cn
###### Abstract
Large language models \(LLMs\) are increasingly embedded in AI for Science \(AI4Science\) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery\. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high\-stakes scientific contexts\. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified\. We introduceSciRisk\-Bench, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines\. SciRisk\-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions\. In the experimental section, we evaluate both mainstream LLMs and science\-oriented LLMs across risk dimensions, disciplines, and sub\-disciplines, enabling fine\-grained diagnosis of where scientific models remain unsafe\.
SciRisk\-Bench: A Risk\-Dimension\-Aware Benchmark for AI4Science Safety
Linghao Feng1,2,\*Yinqian Sun1,\*Dongqi Liang1,3Sicheng Shen1,2,4Chenfei Yan1Yuxuan Peng8Yilin Zhao1Haibo Tong1,2Kai Li7FeiFei Zhao1,†\{\\dagger\}Yi Zeng1,5,6,7,†\{\\dagger\}1Brain\-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of Future Technology, University of Chinese Academy of Sciences, China 3School of Artificial Intelligence, University of Chinese Academy of Sciences, China 4Zhongguancun Academy, China 5Beijing Key Laboratory of Safe AI and Superalignment 6Gaoling School of AI, Renmin University of China 7Beijing Institute of AI Safety and Governance \(Beijing\-AISI\) 8School of Humanities, University of Chinese Academy of Sciences, China \*Equal contribution\.†\{\\dagger\}Corresponding author\. fenglinghao2022@ia\.ac\.cnyi\.zeng@ruc\.edu\.cn
## 1Introduction
AI4Science has become a central paradigm for accelerating scientific discovery\. Recent systems have demonstrated that machine learning and LLM\-based methods can assist mathematical program search\(Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib1)\)and discover efficient algorithms\(Mankowitzet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib2)\)\. In materials science, AI has supported large\-scale materials discovery\(Merchantet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib3)\)and autonomous synthesis\(Szymanskiet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib4)\)\. In biology, protein structure prediction has been transformed by AlphaFold\(Jumperet al\.,[2021](https://arxiv.org/html/2606.18936#bib.bib5)\), with later work extending biomolecular modeling to broader molecular complexes\(Krishnaet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib6)\)\. In geoscience, foundation models have been proposed for weather and climate modeling\(Nguyenet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib7)\), and neural forecasting systems have achieved strong medium\-range weather prediction\(Lamet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib8)\)\. Foundation models are also entering generalist medical AI\(Mooret al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib9)\)and clinical knowledge reasoning\(Singhalet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib10)\)\. As LLMs become natural\-language interfaces to scientific knowledge, tools, and protocols, they increasingly mediate decisions that may affect laboratories, public health, critical infrastructure, and scientific governance\.
This expanding role makes AI4Science safety a distinct and urgent evaluation problem\. Scientific mistakes are not limited to ordinary factual errors: an unsafe answer may provide actionable dual\-use details, omit laboratory precautions, overstate uncertain evidence, expose private or sensitive data, misrepresent regulations, or give authoritative\-sounding but false explanations\. Prior studies have shown that AI systems can amplify dual\-use risks in drug discovery\(Urbinaet al\.,[2022](https://arxiv.org/html/2606.18936#bib.bib11)\)and rely on misleading shortcuts in medical imaging\(DeGraveet al\.,[2021](https://arxiv.org/html/2606.18936#bib.bib12)\)\. Chemistry\-specific prompting attacks further expose safety vulnerabilities in molecular representations\(Wonget al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib13)\), while synthetic biology and AI convergence raises broader regulatory and security concerns\(Hynek,[2025](https://arxiv.org/html/2606.18936#bib.bib14)\)\. General LLM safety benchmarks are useful, but scientific settings require specialized evaluation because risk is tightly coupled with domain expertise, experimental context, and regulatory constraints\.
Several benchmarks have begun to address this gap\. SciBench evaluates college\-level scientific problem solving\(Wang and others,[2023](https://arxiv.org/html/2606.18936#bib.bib15)\), ScienceQA focuses on multimodal science question answering\(Luet al\.,[2022](https://arxiv.org/html/2606.18936#bib.bib16)\), SciEval targets multi\-level scientific research evaluation\(Sunet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib18)\), and SciKnowEval measures multi\-level scientific knowledge\(Fenget al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib19)\)\. Safety\-oriented efforts have also emerged: ChemSafetyBench targets chemistry safety\(Zhao and others,[2024](https://arxiv.org/html/2606.18936#bib.bib20)\), MedSafetyBench evaluates harmful medical requests\(Hanet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib21)\), LabSafetyBench focuses on laboratory safety\(Zhouet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib22)\), SciSafeEval evaluates scientific safety alignment\(Liet al\.,[2024b](https://arxiv.org/html/2606.18936#bib.bib23)\), WMDP measures malicious\-use knowledge\(Liet al\.,[2024a](https://arxiv.org/html/2606.18936#bib.bib24)\), SOSBench studies safety alignment on scientific knowledge\(Jianget al\.,[2025](https://arxiv.org/html/2606.18936#bib.bib25)\), and SafeScientist evaluates risk\-aware scientific agents\(Zhuet al\.,[2025](https://arxiv.org/html/2606.18936#bib.bib26)\)\. However, most existing benchmarks still emphasize either disciplinary coverage or broad safety categories\. They provide limited visibility into which*types*of scientific risk drive unsafe behavior inside each discipline\.
We proposeSciRisk\-Bench, a risk\-dimension\-aware benchmark for AI4Science safety\. SciRisk\-Bench spans seven scientific disciplines, including domains such as biology, chemistry, geography, engineering, and physics, with representative sub\-disciplines ranging from synthetic biology and organic synthesis to GIS and nuclear physics\. The full discipline hierarchy is described in the Method section\. Unlike prior work that primarily treats scientific safety as a domain\-level problem, SciRisk\-Bench explicitly annotates examples by risk dimensions\. For example, dual\-use captures scientific knowledge that can enable harmful misuse, laboratory safety concerns missing precautions in experimental settings, and hallucinations and misconceptions cover confident but false scientific claims\. This design enables evaluation to answer not only “which discipline is unsafe?”, but also “which risk mechanism causes the failure?”
Our experiments evaluate mainstream LLMs and science\-oriented LLMs across risk dimensions, disciplines, and sub\-disciplines, showing that science\-specialized models often exhibit higher ASR despite their stronger domain fluency\.
The contributions of this work are:
- •We propose SciRisk\-Bench, an AI4Science safety benchmark that jointly covers multiple scientific sub\-disciplines and explicit risk dimensions\.
- •We introduce a two\-level taxonomy that supports analysis by both risk mechanism and scientific discipline, making failures more interpretable than discipline\-only evaluation\.
- •We evaluate mainstream LLMs and science\-oriented LLMs from risk\-dimension and discipline\-level perspectives, providing a basis for fine\-grained safety diagnosis\.
Figure 1:Overview of the SciRisk\-Bench construction and evaluation pipeline\. Prompts are organized by scientific discipline and risk dimension, model responses are judged for unsafe scientific behavior, and ASR is reported at multiple granularities\.
## 2Related Work
#### Scientific capability benchmarks\.
Early AI4Science evaluation has largely focused on scientific knowledge, reasoning, and problem solving\. SciBench measures college\-level scientific problem solving\(Wang and others,[2023](https://arxiv.org/html/2606.18936#bib.bib15)\); ScienceQA evaluates multimodal science question answering with explanations\(Luet al\.,[2022](https://arxiv.org/html/2606.18936#bib.bib16)\); GPQA targets graduate\-level, expert\-written questions\(Reinet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib17)\); SciEval provides multi\-level scientific research evaluation\(Sunet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib18)\); and SciKnowEval measures multi\-level scientific knowledge\(Fenget al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib19)\)\. These datasets are important for measuring whether models understand scientific concepts, but correctness\-oriented evaluation is not sufficient for safety\. A model can solve scientific problems while still producing outputs that are hazardous, non\-compliant, privacy\-violating, or misleading in practice\.
#### Domain\-specific AI4Science safety benchmarks\.
Recent work has begun to construct safety benchmarks for high\-risk scientific domains\. ChemSafetyBench evaluates LLM safety in chemistry, including controlled substances and risky synthesis contexts\(Zhao and others,[2024](https://arxiv.org/html/2606.18936#bib.bib20)\)\. MedSafetyBench focuses on harmful medical requests and safe response behavior\(Hanet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib21)\)\. LabSafetyBench evaluates laboratory hazard recognition, consequence reasoning, and emergency response\(Zhouet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib22)\)\. These efforts show that scientific safety requires domain knowledge and cannot be reduced to generic refusal behavior\. However, many domain\-specific datasets remain concentrated in chemistry, medicine, or biology, leaving traditional sciences such as astronomy, geography, mathematics, engineering, and physics less systematically covered\.
#### Cross\-disciplinary and red\-teaming benchmarks\.
Cross\-domain safety benchmarks broaden the scope of AI4Science evaluation\. SciSafeEval integrates adversarial prompts across scientific modalities and domains\(Liet al\.,[2024b](https://arxiv.org/html/2606.18936#bib.bib23)\)\. WMDP measures malicious\-use knowledge in biology, chemistry, cyber, and related security contexts\(Liet al\.,[2024a](https://arxiv.org/html/2606.18936#bib.bib24)\)\. SOSBench benchmarks safety alignment on scientific knowledge with legal and regulatory grounding\(Jianget al\.,[2025](https://arxiv.org/html/2606.18936#bib.bib25)\)\. SafeScientist evaluates risk\-aware scientific discovery by LLM agents\(Zhuet al\.,[2025](https://arxiv.org/html/2606.18936#bib.bib26)\)\. General safety benchmarks provide complementary signals: TruthfulQA targets factual falsehoods\(Linet al\.,[2021](https://arxiv.org/html/2606.18936#bib.bib27)\), HaluEval evaluates hallucination\(Liet al\.,[2023](https://arxiv.org/html/2606.18936#bib.bib28)\), HarmBench supports automated red\-teaming and refusal evaluation\(Mazeikaet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib29)\), and SafetyBench evaluates broad safety behavior\(Zhanget al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib30)\)\. Yet these resources often do not expose a fine\-grained mapping between scientific disciplines and concrete risk dimensions\. SciRisk\-Bench complements them by making risk dimensions a first\-class organizing axis\.
#### Benchmark reliability and safety measurement\.
A growing body of work cautions that safety benchmarks can reward superficial refusal or narrow benchmark gaming rather than genuine risk awareness\(Renet al\.,[2024](https://arxiv.org/html/2606.18936#bib.bib31)\)\. This concern is especially important in AI4Science: over\-refusal can make models unusable for benign research, while under\-refusal can expose harmful details\. SciRisk\-Bench is designed to support more diagnostic evaluation by separating failure modes\. For example, hallucination, authority inflation, privacy leakage, laboratory safety omission, and dual\-use leakage should not be collapsed into a single safety score, because each requires different mitigation strategies\.
Table 1:Risk dimensions in SciRisk\-Bench\. The benchmark introduces explicit risk annotations to make model failures interpretable beyond discipline\-level aggregation\.Figure 2:Model\-level ASR heatmap by risk dimension\. Columns are individual models and rows are risk dimensions; warmer colors indicate higher ASR\. The left block shows mainstream models, and the right block shows science\-specialized models\.
## 3SciRisk\-Bench
SciRisk\-Bench is organized around two complementary axes: risk dimensions and scientific disciplines\. The risk\-dimension axis captures the mechanism by which a model response may become unsafe\. The discipline axis captures the scientific context in which the risk appears\. This design supports both horizontal comparisons across risk types and vertical comparisons across scientific sub\-fields\.
The dataset contains 350 examples across seven disciplines and 31 sub\-disciplines\. By discipline, it includes 58 mathematics examples, 50 examples each from chemistry, biology, astronomy, and physics, 47 geography examples, and 45 engineering examples\. By risk dimension, the largest category is hallucinations and misconceptions \(118 examples\), followed by dual\-use \(53\), fringe amplification \(38\), knowledge cutoff drift \(27\), regulatory blind spot \(27\), laboratory safety \(26\), safety omission \(25\), authority inflation \(17\), privacy leakage \(11\), and geopolitical sensitivity \(8\)\. This distribution reflects the benchmark’s emphasis on both science\-specific misuse risks and broader reliability risks that can become safety\-critical in scientific workflows\.
### 3\.1Scientific Disciplines and Sub\-disciplines
SciRisk\-Bench uses a two\-level discipline hierarchy to make safety failures more actionable than broad domain labels alone\. The benchmark covers seven disciplines and 31 sub\-disciplines; the full sub\-discipline index is provided in Table[2](https://arxiv.org/html/2606.18936#A1.T2)in the appendix\. This hierarchy is important because different sub\-disciplines expose different risk mechanisms\. For example, pathogens, toxins, and pharmacology may test whether a model leaks dual\-use biological knowledge or omits containment requirements, whereas ecology and evolutionary biology may involve privacy risks when sensitive species\-location data are requested\. Organic synthesis prompts may expose unsafe chemical\-procedure guidance, while cartography and GIS prompts may involve privacy leakage or geopolitical sensitivity\. In physics, nuclear and particle physics can involve dual\-use or authority\-inflation risks, whereas quantum mechanics is more likely to expose hallucinations or fringe amplification\. This level of detail is necessary for diagnosing science\-oriented LLMs that may have uneven training coverage and uneven safety behavior across sub\-fields\.
### 3\.2Risk Dimensions
Table[1](https://arxiv.org/html/2606.18936#S2.T1)summarizes the risk dimensions in SciRisk\-Bench and their associated disciplines\. Rather than relying solely on discipline\-level safety labels, the taxonomy identifies the specific risk mechanism associated with each example, such as dual\-use, laboratory safety, regulatory blind spot, privacy leakage, or hallucination\.
For example, dual\-use is treated broadly because harmful scientific utility can arise outside canonical biosecurity or chemistry examples; physics, engineering, astronomy, and mathematics may also contribute to dangerous systems or targeting workflows\. Hallucinations and misconceptions are included as safety risks rather than mere accuracy errors, because false scientific claims can directly affect downstream decisions\. Safety omission is separated from hallucination: a response may be factually correct but unsafe because it omits necessary precautions\.
### 3\.3Evaluation
SciRisk\-Bench follows the LLM\-as\-a\-judge evaluation paradigm\. For each benchmark instance, we first provide the model under test with a prompt that is grounded in a scientific discipline and annotated with a risk dimension\. The prompt is designed to elicit behavior that may induce scientific safety issues\. The model under test then generates a free\-form response under the same inference setting used for all evaluated systems\. Next, a judge LLM evaluates the generated response\. The judge receives the original prompt, the model response, and the corresponding risk\-dimension definition, and determines whether the response would cause or facilitate a scientific safety issue\. This judgment converts each model response into a binary safety outcome for statistical analysis\. Finally, we compute the attack success rate \(ASR\), defined as the proportion of benchmark prompts for which the model produces an unsafe response according to the judge LLM\. We report ASR at multiple granularities\.
## 4Results
This section evaluates AI4Science safety from two complementary perspectives\. First, we analyze unsafe response patterns by risk dimension, which reveals which safety mechanisms remain difficult for current models\. Second, we analyze the same results by scientific discipline and sub\-discipline, which exposes where domain context changes model behavior\. Throughout the section, we compare mainstream base LLMs with science\-specialized LLMs to examine whether scientific fine\-tuning improves safety or instead increases the likelihood that models provide risky technical assistance\. Unless otherwise noted, all reported values are ASR; lower values indicate safer behavior\.
### 4\.1Analysis on Risk Dimensions
Figure 3:Average ASR across risk dimensions\. The most vulnerable dimensions are safety omission, knowledge cutoff drift, and laboratory safety, while privacy leakage has the lowest average ASR\.Figure[3](https://arxiv.org/html/2606.18936#S4.F3)shows substantial variation across risk dimensions\. Knowledge cutoff drift1 is the most vulnerable category, with an average ASR of 74\.2%, followed closely by Safety omission at 53\.5%\. This pattern suggests that models often fail not only when asked for overtly harmful scientific content, but also when the unsafe behavior is implicit: they may provide technically plausible advice while omitting necessary constraints, or they may rely on outdated scientific or regulatory knowledge\. Laboratory safety also remains high, indicating that current models frequently under\-specify precautions in experimental contexts\. By contrast, privacy leakage is the lowest dimension at 12\.2%\. The gap between these low\-risk and high\-risk categories implies that existing alignment is more effective for familiar information\-control risks than for science\-specific procedural and temporal risks\.
Figure 4:Risk\-dimension radar charts for mainstream base models and science\-specialized models\. Science\-specialized models exhibit a broader unsafe region across most risk dimensions\.The radar charts and heatmap in Figures[4](https://arxiv.org/html/2606.18936#S4.F4)and[2](https://arxiv.org/html/2606.18936#S2.F2)further show that the difference between model families is systematic rather than driven by a single risk category\. Mainstream base models have their largest unsafe regions on knowledge cutoff drift, safety omission, and laboratory safety, but their ASR drops sharply for privacy leakage, compliance\-related risks, and several misconception\-oriented categories\. Science\-specialized models, in contrast, form a larger and more uniform risk profile\. Their ASR remains high on the leading procedural risks and also increases on dual\-use, authority inflation, hallucination, and fringe\-amplification dimensions\.
This result indicates a safety\-capability tension in science\-oriented tuning\. Fine\-tuning on scientific corpora may improve domain fluency and willingness to answer technical prompts, but it does not necessarily improve risk recognition\. The broadening of unsafe behavior is especially important for AI4Science settings: a model that is more competent at scientific explanation can become more harmful if it also becomes more willing to provide confident, detailed, or insufficiently caveated guidance in hazardous contexts\.
### 4\.2Analysis on Scientific Disciplines
Figure 5:Average ASR across scientific disciplines\. Engineering, chemistry, and astronomy have the highest average ASR, while biology has the lowest\.Figure 6:Average ASR for sub\-disciplines within each scientific discipline\. Bars show mean ASR and error bars show variation across evaluated models; sub\-discipline indices are listed in Table[2](https://arxiv.org/html/2606.18936#A1.T2)in the appendix\.Figure[5](https://arxiv.org/html/2606.18936#S4.F5)aggregates ASR by scientific discipline\. Engineering has the highest average ASR at 57\.0%, followed by chemistry and astronomy, both close to 50%\. These fields contain many prompts where unsafe behavior can appear as practical technical assistance, such as process design, hazardous synthesis, instrumentation, or physical\-system guidance\. Physics and geography occupy the middle range, while mathematics is lower but still non\-trivial\. Biology has the lowest average ASR, around 18\.8%, suggesting that models are more likely to recognize and refuse biological safety risks than similarly structured risks in engineering or chemistry\. One possible reason is that biological misuse and biomedical privacy have been more salient in prior safety alignment, whereas engineering and physical\-science hazards are often framed as ordinary problem solving\.
The sub\-discipline results in Figure[6](https://arxiv.org/html/2606.18936#S4.F6)show that broad discipline\-level averages hide substantial internal heterogeneity; the sub\-discipline indices used in the figure are provided in Table[2](https://arxiv.org/html/2606.18936#A1.T2)in the appendix\. Engineering contains the most vulnerable sub\-field overall: electrical and electronic engineering \(E\-1\) reaches the highest ASR, followed by structural and civil engineering \(E\-2\), mechanical and manufacturing engineering \(E\-3\), and chemical and process engineering \(E\-4\)\. In contrast, software and systems engineering \(E\-5\) is markedly lower, suggesting that existing safety alignment may transfer more effectively to software\-oriented prompts than to physical engineering processes involving infrastructure, devices, or hazardous systems\.
Chemistry also exhibits consistently high risk, with analytical chemistry \(C\-1\) showing the highest ASR among chemistry sub\-fields, while inorganic and coordination chemistry \(C\-2\), organic synthesis \(C\-3\), and biochemistry \(C\-4\) remain clustered in the mid\-to\-high range\. This pattern is consistent with the prevalence of laboratory safety, synthesis\-related, regulatory, and dual\-use risks in chemistry prompts\. Astronomy is similarly elevated, especially for space exploration and orbital mechanics \(A\-1\) and observational astronomy and instrumentation \(A\-2\), whereas stellar astrophysics \(A\-3\) and planetary science and cosmology \(A\-4\) are relatively lower\.
Geography and physics show broader internal variation\. In geography, urban and economic geography \(G\-1\) is substantially higher than cartography and GIS \(G\-5\), indicating that risks related to authority inflation, privacy leakage, or geopolitical sensitivity may be more difficult for models than less directly actionable geographic tasks\. In physics, electromagnetism and optics \(P\-1\) and quantum mechanics \(P\-2\) are the highest\-ASR sub\-fields, while classical mechanics and dynamics \(P\-5\) is the lowest\. Biology is the clearest low\-ASR discipline, with all sub\-disciplines far below the leading engineering, chemistry, and astronomy sub\-fields\. Nevertheless, the large error bars in several categories indicate meaningful model\-level variability\. These results suggest that discipline labels alone are insufficient for safety diagnosis; safety risk depends on the interaction among discipline, sub\-discipline, and the specific unsafe mechanism involved\.
Figure 7:Discipline\-level comparison between mainstream base models and science\-specialized models\. Science\-specialized models have higher ASR in most disciplines, with the largest gaps in mathematics, physics, chemistry, and biology\.Figure[7](https://arxiv.org/html/2606.18936#S4.F7)compares the two model families after averaging within each discipline\. Science\-specialized models have higher ASR in most disciplines, especially chemistry, mathematics, geography, physics, and biology\. The largest relative gaps occur in domains where scientific fine\-tuning plausibly increases models’ ability to complete technical requests that base models would answer less fully\. Mathematics is particularly notable: although the discipline\-level average in Figure[5](https://arxiv.org/html/2606.18936#S4.F5)is not among the highest, science\-specialized models show a large increase over base models, consistent with risks such as authority inflation, hallucinated derivations, and dual\-use quantitative support\.
The main exception is astronomy, where base models are comparable to or higher than science\-specialized models\. This suggests that not all scientific specialization uniformly increases ASR; the effect depends on how fine\-tuning changes model coverage, refusal behavior, and uncertainty expression in a given domain\. Overall, the discipline\-level comparison supports the central motivation of SciRisk\-Bench: AI4Science safety cannot be summarized by a single aggregate score\. Science\-specialized models can be more unsafe even when they are more domain capable, and the magnitude of this effect varies across both risk dimensions and scientific disciplines\.
## 5Discussion
The results highlight three implications for AI4Science safety evaluation\. First, the most vulnerable categories are not limited to explicit malicious\-use requests\. Safety omission, knowledge cutoff drift, and laboratory safety produce high ASR because unsafe behavior can be embedded in otherwise normal scientific assistance\. Second, scientific specialization does not automatically imply safer scientific behavior\. Science\-specialized models often produce higher ASR across risk dimensions and disciplines, suggesting that domain adaptation can increase answerability without adding sufficient risk discrimination\. Third, safety risk is highly uneven within broad disciplines\. Engineering, chemistry, and astronomy have high average ASR, but the sub\-discipline analysis shows that actionable physical processes, synthesis settings, and infrastructure\-related contexts are especially important drivers\.
These findings motivate benchmarks that jointly expose risk mechanisms and scientific context\. A single aggregate safety score can obscure whether a model fails because it provides dual\-use details, omits precautions, hallucinates scientific claims, or overstates its authority\. SciRisk\-Bench therefore supports more targeted diagnosis: model developers can identify whether mitigation should focus on procedural safeguards, temporal knowledge updating, refusal calibration, uncertainty expression, or discipline\-specific governance rules\.
## Limitations
SciRisk\-Bench focuses on text\-based evaluation and does not yet fully cover multimodal scientific inputs such as microscopy images, geographic rasters, molecular structures, or laboratory videos\. The benchmark also represents a snapshot of risk definitions; scientific regulations, model capabilities, and misuse patterns evolve over time\. Future versions should support dynamic updates, expert review across additional disciplines, and stronger integration with domain\-specific governance standards\.
## Ethical Considerations
SciRisk\-Bench is designed to improve AI4Science safety, but the benchmark necessarily includes prompts that describe or elicit hazardous scientific behavior\. These examples may involve dual\-use scientific knowledge, unsafe laboratory procedures, biological or chemical misuse, privacy\-sensitive geographic or biomedical information, and misleading scientific claims\. Such content is included only to evaluate whether LLMs can recognize and avoid unsafe responses in high\-stakes scientific contexts\.
We acknowledge the dual\-use nature of this work\. Detailed analysis of model failures could potentially inform adversarial prompting or misuse attempts\. To reduce this risk, the paper focuses on aggregate trends and representative risk categories rather than disclosing extensive actionable harmful instructions\. Evaluation materials should be handled responsibly, with access restricted to research and safety evaluation purposes\. We believe that careful transparency about failure modes is important for building safer AI4Science systems, provided that benchmark artifacts and examples are shared with appropriate safeguards and contextualization\.
## Acknowledgments
The authors acknowledge the use of large language models \(LLMs\) as writing assistants to refine grammar and improve phrasing\. These models were used solely for linguistic editing and did not contribute to the research idea, experimental design, or data analysis\. The authors take full responsibility for the correctness and integrity of the content\.
## References
- AI for radiographic COVID\-19 detection selects shortcuts over signal\.Nature Machine Intelligence3\(7\),pp\. 610–619\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p2.1)\.
- K\. Feng, K\. Ding, W\. Wang, X\. Zhuang, Z\. Wang, M\. Qin, Y\. Zhao, J\. Yao, Q\. Zhang, and H\. Chen \(2024\)SciKnowEval: evaluating multi\-level scientific knowledge of large language models\.Note:arXiv:2406\.09098Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Han, A\. Kumar, C\. Agarwal, and H\. Lakkaraju \(2024\)MedSafetyBench: evaluating and improving the medical safety of large language models\.Advances in Neural Information Processing Systems37,pp\. 33423–33454\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Hynek \(2025\)Synthetic biology/AI convergence \(SynBioAI\): security threats in frontier science and regulatory challenges\.AI & SOCIETY,pp\. 1–18\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p2.1)\.
- F\. Jiang, F\. Ma, Z\. Xu, Y\. Li, B\. Ramasubramanian, L\. Niu, B\. Li, X\. Chen, Z\. Xiang, and R\. Poovendran \(2025\)SOSBench: benchmarking safety alignment on scientific knowledge\.arXiv preprint arXiv:2505\.21605\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Jumper, R\. Evans, A\. Pritzel, T\. Green, M\. Figurnov, O\. Ronneberger, K\. Tunyasuvunakool, R\. Bates, A\. Zidek, A\. Potapenko,et al\.\(2021\)Highly accurate protein structure prediction with AlphaFold\.Nature596\(7873\),pp\. 583–589\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- R\. Krishna, J\. Wang, W\. Ahern, P\. Sturmfels, P\. Venkatesh, I\. Kalvet, G\. R\. Lee, F\. S\. Morey\-Burrows, I\. Anishchenko, I\. R\. Humphreys,et al\.\(2024\)Generalized biomolecular modeling and design with RoseTTAFold all\-atom\.Science384\(6693\),pp\. eadl2528\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- R\. Lam, A\. Sanchez\-Gonzalez, M\. Willson, P\. Wirnsberger, M\. Fortunato, F\. Alet, S\. Ravuri, T\. Ewalds, Z\. Eaton\-Rosen, W\. Hu,et al\.\(2023\)Learning skillful medium\-range global weather forecasting\.Science382\(6677\),pp\. 1416–1421\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- J\. Li, X\. Cheng, W\. X\. Zhao, J\. Nie, and J\. Wen \(2023\)HaluEval: a large\-scale hallucination evaluation benchmark for large language models\.arXiv preprint arXiv:2305\.11747\.Cited by:[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Li, A\. Pan, A\. Gopal, S\. Yue, D\. Berrios, A\. Gatti, J\. D\. Li, A\. Dombrowski, S\. Goel, G\. Mukobi,et al\.\(2024a\)The WMDP benchmark: measuring and reducing malicious use with unlearning\.InProceedings of the 41st International Conference on Machine Learning,pp\. 28525–28550\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Li, J\. Lu, C\. Chu, T\. Zeng, Y\. Zheng, M\. Li, H\. Huang, B\. Wu, Z\. Liu, K\. Ma,et al\.\(2024b\)SciSafeEval: a comprehensive benchmark for safety alignment of large language models in scientific tasks\.Note:arXiv:2410\.03769Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2021\)TruthfulQA: measuring how models mimic human falsehoods\.arXiv preprint arXiv:2109\.07958\.Cited by:[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Lu, S\. Mishra, T\. Xia, L\. Qiu, K\. Chang, S\. Zhu, O\. Tafjord, P\. Clark, and A\. Kalyan \(2022\)Learn to explain: multimodal reasoning via thought chains for science question answering\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px1.p1.1)\.
- D\. J\. Mankowitz, A\. Michi, A\. Zhernov, M\. Gelmi, M\. Selvi, C\. Paduraru, E\. Leurent, S\. Iqbal, J\. Lespiau, A\. Ahern,et al\.\(2023\)Faster sorting algorithms discovered using deep reinforcement learning\.Nature618\(7964\),pp\. 257–263\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- M\. Mazeika, L\. Phan, X\. Yin, A\. Zou, Z\. Wang, N\. Mu, E\. Sakhaee, N\. Li, S\. Basart, B\. Li,et al\.\(2024\)HarmBench: a standardized evaluation framework for automated red teaming and robust refusal\.arXiv preprint arXiv:2402\.04249\.Cited by:[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Merchant, S\. Batzner, S\. S\. Schoenholz, M\. Aykol, G\. Cheon, and E\. D\. Cubuk \(2023\)Scaling deep learning for materials discovery\.Nature624\(7990\),pp\. 80–85\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- M\. Moor, O\. Banerjee, Z\. S\. H\. Abad, H\. M\. Krumholz, J\. Leskovec, E\. J\. Topol, and P\. Rajpurkar \(2023\)Foundation models for generalist medical artificial intelligence\.Nature616\(7956\),pp\. 259–265\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- T\. Nguyen, J\. Brandstetter, A\. Kapoor, J\. K\. Gupta, and A\. Grover \(2023\)ClimaX: a foundation model for weather and climate\.arXiv preprint arXiv:2301\.10343\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: a graduate\-level google\-proof q&a benchmark\.Note:arXiv:2311\.12022Cited by:[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Ren, S\. Basart, A\. Khoja, A\. Gatti, L\. Phan, X\. Yin, M\. Mazeika, A\. Pan, G\. Mukobi, R\. Kim,et al\.\(2024\)Safetywashing: do AI safety benchmarks actually measure safety progress?\.Advances in Neural Information Processing Systems37,pp\. 68559–68594\.Cited by:[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px4.p1.1)\.
- B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. R\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi,et al\.\(2024\)Mathematical discoveries from program search with large language models\.Nature625\(7995\),pp\. 468–475\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- L\. Sun, Y\. Han, Z\. Zhao, D\. Ma, Z\. Shen, B\. Chen, L\. Chen, and K\. Yu \(2024\)SciEval: a multi\-level large language model evaluation benchmark for scientific research\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px1.p1.1)\.
- N\. J\. Szymanski, B\. Rendy, Y\. Fei, R\. E\. Kumar, T\. He, D\. Milsted, M\. J\. McDermott, M\. Gallant, E\. D\. Cubuk, A\. Merchant,et al\.\(2023\)An autonomous laboratory for the accelerated synthesis of novel materials\.Nature624\(7990\),pp\. 86–91\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p1.1)\.
- F\. Urbina, F\. Lentzos, C\. Invernizzi, and S\. Ekins \(2022\)Dual use of artificial\-intelligence\-powered drug discovery\.Nature Machine Intelligence4\(3\),pp\. 189–191\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p2.1)\.
- X\. Wanget al\.\(2023\)SciBench: evaluating college\-level scientific problem solving of LLMs\.Note:arXiv:2307\.10635Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Wong, H\. Cao, Z\. Liu, and Y\. Li \(2024\)SMILES\-prompting: a novel approach to LLM jailbreak attacks in chemical synthesis\.arXiv preprint arXiv:2410\.15641\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p2.1)\.
- Z\. Zhang, L\. Lei, L\. Wu, R\. Sun, Y\. Huang, C\. Long, X\. Liu, X\. Lei, J\. Tang, and M\. Huang \(2024\)SafetyBench: evaluating the safety of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,pp\. 15537–15553\.Cited by:[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Zhaoet al\.\(2024\)ChemSafetyBench: benchmarking LLM safety on the chemistry domain\.Note:arXiv:2411\.16736Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhou, J\. Yang, Y\. Huang, K\. Guo, Z\. Emory, B\. Ghosh, A\. Bedar, S\. Shekar, Z\. Liang, P\. Chen,et al\.\(2024\)LabSafetyBench: benchmarking LLMs on safety issues in scientific labs\.arXiv preprint arXiv:2410\.14182\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Zhu, J\. Zhang, Z\. Qi, N\. Shang, Z\. Liu, P\. Han, Y\. Su, H\. Yu, and J\. You \(2025\)SafeScientist: toward risk\-aware scientific discoveries by LLM agents\.arXiv preprint arXiv:2505\.23559\.Cited by:[§1](https://arxiv.org/html/2606.18936#S1.p3.1),[§2](https://arxiv.org/html/2606.18936#S2.SS0.SSS0.Px3.p1.1)\.
## Appendix AScientific Sub\-disciplines
Table[2](https://arxiv.org/html/2606.18936#A1.T2)lists the sub\-discipline index and representative potential risks used in SciRisk\-Bench\.
Table 2:Scientific disciplines, sub\-disciplines, and representative potential risks covered by SciRisk\-Bench\.
## Appendix BData Collection
Data collection followed a structured AI4Science safety data collection protocol covering astronomy, mathematics, geography, chemistry, biology, physics, and engineering\. The protocol prioritized examples derived from policies, regulations, industry standards, and other normative documents\. Annotators extracted safety\-relevant provisions and converted them into natural\-language safety questions or risk scenarios, optionally with LLM assistance for phrasing\. This source type was treated as the highest\-priority collection route because it provides traceable safety grounding\.
When policy or standards coverage was insufficient, annotators used existing AI4Science safety datasets as a secondary source and rephrased examples without changing their semantic intent or risk label\. Direct LLM generation was used only as the lowest\-priority route for areas not adequately covered by the first two methods\. During collection, annotators organized each item with its goal, discipline, sub\-discipline, and risk dimension, while seeking balanced coverage across sub\-disciplines and maintaining references to source materials when applicable\.
Fourteen data collectors participated in this process\. Each collector was paid 200 yuan for their contribution, and all collectors consented to the use of the collected data for this research\.Similar Articles
Introducing LifeSciBench
OpenAI introduces LifeSciBench, a benchmark of 750 expert-authored tasks to evaluate AI systems on realistic life science research workflows, including evidence handling, analysis, and scientific reasoning.
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
This paper introduces AI-MASLD, a stress-audit framework for medical LLMs that reveals how benchmark accuracy can hide serious safety failures, and demonstrates that open-weight models can match or exceed proprietary ones on safety dimensions.
AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
AICompanionBench introduces the first publicly available benchmark dataset of 2,123 real-world AI companion conversations annotated across nine safety risk categories, used to evaluate 20 LLMs as safety judges. Results show strong models handle explicit harmful content well but struggle with nuanced risks like manipulation and false positives on benign conversations.
SciR: A Controllable Benchmark for Scientific Reasoning in LLMs
SciR is a new controllable benchmark for evaluating LLMs on scientific reasoning including deduction, induction, and causal abduction, with parametric control over extraction and inference difficulty. Tests show both axes degrade performance across models, with reasoning models like DeepSeek-R1 outperforming instruct models on inference.
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
Introduces SciAgentArena, a benchmark of ~200 tasks for evaluating AI agents in real scientific research. Finds agents effective for well-specified data-analysis workflows but struggle with novel insights and open-ended exploration.