Benchmarking Large Language Models for Safety Data Extraction
Summary
This paper benchmarks four large language models (Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, Llama 3.1-70B) for extracting structured information from Safety Data Sheets, finding that text-based extraction with chain-of-thought prompting yields the highest accuracy (84% by Gemini 1.5 Pro) but no model surpasses the 90% threshold required for reliable industrial deployment.
View Cached Full Text
Cached at: 06/11/26, 01:35 PM
# Benchmarking Large Language Models for Safety Data Extraction
Source: [https://arxiv.org/html/2606.11204](https://arxiv.org/html/2606.11204)
\[2\]\\fnmThomas\\surBayer
1\]\\orgnameSAP SE, Germany 2\]\\orgnameInstitute for Digital Transformation, Ravensburg\-Weingarten Universityorcid: 0009\-0007\-4373\-7933
###### Abstract
Accurate extraction of structured information from Safety Data Sheets \(SDS\) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule\-based methods\. This study benchmarks state\-of\-the\-art Large Language Models \(LLMs\) for automated SDS data extraction, comparing text\-based and multimodal processing pipelines\. We systematically evaluate four models—Gemini 1\.5 Pro, GPT\-4o, Claude 3\.7 Sonnet, and Llama 3\.1\-70B—across three prompting strategies: zero\-shot, few\-shot, and chain\-of\-thought\. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields\. Results show that text\-based extraction consistently outperforms multimodal processing across all metrics\. Gemini 1\.5 Pro combined with a Chain\-of\-Thought prompt achieved the highest accuracy \(84 %\), outperforming GPT\-4o \(81 %\) and Claude 3\.7 Sonnet \(79 %\)\. However, no model surpassed the 90 % accuracy threshold commonly required for reliable real\-world deployment\. These findings indicate that general\-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task\-specific fine\-tuning\. Future research should focus on domain\-adapted training, model calibration, and the integration of Human\-in\-the\-Loop verification to ensure safety\-critical reliability\.
###### keywords:
Large Language Models, Safety Data Sheets, Information Extraction, Benchmark Evaluation, Prompt Engineering
This manuscript is currently under review at Applied Intelligence\.
## 1Introduction
Safety Data Sheets \(SDS\) are critical regulatory documents in industrial environments, providing authoritative information on hazardous substances, chemical compositions, and protective measures\. They form the basis for compliance with international standards such as the Globally Harmonized System \(GHS\) and the European REACH and CLP regulations\[EuropeanCommission2020,OccupationalSafetyHealth2012\]\. Despite this standardization, SDSs vary considerably in structure, terminology, and completeness across manufacturers\. As a result, manual extraction of relevant fields remains costly, time\-consuming, and error\-prone, limiting scalability and increasing the risk of safety\-critical misinterpretation\[Khan2025,Matlhare2024\]\. An example of a SDS is given in Fig\.[1](https://arxiv.org/html/2606.11204#S1.F1)\.
Figure 1:Example page from the Safety Data Sheet \(SDS\) for SULFURIC ACID 1\-51%\) showing annotated regions corresponding to extracted data fields\. Source: Univar Solutions USA, Inc\.\[univar2023sds\]Recent advances in large language models \(LLMs\) offer promising opportunities to automate SDS data extraction\. Transformer\-based LLMs\[10\.5555/3295222\.3295349\]exhibit strong semantic and contextual reasoning capabilities and can process semi\-structured technical documents beyond the capabilities of traditional rule\-based or OCR\-centric systems\[Zhang2024,Dagdelen2024\]\. Their in\-context learning capabilities allow task specialization without retraining, making them attractive for industrial applications where document heterogeneity is high\.
However, their suitability for safety\-critical information extraction remains largely unexplored\. Existing studies provide limited evidence on how reliably different LLMs extract structured SDS fields, how text\-based and multimodal processing compare in performance, and how prompting techniques such as Zero\-Shot, Few\-Shot, and Chain\-of\-Thought influence extraction robustness\[Opitz2024,Vatsal2024,Sahoo2025,Schulhoff2025,Cheng2025\]\. A systematic, controlled benchmark addressing these aspects has not yet been conducted\.
This paper closes this gap by benchmarking four state\-of\-the\-art LLMs across text\-based and multimodal pipelines: Gemini 1\.5 Pro, GPT\-4o, Claude 3\.7 Sonnet, and Llama 3\.1\-70B\. We use a schema\-constrained protocol to compare accuracy, latency, and cost across different prompting strategies\. We compare accuracy, latency, and token\-normalized cost and analyze how prompting influences extraction quality\.The remainder is structured as follows: Section 2 outlines foundations; Section 3 details the methodology; Section 4 presents results and discussion; Section 5 concludes with implications and future work\.
## 2Related Work
The automatic extraction of structured information from Safety Data Sheets \(SDS\) has been addressed using rule\-based, machine learning, and hybrid approaches\. Early large\-scale systems rely on hybrid pipelines combining OCR, pattern matching, and neural networks to achieve robust field extraction for regulatory compliance\[fenton2021doceng,fenton2023acs,Khan2025\]\. Recent machine learning approaches focus on high\-precision extraction of standardized SDS fields such as product identifiers, suppliers, and revision dates\[khan2025heliyon\], as well as detailed composition data including CAS numbers and concentration values\[suman2024scirep\]\.
With the emergence of large language models \(LLMs\), generative approaches have been explored for SDS and technical document understanding, showing strong performance in flexible text and table extraction\[pekel2025gpt,moreira2024knime\]\. Survey work confirms the growing dominance of LLMs for chemical text mining while emphasizing the need for domain adaptation and validation\[schilling2025csr\]\. In addition, benchmark datasets such as ChemTEB highlight the challenges of SDS\-specific language and demonstrate the benefits of domain\-specialized embeddings\[mansouri2024chemteb\]\. Complementary to statistical methods, ontology\-based representations using SHACL and SKOS enable semantic validation and integration of extracted SDS data\[lu2025shacl\]\.
## 3Methodology
We present two variants of an SDS data extraction pipeline—a text\-based and an image\-based approach—designed to improve the efficiency and accuracy of structured information extraction from PDF documents\. We further introduce a systematic evaluation framework for benchmarking four state\-of\-the\-artLLMs across text\-only and multimodal extraction pipelines\. The framework explicitly disentangles the effects of model architecture, preprocessing strategy, and prompting methodology on extraction performance, while controlling for dataset composition and output schema consistency\. All configurations are evaluated on identical SDS documents using standardized metrics for accuracy, latency, and computational cost\. The following sections describe the methodological design of each approach, highlighting their respective strengths, limitations, and technical trade\-offs\.
### 3\.1Prompt Design
Prompt engineering plays a central role in controlling extraction behavior\. A unified template combines role specification, task description, schema definition, extraction rules, and strict formatting constraints\. We evaluate three prompting strategies, Zero\-Shot, Few\-Shot, Chain\-of\-Thought\), cf\. Table[1](https://arxiv.org/html/2606.11204#S3.T1)\. This setup isolates how prompting techniques influence accuracy, false positives, and extraction stability across models and modalities\.
Table 1:Overview of prompting strategies evaluated in the extraction pipeline\.Each prompt combines a system instruction \(task definition, role, output format, error handling\) depending on the prompting strategy, and a JSON schema specifying the structured output of theLLM\. An example of a Zero\-Shot prompt is given in Fig\.[2](https://arxiv.org/html/2606.11204#S3.F2), and a JSON schema for the SDS SULFURIC ACID 1\-51%\) shown in Fig\.[1](https://arxiv.org/html/2606.11204#S1.F1), concerning First Aid measures can be found in Fig\.[3](https://arxiv.org/html/2606.11204#S3.F3)\.
Figure 2:Zero\-shot Prompt: The prompt specifies the extraction task, input format, schema constraints, and output requirements\.SDS Extraction Prompt[⬇](data:text/plain;base64,IyBSb2xlCllvdSBhcmUgYSBoaWdobHkgY2FwYWJsZSBhbmQgcHJlY2lzZSBkYXRhIGV4dHJhY3Rpb24gc3BlY2lhbGlzdCBmb3IgU2FmZXR5IERhdGEgU2hlZXRzIChTRFMpLgoKIyBUYXNrCkV4dHJhY3Qgc3RydWN0dXJlZCBkYXRhIGZyb20gdGhlIHByb3ZpZGVkIFNhZmV0eSBEYXRhIFNoZWV0IChTRFMpLCBzdHJpY3RseSBmb2xsb3dpbmcgdGhlIEpTT04gc2NoZW1hIGJlbG93LgoKIyBTY2hlbWEKe3NjaGVtYX0KCiMgSW5wdXQKWW91IHdpbGwgcmVjZWl2ZToKLSAgQSBTYWZldHkgRGF0YSBTaGVldCAoU0RTKSBkb2N1bWVudCwgcHJvdmlkZWQgZWl0aGVyIGRpcmVjdGx5IGFzIGEgUERGIGZpbGUgb3IgYXMgcGxhaW4gdGV4dC4KCiMgRXh0cmFjdGlvbiBSdWxlcwotIEV4dHJhY3QgT05MWSBpbmZvcm1hdGlvbiBmcm9tIHRoZSBjb3JyZWN0IFNEUyBzZWN0aW9uLiBETyBOT1QgdXNlIGluZm9ybWF0aW9uIGZyb20gb3RoZXIgc2VjdGlvbnMuCi0gTWF0Y2ggZWFjaCB2YWx1ZSBwcmVjaXNlbHkgdG8gdGhlIGZpZWxkcyBkZWZpbmVkIGluIHRoZSBzY2hlbWEuCi0gSWYgaW5mb3JtYXRpb24gaXMgZXhwbGljaXRseSBtaXNzaW5nOgogIC0gVXNlICJudWxsIiBmb3Igc2NhbGFyIGZpZWxkcyAoc3RyaW5ncywgbnVtYmVycywgYm9vbGVhbnMpLgotIENvbnZlcnQgYWxsIFVuaWNvZGUgY2hhcmFjdGVycyB0byB0aGVpciBjbG9zZXN0IEFTQ0lJIGVxdWl2YWxlbnQgCi0gQWx3YXlzIHVzZSBkb3VibGUgcXVvdGVzIGZvciBKU09OIGtleXMgYW5kIHZhbHVlcy4KCiMgT3V0cHV0IEZvcm1hdCAKLSBPdXRwdXQgbXVzdCBiZSBPTkxZIHRoZSB2YWxpZCBKU09OIG9iamVjdC4KLSBTdGFydCB5b3VyIG91dHB1dCB3aXRoIHt7IGFuZCBlbmQgd2l0aCB9fS4KLSBETyBOT1QgaW5jbHVkZSBtYXJrZG93biwgYmFja3RpY2tzLCBleHBsYW5hdGlvbnMsIHNjaGVtYSByZXBldGl0aW9uLCBjb21tZW50cywgb3IgYWRkaXRpb25hbCB0ZXh0Lgo=)\#RoleYouareahighlycapableandprecisedataextractionspecialistforSafetyDataSheets\(SDS\)\.\#TaskExtractstructureddatafromtheprovidedSafetyDataSheet\(SDS\),strictlyfollowingtheJSONschemabelow\.\#Schema\{schema\}\#InputYouwillreceive:\-ASafetyDataSheet\(SDS\)document,providedeitherdirectlyasaPDFfileorasplaintext\.\#ExtractionRules\-ExtractONLYinformationfromthecorrectSDSsection\.DONOTuseinformationfromothersections\.\-Matcheachvaluepreciselytothefieldsdefinedintheschema\.\-Ifinformationisexplicitlymissing:\-Use"null"forscalarfields\(strings,numbers,booleans\)\.\-ConvertallUnicodecharacterstotheirclosestASCIIequivalent\-AlwaysusedoublequotesforJSONkeysandvalues\.\#OutputFormat\-OutputmustbeONLYthevalidJSONobject\.\-Startyouroutputwith\{\{andendwith\}\}\.\-DONOTincludemarkdown,backticks,explanations,schemarepetition,comments,oradditionaltext\.Figure 3:JSON for First AidExample JSON Schema Structure for SDS SULFURIC ACID 1\-51%\)\.[⬇](data:text/plain;base64,ewogICJ0eXBlIjogIm9iamVjdCIsCiAgInByb3BlcnRpZXMiOiB7CiAgICAiV29ybGRfRmlyc3RfQWlkX01lYXN1cmVzIjogewogICAgICAidHlwZSI6ICJvYmplY3QiLAogICAgICAicHJvcGVydGllcyI6IHsKICAgICAgICAiRmlyc3RfQWlkX01lYXN1cmVzIjogewogICAgICAgICAgInR5cGUiOiAib2JqZWN0IiwKICAgICAgICAgICJwcm9wZXJ0aWVzIjogewogICAgICAgICAgICAiR2VuZXJhbF9JbmZvcm1hdGlvbiI6IHsKICAgICAgICAgICAgICAidHlwZSI6ICJhcnJheSIsCiAgICAgICAgICAgICAgIml0ZW1zIjogeyAidHlwZSI6ICJzdHJpbmciIH0KICAgICAgICAgICAgfSwKICAgICAgICAgICAgLi4uLAogICAgICAgICAgICAiUHJvdGVjdGlvbl9vZl9GaXJzdF9BaWRfUmVzcG9uZGVycyI6IHsKICAgICAgICAgICAgICAidHlwZSI6ICJhcnJheSIsCiAgICAgICAgICAgICAgIml0ZW1zIjogeyAidHlwZSI6ICJzdHJpbmciIH0KICAgICAgICAgICAgfQogICAgICAgICAgfQogICAgICAgIH0sCiAgICAgICAgLi4uLAogICAgICAgICJ0cmVhdG1lbnRzIjogewogICAgICAgICAgInR5cGUiOiAiYXJyYXkiLAogICAgICAgICAgIml0ZW1zIjogeyAidHlwZSI6ICJzdHJpbmciIH0KICAgICAgICB9CiAgICAgIH0KICAgIH0KICB9Cn0=)1\{2"type":"object",3"properties":\{4"World\_First\_Aid\_Measures":\{5"type":"object",6"properties":\{7"First\_Aid\_Measures":\{8"type":"object",9"properties":\{10"General\_Information":\{11"type":"array",12"items":\{"type":"string"\}13\},14\.\.\.,15"Protection\_of\_First\_Aid\_Responders":\{16"type":"array",17"items":\{"type":"string"\}18\}19\}20\},21\.\.\.,22"treatments":\{23"type":"array",24"items":\{"type":"string"\}25\}26\}27\}28\}29\}
### 3\.2Data Processing Pipeline
The first approach applies a PDF preprocessing pipeline in which PDF documents are transformed into a structured Markdown representation\. This intermediate format enables subsequent text\-based information extraction usingLLMs\. The second approach leverages multimodal LLMs, allowing the system to directly process and interpret textual content from the original PDF without prior format conversion\.
Both approaches produce structured metadata and a JSON document containing the extracted fields\. Their objective is to streamline the extraction workflow while addressing the architectural and operational characteristics of different AI model classes\.
The extraction workflow consists of five sequential stages that transform raw SDS documents into structured JSON outputs while maintaining reproducibility across all configurations\.
1. 1\.Input and Preprocessing:For text\-based extraction, native PDF text is extracted usingPyMuPDF4LLMand converted to Markdown, preserving structural elements \(headings, lists, section boundaries\)\. For multimodal extraction, PDFs are passed directly to vision\-capable models via provider\-specific APIs: as binary files \(Claude 3\.7 Sonnet\), cloud URIs \(Gemini 1\.5 Pro\), or base64\-encoded JPEG images \(GPT\-4o\)\.
2. 2\.Prompt Generation:Prompt engineering constitutes a central control mechanism for guiding extraction behavior\. A standardized prompt template is employed across all experimental settings\. This template includes role definition, task specification, schema declaration, extraction constraints, and strict output formatting requirements\. Details can be found in Section[3\.1](https://arxiv.org/html/2606.11204#S3.SS1)\.
3. 3\.Extraction:The prompt and preprocessed document are submitted to theLLMAPI\. Text\-based requests route through SAP AI Core’s Orchestration Service; multimodal requests call provider endpoints directly\. Processing time is logged for latency measurement\.
4. 4\.Output Handling and Post\-processing:TheLLMreturns a structured JSON object conforming to the predefined schema together with metadata including input/output token counts and processing timestamps\. The response is parsed to extract valid JSON content\. Post\-processing includes removal of Markdown fences, schema validation, and Unicode normalization\. The cleaned output and associated metadata are stored as uniquely identified files to enable subsequent evaluation and cost computation\.
5. 5\.Data Processing:The returned JSON undergoes field\-by\-field comparison against manually validated ground truth\. For each field, a binary match indicator \(true/false\) is computed and aggregated at the section level to calculate accuracy\. Results are stored per SDS document, enabling computation of per\-document accuracy across all extracted sections\. The results are then aggregated across all ten SDS documents to compute the final accuracy for each model–prompt–method configuration\. Token metadata are used to calculate per\-document costs based on provider\-specific pricing\. Finally, a normalized cost function combines accuracy \(0\.7 weight\), processing time \(0\.2 weight\), and cost \(0\.1 weight\) into a unified performance score for systematic comparison across all 21 configurations\.
The extraction schema covers a wide range of SDS fields across multiple information types, including textual, numeric, tabular, and graphical elements\. SDS sections introduce different structural patterns, from simple key–value pairs \(e\.g\., product identifiers or signal word\) to nested lists describing chemical compositions, exposure limits, and regulatory classifications \(Figure[1](https://arxiv.org/html/2606.11204#S1.F1)\)\.
Many fields are semantically dense or multi\-part, such as tabular data that intermixes quantitative values, units, and qualifiers within a single cell\. Examples include concentration ranges, exposure thresholds, and transport information tables that encode several regulatory systems simultaneously\. These tabular structures required consistent normalization during parsing to preserve relational meaning across columns\. Other elements, such as GHS hazard pictograms, present an additional layer of complexity because they are rendered as images rather than text\.
Beyond these special cases, many SDS fields exhibit free\-text variability long procedural instructions, embedded lists, and incomplete sentences, which challenge deterministic schema alignment\. Certain sections, such as “First Aid Measures” or “Handling and Storage,” frequently contain semi\-structured instructions with conditional clauses and implicit references\. To handle this, the extraction framework enforced strict JSON conformity while allowing for natural\-language variation in values\.
Overall, the benchmark encompasses 235 fields, covering all major SDS sections while explicitly including structured, tabular, and visual information\. This heterogeneity ensures that the benchmark reflects realistic industrial complexity and tests model robustness across multiple data modalities\.
### 3\.3Evaluation
We evaluate on a curated set of ten SDS, comprising approximately 50,000 labeled fields\. The documents originate from the public ChemicalSafety\.com database and represent a diverse range of manufacturers, including BASF, Sigma\-Aldrich, and Merck\. Only recent revisions were selected to ensure alignment with current GHS and REACH/CLP standards\. Each SDS includes ten of the sixteen standardized sections commonly used in industrial workflows\. Figure[1](https://arxiv.org/html/2606.11204#S1.F1)illustrates a representative SDS page with annotated extraction regions\. An example of the extracted SDS fields from the SDS SULFURIC ACID 1\-51%\), from Fig\.[1](https://arxiv.org/html/2606.11204#S1.F1)according to the JSON schema outlined in Fig\.[3](https://arxiv.org/html/2606.11204#S3.F3)is provided in Fig\.[4](https://arxiv.org/html/2606.11204#S3.F4)\.
Figure 4:Structured Output for SDS SULFURIC ACID 1\-51%\) of Fig\.[1](https://arxiv.org/html/2606.11204#S1.F1)Example JSON Schema Structure[⬇](data:text/plain;base64,ewogICJXb3JsZF9GaXJzdF9BaWRfTWVhc3VyZXMiOiB7CiAgICAiRmlyc3RfQWlkX01lYXN1cmVzIjogewogICAgICAiR2VuZXJhbF9JbmZvcm1hdGlvbiI6IFsKICAgICAgICAiVGFrZSBvZmYgY29udGFtaW5hdGVkIGNsb3RoaW5nLiIKICAgICAgXSwKICAgICAgIkZvbGxvd2luZ19JbmhhbGF0aW9uIjogWwogICAgICAgICJQcm92aWRlIGZyZXNoIGFpci4iLAogICAgICAgICJJbiBhbGwgY2FzZXMgb2YgZG91YnQsIG9yIHdoZW4gc3ltcHRvbXMgcGVyc2lzdCwgc2VlayBtZWRpY2FsIGFkdmljZS4iCiAgICAgIF0sCiAgICAgICJGb2xsb3dpbmdfU2tpbl9Db250YWN0IjogWwogICAgICAgICJSaW5zZSBza2luIHdpdGggd2F0ZXIvc2hvd2VyLiIKICAgICAgXSwKICAgICAgIkZvbGxvd2luZ19FeWVfQ29udGFjdCI6IFsKICAgICAgICAiSXJyaWdhdGUgY29waW91c2x5IHdpdGggY2xlYW4sIGZyZXNoIHdhdGVyIGZvciBhdCBsZWFzdCAxMCBtaW51dGVzLCBob2xkaW5nIHRoZSBleWVsaWRzIGFwYXJ0LiIsCiAgICAgICAgIkluIGNhc2Ugb2YgZXllIGlycml0YXRpb24gY29uc3VsdCBhbiBvcGh0aGFsbW9sb2dpc3QuIgogICAgICBdLAogICAgICAiRm9sbG93aW5nX0luZ2VzdGlvbiI6IFsKICAgICAgICAiUmluc2UgbW91dGguIiwKICAgICAgICAiQ2FsbCBhIGRvY3RvciBpZiB5b3UgZmVlbCB1bndlbGwuIgogICAgICBdLAogICAgICAiUHJvdGVjdGlvbl9vZl9GaXJzdF9BaWRfUmVzcG9uZGVycyI6IFtdCiAgICB9LAogICAgIlN5bXB0b21zIjogWwogICAgICAiSXJyaXRhdGlvbiIsCiAgICAgICJOYXVzZWEiLAogICAgICAiVm9taXRpbmciLAogICAgICAiR2FzdHJvaW50ZXN0aW5hbCBjb21wbGFpbnRzIiwKICAgICAgIkhlYWRhY2hlIiwKICAgICAgIlZlcnRpZ28iLAogICAgICAiRGl6emluZXNzIiwKICAgICAgIkRyb3dzaW5lc3MiLAogICAgICAiTmFyY29zaXMiCiAgICBdLAogICAgIlRyZWF0bWVudCI6IFsKICAgICAgIm5vbmUiCiAgICBdCiAgfQp9Cg==)1\{2"World\_First\_Aid\_Measures":\{3"First\_Aid\_Measures":\{4"General\_Information":\[5"Takeoffcontaminatedclothing\."6\],7"Following\_Inhalation":\[8"Providefreshair\.",9"Inallcasesofdoubt,orwhensymptomspersist,seekmedicaladvice\."10\],11"Following\_Skin\_Contact":\[12"Rinseskinwithwater/shower\."13\],14"Following\_Eye\_Contact":\[15"Irrigatecopiouslywithclean,freshwaterforatleast10minutes,holdingtheeyelidsapart\.",16"Incaseofeyeirritationconsultanophthalmologist\."17\],18"Following\_Ingestion":\[19"Rinsemouth\.",20"Calladoctorifyoufeelunwell\."21\],22"Protection\_of\_First\_Aid\_Responders":\[\]23\},24"Symptoms":\[25"Irritation",26"Nausea",27"Vomiting",28"Gastrointestinalcomplaints",29"Headache",30"Vertigo",31"Dizziness",32"Drowsiness",33"Narcosis"34\],35"Treatment":\[36"none"37\]38\}39\}Evaluation is based on quality of extraction of SDS fields, the processing time and the number of tokens of the correspondingLLM\. The evaluation methodology is provided below\.
1. 1\.Extraction Quality:Here,TPTPandTNTNdenote correct extractions or correct omissions, whileFPFPandFNFNrepresent hallucinated or missing fields\. A high FP rate indicates hallucinations, which are particularly critical in safety\-relevant domains\. To capture extraction errors more granularly, we additionally report three complementary quality metrics: Not\-Found Rate, False\-Positive Rate, and BERTScore for semantic similarity between extracted and reference text\[bert\-score\]\. These metrics provide a more differentiated view of omission errors, hallucinations, and semantic similarity\. - •Accuracyis the proportion of correctly extracted fields relative to all expected fields: Accuracy=TP\+TNTP\+FP\+FN\+TN\.\\mathrm\{Accuracy\}=\\frac\{TP\+TN\}\{TP\+FP\+FN\+TN\}\. - •Not\-Found Rate \(NF Rate\):The proportion of required fields that were present in the SDS but not extracted by the model: NFRate=FNTP\+FN\.\\mathrm\{NF\\ Rate\}=\\frac\{FN\}\{TP\+FN\}\. - •False\-Positive Rate \(FP Rate\):The proportion of fields that were extracted despite not being present in the ground truth: FPRate=FPFP\+TN\.\\mathrm\{FP\\ Rate\}=\\frac\{FP\}\{FP\+TN\}\. - •BERTScore \(Semantic Similarity\):Measures semantic similarity between extracted and reference text by comparing contextual token embeddings using cosine similarity, capturing meaning\-level agreement beyond exact string matching\[bert\-score\]\.
2. 2\.Processing Time \(Latency\)\.The end\-to\-end runtime per document \(in seconds\), measured from prompt submission to receipt of the final model output\.
3. 3\.Cost \(Token Usage\)\.Computed from token counts and model pricing: Cost=Tokensin106⋅Pricein\+Tokensout106⋅Priceout\.\\mathrm\{Cost\}=\\frac\{Tokens\_\{in\}\}\{10^\{6\}\}\\cdot Price\_\{in\}\\;\+\\;\\frac\{Tokens\_\{out\}\}\{10^\{6\}\}\\cdot Price\_\{out\}\.
For a unified comparison across configurations, we report a normalizedweighted performance scorethat emphasizes accuracy \(0\.7\), latency \(0\.2\), and cost \(0\.1\):
Score=0\.7⋅Accuracynorm\+0\.2⋅Timenorm\+0\.1⋅Costnorm\.\\mathrm\{Score\}=0\.7\\cdot Accuracy\_\{\\text\{norm\}\}\+0\.2\\cdot Time\_\{\\text\{norm\}\}\+0\.1\\cdot Cost\_\{\\text\{norm\}\}\.Normalization is performed across all configurations using min–max scaling\.
Table 2:Weighted factors used for model evaluation\.
### 3\.4Experimental Procedure
Four state\-of\-the\-artLLMs were selected to represent major providers:Gemini 1\.5 Pro\(Google DeepMind\),GPT\-4o\(OpenAI\),Claude3\.7Sonnet\(Anthropic\), andLlama 3\.1–70B\(Meta\)
Gemini and Claude support native multimodal input; GPT\-4o processes image input via base64 encoding; Llama 3\.1–70B is text\-only and open\-source\. This selection enables cross\-vendor, cross\-architecture, and cross\-modality comparison under identical extraction conditions\. All model–prompt–method combinations were executed on the full set of ten SDS documents\. Each SDS section was processed independently to avoid cross\-section leakage and to ensure consistent adherence to the defined JSON schema\. Model outputs were compared against a manually validated ground truth, with all reference labels independently verified for consistency and correctness\.
For every configuration, field\-level metrics—accuracy, false\-positive rate, and not\-found rate—were computed per section and then averaged across documents\. Processing time and token usage were logged for each run to enable cost and latency comparisons\. In total, 21 configurations were evaluated under identical conditions to ensure reproducibility and controlled comparison\.
## 4Results and Discussion
### 4\.1Comparison of Text\-Based and Multimodal Approaches
The first part of the analysis compared the text\-based and multimodal extraction pipelines across all models\. Results indicate that the text\-based approach consistently outperformed the multimodal method in every metric\. While multimodal models can process image\-based PDFs directly, their optical character recognition \(OCR\) introduces additional uncertainty and latency\. Across models, text\-based extraction achieved between four and nine percentage points higher accuracy\. As shown in Figure[5](https://arxiv.org/html/2606.11204#S4.F5), Gemini 1\.5 Pro reached an accuracy of 82 percent in the text\-based setup compared to 73 percent in multimodal mode\. Similarly, GPT\-4o achieved 80 percent text\-based accuracy versus 76 percent in multimodal processing\.
Average processing times were also lower, with GPT\-4o completing text\-based extractions in approximately 73 seconds compared to nearly 300 seconds in multimodal mode \(Figure[6](https://arxiv.org/html/2606.11204#S4.F6)\)\. These results confirm that, for digitally available SDS documents, text\-based processing offers a more efficient and reliable approach\.
Figure 5:Accuracy comparison between text\-based and multimodal extraction across all modelsFigure 6:Processing time comparison between text\-based and multimodal extraction across all models
### 4\.2Model Performance
A comparative evaluation of the four Large Language Models \(Table[3](https://arxiv.org/html/2606.11204#S4.T3)\) reveals clear performance differences\.Gemini 1\.5 Proachieves the highest overall accuracy in the benchmark \(up to 0\.84 with Chain\-of\-Thought prompting\) while maintaining one of the lowest total costs, resulting in the most favorable overall efficiency profile\.GPT\-4oranks second, reaching accuracies up to 0\.81 and achieving the lowest average processing time \(73 s\) among all models, albeit at higher token\-related cost\.Claude 3\.7 Sonnetperforms slightly below GPT\-4o with accuracies around 0\.79, offering competitive extraction quality but at moderate latency and cost\.
In contrast, the open\-sourceLlama 3\.1–70Bmodel shows substantially lower accuracy \(0\.66–0\.71 across prompting techniques\) and consistently higher false\-positive and not\-found rates\. The performance gap between the strongest and weakest model configurations spans approximately 18 percentage points, confirming that model choice is the dominant factor affecting SDS extraction quality\.
Table 3:Overall evaluation results across all model–prompt–method combinationsTable 4:Accuracy by model, method, and prompting technique
### 4\.3Aggregated Cost\-Function Evaluation
The weighted cost\-function, combining accuracy \(0\.7\), processing time \(0\.2\), and cost \(0\.1\), provides a holistic efficiency metric\. The highest overall score \(0\.88\) was achieved by Gemini 1\.5 Pro using the Chain\-of\-Thought technique, followed by GPT\-4o \(0\.81\) and Claude 3\.7 Sonnet \(0\.72\)\. Llama 3\.1–70B achieved the lowest composite score \(0\.62\)\. Contrary to initial assumptions, model efficiency did not strictly correlate with model size or multimodal capabilities; instead, optimized text\-based inference produced the most balanced performance\. These outcomes emphasize that text\-based extraction, combined with well\-structured prompting, currently represents the most viable strategy for SDS information processing\.
Figure 7:Model accuracy by prompting technique \(Zero\-Shot, Few\-Shot, Chain\-of\-Thought\)Prompting effects\.Across models, Zero\-Shot matched or exceeded Few\-Shot accuracy while using fewer tokens and incurring lower latency\. This pattern is consistent with the*lost in the middle*effect\[liu2023lost\], where models attend less reliably to information positioned mid\-prompt; in our setup, adding exemplars increased cost and runtime without improving extraction quality\. Chain\-of\-Thought helped for Gemini 1\.5 Pro but did not close the gap to the 90% reliability threshold\.
### 4\.4Aggregated Cost\-Function Evaluation
The weighted cost\-function, combining accuracy \(0\.7\), processing time \(0\.2\), and cost \(0\.1\), provides a holistic efficiency metric\. The highest overall score \(0\.88\) was achieved by Gemini 1\.5 Pro using the Chain\-of\-Thought technique, followed by GPT\-4o \(0\.81\) and Claude 3\.7 Sonnet \(0\.72\)\. Figure[8](https://arxiv.org/html/2606.11204#S4.F8)visualizes the trade\-off between accuracy, latency, and cost across all model–prompt configurations and highlights the relative efficiency of the evaluated models\.
Figure 8:Accuracy–processing time–cost trade\-off across all model–prompt configurations\. Bubble size encodes total cost; the green band indicates the desired accuracy region \(≥0\.90\\geq 0\.90\)Llama 3\.1–70B achieved the lowest composite score \(0\.62\)\. Contrary to initial assumptions, model efficiency did not strictly correlate with model size or multimodal capabilities; instead, optimized text\-based inference produced the most balanced performance\. Taken together, these results indicate that text\-based extraction, combined with well\-structured prompting, currently represents the most viable strategy for SDS information processing\.
### 4\.5Discussion of Findings
The experimental results demonstrate that current state\-of\-the\-art LLMs can extract structured information from SDS documents with promising but still limited accuracy\. Although the best configuration achieved 84 percent accuracy, no model surpassed the 90 percent reliability threshold required for autonomous deployment in industrial safety applications\. High false\-positive rates, particularly among smaller or multimodal models, remain a key limitation\. Nevertheless, the observed performance suggests that domain\-specific fine\-tuning or hybrid workflows incorporating Human\-in\-the\-Loop validation could bridge this gap\. Moreover, continuous benchmarking across model versions will be essential to track future progress as foundation models evolve rapidly\. Overall, this study confirms the potential of LLMs for semi\-automated safety data extraction while underscoring the need for further optimization and rigorous validation before operational integration\.
#### Limitations and Responsible Use\.
Limitations include residual false positives that matter in safety\-critical settings, sub\-90% accuracy across all configurations, OCR\-induced artifacts in multimodal runs, and uncertain generalizability beyond the ten SDS\. To mitigate risk, we recommend a human\-in\-the\-loop validator for high\-impact fields \(e\.g\., hazards, first aid, PPE\) and audit logging for traceability\. Future work should expand the dataset, report inter\-annotator agreement, stress\-test degraded scans, and evaluate domain\-adapted fine\-tuning and calibration for reliable confidence estimates\.
## 5Conclusion and Future Work
The benchmark of four state\-of\-the\-art Large Language Models—Gemini 1\.5 Pro, GPT\-4o, Claude 3\.7 Sonnet, and Llama 3\.1–70B—showed that text\-based processing consistently outperforms multimodal approaches across accuracy, runtime, and cost\. The best configuration, Gemini 1\.5 Pro with Chain\-of\-Thought prompting, achieved 84% accuracy; however, no model reached the 90% reliability threshold required for autonomous use in safety\-critical settings\. False positives remain the most consequential failure mode, and multimodal OCR artefacts further limit extraction robustness\. Model choice exerted a far stronger influence on performance than prompting strategies, indicating that architectural capabilities dominate over interaction design\. Given these limitations and the narrow scope of the ten\-document dataset, practical deployments should incorporate a Human\-in\-the\-Loop validation step to mitigate risks and ensure the trustworthy handling of chemical safety information\. The presented framework provides a reproducible basis for future evaluation, fine\-tuning, and optimization of LLM\-based extraction pipelines in regulated industrial environments\.
## Declarations
### Conflict of Interest
The authors declare that they have no competing interests\. The first and third authors are affiliated with SAP; these affiliations did not influence the design, results, or interpretation of the research\.
### Data availability
The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request\. The code used for the analysis is available from the corresponding author upon reasonable request\.
## ReferencesSimilar Articles
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
Researchers from Utah State and Vanderbilt benchmark GPT-4, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2 and BERT on three social-media tasks—authorship verification, post generation, and user attribute inference—introducing new sampling protocols and taxonomies to reduce bias and enable reproducible benchmarks.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction
This paper presents a modular retrieval-augmented generation (RAG) pipeline for extracting structured clinical observations from conversational nurse-patient transcripts, using schema-constrained prompting and second-pass auditing with Llama and GPT backbones, achieving 80.36% F1 score.
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
This paper introduces AI-MASLD, a stress-audit framework for medical LLMs that reveals how benchmark accuracy can hide serious safety failures, and demonstrates that open-weight models can match or exceed proprietary ones on safety dimensions.
Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs
This paper investigates whether LLMs can infer individual domain knowledge from long-term Slack logs, comparing seven models and finding Gemini 2.5 Flash achieves the lowest error, highlighting feasibility and limits of automated expertise mapping.