Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1-3 Validation

arXiv cs.LG Papers

Summary

This paper proposes a deterministic climate-risk intelligence framework integrating orchestration, anomaly detection, and imbalance-aware ensemble learning for auditable ESG validation, addressing fragmented Scope 1-3 reporting data.

arXiv:2606.02604v1 Announce Type: new Abstract: ESG and climate risk data remain fragmented across heterogeneous Scope 1, Scope 2, and Scope 3 reporting environments, while conventional validation pipelines lack provenance aware auditability, hidden drift detection, and reproducibility oriented governance. This paper proposes a deterministic climate risk intelligence framework integrating single source of truth orchestration, temporal anomaly detection, imbalance aware ensemble learning, and explainability oriented governance for auditable ESG validation. To support open reproducibility, we construct and release a synthetic ESG validation benchmark calibrated against publicly reported characteristics of the GHG Protocol, PCAF, and ISSB standards. The methodology incorporates temporal drift analysis, SMOTE based rare event optimization, ensemble learning, provenance aware orchestration, and TreeSHAP based interpretability for governance inspection and audit reconstruction. We evaluate the framework against statistical classifiers, anomaly detection methods, temporal forecasting baselines, and a threshold based system using classification metrics (recall, F1, ROC AUC), calibration metrics (ECE, Brier score), and a governance oriented audit trace completeness metric measuring the fraction of flagged anomalies for which a deterministic source to escalation provenance chain can be reconstructed. Results are reported as mean and standard deviation across stratified five fold cross validation with paired significance testing. The framework reframes ESG reporting toward deterministic climate risk governance infrastructure supporting reproducibility, explainability, and operational auditability.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:38 AM

# Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1–3 Validation
Source: [https://arxiv.org/html/2606.02604](https://arxiv.org/html/2606.02604)
Karan Sehgal Kent Business School University of Kent Canterbury, United Kingdom K\.Sehgal@kent\.ac\.uk&Khawar Naveed Bhatti Kent Business School University of Kent Canterbury, United Kingdom K\.Bhatti@kent\.ac\.uk

###### Abstract

Corporations pursuing net\-zero commitments increasingly face a systemic engineering challenge: ESG and climate\-risk data remain fragmented across heterogeneous Scope 1, Scope 2, and Scope 3 reporting environments, while conventional validation pipelines lack mechanisms for provenance\-aware auditability, hidden drift detection, and reproducibility\-oriented governance\.

This paper proposes a deterministic climate\-risk intelligence framework integrating single\-source\-of\-truth orchestration, temporal anomaly detection, imbalance\-aware ensemble learning, and explainability\-oriented governance for auditable ESG validation infrastructure\.

To support open reproducibility, we construct and release a synthetic ESG validation benchmark whose disclosure distributions, anomaly prevalence, and missingness structure are calibrated against publicly reported characteristics of the GHG Protocol, PCAF, and ISSB reporting standards\. The proposed workflow combines this benchmark with public climate\-risk hazard datasets, proxy emissions features, and accounting\-aligned data\-quality logic within a deterministic event\-driven orchestration pipeline designed to improve validation traceability and governance reliability\.

The methodology incorporates temporal drift analysis, SMOTE\-based rare\-event optimization, ensemble learning architectures, provenance\-aware orchestration semantics, and TreeSHAP\-based interpretability for governance inspection and audit reconstruction\.

Comparative experimentation evaluates the framework against statistical classifiers, anomaly\-detection methods, temporal forecasting baselines, and a threshold\-based validation system using classification metrics \(recall, F1, ROC\-AUC\), calibration metrics \(Expected Calibration Error, Brier score\), and a governance\-oriented*audit trace completeness*metric measuring the fraction of flagged anomalies for which a deterministic source\-to\-escalation provenance chain can be reconstructed\. Results are reported as the mean and standard deviation across stratified five\-fold cross\-validation, with paired significance testing against the strongest baseline\.

The proposed workflow reframes ESG reporting from passive disclosure aggregation toward deterministic climate\-risk governance infrastructure supporting reproducibility, explainability, and operational auditability within regulated enterprise environments\.

## 1Introduction

Corporations pursuing net\-zero commitments increasingly operate under fragmented ESG reporting environments involving heterogeneous Scope 1, Scope 2, and Scope 3 disclosure systems distributed across suppliers, financial systems, sustainability platforms, and geographically diverse reporting infrastructures\.

Conventional ESG validation workflows frequently rely upon passive extraction, threshold\-based reconciliation procedures, and manually intensive audit processes lacking deterministic orchestration semantics, provenance\-aware traceability, and reproducibility\-oriented governance engineering\.

These limitations become increasingly significant under regulated enterprise environments where climate\-risk disclosures influence institutional reporting obligations, financed\-emissions accounting, operational risk evaluation, and sustainability\-linked financial decision making\.

Enterprise climate\-risk environments additionally exhibit sparse anomaly distributions and highly imbalanced operational conditions represented as:

P​\(y=1\)≪P​\(y=0\)P\(y=1\)\\ll P\(y=0\)
where governance\-critical anomalies constitute only a small fraction of total reporting observations\.

Under such conditions, conventional optimization procedures may produce deceptively strong aggregate performance while exhibiting weak minority\-event sensitivity and elevated false\-negative behavior under governance\-critical edge cases\.

Additionally, ESG reporting systems remain vulnerable to:

- •provenance inconsistencies,
- •delayed supplier disclosures,
- •climate\-transition drift,
- •hidden reconciliation conflicts,
- •and incomplete emissions reporting\.

This paper proposes a deterministic climate\-risk intelligence framework integrating:

- •deterministic orchestration,
- •provenance\-aware governance,
- •temporal drift detection,
- •imbalance\-aware ensemble learning,
- •and explainability\-oriented audit infrastructure\.

The proposed workflow combines heterogeneous ESG reporting environments, climate\-risk hazard datasets, temporal anomaly modeling, and governance\-oriented explainability mechanisms within a reproducibility\-aware orchestration architecture designed to improve auditability and operational reliability under fragmented enterprise disclosure conditions\.

The broader objective is to transition ESG reporting from passive disclosure aggregation toward deterministic governance engineering supporting trustworthy climate\-risk intelligence infrastructure\.

The primary contributions of this work are summarized as follows:

- •A deterministic orchestration framework for provenance\-aware ESG validation under fragmented enterprise reporting environments\.
- •An imbalance\-aware climate\-risk intelligence workflow integrating SMOTE optimization, ensemble learning, and temporal anomaly detection\.
- •A governance\-oriented explainability architecture combining TreeSHAP\-based interpretability with audit reconstruction and provenance\-aware traceability\.
- •An*audit trace completeness*metric that quantifies deterministic provenance reconstruction for flagged anomalies, making the governance\-engineering objective directly measurable\.
- •A reproducibility\-oriented experimentation framework incorporating a calibrated synthetic ESG benchmark, climate\-risk enrichment, temporal drift modeling, cross\-validated evaluation, and governance\-sensitive metrics\.

![Refer to caption](https://arxiv.org/html/2606.02604v1/figures/fig1.png)Figure 1:Deterministic climate\-risk intelligence architecture integrating provenance\-aware orchestration, imbalance\-aware learning, temporal provenance graph modeling, and governance\-oriented explainability for ESG validation infrastructure\.
## 2Related Work

### 2\.1ESG Reporting and Climate\-Risk Governance

Enterprise climate\-risk reporting has evolved rapidly under governance frameworks including the GHG Protocol, PCAF, TCFD, ISSB, and CSRDGreenhouse Gas Protocol \([2023](https://arxiv.org/html/2606.02604#bib.bib18)\); Partnership for Carbon Accounting Financials \([2022](https://arxiv.org/html/2606.02604#bib.bib12)\); International Sustainability Standards Board \([2023](https://arxiv.org/html/2606.02604#bib.bib13)\)\.

These frameworks establish accounting\-oriented disclosure requirements but provide comparatively limited technical guidance regarding deterministic orchestration, provenance\-aware validation, and reproducibility\-oriented governance engineering\.

Recent climate\-risk systems increasingly incorporate NGFS transition scenarios, Copernicus climate infrastructure, WRI Aqueduct hazard datasets, and ThinkHazard environmental\-risk intelligence for enterprise stress\-testing and climate\-exposure modelingNetwork for Greening the Financial System \([2023](https://arxiv.org/html/2606.02604#bib.bib14)\); European Centre for Medium\-Range Weather Forecasts \([2023](https://arxiv.org/html/2606.02604#bib.bib15)\); World Resources Institute \([2023](https://arxiv.org/html/2606.02604#bib.bib16)\); Global Facility for Disaster Reduction and Recovery \([2023](https://arxiv.org/html/2606.02604#bib.bib17)\)\.

However, existing ESG validation workflows remain heavily dependent upon static reconciliation procedures lacking temporal anomaly awareness, governance\-oriented orchestration semantics, and deterministic audit reconstruction\.

### 2\.2Machine Learning for Climate\-Risk Intelligence

Machine learning architectures including Random ForestsBreiman \([2001](https://arxiv.org/html/2606.02604#bib.bib1)\), Gradient Boosting, XGBoostChen and Guestrin \([2016](https://arxiv.org/html/2606.02604#bib.bib3)\), LightGBMKeet al\.\([2017](https://arxiv.org/html/2606.02604#bib.bib4)\), and CatBoostProkhorenkovaet al\.\([2018](https://arxiv.org/html/2606.02604#bib.bib5)\)have demonstrated strong predictive capability under nonlinear enterprise\-risk environments\.

Recent work additionally explores emissions estimation, climate\-risk forecasting, and governance\-sensitive anomaly detection using ensemble\-learning architectures\.

However, governance\-critical ESG failures remain comparatively sparse and operationally imbalanced, frequently producing unstable minority\-event sensitivity and elevated false\-negative behavior under conventional optimization workflows\.

Imbalance\-aware learning procedures including SMOTEChawlaet al\.\([2002](https://arxiv.org/html/2606.02604#bib.bib2)\)therefore remain important for governance\-oriented anomaly detection under fragmented enterprise reporting environments\.

### 2\.3Concept Drift and Temporal Instability

Enterprise ESG reporting systems additionally exhibit temporal instability involving delayed disclosures, supplier volatility, evolving sustainability standards, and climate\-transition uncertainty\.

Prior research on concept drift and adaptive learning highlights the importance of time\-aware validation and non\-stationary anomaly modeling within continuously evolving enterprise systemsGamaet al\.\([2014](https://arxiv.org/html/2606.02604#bib.bib8)\)\.

Temporal governance degradation may emerge through:

- •concept drift,
- •covariate drift,
- •label drift,
- •and reporting\-frequency instability\.

These forms of temporal instability become particularly significant within climate\-risk intelligence systems operating under evolving regulatory and environmental conditions\.

### 2\.4Explainability and Trustworthy AI

Explainability\-oriented methodologies including SHAPLundberg and Lee \([2017](https://arxiv.org/html/2606.02604#bib.bib6)\)and its tree\-ensemble specialization TreeSHAPLundberget al\.\([2020](https://arxiv.org/html/2606.02604#bib.bib19)\)have emerged as important mechanisms for improving enterprise AI transparency, audit reconstruction, and governance inspection\. TreeSHAP is particularly relevant here because it computes exact Shapley values for tree\-based models in low\-order polynomial time, making attribution tractable across the high\-dimensional feature space used in this work\.

Recent trustworthy\-AI governance frameworks including the NIST AI Risk Management FrameworkNational Institute of Standards and Technology \([2023](https://arxiv.org/html/2606.02604#bib.bib10)\), OECD trustworthy\-AI principlesOrganisation for Economic Co\-operation and Development \([2022](https://arxiv.org/html/2606.02604#bib.bib11)\), and reproducibility\-oriented machine\-learning guidancePineauet al\.\([2021](https://arxiv.org/html/2606.02604#bib.bib9)\)additionally emphasize calibration reliability, operational reproducibility, governance\-aware oversight, and audit consistency\.

However, explainability alone does not guarantee deterministic orchestration, provenance\-aware governance, or reproducibility\-oriented audit infrastructure\.

This paper therefore positions explainability as one component within a broader deterministic governance\-engineering architecture for trustworthy climate\-risk intelligence systems\.

## 3Dataset and Problem Setup

### 3\.1A Calibrated Synthetic ESG Benchmark

To enable open, fully reproducible experimentation without disclosing commercially sensitive enterprise records, we construct and release a*synthetic*ESG validation benchmark\. The benchmark is generated by a documented data\-generating process \(DGP\) rather than extracted from a single proprietary system; this design choice trades away real\-world idiosyncrasy in exchange for full reproducibility, public availability, and controllable anomaly prevalence\. Synthetic benchmarking is a standard practice for governance\-sensitive domains in which production data cannot be shared\.

The DGP proceeds in four stages\. First, base disclosure records are sampled across Scope 1, Scope 2, and Scope 3 fields, with marginal distributions and missingness rates calibrated against publicly reported characteristics of the GHG Protocol, PCAF, and ISSB reporting standardsGreenhouse Gas Protocol \([2023](https://arxiv.org/html/2606.02604#bib.bib18)\); Partnership for Carbon Accounting Financials \([2022](https://arxiv.org/html/2606.02604#bib.bib12)\); International Sustainability Standards Board \([2023](https://arxiv.org/html/2606.02604#bib.bib13)\)\. Second, provenance metadata, confidence bands, and temporal reporting signals are attached to each record\. Third, the six governance failure modes of Section[4](https://arxiv.org/html/2606.02604#S4)are injected at a controlled prevalence to produce labelled governance anomalies\. Fourth, records are enriched with public climate\-risk indicators \(Section[3\.3](https://arxiv.org/html/2606.02604#S3.SS3)\)\. The anomaly label is therefore generated jointly with the failure\-injection process, giving an exact ground truth for evaluation\. The benchmark, generation code, and fixed random seeds are released to support reproducibility \(Section[5\.11](https://arxiv.org/html/2606.02604#S5.SS11)\)\.

### 3\.2Enterprise ESG Validation Layer

The enterprise validation layer of the benchmark models heterogeneous ESG reporting systems involving:

- •Scope 1 emissions,
- •Scope 2 disclosures,
- •Scope 3 supplier reporting,
- •provenance metadata,
- •confidence bands,
- •and audit reconciliation states\.

Core fields include emissions variables, governance metadata, provenance identifiers, temporal reporting signals, regional indicators, and climate\-risk severity attributes\.

### 3\.3Public Climate\-Risk Layer

To support reproducibility\-oriented experimentation, public climate\-risk datasets are integrated including:

- •WRI Aqueduct 4\.0,
- •ThinkHazard,
- •Copernicus Climate Change Service,
- •NGFS transition scenarios,
- •and PCAF\-aligned emissions logic\.

These datasets provide flood risk, drought exposure, heat stress, transition risk, and sectoral climate vulnerability indicators\.

Table 1:Synthetic ESG Benchmark Summary Statistics

## 4Governance Failure Modes

Fragmented ESG reporting environments frequently exhibit governance\-critical inconsistencies capable of degrading audit reliability, climate\-risk visibility, and enterprise reporting integrity\.

The proposed framework therefore models anomaly detection as a governance\-oriented validation problem involving multiple operational failure categories\. These categories also define the failure\-injection process used to label the synthetic benchmark \(Table[1](https://arxiv.org/html/2606.02604#S3.T1)\)\.

Table 2:Governance Failure ModesThese governance\-failure categories provide operational grounding for anomaly detection, audit reconstruction, and provenance\-aware orchestration within climate\-risk intelligence systems\.

## 5Framework Architecture

This section describes the components of the proposed deterministic climate\-risk intelligence framework\. We first present the natural\-language ingestion layer and the temporal drift layer, then the deterministic orchestration, provenance reasoning, symbolic\-neural governance, imbalance\-aware learning, and explainability components\.

### 5\.1Natural Language ESG Ingestion

Real\-world ESG reporting environments frequently involve heterogeneous unstructured disclosures including sustainability reports, earnings statements, supplier disclosures, regulatory filings, climate\-risk narratives, PDF\-based governance documents, and cross\-jurisdictional reporting statements\.

Unlike structured financial reporting systems, enterprise ESG disclosures often exhibit inconsistent terminology, incomplete metadata, fragmented provenance lineage, and substantial semantic ambiguity across suppliers, subsidiaries, and geographically distributed reporting environments\.

The proposed framework therefore incorporates natural\-language ingestion procedures designed to support governance\-aware semantic extraction and reproducibility\-oriented disclosure reconciliation under fragmented enterprise reporting conditions\.

The ingestion workflow incorporates:

- •OCR\-based document extraction,
- •semantic normalization,
- •named\-entity recognition,
- •disclosure reconciliation,
- •retrieval\-oriented climate parsing,
- •and provenance\-aware semantic alignment\.

Recent advances in transformer\-based language architectures including encoder\-based semantic representations, retrieval\-augmented generation workflows, and domain\-adapted document intelligence systems have substantially improved enterprise document understanding under heterogeneous reporting environments\. Such architectures enable semantic extraction of climate\-risk disclosures, supplier reporting narratives, governance statements, and sustainability\-oriented financial commentary across fragmented enterprise ecosystems\.

Retrieval\-augmented ESG parsing workflows further improve disclosure\-grounding consistency by associating extracted emissions claims with provenance\-aware enterprise metadata, climate\-risk indicators, temporal reporting trajectories, and historical reconciliation states\. Such retrieval\-oriented semantic architectures become particularly important under enterprise ESG environments where reporting disclosures frequently exhibit linguistic ambiguity, inconsistent sustainability terminology, and incomplete governance metadata\.

Future extensions may additionally incorporate multimodal climate\-risk ingestion involving geospatial flood maps, satellite\-derived environmental indicators, raster\-based climate intelligence, and visual climate embeddings integrated alongside textual sustainability disclosures and structured ESG records\.

Enterprise disclosure embeddings are represented as:

zi=fθ​\(di\)z\_\{i\}=f\_\{\\theta\}\(d\_\{i\}\)
where:

- •did\_\{i\}denotes ESG disclosure text,
- •ziz\_\{i\}denotes semantic embedding representations,
- •andfθf\_\{\\theta\}denotes parameterized language encoders\.

The resulting semantic representations are incorporated into:

- •anomaly scoring,
- •provenance confidence estimation,
- •governance reconciliation,
- •climate\-risk validation,
- •and audit\-oriented disclosure inspection workflows\.

#### Ingestion evaluation corpus\.

The transformer benchmarking and parsing results below are evaluated on a held\-out corpus of 1,200 annotated disclosure passages drawn from publicly available sustainability reports and regulatory filings, split 70/15/15 into train/validation/test partitions\. Ground\-truth labels \(scope assignment, provenance source, climate entity spans\) were produced by two annotators with disagreements adjudicated by a third; reported metrics are computed on the held\-out test partition\. Encoder models are used off\-the\-shelf with a lightweight task head fine\-tuned on the training partition\.

Table 3:Transformer Benchmarking for ESG Disclosure Intelligence \(held\-out test partition\)On the held\-out test partition, ClimateBERT achieved the highest ESG\-F1 \(0\.88\) and parsing accuracy \(0\.85\), while SBERT achieved the highest retrieval recall \(0\.89\)\. Both outperformed the general\-purpose BERT baseline across all four measures, indicating that domain\-adapted and retrieval\-optimized encoders better handle cross\-source disclosure variation and provenance\-sensitive reconciliation\.

Retrieval\-oriented ESG parsing workflows were additionally evaluated using retrieval precision, retrieval recall, semantic grounding consistency, and disclosure\-alignment accuracy\. ClimateBERT and SBERT\-based retrieval workflows produced the strongest semantic alignment under fragmented climate\-risk disclosures involving inconsistent sustainability terminology, delayed supplier reporting, and heterogeneous disclosure structures\. These results indicate that retrieval\-augmented ESG parsing improves provenance\-aware disclosure reconciliation and governance\-oriented anomaly interpretation under fragmented enterprise climate\-risk environments\.

![Refer to caption](https://arxiv.org/html/2606.02604v1/figures/fig5_esg_embedding_space.png)Figure 2:Semantic embedding visualization of ESG disclosure representations illustrating governance\-oriented clustering behavior under fragmented climate\-risk reporting environments\.Table 4:ESG Disclosure Parsing Accuracy \(held\-out test partition\)Scope 3 extraction was the hardest task \(84\.1% accuracy\), reflecting the greater heterogeneity and supplier\-dependence of Scope 3 disclosures, while Scope 1 extraction was the most reliable \(91\.2%\)\. The broader objective is not solely automated disclosure extraction, but governance\-aware semantic infrastructure capable of supporting deterministic audit reconstruction and reproducibility\-oriented enterprise climate\-risk intelligence\.

### 5\.2Temporal Drift Layer

Enterprise ESG reporting environments exhibit substantial temporal instability arising from evolving disclosure standards, delayed supplier reporting, climate\-transition uncertainty, and non\-stationary operational behavior across geographically distributed reporting ecosystems\.

To improve governance\-oriented anomaly detection under dynamically evolving enterprise environments, the proposed framework incorporates temporal drift modeling designed to identify:

- •seasonal ESG reporting behavior,
- •supplier disclosure lag,
- •climate\-volatility propagation,
- •confidence degradation trajectories,
- •and anomalous reporting instability\.

Temporal instability within ESG infrastructure may additionally emerge through multiple drift mechanisms including concept drift, covariate drift, label drift, and governance degradation behavior\. Concept drift occurs when underlying anomaly distributions evolve over time due to changing climate\-risk exposure, regulatory requirements, or enterprise disclosure practices\. Covariate drift emerges when feature distributions shift independently of anomaly labels, while label drift reflects evolving anomaly prevalence under changing operational conditions\.

The proposed framework therefore models ESG reporting sequences as time\-indexed governance trajectories designed to support reproducibility\-aware temporal anomaly detection\. Residual\-based temporal instability is represented as:

rt=yt−y^tr\_\{t\}=y\_\{t\}\-\\hat\{y\}\_\{t\}
whereyty\_\{t\}denotes observed ESG reporting behavior andy^t\\hat\{y\}\_\{t\}denotes expected temporal behavior\. Anomalous reporting trajectories are flagged using adaptive threshold estimation:

f​l​a​g​\(t\)=𝕀​\(\|rt\|\>τt\)flag\(t\)=\\mathbb\{I\}\(\|r\_\{t\}\|\>\\tau\_\{t\}\)
whereτt\\tau\_\{t\}denotes dynamic governance\-sensitive anomaly thresholds\.

Comparative experimentation evaluates both SARIMA and SARIMA\-LSTM temporal baselines in order to model seasonal ESG reporting dynamics and non\-linear climate\-risk instability under fragmented disclosure environments\.

The temporal drift layer therefore functions as a governance\-oriented monitoring mechanism supporting audit reconstruction, reporting stability analysis, confidence degradation tracking, and enterprise governance resilience under evolving climate\-risk conditions\.

![Refer to caption](https://arxiv.org/html/2606.02604v1/figures/fig4_drift_heatmap.png)Figure 3:Temporal governance\-drift heatmap illustrating anomaly\-intensity propagation across enterprise ESG reporting sequences under evolving climate\-risk conditions\.
### 5\.3Deterministic ESG Orchestration

The orchestration workflow is modeled as:

whereVVdenotes orchestration states andEEdenotes event\-triggered transitions\. Each orchestration transition generates governance\-aware audit metadata supporting reproducibility\-oriented validation workflows\.

Table 5:Deterministic Climate\-Risk Intelligence PipelineAlgorithm 1Deterministic ESG Governance Workflow1:Enterprise ESG disclosures

DD, public climate\-risk datasets

CC, governance constraint set

𝒞\\mathcal\{C\}, fixed random seed

ss
2:Governance\-annotated anomaly labels

y^\\hat\{y\}and reconstructable audit trace

AA
3:Initialize deterministic state with seed

ss⊳\\trianglerightreproducibility

4:// Ingestion and normalization

5:

D′←Ingest​\(D\)D^\{\\prime\}\\leftarrow\\textsc\{Ingest\}\(D\)⊳\\trianglerightOCR, NER, semantic alignment

6:

X←Normalize​\(D′\)X\\leftarrow\\textsc\{Normalize\}\(D^\{\\prime\}\)⊳\\trianglerightschema and provenance alignment

7:ValidateProvenance\(

XX\)⊳\\trianglerightlineage consistency

8:// Enrichment and temporal monitoring

9:

X←X∪ClimateEnrich​\(X,C\)X\\leftarrow X\\cup\\textsc\{ClimateEnrich\}\(X,C\)
10:

drift←DetectDrift​\(X\)\\textit\{drift\}\\leftarrow\\textsc\{DetectDrift\}\(X\)⊳\\trianglerightresidual thresholding, Eq\. in Sec\.[5\.2](https://arxiv.org/html/2606.02604#S5.SS2)

11:// Inference under governance constraints

12:

p←EnsembleInfer​\(X\)p\\leftarrow\\textsc\{EnsembleInfer\}\(X\)⊳\\trianglerightimbalance\-aware ensemble

13:

y^←p∩𝒞​\(X\)\\hat\{y\}\\leftarrow p\\cap\\mathcal\{C\}\(X\)⊳\\trianglerightsymbolic\-neural governance, Sec\.[5\.5](https://arxiv.org/html/2606.02604#S5.SS5)

14:// Explanation, persistence, escalation

15:

Φ←TreeSHAP​\(X,y^\)\\Phi\\leftarrow\\textsc\{TreeSHAP\}\(X,\\hat\{y\}\)
16:

A←PersistAuditTrace​\(X,drift,y^,Φ\)A\\leftarrow\\textsc\{PersistAuditTrace\}\(X,\\textit\{drift\},\\hat\{y\},\\Phi\)
17:if

y^\\hat\{y\}contains governance\-critical anomaliesthen

18:TriggerGovernanceReview\(

AA\)

19:endif

20:return

y^,A\\hat\{y\},A

### 5\.4Proposed Graph\-Based Provenance Architecture

This subsection specifies a proposed architectural extension\. The graph\-based provenance component is architecturally defined and dimensioned but is not empirically evaluated in this paper; its validation is left for future work \(Section[12](https://arxiv.org/html/2606.02604#S12)\)\. The statistics in Table[6](https://arxiv.org/html/2606.02604#S5.T6)describe the design target of the proposed provenance graph for the synthetic benchmark, not measured experimental outputs\.

Enterprise ESG reporting environments exhibit highly interconnected disclosure dependencies involving suppliers, subsidiaries, climate\-risk entities, audit events, reporting pipelines, and reconciliation workflows\.

The proposed extension positions provenance validation as a graph\-structured governance problem in which enterprise reporting entities and disclosure relationships are represented as:

𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)
where𝒱\\mathcal\{V\}denotes ESG entities including suppliers, disclosures, audit states, and climate\-risk observations, andℰ\\mathcal\{E\}denotes provenance\-aware reporting relationships and governance dependencies\.

Table 6:Proposed Graph Provenance Infrastructure \(design targets, not measured\)The design targets indicate the large\-scale relational complexity associated with fragmented ESG reporting environments involving supplier dependencies, climate\-risk propagation pathways, audit relationships, and governance escalation workflows\. The proposed provenance graph infrastructure is intended to enable lineage\-aware anomaly reconstruction, governance\-sensitive dependency tracing, and temporal relationship analysis under heterogeneous enterprise climate\-risk environments\.

![Refer to caption](https://arxiv.org/html/2606.02604v1/figures/fig6_temporal_provenance_graph.jpg)Figure 4:Proposed temporal provenance\-aware orchestration graph illustrating intended anomaly propagation, governance degradation, and deterministic audit reconstruction under fragmented ESG reporting environments\.Anomalous reporting behavior may propagate across interconnected supplier and governance subgraphs through delayed disclosures, inconsistent climate\-risk declarations, provenance conflicts, and temporally evolving reporting dependencies\. These propagation dynamics motivate graph\-aware orchestration semantics capable of reconstructing relational disclosure lineage under fragmented enterprise reporting environments\. In particular, provenance\-aware graph reconstruction is intended to improve anomaly traceability, audit replay consistency, governance escalation analysis, and temporal disclosure dependency modeling under non\-stationary ESG reporting conditions characterized by evolving supplier relationships and climate\-transition uncertainty\.

The proposed provenance graph evolves temporally under changing supplier relationships, delayed reporting sequences, governance escalation workflows, and climate\-transition dependencies\. To support temporal provenance analysis, the orchestration workflow is designed to maintain time\-indexed graph snapshots enabling reconstruction of evolving disclosure dependencies and governance\-sensitive anomaly propagation pathways across heterogeneous reporting ecosystems\. Under fragmented reporting environments, graph\-structured reasoning enables modeling of supplier dependency propagation, disclosure lineage, climate\-exposure linkage, audit reconstruction pathways, and temporal governance transitions\.

A graph\-message propagation abstraction is represented as:

hv\(l\+1\)=σ​\(∑u∈𝒩​\(v\)W\(l\)​hu\(l\)\)h\_\{v\}^\{\(l\+1\)\}=\\sigma\\left\(\\sum\_\{u\\in\\mathcal\{N\}\(v\)\}W^\{\(l\)\}h\_\{u\}^\{\(l\)\}\\right\)
wherehv\(l\)h\_\{v\}^\{\(l\)\}denotes node representations,𝒩​\(v\)\\mathcal\{N\}\(v\)denotes neighboring governance entities, andW\(l\)W^\{\(l\)\}denotes learnable propagation parameters\.

Full graph\-neural optimization remains outside the scope of the present study; graph\-structured provenance reasoning provides a future pathway toward governance\-aware anomaly propagation and enterprise audit intelligence\.

### 5\.5Symbolic\-Neural Governance Constraints

Statistical anomaly\-detection systems alone remain insufficient within regulated enterprise environments because governance workflows frequently require deterministic rule enforcement and audit\-consistent operational constraints\. The proposed framework therefore combines statistical learning procedures with symbolic governance validation mechanisms\.

Governance constraints are represented as:

𝒞​\(xi\)→\{0,1\}\\mathcal\{C\}\(x\_\{i\}\)\\rightarrow\\\{0,1\\\}
where𝒞\\mathcal\{C\}denotes governance\-oriented validation constraints and binary outputs denote compliance\-consistent workflow states\. The resulting governance\-aware prediction workflow is represented as:

yi=fθ​\(xi\)∩𝒞​\(xi\)y\_\{i\}=f\_\{\\theta\}\(x\_\{i\}\)\\cap\\mathcal\{C\}\(x\_\{i\}\)
wherefθ​\(xi\)f\_\{\\theta\}\(x\_\{i\}\)denotes statistical model inference and𝒞​\(xi\)\\mathcal\{C\}\(x\_\{i\}\)denotes deterministic governance enforcement\. This formulation integrates probabilistic anomaly scoring, provenance\-aware validation, symbolic audit constraints, and governance\-oriented orchestration semantics\.

In operational enterprise environments, governance constraints may include mandatory emissions\-field validation, provenance completeness requirements, regulatory disclosure consistency checks, supplier\-reporting dependency validation, and audit\-sensitive escalation triggers\. The deterministic orchestration workflow therefore combines probabilistic anomaly inference with rule\-based governance enforcement to reduce unresolved disclosure conflicts, improve audit reconstruction consistency, and support reproducibility\-oriented climate\-risk validation under fragmented reporting conditions\.

### 5\.6ESG Record Standardization

Enterprise ESG records are transformed into unified schema representations:

xi′=fn​o​r​m​\(xi\)x\_\{i\}^\{\\prime\}=f\_\{norm\}\(x\_\{i\}\)
wherexix\_\{i\}denotes raw ESG records andxi′x\_\{i\}^\{\\prime\}denotes normalized representations\.

### 5\.7Imbalance\-Aware Learning

SMOTE\-based oversamplingChawlaet al\.\([2002](https://arxiv.org/html/2606.02604#bib.bib2)\)is incorporated to improve minority\-event representation:

xn​e​w=xi\+λ​\(xn​n−xi\)x\_\{new\}=x\_\{i\}\+\\lambda\(x\_\{nn\}\-x\_\{i\}\)
wherexix\_\{i\}denotes minority samples,xn​nx\_\{nn\}denotes neighboring minority observations, andλ∼U​\(0,1\)\\lambda\\sim U\(0,1\)\. Ensemble prediction is represented as:

p​\(x\)=∑k=1Kwk​fk​\(x\)p\(x\)=\\sum\_\{k=1\}^\{K\}w\_\{k\}f\_\{k\}\(x\)
wherefkf\_\{k\}denotes ensemble learners andwkw\_\{k\}denotes ensemble weights\.

#### Choice of resampling ratio\.

SMOTE is applied to the training partition only, and never to validation or test partitions, in order to preserve evaluation integrity and the true 4\.7% anomaly prevalence at test time\. We oversample minority instances to full balance \(1:1\) as our primary configuration, and we report a sensitivity analysis over target ratios\{5:1,2:1,1:1\}\\\{5\{:\}1,2\{:\}1,1\{:\}1\\\}in Section[7](https://arxiv.org/html/2606.02604#S7)\. Full balancing was selected because, on the validation partition, it maximized minority\-event recall without a statistically significant degradation in precision; more conservative ratios reduced recall on the rare governance\-critical classes that motivate this work\. Because resampling is confined to training folds, the reported test metrics reflect performance under the original imbalanced distribution\.

![Refer to caption](https://arxiv.org/html/2606.02604v1/figures/fig2_smote.png)Figure 5:Class\-distribution comparison before and after SMOTE\-based minority oversampling on the training partition under governance\-sensitive ESG anomaly environments\.

### 5\.8Adversarial Reporting Robustness

Enterprise ESG reporting systems remain vulnerable to manipulated disclosure values, delayed emissions reporting, fabricated sustainability claims, provenance inconsistency, and adversarial omission behavior\. To improve governance\-oriented resilience, the proposed framework incorporates anomaly\-sensitive validation procedures designed to identify reporting irregularities under fragmented enterprise environments\.

Adversarial perturbations are represented as:

xia​d​v=xi\+δx\_\{i\}^\{adv\}=x\_\{i\}\+\\delta
wherexix\_\{i\}denotes enterprise ESG observations andδ\\deltadenotes adversarial perturbation behavior\. The deterministic orchestration workflow therefore prioritizes provenance traceability, anomaly reconstruction, governance consistency, and audit\-aware reporting resilience\.

### 5\.9Explainability and Governance

SHAP\-based explainability analysis estimates feature contribution behavior through marginal Shapley attribution:

ϕi=∑S⊆F∖\{i\}\|S\|\!​\(\|F\|−\|S\|−1\)\!\|F\|\!​\[f​\(S∪\{i\}\)−f​\(S\)\]\\phi\_\{i\}=\\sum\_\{S\\subseteq F\\setminus\\\{i\\\}\}\\frac\{\|S\|\!\(\|F\|\-\|S\|\-1\)\!\}\{\|F\|\!\}\\left\[f\(S\\cup\\\{i\\\}\)\-f\(S\)\\right\]
whereFFdenotes the feature space,SSdenotes feature subsets, andϕi\\phi\_\{i\}measures marginal feature contribution\. Because exact evaluation of this expression is exponential in\|F\|\|F\|and therefore intractable for the 231\-feature space used here, we use TreeSHAPLundberget al\.\([2020](https://arxiv.org/html/2606.02604#bib.bib19)\), which computes exact Shapley values for tree\-ensemble models in low\-order polynomial time\. Outputs are mapped to provenance confidence, reconciliation severity, audit reconstruction states, and governance\-oriented review workflows\.

### 5\.10Computational Characteristics

Table 7:Approximate Computational Characteristicswherekkdenotes ensemble learners,dddenotes feature dimensionality,TTdenotes the number of trees,LLdenotes maximum tree depth,DDdenotes the maximum number of leaves,nndenotes temporal sequence length, andhhdenotes hidden\-state dimensionality\.

The computational analysis highlights the trade\-off between governance\-aware auditability, temporal anomaly sensitivity, explainability overhead, and operational scalability under fragmented enterprise ESG reporting environments\. Although the proposed framework prioritizes reproducibility\-oriented governance engineering over raw inference throughput, the orchestration workflow remains computationally tractable under bounded enterprise deployment conditions\.

The deterministic orchestration workflow was additionally designed to support modular deployment across distributed enterprise validation environments involving heterogeneous reporting infrastructures, asynchronous supplier disclosures, and governance\-sensitive audit pipelines\. Operational scalability therefore depends not solely on model inference throughput, but also on orchestration replayability, provenance persistence, and audit\-trace reconstruction under evolving enterprise reporting conditions\.

### 5\.11Reproducibility Protocol

Reproducibility remains a foundational requirement for trustworthy climate\-risk intelligence systems operating under governance\-sensitive enterprise environmentsPineauet al\.\([2021](https://arxiv.org/html/2606.02604#bib.bib9)\)\. The proposed framework therefore incorporates reproducibility\-aware orchestration semantics designed to improve audit consistency, deterministic workflow reconstruction, and governance\-oriented experiment traceability\.

To reduce operational variability across repeated experimentation procedures, the workflow incorporates deterministic seed control, immutable experiment logging, orchestration replayability, dataset hashing, governance lineage persistence, and audit\-trace versioning\.

Each orchestration transition generates structured governance metadata supporting deterministic replay of validation workflows under repeated execution conditions\. Dataset integrity is additionally preserved through cryptographic hashing enabling reproducibility\-aware dataset verification and lineage inspection across experimental environments\. The framework further incorporates version\-controlled orchestration states designed to improve audit reconstruction, provenance consistency, experiment reproducibility, and governance\-aware operational inspection\. Deterministic orchestration replayability additionally enables reconstruction of enterprise anomaly\-validation pathways under historical reporting conditions\.

The broader objective is not solely predictive optimization, but reproducibility\-oriented governance engineering capable of supporting trustworthy enterprise AI infrastructure under fragmented ESG reporting environments\.

## 6Experimental Setup

### 6\.1Baselines

The proposed governance\-oriented climate\-risk intelligence framework was comparatively evaluated against statistical, ensemble\-learning, anomaly\-detection, and temporal forecasting baselines under the synthetic ESG benchmark, characterized by sparse anomaly distributions and provenance\-sensitive validation conditions\. To improve minority\-event representation during training, SMOTE\-based oversampling was applied exclusively to training folds in order to reduce imbalance\-induced optimization instability while preserving evaluation integrityChawlaet al\.\([2002](https://arxiv.org/html/2606.02604#bib.bib2)\)\.

The comparative experimentation framework evaluated:

- •Statistical / ensemble classifiers:Logistic Regression, Random Forest, XGBoost, LightGBM, CatBoost\.
- •Unsupervised anomaly detection:Isolation Forest\.
- •Temporal forecasting baselines:SARIMA and SARIMA\-LSTM, applied to the residual\-thresholding formulation of Section[5\.2](https://arxiv.org/html/2606.02604#S5.SS2)\.
- •Threshold\-based ESG validation system:a rule\-based reconciliation baseline that flags a record as anomalous when any monitored field violates a fixed acceptance band \(defined below\)\.

#### Threshold\-based validation baseline\.

The threshold baseline encodes conventional reconciliation practice\. A record is flagged when \(i\) a Scope 1–3 emissions field deviates from its sector\-year median by more than three median\-absolute\-deviations, \(ii\) the per\-record missing\-value ratio exceeds 25%, or \(iii\) the provenance\-confidence band falls below a fixed acceptance value\. Thresholds were tuned on the validation partition to maximize F1\. This baseline contains no learned interaction terms and therefore isolates the contribution of learned ensemble structure\.

### 6\.2Evaluation Protocol

All supervised models are evaluated using stratified five\-fold cross\-validation, preserving the 4\.7% anomaly prevalence in every fold\. SMOTE is fit on the training folds only\. For each metric we report the mean and standard deviation across folds\. To assess whether the strongest model improves on the strongest baseline, we apply a paired Wilcoxon signed\-rank test across per\-fold scores and report the resultingpp\-value\. For the temporal baselines, fold construction respects chronological order so that future observations never inform past predictions; random splitting alone is avoided to reduce information leakage under temporal reporting environments\.

#### Hardware and software environment\.

All experiments were run on a single workstation with an Intel Xeon W\-2295 \(18 cores, 3\.0 GHz\) and 128 GB RAM, no GPU, under Ubuntu 22\.04, Python 3\.11, scikit\-learn 1\.4, XGBoost 2\.0, LightGBM 4\.3, CatBoost 1\.2, and theshap0\.45 TreeExplainer\. Reported runtimes \(Table[10](https://arxiv.org/html/2606.02604#S7.T10)\) are measured on the full∼\\sim68k\-record synthetic benchmark\.

## 7Experimental Results

Comparative experimentation evaluated deterministic orchestration workflows on the synthetic ESG benchmark involving sparse anomaly distributions, provenance\-sensitive validation conditions, and temporally unstable disclosure behavior\. Table[8](https://arxiv.org/html/2606.02604#S7.T8)reports the mean and standard deviation of each metric across stratified five\-fold cross\-validation\.

Table 8:Comparative Performance Evaluation \(mean±\\pmstd over stratified 5\-fold CV\)#### Significance\.

XGBoost achieved the highest recall \(0\.74±0\.030\.74\\pm 0\.03\), F1 \(0\.71±0\.020\.71\\pm 0\.02\), and ROC\-AUC \(0\.93±0\.020\.93\\pm 0\.02\)\. A paired Wilcoxon signed\-rank test across folds confirms that XGBoost significantly exceeds the strongest non\-gradient\-boosting baseline, SARIMA\-LSTM, on F1 \(p<0\.05p<0\.05\) and on recall \(p<0\.05p<0\.05\)\. Relative to the threshold baseline, XGBoost improves recall by 38 percentage points \(0\.74 vs\. 0\.36\) and ROC\-AUC by 0\.27 \(0\.93 vs\. 0\.66\)\. Relative to Logistic Regression, XGBoost improves recall by 33 percentage points \(0\.74 vs\. 0\.41\)\.

Table 9:SMOTE Target\-Ratio Sensitivity \(XGBoost, mean over 5\-fold CV\)The sensitivity analysis in Table[9](https://arxiv.org/html/2606.02604#S7.T9)shows that moving from mild \(5:1\) to full \(1:1\) oversampling increases recall from 0\.61 to 0\.74 while precision declines from 0\.78 to 0\.69, leaving F1 effectively unchanged \(0\.68 to 0\.71\)\. Because governance\-critical false negatives are the dominant operational cost in this setting, the 1:1 configuration is preferred: it maximizes recall on rare anomalies at an acceptable precision cost\.

Table 10:Operational Runtime Characteristics \(full 68k\-record benchmark; see Section[6\.2](https://arxiv.org/html/2606.02604#S6.SS2)for hardware\)
### 7\.1Audit Trace Completeness

Because the central claim of this work is deterministic, replayable auditability rather than predictive accuracy alone, we introduce a governance\-oriented metric that directly measures audit quality\. We define*audit trace completeness*\(ATC\) as the fraction of flagged anomalies for which a complete deterministic provenance chain — spanning the four stagessource→\\rightarrowtransformation→\\rightarrowdetection→\\rightarrowescalation— can be reconstructed from persisted orchestration metadata:

ATC=1\|𝒜\|​∑a∈𝒜𝕀​\(ChainComplete​\(a\)\)\\mathrm\{ATC\}=\\frac\{1\}\{\|\\mathcal\{A\}\|\}\\sum\_\{a\\in\\mathcal\{A\}\}\\mathbb\{I\}\\big\(\\textsc\{ChainComplete\}\(a\)\\big\)
where𝒜\\mathcal\{A\}is the set of flagged anomalies andChainComplete​\(a\)\\textsc\{ChainComplete\}\(a\)returns true when all four provenance stages for anomalyaaare present and consistent in the audit trace\.

Table 11:Audit Trace Completeness with and without Deterministic OrchestrationUnder the full deterministic orchestration workflow, 94\.7% of flagged anomalies admit a complete four\-stage provenance chain, with a mean reconstruction depth of 3\.9 of 4 stages\. Removing orchestration replay reduces completeness to 61\.2% and mean depth to 2\.4 stages\. This isolates the contribution of deterministic orchestration to auditability and provides a concrete, reproducible measure of the governance\-engineering objective that distinguishes this framework from prediction\-only ESG pipelines\.

### 7\.2Discussion

SMOTE\-based minority oversampling improved governance\-anomaly sensitivity by reducing class\-distribution imbalance during ensemble optimization; on the validation partition it raised minority\-event recall under sparse anomaly conditions involving delayed supplier disclosures, provenance inconsistencies, and fragmented Scope 1–3 reporting structures\. Because false negatives in this setting drive unresolved disclosure inconsistencies, hidden climate\-risk exposure, and reduced audit reconstruction reliability, imbalance\-aware optimization is important for governance\-oriented ESG validation\.

Gradient\-boosting ensembles outperformed the threshold baseline and the classical statistical baselines across all reported metrics\. XGBoost and CatBoost in particular delivered the highest minority\-event recall and calibration stability under provenance\-sensitive reporting conditions involving delayed supplier disclosures, heterogeneous Scope 1–3 reporting structures, and climate\-risk volatility propagation\. This advantage follows from the ability of gradient\-boosting architectures to model nonlinear feature interactions across temporally unstable ESG environments\.

The 33\-percentage\-point recall gap between Logistic Regression \(0\.41\) and XGBoost \(0\.74\) indicates that linear decision boundaries are insufficient for modeling the climate\-risk dependencies present in this benchmark — provenance inconsistency, disclosure lag, confidence degradation, and multi\-source reconciliation conflicts\. Feature\-interaction complexity was most pronounced under fragmented Scope 3 reporting, where delayed supplier disclosures and heterogeneous reporting standards produced the highest anomaly uncertainty \(consistent with the 84\.1% Scope 3 parsing accuracy in Table[4](https://arxiv.org/html/2606.02604#S5.T4)\)\. These findings support governance\-aware ensemble architectures capable of modeling high\-dimensional reconciliation dependencies under non\-stationary enterprise climate\-risk conditions\.

## 8Explainability Analysis

TreeSHAP\-based explainability analysisLundberg and Lee \([2017](https://arxiv.org/html/2606.02604#bib.bib6)\); Lundberget al\.\([2020](https://arxiv.org/html/2606.02604#bib.bib19)\)computed on the held\-out test folds identified the strongest governance\-anomaly drivers as provenance inconsistency, supplier reporting latency, reporting\-confidence degradation, missing\-value \(null\-inflation\) density, and regional climate\-risk intensity\. These are governance and provenance features rather than generic financial ratios, consistent with the failure\-mode taxonomy of Table[2](https://arxiv.org/html/2606.02604#S4.T2)\.

![Refer to caption](https://arxiv.org/html/2606.02604v1/figures/fig3_shap.png)Figure 6:TreeSHAP governance\-anomaly attribution \(beeswarm\) for the XGBoost model, computed on the held\-out test folds\. Features are ESG governance and provenance signals — provenance\-consistency score, supplier reporting\-lag \(days\), reporting\-confidence band, null\-inflation density, and regional climate\-hazard index — ordered by mean absolute SHAP value\.The explainability workflow supported governance\-aware attribution inspection across temporally unstable ESG environments involving provenance inconsistency, supplier volatility, delayed reporting behavior, and fragmented climate\-risk disclosures\. SHAP attribution trajectories enabled identification of feature\-interaction behavior associated with governance escalation, anomaly propagation, reconciliation instability, and audit\-sensitive reporting inconsistencies\.

Unlike passive feature\-importance estimation, the proposed framework positions explainability as an operational governance mechanism integrated directly within deterministic orchestration workflows supporting audit reconstruction, anomaly prioritization, and provenance\-aware validation inspection\. The framework prioritizes attribution stability and explanation reproducibility because unstable interpretability degrades audit reliability, governance escalation consistency, and operational anomaly reconstruction under fragmented enterprise reporting conditions\.

Stable TreeSHAP attribution therefore improves audit reproducibility, governance inspection consistency, deterministic anomaly reconstruction, provenance\-aware traceability, and operational validation reliability\.

## 9Calibration Analysis

Calibration reliability is particularly important within governance\-oriented enterprise environments because false confidence degrades audit prioritization, anomaly escalation consistency, and operational governance decision\-making under fragmented ESG reporting conditions\. Under governance\-sensitive infrastructures, calibration can be more operationally significant than raw predictive accuracy, because governance workflows depend on confidence\-aware anomaly prioritization rather than binary classification alone\. Poorly calibrated systems produce unstable governance escalation behavior despite strong aggregate predictive performance\.

Expected Calibration Error \(ECE\) is represented as:

E​C​E=∑m=1M\|Bm\|n​\|a​c​c​\(Bm\)−c​o​n​f​\(Bm\)\|ECE=\\sum\_\{m=1\}^\{M\}\\frac\{\|B\_\{m\}\|\}\{n\}\|acc\(B\_\{m\}\)\-conf\(B\_\{m\}\)\|
whereBmB\_\{m\}denotes confidence bins,a​c​c​\(Bm\)acc\(B\_\{m\}\)denotes empirical bin accuracy,c​o​n​f​\(Bm\)conf\(B\_\{m\}\)denotes average prediction confidence, andnndenotes total observations\.

![Refer to caption](https://arxiv.org/html/2606.02604v1/figures/calibration_curve.png)Figure 7:Calibration reliability curves comparing probabilistic confidence alignment across governance\-sensitive ESG anomaly detection models on the held\-out test folds\.The framework evaluates calibration alongside conventional predictive metrics to improve trustworthy anomaly assessment, because poorly calibrated anomaly probabilities produce unstable governance escalation despite strong predictive accuracyGuoet al\.\([2017](https://arxiv.org/html/2606.02604#bib.bib7)\)\. Gradient\-boosting ensembles achieved stronger calibration consistency under sparse anomaly conditions than the threshold baseline and the classical statistical baselines\. Brier\-score and ECE analysis indicated improved confidence stability under provenance\-aware orchestration workflows involving temporal anomaly detection and governance\-sensitive validation\.

The broader objective is therefore not solely predictive optimization, but confidence\-aware governance engineering supporting operationally reliable climate\-risk intelligence under uncertain enterprise reporting environments\. This positions confidence estimation not as a statistical post\-processing step, but as a governance\-critical operational requirement for trustworthy climate\-risk intelligence systems\.

## 10Ablation Analysis

Ablation experiments evaluated the contribution of deterministic orchestration, provenance\-aware validation, the explainability layer, climate\-risk enrichment, and SMOTE\-based imbalance optimization\. Recall and ECE deltas are measured relative to the full XGBoost pipeline; governance stability is a qualitative summary of audit reconstruction behavior\.

Table 12:Ablation Impact on Governance\-Oriented ESG Validation#### Interpreting the explainability ablation\.

TreeSHAP is a post\-hoc attribution method and does not alter the model’s predictions; removing it therefore leaves recall essentially unchanged \(≈0%\\approx 0\\%\)\. Its effect is on the*governance*dimension: without the explainability layer, audit interpretability and escalation\-justification consistency degrade \(governance stability drops to “Weak”\) and downstream calibration\-aware review is less reliable, which is reflected in the small ECE increase from reduced human\-in\-the\-loop correction\. The recall\-affecting components are SMOTE \(largest effect,−18%\-18\\%\), the provenance layer \(−11%\-11\\%\), and the drift layer \(−9%\-9\\%\)\.

Removing provenance\-aware orchestration reduced audit traceability and governance reconstruction consistency\. Removing SMOTE reduced minority\-event sensitivity under sparse anomaly conditions\. Removing climate\-risk enrichment reduced anomaly\-calibration stability under regional hazard drift\. Removing the explainability layer reduced audit interpretability consistency under provenance\-sensitive anomaly conditions\. Removing deterministic orchestration replay reduced governance replayability and weakened lineage reconstruction across delayed reporting sequences and cross\-source reconciliation workflows — the same effect quantified by the audit\-trace\-completeness drop in Table[11](https://arxiv.org/html/2606.02604#S7.T11)\.

## 11Ethical Considerations

Several governance\-oriented ethical considerations remain relevant within climate\-risk intelligence systems, including disclosure inequality across regions and sectors, proxy\-feature estimation bias, under\-reporting within fragmented supplier ecosystems, climate uncertainty propagation, and over\-reliance on automated governance workflows\.

Recent trustworthy\-AI governance frameworks emphasize operational accountability, human oversight, transparency, and governance\-oriented risk management under high\-impact enterprise environmentsNational Institute of Standards and Technology \([2023](https://arxiv.org/html/2606.02604#bib.bib10)\); Organisation for Economic Co\-operation and Development \([2022](https://arxiv.org/html/2606.02604#bib.bib11)\)\. The proposed framework therefore positions deterministic orchestration as decision\-support infrastructure requiring continued human governance oversight rather than fully autonomous regulatory decision\-making\. Human audit review remains necessary for governance\-critical edge cases involving disclosure ambiguity, anomalous climate transitions, and cross\-jurisdictional reporting inconsistencies\.

## 12Limitations

This study is subject to several limitations\.

First, the benchmark is synthetic\. While its disclosure distributions, anomaly prevalence, and missingness are calibrated against publicly reported characteristics of established reporting standards, synthetic data cannot fully capture the idiosyncrasy of production enterprise systems; results should be read as evidence on a controlled, reproducible benchmark rather than as production\-validated performance\.

Second, climate\-risk datasets inherently contain uncertainty associated with long\-horizon environmental forecasting, transition\-risk assumptions, and incomplete disclosure behavior across geographically heterogeneous reporting environments\.

Third, proxy\-feature estimation used for missing emissions inference may introduce latent estimation noise under sparse disclosure conditions and fragmented supplier ecosystems\.

Fourth, temporal drift modeling remains sensitive to evolving sustainability standards, delayed supplier disclosures, non\-stationary reporting behavior, and cross\-jurisdictional reporting inconsistencies\.

Fifth, the graph\-based provenance architecture of Section[5\.4](https://arxiv.org/html/2606.02604#S5.SS4)is specified and dimensioned but not empirically evaluated; its statistics are design targets, not measurements\. Graph\-based provenance reasoning, multimodal climate\-risk fusion, and retrieval\-augmented governance workflows therefore remain exploratory extensions requiring empirical validation under large\-scale enterprise deployment\.

The presented framework should therefore be interpreted as a reproducibility\-oriented governance\-engineering prototype rather than a fully production\-validated enterprise deployment system\.

## 13Future Work

Future work will investigate graph\-transformer architectures, retrieval\-augmented ESG reasoning, neuro\-symbolic governance orchestration, and federated climate\-risk learning for cross\-organizational audit intelligence under privacy\-sensitive enterprise environments\. A primary near\-term goal is the empirical evaluation of the proposed provenance\-graph component \(Section[5\.4](https://arxiv.org/html/2606.02604#S5.SS4)\) against the audit\-trace\-completeness metric introduced in Section[7\.1](https://arxiv.org/html/2606.02604#S7.SS1)\.

Additional research directions include multimodal climate\-risk intelligence integrating satellite imagery, flood raster maps, geospatial climate embeddings, textual sustainability disclosures, and multimodal governance\-fusion architectures\. Future deterministic orchestration systems may additionally incorporate graph\-based anomaly propagation, agentic governance workflows, causal climate\-risk inference, retrieval\-augmented audit reconstruction, and temporal graph\-learning infrastructure for enterprise governance intelligence\.

The broader long\-term objective involves developing trustworthy climate\-risk infrastructure capable of supporting reproducibility\-aware enterprise governance systems under increasingly complex environmental reporting conditions\.

## 14Conclusion

This paper introduced a deterministic climate\-risk intelligence framework integrating provenance\-aware orchestration, imbalance\-aware ensemble learning, temporal anomaly detection, retrieval\-oriented ESG parsing, and governance\-oriented explainability for auditable ESG validation under fragmented enterprise reporting environments, together with a proposed graph\-based provenance architecture for future evaluation\.

On a calibrated, openly released synthetic benchmark, gradient\-boosting ensembles — led by XGBoost \(recall0\.740\.74, ROC\-AUC0\.930\.93\) — significantly outperformed a rule\-based threshold baseline and classical statistical baselines under sparse anomaly distributions, and the framework reconstructed complete deterministic provenance chains for 94\.7% of flagged anomalies\. Experimental evaluation demonstrated the importance of imbalance\-aware optimization, temporal governance monitoring, calibration reliability, and explainability\-oriented audit inspection\.

The broader contribution lies not solely in predictive optimization, but in reframing ESG validation as a deterministic governance\-engineering problem requiring provenance\-aware orchestration, reproducibility\-oriented auditability, temporal lineage reconstruction, and a directly measurable audit\-trace\-completeness criterion\. As enterprise sustainability reporting becomes increasingly regulated, fragmented, and operationally complex, governance\-oriented AI systems will require integrated orchestration semantics capable of supporting reliable audit reconstruction, provenance\-aware anomaly propagation analysis, temporal governance replayability, and operationally trustworthy climate\-risk intelligence under uncertain reporting conditions\.

## References

- \[1\]\(2001\)Random forests\.Machine Learning45\(1\),pp\. 5–32\.Cited by:[§2\.2](https://arxiv.org/html/2606.02604#S2.SS2.p1.1)\.
- \[2\]N\. V\. Chawla, K\. W\. Bowyer, L\. O\. Hall, and W\. P\. Kegelmeyer\(2002\)SMOTE: synthetic minority over\-sampling technique\.Journal of Artificial Intelligence Research16,pp\. 321–357\.Cited by:[§2\.2](https://arxiv.org/html/2606.02604#S2.SS2.p4.1),[§5\.7](https://arxiv.org/html/2606.02604#S5.SS7.p1.1),[§6\.1](https://arxiv.org/html/2606.02604#S6.SS1.p1.1)\.
- \[3\]T\. Chen and C\. Guestrin\(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Cited by:[§2\.2](https://arxiv.org/html/2606.02604#S2.SS2.p1.1)\.
- \[4\]European Centre for Medium\-Range Weather Forecasts\(2023\)Copernicus climate change service\.Note:[https://climate\.copernicus\.eu](https://climate.copernicus.eu/)Cited by:[§2\.1](https://arxiv.org/html/2606.02604#S2.SS1.p3.1)\.
- \[5\]J\. Gama, I\. Zliobaite, A\. Bifet, M\. Pechenizkiy, and A\. Bouchachia\(2014\)A survey on concept drift adaptation\.ACM Computing Surveys46\(4\),pp\. 1–37\.Cited by:[§2\.3](https://arxiv.org/html/2606.02604#S2.SS3.p2.1)\.
- \[6\]Global Facility for Disaster Reduction and Recovery\(2023\)ThinkHazard\! hazard information platform\.Note:[https://thinkhazard\.org](https://thinkhazard.org/)Cited by:[§2\.1](https://arxiv.org/html/2606.02604#S2.SS1.p3.1)\.
- \[7\]Greenhouse Gas Protocol\(2023\)GHG protocol corporate accounting and reporting standard\.Note:[https://ghgprotocol\.org](https://ghgprotocol.org/)Cited by:[§2\.1](https://arxiv.org/html/2606.02604#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.02604#S3.SS1.p2.1)\.
- \[8\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.Proceedings of the 34th International Conference on Machine Learning70,pp\. 1321–1330\.Cited by:[§9](https://arxiv.org/html/2606.02604#S9.p5.1)\.
- \[9\]International Sustainability Standards Board\(2023\)IFRS s2 climate\-related disclosures\.Note:[https://www\.ifrs\.org](https://www.ifrs.org/)Cited by:[§2\.1](https://arxiv.org/html/2606.02604#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.02604#S3.SS1.p2.1)\.
- \[10\]G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu\(2017\)LightGBM: a highly efficient gradient boosting decision tree\.Advances in Neural Information Processing Systems30\.Cited by:[§2\.2](https://arxiv.org/html/2606.02604#S2.SS2.p1.1)\.
- \[11\]S\. M\. Lundberg, G\. Erion, H\. Chen, A\. DeGrave, J\. M\. Prutkin, B\. Nair, R\. Katz, J\. Himmelfarb, N\. Bansal, and S\. Lee\(2020\)From local explanations to global understanding with explainable AI for trees\.Nature Machine Intelligence2\(1\),pp\. 56–67\.Cited by:[§2\.4](https://arxiv.org/html/2606.02604#S2.SS4.p1.1),[§5\.9](https://arxiv.org/html/2606.02604#S5.SS9.p3.4),[§8](https://arxiv.org/html/2606.02604#S8.p1.1)\.
- \[12\]S\. M\. Lundberg and S\. Lee\(2017\)A unified approach to interpreting model predictions\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§2\.4](https://arxiv.org/html/2606.02604#S2.SS4.p1.1),[§8](https://arxiv.org/html/2606.02604#S8.p1.1)\.
- \[13\]National Institute of Standards and Technology\(2023\)Artificial intelligence risk management framework \(ai rmf 1\.0\)\.Note:[https://www\.nist\.gov/itl/ai\-risk\-management\-framework](https://www.nist.gov/itl/ai-risk-management-framework)Cited by:[§11](https://arxiv.org/html/2606.02604#S11.p2.1),[§2\.4](https://arxiv.org/html/2606.02604#S2.SS4.p2.1)\.
- \[14\]Network for Greening the Financial System\(2023\)NGFS climate scenarios for central banks and supervisors\.Note:[https://www\.ngfs\.net](https://www.ngfs.net/)Cited by:[§2\.1](https://arxiv.org/html/2606.02604#S2.SS1.p3.1)\.
- \[15\]Organisation for Economic Co\-operation and Development\(2022\)OECD framework for the classification of ai systems\.Note:[https://oecd\.ai](https://oecd.ai/)Cited by:[§11](https://arxiv.org/html/2606.02604#S11.p2.1),[§2\.4](https://arxiv.org/html/2606.02604#S2.SS4.p2.1)\.
- \[16\]Partnership for Carbon Accounting Financials\(2022\)The global ghg accounting and reporting standard for the financial industry\.Note:[https://carbonaccountingfinancials\.com](https://carbonaccountingfinancials.com/)Cited by:[§2\.1](https://arxiv.org/html/2606.02604#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.02604#S3.SS1.p2.1)\.
- \[17\]J\. Pineau, P\. Vincent\-Lamarre, K\. Sinha, V\. Lariviere, A\. Beygelzimer, F\. d’Alche\-Buc, E\. Fox, and H\. Larochelle\(2021\)Improving reproducibility in machine learning research\.Journal of Machine Learning Research22\(164\),pp\. 1–20\.Cited by:[§2\.4](https://arxiv.org/html/2606.02604#S2.SS4.p2.1),[§5\.11](https://arxiv.org/html/2606.02604#S5.SS11.p1.1)\.
- \[18\]L\. Prokhorenkova, G\. Gusev, A\. Vorobev, A\. V\. Dorogush, and A\. Gulin\(2018\)CatBoost: unbiased boosting with categorical features\.Advances in Neural Information Processing Systems31\.Cited by:[§2\.2](https://arxiv.org/html/2606.02604#S2.SS2.p1.1)\.
- \[19\]World Resources Institute\(2023\)Aqueduct floods methodology\.Note:[https://www\.wri\.org/aqueduct](https://www.wri.org/aqueduct)Cited by:[§2\.1](https://arxiv.org/html/2606.02604#S2.SS1.p3.1)\.

Similar Articles

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

arXiv cs.AI

This paper identifies distribution shift and scale constraints as critical failure modes for statistical contamination detection methods in LLM benchmark auditing. Evaluating three paradigms across 27 models reveals only 199 correct outcomes out of 335 evaluations, indicating a systematic reliability gap that prevents these methods from replacing transparent data provenance.

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

arXiv cs.AI

Introduces DeXposure-Claw, a forecast-grounded agentic system for DeFi risk supervision that uses a graph time-series foundation model to forecast exposure networks, with deterministic monitors and confidence gates to constrain LLM-generated supervisory tickets. Also presents DeXposure-Bench, a six-axis evaluation harness for regulator-aligned assessment.