What Do Language Priors Contribute to Darcy-Flow Inversion? A Mechanistic Audit
Summary
This paper investigates whether sentence embeddings can serve as an inference-time interface for injecting geological knowledge into a learned Darcy-flow inverse solver, finding that text conditioning reduces reconstruction error by 81% relative to a no-text counterfactual, with most gains from categorical class-level constraints.
View Cached Full Text
Cached at: 06/25/26, 05:08 AM
# What Do Language Priors Contribute to Darcy-Flow Inversion? A Mechanistic Audit
Source: [https://arxiv.org/html/2606.24967](https://arxiv.org/html/2606.24967)
###### Abstract
In ill\-posed inverse problems, the recovered solution depends as much on the prior as on the data, yet much of the engineering knowledge that could serve as that prior is recorded qualitatively rather than in formal mathematical form\. Here we test whether sentence embeddings can act as an inference\-time interface for injecting geological descriptions into a learned Darcy\-flow inverse solver\. Across six synthetic geological classes and an exploratory transfer to a benchmark reservoir model \(SPE10\), we vary only the conditioning representation and find that text conditioning reduces reconstruction error by81%81\\%relative to a no\-text counterfactual\. Most of this gain comes from a categorical, class\-level constraint whose value concentrates where the hydraulic head leaves the conductivity field underdetermined, while within\-class geometric detail is secondary and pattern\-dependent\. Compared with a discrete class label, sentence embeddings add little dense\-observation accuracy but improve training stability and enable paraphrase\-based sensitivity analysis and open\-vocabulary inputs\. These results show that language priors can serve as an engineering\-informatics interface for injecting geological knowledge into learned inverse solvers, while clarifying when they help and what signal they actually carry\.
###### keywords:
Sentence embeddings , Soft priors , Text conditioning , Hydraulic conductivity , Site characterisation , Knowledge representation
††journal:Advanced Engineering Informatics\\affiliation
\[label1\] organization=Department of Civil and Environmental Engineering, Tohoku University, addressline=6\-6\-06 Aramaki Aoba, Aoba\-ku, city=Sendai, state=Miyagi, postcode=980\-8579, country=Japan
## 1Introduction
Across engineering, many of the quantities that govern system behaviour cannot be observed directly\. Spatially varying material properties, internal states, and subsurface structures must instead be inferred from sparse, indirect observations through the physics that links them to the measured response, an inverse mapping that is typically underdetermined or ill\-posed\[[45](https://arxiv.org/html/2606.24967#bib.bib2),[24](https://arxiv.org/html/2606.24967#bib.bib19),[44](https://arxiv.org/html/2606.24967#bib.bib31)\]\. Which of the many configurations consistent with the data is recovered is then decided largely by the prior \(the regularisation term, the correlation model, or the training ensemble that encodes which solutions are considered plausible\[[2](https://arxiv.org/html/2606.24967#bib.bib1)\]\)\. The practical question is therefore one of prior selection: what knowledge should be used to choose among the plausible configurations, and in what form should it enter the solver? Much of the knowledge engineers actually hold, however, is not stored as equations or covariances\. It exists instead as qualitative, documented expertise, and turning that expertise into a computable prior is a central concern of engineering informatics\.
Subsurface site characterisation makes this concern especially concrete\. The performance of civil and geotechnical systems is strongly influenced by the heterogeneity of the surrounding ground\[[27](https://arxiv.org/html/2606.24967#bib.bib36)\], which has long motivated random\-field modelling for reliability\-based design\[[31](https://arxiv.org/html/2606.24967#bib.bib37),[32](https://arxiv.org/html/2606.24967#bib.bib38),[3](https://arxiv.org/html/2606.24967#bib.bib39)\], yet that heterogeneity is only partially observed through site data that are often multivariate, uncertain, sparse, and incomplete\[[30](https://arxiv.org/html/2606.24967#bib.bib27)\]\. The prior knowledge practitioners bring to this task is rarely limited to numerical smoothness or correlation structure\. It is often expressed in heterogeneous, partly qualitative forms such as borehole logs, stratigraphic interpretations, depositional\-setting narratives, and expert judgement about recurring structural patterns, whose classification and representation are long\-standing organising principles of the field\[[6](https://arxiv.org/html/2606.24967#bib.bib28),[7](https://arxiv.org/html/2606.24967#bib.bib29)\]\. In this sense, site characterisation is inherently multimodal\. Here we focus on language as one controllable source of this prior information, rather than proposing a general multimodal fusion system\.
Classical regularised inversion encodes smoothness, sparsity, or spatial correlation through mechanisms such as Tikhonov\-type penalties on model magnitude or roughness\[[2](https://arxiv.org/html/2606.24967#bib.bib1)\], total\-variation penalties on gradient sparsity\[[41](https://arxiv.org/html/2606.24967#bib.bib11)\], and variogram\-based geostatistics\[[18](https://arxiv.org/html/2606.24967#bib.bib21),[55](https://arxiv.org/html/2606.24967#bib.bib22)\]\. It does not, however, encode the categorical, morphological structure a geologist reasons with: that clay layers tend to be continuous and inclined, that lenses and channels have recognisable shapes, and that a site exhibits only a few such patterns\. Conceptual geological models express this categorical structure\[[12](https://arxiv.org/html/2606.24967#bib.bib23),[20](https://arxiv.org/html/2606.24967#bib.bib24)\], and learned priors can absorb it implicitly from training ensembles, including convolutional encoder–decoders\[[57](https://arxiv.org/html/2606.24967#bib.bib14),[23](https://arxiv.org/html/2606.24967#bib.bib15)\]and spatial generative adversarial networks\[[19](https://arxiv.org/html/2606.24967#bib.bib16),[14](https://arxiv.org/html/2606.24967#bib.bib13)\], while physics\-informed networks constrain solutions through the governing residual\[[37](https://arxiv.org/html/2606.24967#bib.bib54),[46](https://arxiv.org/html/2606.24967#bib.bib55),[10](https://arxiv.org/html/2606.24967#bib.bib56),[4](https://arxiv.org/html/2606.24967#bib.bib57)\]\. Knowledge\-representation approaches likewise structure geological information explicitly, for example through entity–relationship networks for three\-dimensional geomodelling\[[48](https://arxiv.org/html/2606.24967#bib.bib49)\]\. Recent engineering\-informatics work pursues related data\-driven approaches: graph networks infer subsurface stratigraphy from sparse, multi\-source exploration data\[[56](https://arxiv.org/html/2606.24967#bib.bib45)\], sensor\-selection frameworks build on qualitative physical models\[[9](https://arxiv.org/html/2606.24967#bib.bib46)\], diffusion\-based models bring uncertainty evaluation to geotechnical displacement reconstruction\[[35](https://arxiv.org/html/2606.24967#bib.bib47)\], and mixture\-of\-experts surrogates are fused with data assimilation for dam\-deformation modelling\[[54](https://arxiv.org/html/2606.24967#bib.bib48)\]\. Yet none of these routes accepts free\-form, case\-specific knowledge at inference time: a practitioner whose core log describes exactly such a pattern cannot straightforwardly communicate it to a trained network\. The missing element is a plug\-in semantic conditioning interface that injects such knowledge into the trained solver\.
Recent progress in language models makes such an interface plausible\. Large language models make free\-form technical text computationally accessible\[[25](https://arxiv.org/html/2606.24967#bib.bib17)\]\. More broadly, natural language has matured into a first\-class conditioning signal in machine learning: vision–language alignment and text\-conditioned generative models routinely steer image synthesis with a sentence\[[36](https://arxiv.org/html/2606.24967#bib.bib9),[39](https://arxiv.org/html/2606.24967#bib.bib10)\]\. Engineering uses, however, remain predominantly text\-to\-text or text\-to\-structured\-knowledge \(classification, information extraction, retrieval, ontology and knowledge\-graph construction\[[22](https://arxiv.org/html/2606.24967#bib.bib43),[43](https://arxiv.org/html/2606.24967#bib.bib44)\], and drafting\) rather than making a free\-form description an inference\-time prior for the inverse problem itself\. Within engineering informatics, the missing step is therefore not language processing itself, but letting that description act directly on a physics\-governed inverse solver\. The step examined here is text\-to\-physics: a free\-form description, encoded as a continuous embedding, directly conditioning a learned solver for a physics\-governed inverse problem\. The open questions are what such text actually contributes to the recovered field, and through which mechanism\.
Engineering informatics has also begun using LLM\-driven systems to expose domain knowledge under practical data constraints, from multi\-agent geotechnical analysis to physics\-informed evaluation of language\-model outputs and multimodal landslide interpretation\[[34](https://arxiv.org/html/2606.24967#bib.bib40),[17](https://arxiv.org/html/2606.24967#bib.bib41),[1](https://arxiv.org/html/2606.24967#bib.bib42)\]; position papers chart this potential across the wider geotechnical workflow\[[49](https://arxiv.org/html/2606.24967#bib.bib33),[50](https://arxiv.org/html/2606.24967#bib.bib34)\]\. To our knowledge, however, this adjacent engineering\-informatics literature does not yet address the question studied here: how a free\-form description functions as an inference\-time prior for a learned inverse solver, namely what signal it contributes and through which pathway\.
The closest direct precedent comes from outside engineering informatics, where semantic information has recently been used to guide physics\-governed inverse problems\. Zhang et al\.\[[53](https://arxiv.org/html/2606.24967#bib.bib3)\]showed that sentence embeddings of natural\-language scene descriptions can regularise an electromagnetic inverse problem: the text enters as a semantic latent\-space penalty that pulls the inferred latent toward an LLM encoding of the description\. Follow\-up work extends the idea to multimodal semantic priors in the same electromagnetic setting\[[5](https://arxiv.org/html/2606.24967#bib.bib4)\]\. These studies establish that semantic descriptions can constrain inverse recovery, but they leave open the engineering\-informatics question addressed here: what part of the textual prior a learned solver actually uses, whether a discrete class label would carry the same signal, and when language adds value under different observation regimes\. We bring this question to Darcy flow, a representative ill\-posed inverse problem of subsurface engineering\. Hydraulic conductivity governs groundwater flow and contaminant transport yet is rarely observed directly; it must be inferred from the routinely accessible hydraulic headhhthrough Darcy’s law,
−∇⋅\(K∇h\)=0,\-\\nabla\\cdot\(K\\nabla h\)=0,\(1\)an inverse problem that is ill\-posed in the classical sense that distinctKKfields produce nearly indistinguishable head responses\[[24](https://arxiv.org/html/2606.24967#bib.bib19)\]\. In contrast to that precedent, which supplies text to improve recovery, we ask what the text itself contributes, and we design the main synthetic experiments so that the conditioning representation, rather than an auxiliary penalty, is the sole experimental variable \(§[3\.1](https://arxiv.org/html/2606.24967#S3.SS1)\)\. Rather than proposing a new architecture, we contribute a controlled audit: a systematic evaluation of what information the text conditioning carries, treating the sentence embedding as a plug\-in interface and characterising the capabilities it provides\.
This question bears on what a practitioner would need to provide at inference time: a detailed narrative, a simple label, or a flexible open\-vocabulary description\. To answer it mechanistically, we use controlled synthetic descriptions as an analysis instrument, varying their content in isolation to reveal which aspect of the prior aids recovery\. What is intended to carry over to practice is not the synthetic text itself, but the channel it probes: an entry point through which qualitative site observations can constrain the inversion alongside quantitative data\. In a Bayesian inverse\-problem view\[[44](https://arxiv.org/html/2606.24967#bib.bib31)\], we interpret the textttas supplying a soft prior inp\(K∣h,t\)p\(K\\mid h,t\), and read the trained text\-conditioned generator with its paraphrase ensemble as a low\-cost, posterior\-inspired sensitivity proxy rather than as a posterior sampler \(§[3\.2](https://arxiv.org/html/2606.24967#S3.SS2)\)\. The resulting mechanism\-first study centres on the claim that this prior, language\-encoded engineering information, is conditional: how much language contributes to the recovered field depends on how strongly the head observation already constrainsKK\. Language is expected to matter most when the forward map collapses geologically distinct fields onto nearly indistinguishable observations, to add finer information when observations are sparse and the missing structure is describable, and to become largely redundant when the observation nearly determinesKKonce the broad geological class is known\. The study therefore characterises this regime dependence rather than reporting a single accuracy gain, and is organised around three questions:
1. 1\.What information does the solver use from text?We test whether sentence embeddings contribute instance\-specific geological information or mainly act as a class\-level prior, and whether the same recoverable signal could be carried by a discrete class label\.
2. 2\.When does text help?We examine whether the value of language depends on the degree to which the hydraulic\-head observation constrainsKK, including cases where the forward map collapses distinct geological structures onto similar responses and cases where observations are sparsified\.
3. 3\.What does the embedding interface enable beyond a fixed label?We evaluate two interface\-level capabilities of free\-form text: paraphrase\-based sensitivity analysis and open\-vocabulary descriptions outside the closed six\-class taxonomy\.
## 2Problem Setting and Datasets
### 2\.1Synthetic Darcy dataset \(six geological classes\)
#### 2\.1\.1Field generation and split\.
We consider steady\-state Darcy flow \(Equation[1](https://arxiv.org/html/2606.24967#S1.E1)\) onΩ=\[0,1\]2\\Omega=\[0,1\]^\{2\}, discretised on a64×6464\\times 64grid\. For five of the six classes, hydraulic conductivityKKis binary: clay \(K=10−8K=10^\{\-8\}m/s\) or sand \(K=10−5K=10^\{\-5\}m/s\), spanning three orders of magnitude, a contrast representative of unconsolidated sedimentary environments; the Continuous class instead uses a log\-normal continuousKKfield\. Six structural classes are generated from a two\-dimensional latent variableξ=\(ξ1,ξ2\)\\xi=\(\\xi\_\{1\},\\xi\_\{2\}\), whose componentsξ1\\xi\_\{1\}andξ2\\xi\_\{2\}are drawn independently from the uniform distribution on\[0,1\]\[0,1\]:
- 1\.Band: a continuous sand body of characteristic width, tilted at an angle\.
- 2\.Circle: a connected sandy conduit formed by a zig\-zag chain of circular segments\.
- 3\.Ellipse: scattered elongated sand lenses with horizontal or vertical orientation and varying position\.
- 4\.Layered:11–44horizontal sand strata embedded in a clay background, with a vertical offset\.
- 5\.Random: a spatially correlated binary random field with clay fraction∈\[0\.2,0\.7\]\\in\[0\.2,0\.7\], generated via thresholded Gaussian process\.
- 6\.Continuous: a log\-normalKKfield whose log\-mean and spatial correlation scale vary withξ\\xi\.
The six classes are chosen to represent recurring structural motifs in sedimentary ground \(layers, inclined bands, lenses, and channel\-like conduits\) together with two standard geostatistical field models \(a correlated binary field and a log\-normal continuous field\); they are morphological abstractions rather than a standard geotechnical taxonomy\. Class assignment is fixed byτ=ξ1\+ξ2\\tau=\\xi\_\{1\}\+\\xi\_\{2\}via equal\-mass thresholds onτ\\tau\(∼16\.7%\{\\sim\}16\.7\\%of samples per class\)\. Within each class, the same\(ξ1,ξ2\)\(\\xi\_\{1\},\\xi\_\{2\}\)determine that class’s geometric or field parameters; the Random and Continuous classes additionally draw a Gaussian random field\. The latent\-to\-class partition and the shared vertical\-placement coordinate are illustrated inLABEL:app:latent\_partition\. Figure[1](https://arxiv.org/html/2606.24967#S2.F1)shows representativeKKandhhfields for each class\. The dataset comprises 18,000 samples \(∼3,000\{\\sim\}3\{,\}000per class\)\. A leak\-free hash\-grouped train/validation/test split \(14,381/1,747/1,87214\{,\}381/1\{,\}747/1\{,\}872\) keeps all samples sharing aKKfield in the same partition, preventingKK\-field leakage from generators that reuse a small library of distinct fields \(LABEL:app:split\_audit\)\.
The head fieldhhis solved via second\-order centred finite differences with Dirichlet conditions \(lefth=1h\{=\}1, righth=0h\{=\}0;hhis dimensionless\) and no\-flux Neumann conditions on top/bottom\. The solver is verified against analytic solutions for uniform and two\-layer media; taking theyy\-averaged hydraulic\-head profile alongxx, the maximum absolute deviation from the analytic profile is<10−4<10^\{\-4\}in both cases\.
Figure 1:Representative examples of the six synthetic geological classes \(columns, left to right: Band, Circle, Ellipse, Layered, Random, Continuous\)\. Top row: hydraulic conductivityKKfields, shown aslog10K\\log\_\{10\}Kand min\-max scaled for visualisation only\. Bottom row: the corresponding hydraulic headhhfields obtained by solving the steady\-state Darcy equation \(Equation[1](https://arxiv.org/html/2606.24967#S1.E1)\) on the same domains\.
#### 2\.1\.2Text generation\.
For each sample, the known latent parametersξ\\xiand derived geometric quantities \(angle, layer count, clay fraction, etc\.\) are formatted into a structured prompt and submitted to OpenAI’s GPT\-4o\-mini\[[26](https://arxiv.org/html/2606.24967#bib.bib18)\], which generates a single 20–40\-word natural\-language geological description\. Descriptions are generated individually per sample and vary with latent\-derived geometric quantities: Band samples with steep angles yield phrasing such as “near\-vertical clay barrier”, while shallow angles produce “gently dipping aquitard”\. This ensures that the text embedding space reflects genuine geological variation rather than a fixed set of class\-level templates \(Figure[2](https://arxiv.org/html/2606.24967#S2.F2)c\)\. Each description is encoded into a 384\-dimensional embedding𝐞\\mathbf\{e\}by a frozen SBERT model \(all\-MiniLM\-L6\-v2\)\[[38](https://arxiv.org/html/2606.24967#bib.bib8)\]\. Full prompts for all six classes are provided inLABEL:app:synthetic\_text\. Because every description is generated from the known latent parametersξ\\xi, this synthetic text is “oracle” by construction \(§[1](https://arxiv.org/html/2606.24967#S1)\): it is a controlled instrument for isolating which geological information aids recovery, not an emulation of the noisier, human\-authored reports a deployed system would receive\.
#### 2\.1\.3Forward\-map PCA ofKK,hh, and text embeddings\.
A three\-way principal\-component view of the dataset previews why a language prior should help unevenly across classes\. The conductivity fields are strongly class\-structured \(Figure[2](https://arxiv.org/html/2606.24967#S2.F2)a\), but the forward Darcy map collapses much of that structure: the head fields overlap across classes onto a single dominant boundary\-driven mode \(Figure[2](https://arxiv.org/html/2606.24967#S2.F2)b; PC1≈54%\{\\approx\}54\\%\), to a degree that differs by class\. The text embeddings, by contrast, retain class separation \(Figure[2](https://arxiv.org/html/2606.24967#S2.F2)c\)\. Where the head collapses a class’s structure most, the head observation least constrainsKK, motivating the hypothesis that a description has more to add there; we make this per\-class and quantitative with a forward\-collapse ratio in §[4\.5](https://arxiv.org/html/2606.24967#S4.SS5), where, over only six classes, we treat it as an interpretive axis rather than a quantitative predictor\.
Figure 2:Class structure of the dataset under the forward map\. First two principal components of \(a\) the conductivity fieldsKK, \(b\) the head fieldshh, and \(c\) the SBERT text embeddings, over all18,00018\{,\}000samples, coloured by class\.
### 2\.2SPE10 \(industry\-standard reservoir benchmark\)
As a semi\-real external anchor we use the SPE10 Model 2 benchmark\[[8](https://arxiv.org/html/2606.24967#bib.bib6)\], a geostatistical reservoir model representative of real North Sea formations: 85 layers of heterogeneous permeability spanning the Tarbert \(shallow\-marine, smooth gradients\) and Upper Ness \(fluvial, channelised\) formations\. Each layer is resized to64×6464\\times 64, log\-scaled, and min\-max normalised to\[0,1\]\[0,1\]to play the role ofKKin the Darcy problem defined above, and the same Darcy solver with a fixed\-head line\-drive surrogate boundary condition generates head fields\. The SPE10 facies vocabulary overlaps only partially with the six synthetic classes, so any text benefit must come from a coarse in\-domain geological prior rather than exact class matching; data preparation, the surrogate boundary condition, and the class\-vocabulary mapping are detailed inLABEL:app:spe10\_data\.
## 3Methods
### 3\.1Model Architecture and Training
The generator is a U\-Net encoder–decoder\[[40](https://arxiv.org/html/2606.24967#bib.bib12)\]that takes the observed head fieldhh\(1 channel,64×6464\\times 64\) as input and outputs a hydraulic conductivity reconstructionK^\\hat\{K\}\(1 channel,64×6464\\times 64; Figure[3](https://arxiv.org/html/2606.24967#S3.F3)\)\. The encoder comprises four convolutional blocks with batch normalisation and ReLU activation, reducing spatial resolution from64×6464\\times 64to8×88\\times 8, followed by a bottleneck block at4×44\\times 4\(channel progression1→32→64→128→256→5121\\to 32\\to 64\\to 128\\to 256\\to 512\)\. The decoder mirrors the encoder with transposed convolutions and skip connections, restoring the output to64×6464\\times 64\.
Text information enters via Feature\-wise Linear Modulation \(FiLM;\[[29](https://arxiv.org/html/2606.24967#bib.bib7)\]\) at the bottleneck \(4×44\\times 4, 512 channels\):
𝐳~=𝐳\+γ\(𝐞\)⊙𝐳\+β\(𝐞\),\\tilde\{\\mathbf\{z\}\}=\\mathbf\{z\}\+\\gamma\(\\mathbf\{e\}\)\\odot\\mathbf\{z\}\+\\beta\(\\mathbf\{e\}\),\(2\)where𝐳\\mathbf\{z\}is the bottleneck feature map,𝐞∈ℝ384\\mathbf\{e\}\\in\\mathbb\{R\}^\{384\}is the SBERT embedding, and\(γ,β\)=FC\(𝐞\)\(\\gamma,\\beta\)=\\mathrm\{FC\}\(\\mathbf\{e\}\)are learned affine parameters produced by a single linear layer \(384→2×512384\\to 2\\times 512\)\. This residual form recovers the identity map when\(γ,β\)=\(𝟎,𝟎\)\(\\gamma,\\beta\)=\(\\mathbf\{0\},\\mathbf\{0\}\), so the unconditioned generator is a natural special case\. We adopt FiLM because it provides channel\-wise modulation without increasing the spatial dimensionality of the bottleneck\. The headline synthetic experiments train the generator with a pure reconstruction objective,
ℒG=‖K−K^‖2,\\mathcal\{L\}\_\{G\}=\\\|K\-\\hat\{K\}\\\|^\{2\},\(3\)so that the parameterisation comparisons of §[3\.4](https://arxiv.org/html/2606.24967#S3.SS4)isolate the effect of the conditioning encoding under an identical, minimal objective\.
Every synthetic experiment here, including all controls of §[3\.4](https://arxiv.org/html/2606.24967#S3.SS4), uses this pure reconstruction objective \(Equation[3](https://arxiv.org/html/2606.24967#S3.E3)\); the SPE10 transfer instead adds two auxiliary terms, described with the transfer protocol \(§[3\.6](https://arxiv.org/html/2606.24967#S3.SS6)\)\. Training uses Adam \(β1=0\.5\\beta\_\{1\}=0\.5\) at learning rate2×10−42\\times 10^\{\-4\}for 400 epochs \(batch size 64\) with ReduceOnPlateau scheduling and best\-checkpoint selection on the validation K\-MSE, the mean squared error between the reconstructed and true hydraulic conductivity fields and our primary reconstruction\-accuracy metric\.
Figure 3:Text\-conditioned generator architecture\. The observed head fieldhhpasses through a U\-Net encoder–decoder with horizontal skip connections \(grey arrows\) to produce the predicted conductivity fieldK^\\hat\{K\}\. A frozen SBERT embedding𝐞\\mathbf\{e\}of the geological description modulates the bottleneck via FiLM \(Equation[2](https://arxiv.org/html/2606.24967#S3.E2)\)\. The example descriptions L0–L4 illustrate the specificity spectrum of §[3\.3](https://arxiv.org/html/2606.24967#S3.SS3), from a generic statement of the medium to a fully detailed geometric description\.
### 3\.2Text as a Soft Prior and the Paraphrase\-Ensemble Proxy
To motivate the paraphrase\-ensemble uncertainty proxy, we interpret geological text as a soft prior on admissibleKKfields in a Bayesian inverse\-problem framework\. The trained generator itself is deterministic, producing a point estimateK^=Gθ\(h,t\)\\hat\{K\}=G\_\{\\theta\}\(h,t\); the posterior notation below is therefore an interpretive model for the information thathhandttsupply, not a claim that the generator samples fromp\(K∣h,t\)p\(K\\mid h,t\)\. More formally, treatingp\(h∣K\)p\(h\\mid K\)as the implicit likelihood induced by the Darcy forward operator and assuming that text affectsKKonly through the prior \(so thatp\(h∣K,t\)=p\(h∣K\)p\(h\\mid K,t\)=p\(h\\mid K\), since the Darcy operator itself is unchanged bytt\), Bayes’ rule on\(h,t,K\)\(h,t,K\)gives
p\(K∣h,t\)∝p\(h∣K\)p\(K∣t\)\.p\(K\\mid h,t\)\\propto p\(h\\mid K\)\\,p\(K\\mid t\)\.\(4\)What the text actually contributes to this prior \(whether it acts only as a class label or also carries finer within\-class structure\) is the mechanism question we take up in §[5\.1](https://arxiv.org/html/2606.24967#S5.SS1)\.
Because a single descriptionttis one linguistic surface form of an underlying belief, intended to preserve the geological content, we treat the reference sentence together withNp−1N\_\{p\}\-1LLM\-generated paraphrases of it asNpN\_\{p\}draws\{tk\}k=1Np\\\{t\_\{k\}\\\}\_\{k=1\}^\{N\_\{p\}\}from a paraphrase distribution around that belief \(Np=16N\_\{p\}=16in all experiments; per\-experiment prompt variants are detailed in §[3\.3](https://arxiv.org/html/2606.24967#S3.SS3)andLABEL:app:paraphrase\_protocol\)\. We do not assume that these paraphrases are independent or exhaustive; we use them only to probe the linguistic neighbourhood of the description\. Passing each text through the \(frozen\) generator yields an ensemble of hydraulic conductivity reconstructions\{K^k\}k=1Np\\\{\\hat\{K\}\_\{k\}\\\}\_\{k=1\}^\{N\_\{p\}\}which we read as a posterior\-inspired probe around the conditioning pair\(h,t\)\(h,t\): the within\-paraphrase standard deviationσK\(𝐱\)=stdkK^k\(𝐱\)\\sigma\_\{K\}\(\\mathbf\{x\}\)=\\mathrm\{std\}\_\{k\}\\,\\hat\{K\}\_\{k\}\(\\mathbf\{x\}\)serves as a per\-pixel dispersion proxy whose grid\-mean we denoteσ¯K\\bar\{\\sigma\}\_\{K\}\. This is not a formal posterior sampler in the sense of Stein\-variational or MCMC methods\[[44](https://arxiv.org/html/2606.24967#bib.bib31)\]: the paraphrases are not samples fromp\(K∣h,t\)p\(K\\mid h,t\), but a low\-cost sensitivity proxy that measures how much the reconstruction changes under meaning\-preserving perturbations of the text input\. Whether this stand\-in is reliable both as a relative ranking and as a quantitative magnitude is an empirical question deferred to §[4\.6](https://arxiv.org/html/2606.24967#S4.SS6)\.
### 3\.3Specificity Spectrum and Uncertainty\-Reduction Protocol
Table 1:Specificity levels used in the text uncertainty reduction experiment\. Angle brackets denote sample\- or class\-dependent placeholders; examples shown are for the band family, with wording adapted per class\. The L0 baseline is intentionally minimal and serves as the reference level for the uncertainty\-reduction proxy\.To quantify which aspects of geological text carry the most information, we define five specificity levels \(Table[1](https://arxiv.org/html/2606.24967#S3.T1)\)\. For each level, an ensemble ofNp=16N\_\{p\}=16variants is generated by GPT\-4o\-mini\. L0 and L1 are pattern\-independent generic descriptions drawn from a short seed pool and extended by GPT calls; L2–L4 are paraphrases of the level\-specific anchor text \(Table[1](https://arxiv.org/html/2606.24967#S3.T1)\); for the number\-bearing levels \(L3–L4\), the meaning\-preserving paraphrase prompt is relied on to keep the numerical values in the text stable rather than enforcing them through a dedicated constraint\. The prompt variants used by the specificity experiment \(§[4\.3](https://arxiv.org/html/2606.24967#S4.SS3)\) and the posterior\-reliability experiment \(§[4\.6](https://arxiv.org/html/2606.24967#S4.SS6)\) are documented inLABEL:app:paraphrase\_protocol\. Each variant is encoded by SBERT and passed through the generator independently, producing an ensemble ofNpN\_\{p\}hydraulic conductivity predictions\{K^1,…,K^Np\}\\\{\\hat\{K\}\_\{1\},\\ldots,\\hat\{K\}\_\{N\_\{p\}\}\\\}for each input sample\.
Text uncertainty reduction is estimated as
Δ^ℓ=H^\(K∣h,tℓ=0\)−H^\(K∣h,tℓ\),\\hat\{\\Delta\}\_\{\\ell\}=\\widehat\{H\}\(K\\mid h,t\_\{\\ell=0\}\)\-\\widehat\{H\}\(K\\mid h,t\_\{\\ell\}\),\(5\)where
H^\(K∣h,tℓ\)=1Npix∑i,j12log\(2πeσij2\(tℓ\)\)\\widehat\{H\}\(K\\mid h,t\_\{\\ell\}\)\\;=\\;\\frac\{1\}\{N\_\{\\mathrm\{pix\}\}\}\\sum\_\{i,j\}\\tfrac\{1\}\{2\}\\log\\\!\\bigl\(2\\pi e\\,\\sigma\_\{ij\}^\{2\}\(t\_\{\\ell\}\)\\bigr\)\(6\)is the grid\-mean pixel\-wise Gaussian differential entropy \(diagonal approximation\) of the generator output ensemble at levelℓ\\ell, withσij\(tℓ\)\\sigma\_\{ij\}\(t\_\{\\ell\}\)the standard deviation ofK^\\hat\{K\}across the level\-ℓ\\ellvariants at pixel\(i,j\)\(i,j\)\(§[3\.2](https://arxiv.org/html/2606.24967#S3.SS2)\); the per\-pixel variance is clamped at10−1210^\{\-12\}for numerical stability, andNpix=642=4096N\_\{\\mathrm\{pix\}\}=64^\{2\}=4096\.Δ^ℓ\\hat\{\\Delta\}\_\{\\ell\}is not a formal conditional mutual information but an ensemble\-based proxy: by constructionΔ^ℓ=0=0\\hat\{\\Delta\}\_\{\\ell=0\}=0, andΔ^ℓ\>0\\hat\{\\Delta\}\_\{\\ell\}\>0means the ensemble at levelℓ\\ellis tighter \(more certain\) than at L0\. The reliability of the linear dispersion summaryσ¯K\\bar\{\\sigma\}\_\{K\}\(§[3\.2](https://arxiv.org/html/2606.24967#S3.SS2)\) is tested separately in §[4\.6](https://arxiv.org/html/2606.24967#S4.SS6); the specificity spectrum instead uses the log\-dispersion entropy proxyH^\\widehat\{H\}, so the two dispersion summaries are not interpreted as calibrated or interchangeable uncertainty estimates\. The specificity spectrum is evaluated on a fixed, reproducible subset of 10 hash\-grouped test samples per pattern; the associated paired statistics \(§[4\.3](https://arxiv.org/html/2606.24967#S4.SS3)\) are therefore interpreted as exploratory\.
### 3\.4Representation Controls
To test whether sentence embeddings supply more than a class label, we compare the SBERT generator against four representation controls: \(i\) a class\-mean embedding inserted at inference without retraining, \(ii\) a within\-class text swap, \(iii\) from\-scratch retrains with a categorical one\-hot code and a capacity\-matched random\-code variant, and \(iv\) a nearest\-class\-mean proxy on SPE10 as an out\-of\-distribution test\.
##### Class\-mean embedding \(no retrain\)
To isolate class identity from within\-class variation, we use the SBERT\-trained generator unchanged and replace each test sample’s per\-sample SBERT embedding with the training\-set class meane¯c\\bar\{e\}\_\{c\}of its true class\. This 384\-dimensional substitute carries class identity but no within\-class variation\. Because this control shares the SBERT\-trained generator with the oracle and no\-text conditions, we quantify it by the capture percentage
capture%=K\-MSEnotext−K\-MSEclass\-meanK\-MSEnotext−K\-MSEoracle×100,\\mathrm\{capture\}\_\{\\%\}\\;=\\;\\frac\{\\mathrm\{K\\text\{\-\}MSE\}\_\{\\mathrm\{notext\}\}\-\\mathrm\{K\\text\{\-\}MSE\}\_\{\\mathrm\{class\\text\{\-\}mean\}\}\}\{\\mathrm\{K\\text\{\-\}MSE\}\_\{\\mathrm\{notext\}\}\-\\mathrm\{K\\text\{\-\}MSE\}\_\{\\mathrm\{oracle\}\}\}\\times 100,\(7\)the fraction of the oracle\-vs\-no\-text gain recovered given only perfect class identity, where the oracle condition feeds the SBERT\-trained generator the per\-sample full\-text embedding and no\-text feeds the same generator a zero 384\-dimensional embedding\. Because the generator was trained only with non\-zero conditioning, the no\-text input should be read as a counterfactual rather than an in\-distribution baseline\. The resulting capture percentage is therefore only an approximate decomposition and can exceed100%100\\%when the class mean yields lower K\-MSE than the per\-sample SBERT embedding\.
##### Within\-class text swap \(no retrain\)
To test whether the correct within\-class instance matters beyond class identity, we use the same SBERT\-trained generator and replace each test sample’s embedding with that of a different same\-class sample with a distinctKKrealisation \(excluding the target’s own, to avoid pseudoreplication\), averaging K\-MSE over ten random donor draws\. The instance fraction\(within−oracle\)/\(no\-text−oracle\)\(\\mathrm\{within\}\-\\mathrm\{oracle\}\)/\(\\mathrm\{no\\text\{\-\}text\}\-\\mathrm\{oracle\}\)isolates the benefit of the correct within\-class instance, complementing the class\-mean control from the content side\.
##### One\-hot and capacity\-matched retrains \(d=6d=6andd=384d=384\)
To test whether sentence embeddings carry recoverable information beyond a discrete class label, a new generator is trained from scratch with the conditioning input replaced by a discrete class code, in two variants\. The first is the canonical one\-hot encoding𝟏c∈\{0,1\}6\\mathbf\{1\}\_\{c\}\\in\\\{0,1\\\}^\{6\}, reducing the FiLM projection toLinear\(6→1024\)\\mathrm\{Linear\}\(6\\to 1024\), the smallest possible conditioning bottleneck\. The second is a capacity\-matched control \(one\-hot384\) replacing each class label with a fixed per\-class random unit vector inℝ384\\mathbb\{R\}^\{384\}, drawn fresh per seed \(a dense code, not a literal one\-hot\), so its FiLM projectionLinear\(384→1024\)\\mathrm\{Linear\}\(384\\to 1024\)matches the SBERT input width; this tests whether input width alone explains the instability of the 6\-dimensional one\-hot\. The architecture, optimiser, schedule, random seed, and hash\-grouped training partition \(LABEL:app:split\_audit\) are matched to a freshly retrained SBERT reference at the same seed, and all three retrains \(SBERT, one\-hot, one\-hot384\) are repeated across ten seeds \(§[4\.4\.3](https://arxiv.org/html/2606.24967#S4.SS4.SSS3)\)\. Capture \(Equation[7](https://arxiv.org/html/2606.24967#S3.E7)\) does not apply to these separately trained generators, so they are compared with SBERT at the seed level \(family \(ii\) of §[3\.8](https://arxiv.org/html/2606.24967#S3.SS8)\)\.
##### SPE10 nearest\-class\-mean \(no retrain\)
On the SPE10 transfer \(§[3\.6](https://arxiv.org/html/2606.24967#S3.SS6)\) the class\-mean control has no ground\-truth label, so we assign a class automatically from text\. For each of the 13 holdout layers we assign the class whose synthetic training\-set SBERT centroide¯c\\bar\{e\}\_\{c\}\(c=1,…,6c=1,\\dots,6\) has the highest cosine similarity to the layer’sKK\-informed reference embeddingeℓe\_\{\\ell\}, and forward that nearest centroid through the SPE10\-fine\-tuned generator in place ofeℓe\_\{\\ell\}\. This is the most direct automated class\-label proxy reachable from the fixed taxonomy without human intervention: the assignment is purely embedding\-based and fully reproducible, though it is not tuned to maximise generator performance; the assigned\-class mean cosine is reported alongside K\-MSE so that the reader can judge how well SPE10 layers fit the synthetic class manifold\.
### 3\.5Mechanism Probes
To localise when and why the text helps, we add a forward\-collapse measure and a decodability probe, together with a sparse\-observation grid retraining variant\.
##### Forward\-collapse ratio
To quantify how much geological variation each class’s head field retains, we compute, over the class’s uniqueKKrealisations, the mean pairwise Euclidean distance among the head fields divided by the mean pairwise distance among the conductivity fields that generate them \(both fields globallyzz\-standardised so the ratio is dimensionless\)\. A ratio near zero means geologically distinct fields map to near\-identical heads, so the observation cannot distinguish them and the inverse problem must lean on the prior\.
##### Decodability probe
To test whether the embedding retains within\-class instance information independently of whether the solver uses it, we fit a low\-capacity per\-class ridge readout from the text embedding to the generative latentξ\\xi\(§[2](https://arxiv.org/html/2606.24967#S2)\) and score it held\-out \(R2R^\{2\}\), benchmarked against a shuffled\-ξ\\xinull, for the deployed SBERT encoder and three alternatives \(MPNet, e5, BGE\) to rule out encoder\-specific artefacts\. A clearly positive held\-outR2R^\{2\}above the null indicates the instance signal is linearly present in the embedding even where the solver leaves it unused\.
##### Sparse\-observation grid retraining
To probe whether beyond\-class information becomes useful as the head observation is degraded, we retrain a reconstruction\-only generator on grid\-subsampled heads: each head is sampled on an endpoint\-inclusivek×kk\\times kgrid and bilinearly interpolated back to the full64×6464\\times 64field, with subsampling applied as an online training augmentation \(on70%70\\%of batches,kkredrawn per sample\) so that a single model, retrained from scratch, spans all densities\. At evaluation we sweepN=k2∈\{4096,1024,256,64,16,4\}N=k^\{2\}\\in\\\{4096,1024,256,64,16,4\\\}\(Figure[9](https://arxiv.org/html/2606.24967#S4.F9)a\)\. As a pure class\-identity reference we feed the class\-mean embedding \(§[3\.4](https://arxiv.org/html/2606.24967#S3.SS4)\) at each density on a single training seed and report its capture of the oracle benefit; the within\-class swap is recomputed at each density across ten independent training seeds, and per class the rise of the beyond\-class gap\(within−oracle\)\(\\mathrm\{within\}\-\\mathrm\{oracle\}\)from full observation to the corners\-only grid is tested with an exact Wilcoxon signed\-rank across seeds, Holm\-corrected over the five non\-Layered classes\.
### 3\.6SPE10 Transfer Protocol
Exploratorily, we fine\-tune a separate SPE10\-transfer base \(synthetically pre\-trained with the auxiliary objective below, distinct from the reconstruction\-only headline model\) onN=20N=20SPE10 layers \(generator and discriminator jointly\), evaluating on a fixed 13\-layer holdout \(fine\-tuning schedule and layer split inLABEL:app:spe10\_protocol,LABEL:app:spe10\_split\)\. The transfer objective adds two auxiliary terms to the reconstruction loss, an adversarial sharpening loss\[[14](https://arxiv.org/html/2606.24967#bib.bib13),[19](https://arxiv.org/html/2606.24967#bib.bib16),[23](https://arxiv.org/html/2606.24967#bib.bib15)\]and a semantic\-alignment penalty between a bottleneck projection𝐞^\\hat\{\\mathbf\{e\}\}and the input embedding𝐞\\mathbf\{e\}\(cf\.\[[53](https://arxiv.org/html/2606.24967#bib.bib3)\]\); full form and weights are inLABEL:app:spe10\_objective\. Against a no\-text baseline, three text inputs are compared:KK\-informed reference text \(a per\-layer GPT\-4o\-mini description rendered from thelog10\(K\)\\log\_\{10\}\(K\)image, an upper bound on per\-layer text, not a true inverse\-problem oracle\), a generic\-geological control \(a single pattern\-free sentence applied to every layer\), and a non\-geological random\-text control; the taxonomy\-restricted nearest\-class\-mean proxy \(§[3\.4](https://arxiv.org/html/2606.24967#S3.SS4)\) is evaluated on the same holdout\. The reference\-text prompt and schema are inLABEL:app:spe10\_vision; the control sentences are specified inLABEL:app:spe10\_protocol\.
### 3\.7Classical Inversion Baselines
To place the text\-driven gain in context against priors that carry no geological knowledge, we evaluate unregularised, Tikhonov\[[2](https://arxiv.org/html/2606.24967#bib.bib1)\], and total\-variation \(TV\)\[[41](https://arxiv.org/html/2606.24967#bib.bib11)\]inversions on the grouped synthetic test set\. Each minimises a head data\-fit term plus a regulariserλR\(K\)\\lambda R\(K\)over the normalised log\-conductivity field \(the generator’s output space\), optimised through the same differentiable Darcy discretisation used to generate the data\. For each sample we report the K\-MSE of the iterate with the lowest head data\-fit, selected without access to the trueKK, which favours the iterative baselines over final\-iterate reporting\. Aggregate statistics use the full grouped test set \(n=1,872n=1\{,\}872\), with oracle\-text and no\-text generator outputs evaluated on the same samples; the classical runs are allowed up to500500iterations with early stopping\.
### 3\.8Statistical Methods
All confirmatory tests are two\-sided atα=0\.05\\alpha=0\.05\. Paired comparisons use the Wilcoxon signed\-rank test \(per\-sample K\-MSE is right\-skewed and heavy\-tailed\); monotonic associations use Spearman’sρs\\rho\_\{s\}, with exact permutationpp\-values for class\-level analyses \(n=6n=6\)\.
Multiplicity is handled within three pre\-specified families: \(i\) six per\-pattern L0→\\toL4 specificity tests \(§[4\.3](https://arxiv.org/html/2606.24967#S4.SS3), TableLABEL:tab:paired\_spec\); \(ii\) six seed\-level one\-hot\-vs\-SBERT tests across the ten\-seed retrains \(§[4\.4\.3](https://arxiv.org/html/2606.24967#S4.SS4.SSS3), TableLABEL:tab:multiseed\); and \(iii\) three SPE10 text\-condition comparisons \(§[4\.8](https://arxiv.org/html/2606.24967#S4.SS8), TableLABEL:tab:paired\_spe10\)\. Holm–Bonferroni adjustment\[[16](https://arxiv.org/html/2606.24967#bib.bib25)\]is primary for families \(i\) and \(iii\)\. Family \(ii\) rawpp\-values are reported as seed\-level diagnostics; we make no family\-wise significance claim from the small\-error patterns, where the encodings differ only marginally\. The SPE10 nearest\-class\-mean contrasts \(§[3\.4](https://arxiv.org/html/2606.24967#S3.SS4)\) are external descriptive controls, reported with raw paired Wilcoxonpp\-values\.
Intervals labelled BCa use the bias\-corrected and accelerated bootstrap with10,00010\{,\}000resamples\[[11](https://arxiv.org/html/2606.24967#bib.bib26)\], resampling observations at the sample or layer level: per\-pattern L0→\\toL4 specificity differences \(§[4\.3](https://arxiv.org/html/2606.24967#S4.SS3)\), SPE10 condition means and paired condition differences \(§[4\.8](https://arxiv.org/html/2606.24967#S4.SS8)\), and classical\-baseline mean K\-MSE \(Table[3](https://arxiv.org/html/2606.24967#S4.T3)\)\. The headline text\-effect interval \(LABEL:app:paired\_stats\_tables\) instead uses a unique\-KK\-field cluster bootstrap\. For the within\-class control, Table[4](https://arxiv.org/html/2606.24967#S4.T4)reports the row\-weighted instance fraction; for classes built from repeatedKKfields we additionally compute a realisation\-level estimate: averaging the no\-text, within\-class, and oracle K\-MSE within each uniqueKKfield before forming the fraction, with the unique field as the unit \(Layered:n=10n=10\)\. The one\-hot\-vs\-SBERT comparison is summarised by across\-seed mean±\\pms\.d\. \(TableLABEL:tab:multiseed\); per\-pattern aggregate K\-MSE in Table[2](https://arxiv.org/html/2606.24967#S4.T2)are point estimates\.
## 4Results
### 4\.1Text Conditioning Reduces Reconstruction Error Across All Pattern Classes
We compare three text conditions on the leak\-free hash\-grouped test set \(n=1,872n=1\{,\}872, approximately 300 per pattern;LABEL:app:split\_auditdocuments the split protocol\):
No textZero384384\-dimensional embedding, used as a no\-language counterfactual; the generator was not trained with embedding dropout, so this input is out of distribution \(see §[5\.3](https://arxiv.org/html/2606.24967#S5.SS3)\)\.
Oracle textThe ground\-truth parametric description from the generation pipeline\.
Random textA description drawn from a randomly chosen sample of a different pattern class \(wrong\-class but in\-domain geological text; distinct from the non\-geological random\-text used on SPE10, §[4\.8](https://arxiv.org/html/2606.24967#S4.SS8)\)\.
Relative to this no\-text counterfactual, oracle text reduces average test K\-MSE from 0\.0869 to 0\.0168, an 81% improvement across the full grouped test set\. Figure[4](https://arxiv.org/html/2606.24967#S4.F4)illustrates this on a single Continuous\-class sample: oracle text recovers the log\-normalKKstructure almost exactly, while a wrong\-class description severely distorts the reconstruction\. Table[2](https://arxiv.org/html/2606.24967#S4.T2)reports per\-pattern results\.
Table 2:Per\-pattern K\-MSE under three text conditions on the grouped test set\. Rows ordered by no\-text K\-MSE \(descending\)\. Improvement is\(Knotext−Koracle\)/Knotext\(K\_\{\\mathrm\{notext\}\}\-K\_\{\\mathrm\{oracle\}\}\)/K\_\{\\mathrm\{notext\}\}\. Random denotes a wrong\-class in\-domain geological description\.Figure 4:Single\-sample reconstruction of a Continuous\-class pattern under three text conditions, evaluated on the grouped test partition\. From left: observed head fieldhh, ground\-truth hydraulic conductivityKK, and reconstructedKKwith no text, oracle text, and random text \(a wrong\-class Band description, not a non\-geological string\)\.Figure[5](https://arxiv.org/html/2606.24967#S4.F5)a summarises the pattern\-level contrasts\. The largest absolute improvements occur on patterns with the highest no\-text error: Layered, Random, and Continuous, all characterised by spatially extensive structures whose head fields are similar across a wide range of configurations\. A Layered medium with two sand strata at different depths produces a head field nearly indistinguishable from one with three thinner strata at different positions; text resolves this ambiguity by specifying the number and placement of strata\. Conversely, Circle and Ellipse, where localised sand bodies produce distinctive local head deflections that constrainKKtightly, show the smallest absolute gains \(Figure[5](https://arxiv.org/html/2606.24967#S4.F5)b\)\.
These gains are not merely pixel\-wise: the text\-conditioned reconstructions are more consistent with the observed Darcy flow, not only closer to the trueKK\. Solving the Darcy forward operator on the reconstructedK^\\hat\{K\}, oracle text reduces the mean head\-residual MSE \(the forward\-consistency ofK^\\hat\{K\}with the observed head\) from0\.00340\.0034to0\.00020\.0002and the relative error in total outlet discharge from0\.230\.23to0\.140\.14; over the five binary\-facies classes, thresholding into sand/clay facies raises the mean facies intersection\-over\-union from0\.720\.72to0\.920\.92, with the largest gain on Layered \(0\.14→0\.710\.14\\to 0\.71\)\.
Wrong\-class text increases mean K\-MSE in every pattern and is 73% worse overall than no text \(0\.1501 vs 0\.0869\), supporting sensitivity to text content beyond a generic non\-zero\-input effect\. Because the no\-text condition is itself the zero\-embedding counterfactual, this remains an in\-domain wrong\-text contrast rather than a strict non\-zero ablation\.
Figure 5:Text conditioning effect on synthetic data\. \(a\) K\-MSE by pattern under three text conditions: no text \(zero embedding\), oracle text \(ground\-truth description\), and random text \(wrong\-class description\)\. Oracle text improves all patterns; random text degrades all pattern means\. \(b\) Absolute improvement \(no\-text−\-oracle K\-MSE\) tracks no\-text baseline error; text helps most where physics alone is most ambiguous\.
### 4\.2Comparison with Classical Inversion Baselines
We compare the trained generator against three classical inversion baselines \(unregularised gradient descent, Tikhonov, and total variation\) on the full grouped test set\. The protocol is described in §[3\.7](https://arxiv.org/html/2606.24967#S3.SS7); the trained generator with oracle text and with no text is evaluated on the same samples for comparison\.
Table 3:Mean K\-MSE on the grouped synthetic test set \(n=1,872n=1\{,\}872\)\. Oracle and no\-text rows reproduce the trained generator from Table[2](https://arxiv.org/html/2606.24967#S4.T2)\. 95% CIs are BCa bootstrap \(10,000 resamples\)\.As reported in Table[3](https://arxiv.org/html/2606.24967#S4.T3), the trained generator with oracle text reaches a mean K\-MSE of 0\.0168, more than an order of magnitude below the best classical baseline \(Tikhonov, 0\.1862\)\. Unregularised and TV inversions essentially fail at this scale \(≈0\.20\\approx 0\.20\): with only the head observation as a constraint, gradient descent onKKthrough the elliptic operator finds many fields that matchhhto within the residual tolerance but differ markedly from the trueKK\. The zeroth\-order Tikhonov \(L2L\_\{2\}\) penalty reduces the overall mean only modestly \(∼8%\{\\sim\}8\\%\)\. Even no\-text generator inference \(0\.0869\), which carries no site knowledge at all, remains below every classical baseline, by more than2×2\\timesat this iteration budget, indicating that the architectural prior implicit in the U\-Net plus the training distribution already supplies structure that the testedL2L\_\{2\}magnitude and total\-variation penalties alone do not\.
### 4\.3Specificity Spectrum of Posterior Contraction
On all six synthetic pattern classes \(n=10n=10stratified samples per pattern\), the text uncertainty\-reduction proxyΔ^\(K;t∣h\)\\hat\{\\Delta\}\(K;t\\mid h\)is larger at the most specific level L4 than at the generic L0 \(Figure[6](https://arxiv.org/html/2606.24967#S4.F6)a; paired Wilcoxon within family \(i\) of §[3\.8](https://arxiv.org/html/2606.24967#S3.SS8); per\-pattern statistics in TableLABEL:tab:paired\_spec\)\. The trajectory is broadly increasing but not strictly monotonic: beyond pattern naming, more detailed geometric descriptions can introduce paraphrase variability that outweighs the added constraint, so individual L→\\toL\+1 steps sometimes decrease\.
Decomposing the total reduction \(Figure[6](https://arxiv.org/html/2606.24967#S4.F6)b\), the L0→\\toL2 component, pattern identification, carries the bulk on four of six patterns \(Circle, Ellipse, Random, Continuous;≥84%\{\\geq\}84\\%, exceeding100%100\\%whereΔ^\\hat\{\\Delta\}peaks at L2 and then falls back\), with Band more balanced\. The exception is Layered, where pattern naming captures only7%7\\%and the dominant jump is at L2→\\toL3 \(81%81\\%\), when the text specifies the number and positions of strata, moving from “A horizontally layered pattern” to “3 horizontal clay strata in sand”\. The specificity axis thus corroborates, from the text side, the encoding controls of §[4\.4](https://arxiv.org/html/2606.24967#S4.SS4): for five of six classes the class\-level signal carries the benefit, and only Layered draws on finer, instance\-specific detail\.
The absolute L4 reduction varies roughly twofold across patterns, consistent with how strongly the head field alone constrains the inverse solution \(the forward\-map view developed in §[4\.5](https://arxiv.org/html/2606.24967#S4.SS5)and §[5\.1](https://arxiv.org/html/2606.24967#S5.SS1)\)\.
Figure 6:Specificity spectrum on synthetic data \(n=10n=10stratified samples per pattern\)\. \(a\) Text uncertainty reductionΔ^\(K;t∣h\)\\hat\{\\Delta\}\(K;t\\mid h\)by pattern and specificity level \(L1–L4; L0 is the minimal generic\-text reference, “A porous medium\.”, not the zero\-embedding no\-text condition\)\. \(b\) Decomposition of totalΔ^L4\\hat\{\\Delta\}\_\{\\mathrm\{L4\}\}into the L0→\\toL2 component \(pattern identification\) and L2→\\toL4 component \(instance\-specific parameters\), with percentages indicating the L0→\\toL2 fraction\.
### 4\.4Encoding the Soft Prior: Class\-Mean, One\-Hot, and Sentence Embedding
The specificity spectrum \(§[4\.3](https://arxiv.org/html/2606.24967#S4.SS3)\) showed that pattern identification accounts for the bulk of the text uncertainty reduction on most patterns\. Three further controls isolate the numerical encoding of that pattern\-level signal\. The first holds class membership fixed while removing within\-class variation in the embedding \(the class\-mean control\)\. The second swaps each text for that of a different same\-class realisation, asking whether the correct within\-class instance matters at all \(the within\-class control\)\. The third replaces the SBERT embedding with a discrete class indicator, retraining the generator from scratch \(the one\-hot baseline\)\.
#### 4\.4\.1Class\-mean embedding control\.
Applying the class\-mean control of §[3\.4](https://arxiv.org/html/2606.24967#S3.SS4)on the grouped test set \(per\-sample SBERT input replaced by the training\-set class centroid, generator weights untouched\), we find that for five of six patterns, the class\-mean condition captures effectively all of the oracle K\-MSE benefit by the zero\-baseline capture metric of Equation[7](https://arxiv.org/html/2606.24967#S3.E7)\(99\.599\.5–107\.7%107\.7\\%\)\. The exception is Layered, where the class\-mean captures only19\.0%19\.0\\%\. One alternative explanation is that within\-class SBERT embeddings cluster tightly around their centroid, so the substitution changes little; the next two controls address it: a within\-class text swap that varies the description while holding the class fixed, and a one\-hot retrain that drops the embedding for a discrete label carrying no within\-class information at all\.
#### 4\.4\.2Within\-class control: the embedding acts as a largely class\-level prior\.
To separate the embedding’s value as class identity from within\-class instance detail, we swap each test sample’s text for that of a different same\-class realisation \(distinctKK, ten draws; §[3\.4](https://arxiv.org/html/2606.24967#S3.SS4)\), so that\(no\-text−within\)\(\\mathrm\{no\\text\{\-\}text\}\-\\mathrm\{within\}\)isolates the class contribution and\(within−oracle\)\(\\mathrm\{within\}\-\\mathrm\{oracle\}\)the instance contribution\. For five of six classes the within\-class condition is indistinguishable from oracle \(instance fraction≈0%\{\\approx\}0\\%; Table[4](https://arxiv.org/html/2606.24967#S4.T4)\): once the class is known, the specific wording adds nothing beyond what the head already supplies, so the embedding acts as a categorical prior, with the meaningful class\-level evidence carried by Band, Random and Continuous \(Circle and Ellipse sit at the K\-MSE floor,hh\-only≤0\.002\\leq 0\.002, where the instance fraction is uninformative\)\. The sole exception is Layered, where∼48%\{\\sim\}48\\%of the achievable benefit requires the correct instance \(44%44\\%over its ten uniqueKKrealisations; §[3\.8](https://arxiv.org/html/2606.24967#S3.SS8)\)\. This addresses the alternative left open by the class\-mean control \(§[4\.4\.1](https://arxiv.org/html/2606.24967#S4.SS4.SSS1)\): varying the wording within a class, not merely collapsing it to a centroid, leaves the reconstruction essentially unchanged, and a wrong\-class description degrades it on82\.7%82\.7\\%of samples versus11\.3%11\.3\\%for a wrong same\-class instance, confirming that class fidelity, not instance wording, is what the solver requires\. The unused instance signal is nonetheless present in the embedding: with the deployed encoder, a held\-out ridge probe recovers the generative latentξ\\xifrom the text embedding with clearly positiveR2R^\{2\}for the five non\-Layered classes, above near\-zero shuffled\-ξ\\xinulls \(overallR2=0\.50R^\{2\}=0\.50–0\.660\.66across encoders; §[3\.5](https://arxiv.org/html/2606.24967#S3.SS5)\); for Layered, with only ten unique realisations, the probe is inconclusive \(near\-zeroR2R^\{2\}\), though there the swap result above is itself functional evidence that the embedding carries the instance signal\.
Table 4:Within\-class control \(grouped test set,n=1,872n=1\{,\}872; canonical order\)\. Per\-class K\-MSE under no\-text, a within\-class swap \(same class, differentKKrealisation\), and oracle text\. The instance fraction\(within−oracle\)/\(no\-text−oracle\)\(\\mathrm\{within\}\-\\mathrm\{oracle\}\)/\(\\mathrm\{no\\text\{\-\}text\}\-\\mathrm\{oracle\}\), the share of the oracle benefit requiring the correct same\-class instance, is≈0\{\\approx\}0except for Layered\.
#### 4\.4\.3One\-hot class\-label baseline: a multi\-seed view\.
To test whether sentence\-embedding conditioning carries information beyond a discrete class label, we retrain the generator from scratch with the SBERT input replaced by a 6\-dimensional one\-hot class indicator \(the smallest possible conditioning bottleneck\) and, to separate input width from embedding geometry, by a capacity\-matched 384\-dimensional fixed random per\-class vector \(one\-hot384\), each across ten random seeds with the training seed as the unit of analysis \(§[3\.4](https://arxiv.org/html/2606.24967#S3.SS4)\)\.
The central finding is a difference in training stability, not in attainable error\. SBERT conditioning is stable across seeds with no seed\-level split, its minimum validation K\-MSE tightly clustered at0\.01020\.0102–0\.01110\.0111\. The one\-hot retrain is instead bimodal: classifying a seed as low\-error by a0\.010\.01validation cutoff \(in the wide gap between the one\-hot’s two modes and just below SBERT’s tight range, so the count is insensitive to its exact value\), one\-hot reaches the low\-error mode on only55of1010seeds, the capacity\-matched one\-hot384on77of1010\. Across all ten seeds it thus shows no reliable advantage: comparable to SBERT in the mean but at nearly twenty times the across\-seed s\.d\. \(per\-pattern values and the seed\-level test in TableLABEL:tab:multiseed; Figure[7](https://arxiv.org/html/2606.24967#S4.F7)\)\.
Figure 7:Reliability of discrete\-code versus sentence\-embedding conditioning across ten retrain seeds \(each at its oracle conditioning; §[4\.4\.3](https://arxiv.org/html/2606.24967#S4.SS4.SSS3)\)\. \(a\) Per\-pattern oracle K\-MSE for SBERT \(green\), the66\-d one\-hot \(orange\), and the capacity\-matched384384\-d random code \(purple\); horizontal bars mark across\-seed means\. \(b\) Per\-epoch validation K\-MSE for all ten seeds of each code \(logyy\)\.Conditional on reaching its low\-error mode, however, the one\-hot retrain is in fact more accurate than SBERT \(0\.02120\.0212versus0\.07710\.0771on Layered\), so a discrete label can encode the structural constraint at least as tightly, simply unreliably under the matched protocol\. That the capacity\-matched one\-hot384is bimodal in the same way places the instability in the conditioning representation rather than in input width\. The codes also differ in convergence speed \(Figure[7](https://arxiv.org/html/2606.24967#S4.F7)b\): SBERT settles into its stable plateau within∼40\{\\sim\}40epochs on every seed, whereas a one\-hot seed either plateaus early at high error or transfers only after several times as many epochs \(median186186versus4242\)\. At this data scale, the main practical gain from sentence embeddings is therefore more stable, faster training rather than strictly more information\. The takeaway is not that discrete labels suffice in general, but that this closed\-vocabulary, taxonomy\-aligned setting lets a clean label compete; the capabilities the interface enables beyond such labels, open\-vocabulary coverage and a paraphrase\-ensemble uncertainty proxy, are taken up in §[5\.2](https://arxiv.org/html/2606.24967#S5.SS2)\.
### 4\.5When Text Carries Instance Information: Forward\-Map Collapse and Sparse Observation
The within\-class controls \(§[4\.4\.2](https://arxiv.org/html/2606.24967#S4.SS4.SSS2)\) leave Layered as the only class drawing on within\-class text: its correct instance supplies∼48%\{\\sim\}48\\%of the text benefit \(Table[4](https://arxiv.org/html/2606.24967#S4.T4)\), against≈0%\{\\approx\}0\\%for the rest\. Since a class label fixes the class but never the realisation, this is the one place text can add what a label cannot, and the forward maph=f\(K\)h=f\(K\)explains why \(Figure[8](https://arxiv.org/html/2606.24967#S4.F8)\)\. Per class, the forward\-collapse ratio measures the head\-field spread relative to the conductivity spread that generates it \(§[3\.5](https://arxiv.org/html/2606.24967#S3.SS5)\)\. It is order\-unity for five classes \(0\.200\.20–0\.830\.83\) but1\.3×10−51\.3\\times 10^\{\-5\}for Layered, whose strata lie parallel to the imposed flow so that distinct layerings produce near\-identical heads: the head barely constrains which layering produced it, leaving text the only channel that can\. Layered is correspondingly the hardest class \(largest no\-text error; §[4\.1](https://arxiv.org/html/2606.24967#S4.SS1)\)\.
Figure 8:Forward\-collapse of the head field \(§[4\.5](https://arxiv.org/html/2606.24967#S4.SS5)\)\. \(a\) Three unique\-KKrealisations \(top\) and their head anomaliesΔh\\Delta h\(deviation from the class\-mean head; bottom, shared colour scale\) for Layered and the contrast class Continuous\. \(b\) Forward\-collapse ratio \(head spread over conductivity spread; §[3\.5](https://arxiv.org/html/2606.24967#S3.SS5)\) for all six classes\.A forward\-map degeneracy is not the only way the observation can underdetermine the geology: sparsifying the observation itself should act in the same direction, if more weakly\. We therefore degrade the observed head to an endpoint\-inclusivek×kk\\times kgrid \(Figure[9](https://arxiv.org/html/2606.24967#S4.F9)a; §[3\.5](https://arxiv.org/html/2606.24967#S3.SS5)\) and track the beyond\-class gap\(within−oracle\)\(\\mathrm\{within\}\-\\mathrm\{oracle\}\), what the correct instance text adds over a same\-class wrong\-instance text, across ten training seeds \(Figure[9](https://arxiv.org/html/2606.24967#S4.F9)b\)\. At full observation this gap is small for all five classes \(≤4%\{\\leq\}4\\%of no\-text K\-MSE; Figure[9](https://arxiv.org/html/2606.24967#S4.F9)b\)\. Sparsifying broadens it where the missing structure is describable global geometry, such as Band’s dip direction and Ellipse’s body geometry, whereas Continuous rises weakly and non\-significantly, Random stays elevated at every density without a consistent trend, and Circle stays near zero throughout\. Layered sits far off this scale and confirms the forward\-map account: its gap is large and flat across the sweep \(≈46%\{\\approx\}46\\%of no\-text error at every density\), and, in the grid\-retrained model used for this sweep, feeding the class\-mean embedding \(pure class identity, the cleanest stand\-in for a class label\) reconstructs Layered worse than no text \(capture≈−27%\{\\approx\}\-27\\%\), whereas the same class mean recovers nearly all of the oracle benefit for the other five classes\. The beyond\-class contribution thus grows as sparser observation leaves describable structure underdetermined, though even at the corners\-only limit it remains a fraction of Layered’s forward\-map degeneracy\. We read this growth as a conservative lower bound on what richer description can contribute under sparse observation: the six\-class taxonomy is deliberately informative \(class identity alone recovers nearly all of the text benefit for five of six classes\), so in settings without so strong a categorical prior, instance\-level description would have correspondingly more room to add value\.
Figure 9:Beyond\-class text contribution under sparsifying observation\. \(a\) The six swept observation densities, from full \(N=4096N=4096\) to corners\-only \(N=4N=4\); red points mark the observed head positions\. \(b\) Beyond\-class\(within−oracle\)\(\\mathrm\{within\}\-\\mathrm\{oracle\}\)gap as a percentage of no\-text K\-MSE for the five non\-Layered classes; mean over ten training seeds, band±1\\pm 1s\.e\.
### 4\.6The Paraphrase\-Ensemble Proxy: Rank\-Reliable but Miscalibrated
For the paraphrase ensemble to serve as the posterior\-inspired sensitivity proxy of §[3\.2](https://arxiv.org/html/2606.24967#S3.SS2), its dispersion should track reconstruction error in both rank and magnitude\. We test this onn=200n=200synthetic test samples stratified across the six pattern classes, each with an ensemble ofNp=16N\_\{p\}=16texts: the oracle text plus1515LLM\-generated paraphrases of it\. For each sample we compareσ¯K\\bar\{\\sigma\}\_\{K\}to the observed errorK\-MSE\\sqrt\{\\mathrm\{K\\text\{\-\}MSE\}\}of the oracle\-text reconstruction against the ground\-truthKKfield\.
The rank ordering is strong overall: Spearmanρ=0\.85\\rho=0\.85\(Figure[10](https://arxiv.org/html/2606.24967#S4.F10)\)\. This overall correlation is, however, largely a between\-pattern effect, visible in the figure as colour clusters that separate along the diagonal while spreading vertically within each class: within individual patterns only Random correlates robustly \(ρ=0\.76\\rho=0\.76; the other fiveρ≤0\.36\\rho\\leq 0\.36\)\. Thusσ¯K\\bar\{\\sigma\}\_\{K\}separates easy patterns from hard ones more reliably than it ranks samples within a pattern\. In magnitude, however, the proxy is miscalibrated: the log\-log fit across the samples \(the fitted line in Figure[10](https://arxiv.org/html/2606.24967#S4.F10)\) has slope0\.740\.74, below the unit slope expected ifσ¯K\\bar\{\\sigma\}\_\{K\}were a calibrated predictive standard deviation, and the paraphrase spread under\-estimates RMSE across the whole dispersion range, by roughly26×26\\timesat the low end of the fitted relation, narrowing to∼4×\{\\sim\}4\\timesat the top\.
The paraphrase ensemble is therefore a useful cross\-pattern uncertainty proxy, separating easy reconstructions from hard ones, but its per\-sample use within a class is limited: the sub\-unit slope and the large multiplicative offset mean the absolute spread needs post\-hoc calibration before it can be read as a predictive standard deviation, while the generally weak within\-pattern correlations, which calibration cannot repair, would require separate validation before any within\-class triage\. This cross\-pattern rank reliability motivates the uncertainty probe of §[5\.2](https://arxiv.org/html/2606.24967#S5.SS2): the paraphrase interface exposes an input\-dependent dispersion signal \(arising from the varying text alone, as the SBERT encoder is frozen and deterministic\) that a discrete one\-hot label, having nothing to paraphrase, does not provide\.
Figure 10:Reliability of the paraphrase\-ensemble proxy onn=200n=200stratified test samples \(Np=16N\_\{p\}=16texts each\): per\-sample oracle\-textK\-MSE\\sqrt\{\\mathrm\{K\\text\{\-\}MSE\}\}againstσ¯K\\bar\{\\sigma\}\_\{K\}, coloured by pattern class; the grey dashed liney=xy=xmarks perfect magnitude calibration, the red dashed line the log\-log fit across samples\.
### 4\.7Operating Envelope under Observation Noise
All synthetic experiments above train and evaluate on noiseless head fields, so the gain atσ=0\\sigma=0should be read as a best\-case estimate of the prior’s value\. As an evaluation\-only stress test of the noiseless\-trained solver \(no noise\-aware retraining\), we map how the oracle\-vs\-no\-text gap changes when zero\-mean Gaussian noise is added i\.i\.d\. to every cell of the dense input head field, the same realisation corrupting the paired oracle and no\-text branches; the swept noise levelσ\\sigmais a measurement\-precision axis rather than an observation\-density or \-placement one\. This injection \(normalised head field range\[0,1\]\[0,1\], soσ\\sigmareads directly as a fraction of the full head range;n=200n=200stratified test samples\) reduces the oracle\-vs\-no\-text gain from80\.3%80\.3\\%atσ=0\\sigma=0to39\.0%39\.0\\%atσ=10−3\\sigma=10^\{\-3\},8\.5%8\.5\\%atσ=5×10−3\\sigma=5\\times 10^\{\-3\}, and∼1%\{\\sim\}1\\%byσ=2×10−2\\sigma=2\\times 10^\{\-2\}\(Figure[11](https://arxiv.org/html/2606.24967#S4.F11)\)\. Byσ=5×10−3\\sigma=5\\times 10^\{\-3\}both reconstructions are already substantially degraded \(oracle/no\-text K\-MSE0\.31/0\.340\.31/0\.34, versus0\.015/0\.0760\.015/0\.076atσ=0\\sigma=0\) and the text gain has largely closed\. Because the generator never saw noisy heads in training, this collapse conflates two effects we do not separate here \(the genuine loss of observation signal\-to\-noise, and the train/test mismatch of a noiseless\-trained solver evaluated on corrupted inputs\), so we read it as an empirical operating envelope rather than as an information\-theoretic limit\. Either way, the direction is consistent with the mechanism characterised above: the embedding acts as auxiliary information that resolves structural ambiguity only while the structure\-induced head signal stays above the noise floor\. Within this no\-retraining stress test, the practical advantage of oracle text over no text is confined to low\-noise inputs and closes rapidly as noise grows; quantifying its value under realistic monitoring\-well noise, and whether noise\-aware training can recover any of the lost gain, is left to future work\.
Figure 11:Evaluation\-only noise stress test\. Reconstruction K\-MSE under oracle text and no text versus the standard deviationσ\\sigmaof Gaussian noise added to the normalised head field at evaluation time\. The solver is not retrained for noisy inputs, so the curves reflect both loss of observation signal\-to\-noise and train–test mismatch\. Means overn=200n=200stratified test samples, withσ=0\\sigma=0placed at10−410^\{\-4\}for the logarithmic axis\.
### 4\.8Exploratory Transfer to the SPE10 Reservoir
As a semi\-real external anchor \(exploratory, atn=13n=13holdout layers, rather than a field\-deployment validation\), we fine\-tune the generator onN=20N=20SPE10 layers and evaluate on the 13\-layer holdout \(§[3\.6](https://arxiv.org/html/2606.24967#S3.SS6)\)\.KK\-informed reference text reduces mean holdout K\-MSE from0\.03520\.0352to0\.01900\.0190\(−46%\-46\\%\)\. A single pattern\-free generic\-geological sentence does as well \(−52%\-52\\%\), echoing the synthetic specificity spectrum \(§[4\.3](https://arxiv.org/html/2606.24967#S4.SS3)\) in which most of the benefit needs no fine\-grained detail, whereas a random non\-geological control is numerically worse than no text \(\+19%\+19\\%\)\.
The results suggest that the description needs to name the relevant physical concept, rather than elaborate it\. Coarsening the generic sentence one fixed text at a time \(Table[5](https://arxiv.org/html/2606.24967#S4.T5)\) shows the gain is governed by whether the text names heterogeneous permeability, not by its length or geological detail: an even shorter “A heterogeneous permeability field\.” matches the generic sentence and the single word “Permeability\.” still recovers most of the benefit, whereas on\-topic rock\-type descriptions that omit the concept retain at most a modest gain \(“Sedimentary rock\.”\) or fall back to the no\-text baseline and below \(“Subsurface rock\.”; “A porous rock\.”, worse than random text\)\. Figure[12](https://arxiv.org/html/2606.24967#S4.F12)shows the ladder on one holdout layer per formation: texts that name the concept suppress the spurious low\-KKbodies that the no\-text and random reconstructions hallucinate, and theKK\-informed reference adds no further visible structure beyond the generic sentence\. Consistently, the trained FiLM map sends the in\-domain geological texts to nearly the same bottleneck\-control direction, while non\-geological random text departs from that direction; this compression provides a representation\-level explanation for why added per\-layer detail does not improve over the generic sentence \(LABEL:app:spe10\_mechanism\)\. The conditioning therefore has a minimum content requirement \(the text must name the discriminating concept\) and, once that requirement is met, diminishing returns: adding formation or pattern detail to the generic sentence fails to lower holdout K\-MSE, which instead rises monotonically with payload toward theKK\-informed reference value\.
This benefit is not confined to one rock type\. Across the two SPE10 formations \(full8585\-layer evaluation; one layer of each in Figure[12](https://arxiv.org/html/2606.24967#S4.F12)\) the trend is the same \(text helps and the generic sentence matches theKK\-informed reference\), but its magnitude tracks how under\-determined the inversion is: it is largest in the smooth\-gradient Tarbert layers, whose head fields least identifyKK\(higher no\-text error than Upper Ness; the generic sentence beats no text on all3535layers,−47%\-47\\%\), and smaller in the channelised Upper Ness layers, whose channels are partly recoverable from the head \(4646of5050layers,−36%\-36\\%\)\. Detailed text edges the generic sentence more often in the channelised layers, where the geology has describable structure, but the average there remains essentially tied, so finer description still buys no reliable gain\.
A taxonomy\-restricted nearest\-class\-mean proxy, the closest automated class\-label substitute, is no better than the generic sentence \(Table[5](https://arxiv.org/html/2606.24967#S4.T5)\) and collapses to a near\-degenerate assignment, so the fixed six\-class taxonomy does not fit SPE10, motivating the open\-vocabulary capability we take up in §[5\.2](https://arxiv.org/html/2606.24967#S5.SS2)\. Within this exploratory 13\-layer holdout, the coarse in\-domain geological prior remains useful but the finer per\-layer detail does not at this fine\-tuning scale; the holdout paired statistics and full protocol are inLABEL:app:spe10\_extended\(TableLABEL:tab:paired\_spe10\), and the FiLM\-compression geometry is inLABEL:app:spe10\_mechanism\.
Table 5:SPE10 concept floor: mean K\-MSE on the 13\-layer holdout under each text condition \(lower is better\)\. Ladder rows apply one fixed sentence to all layers\. The generic sentence is “A porous medium with heterogeneous permeability distribution\.”; an example random non\-geological sentence is “The weather forecast predicts rain for tomorrow afternoon\.”Figure 12:Concept\-floor ladder on one SPE10 holdout layer per formation: Layer 16 \(Tarbert, smooth\-gradient\) and Layer 54 \(Upper Ness, patchy\); columns as labelled\. Panel values are single\-layer K\-MSE; the random panel uses one control sentence, so its value differs from the five\-sentence\-per\-layer average of Table[5](https://arxiv.org/html/2606.24967#S4.T5)\. The concept\-absent text suppresses the spurious low\-KKbodies of the no\-text reconstruction only in the Upper Ness layer and degrades the Tarbert layer\.
## 5Discussion
Where Zhang et al\.\[[53](https://arxiv.org/html/2606.24967#bib.bib3)\]train with a semantic latent penalty so that text regularises the inversion, we deliberately use the conditioning as a measurement instrument \(a frozen embedding injected only through FiLM, with the synthetic models trained on pure reconstruction\), so the controls vary the conditioning representation alone and can be read as an audit of what the language channel carries\.
### 5\.1Contraction Through a Low\-Dimensional, Largely Categorical Direction
#### 5\.1\.1A largely class\-level contraction\.
Section[3\.2](https://arxiv.org/html/2606.24967#S3.SS2)cast the text as a soft prior,p\(K∣h,t\)∝p\(h∣K\)p\(K∣t\)p\(K\\mid h,t\)\\propto p\(h\\mid K\)\\,p\(K\\mid t\)\(Eq\.[4](https://arxiv.org/html/2606.24967#S3.E4)\); the controls now let us say what that prior encodes\. Introducing a latent structural classcc\(the six classes of §[2](https://arxiv.org/html/2606.24967#S2)\), the prior decomposes exactly asp\(K∣t\)=∑cp\(K∣c,t\)p\(c∣t\)p\(K\\mid t\)=\\sum\_\{c\}p\(K\\mid c,t\)\\,p\(c\\mid t\), in whichp\(K∣c,t\)p\(K\\mid c,t\)carries any within\-class text information beyond the class label\. As a baseline we define a class\-only null modelp0p\_\{0\}by putting the text\-free priorp\(K∣c\)p\(K\\mid c\)in place ofp\(K∣c,t\)p\(K\\mid c,t\), so that text acts only through the soft class assignmentp\(c∣t\)p\(c\\mid t\):
p0\(K∣h,t\)∝p\(h∣K\)∑cp\(K∣c\)p\(c∣t\)\.p\_\{0\}\(K\\mid h,t\)\\propto p\(h\\mid K\)\\sum\_\{c\}p\(K\\mid c\)\\,p\(c\\mid t\)\.\(8\)The substitutionp\(K∣c,t\)→p\(K∣c\)p\(K\\mid c,t\)\\to p\(K\\mid c\)would be exact only if text carried no within\-class information beyond the class label;p0p\_\{0\}is therefore a null model to read the controls against, not a posterior we estimate\. The controls evaluate this null model class by class\. Under dense observation the controls are consistent withp0p\_\{0\}for five of six classes: language contracts the posterior chiefly throughp\(c∣t\)p\(c\\mid t\), a low\-dimensional categorical direction\. The within\-class text\-dependence it discards is nonetheless real, used for Layered even under full observation, and increasingly for Band and Ellipse as the observation is sparsified\. The specificity spectrum \(§[4\.3](https://arxiv.org/html/2606.24967#S4.SS3)\) and the class\-mean embedding control \(§[4\.4](https://arxiv.org/html/2606.24967#S4.SS4)\) both indicate that, outside Layered, pattern identification rather than within\-class detail carries most of the gain\.
#### 5\.1\.2Information content versus training stability\.
Whether a sentence embedding carries information beyond this class\-level direction is the question the one\-hot retrain addresses, and the multi\-seed answer concerns training stability rather than information content\. When its optimisation reaches the low\-error mode, the one\-hot code matches or beats SBERT: most visibly on Layered, the pattern with the largesthh\-only ambiguity of the six \(hh\-only K\-MSE=0\.306=0\.306; §[4\.1](https://arxiv.org/html/2606.24967#S4.SS1)\), where one\-hot conditioned on reaching that mode attains oracle K\-MSE0\.0210\.021against SBERT’s across\-seed mean of0\.0770\.077\. A discrete class label can therefore encode the structural constraint at least as tightly as the embedding; the information is not unique to language\. But the one\-hot code reaches that mode on only about half of the ten seeds \(and the capacity\-matched384384\-dimensional random code shows the same qualitative bimodality\), so on the high\-error Layered case and in the overall mean it holds no reliable advantage, whereas SBERT converges to its solution on every seed\. We therefore read the comparison as an optimisation phenomenon rather than an information one: at this data scale and FiLM bottleneck the structured embedding makes the class\-level constraint reliably learnable, while a sparse or unstructured code realises it only intermittently\. That SBERT carries information is not in doubt: oracle text reduces K\-MSE by81%81\\%\(§[4\.1](https://arxiv.org/html/2606.24967#S4.SS1)\); what the multi\-seed comparison adds is that its practical value here is training stability rather than strictly more information\. For deploying a learned inverse solver this is a consequential distinction: reproducibility across retraining, not best\-case accuracy on a fortunate seed, is what lets a practitioner trust a model, so the embedding’s training stability is itself an engineering\-relevant property\. We caution that the one\-hot instability may be remediable with conditioning\-specific tuning, so this is a statement about the matched, untuned protocol, not an intrinsic limitation of discrete labels; whether SBERT’s stability stems from embedding geometry or semantic content is not resolved by the present class\-mean and random\-code diagnostics and would require further controls \(e\.g\. permuted\-embedding retrains\), which we leave to future work\.
The embedding nonetheless contains more than class identity: the decode probe recovers the generative latentξ\\xifor the same five classes whose instance signal the solver leaves unused \(§[4\.4\.2](https://arxiv.org/html/2606.24967#S4.SS4.SSS2)\)\. That the within\-class signal is present yet unused under full observation is a property of the inverse problem \(the head field already determinesKKonce the class is fixed\) rather than a bottleneck in the embedding\.
#### 5\.1\.3When language helps: a per\-class map\.
The categorical geological structure expressed by text \(“four horizontal layers”, “scattered elliptical lenses”, “tilted band”\) partitions the solution space into mutually disjoint subsets that the zeroth\-order Tikhonov and total\-variation penalties tested here do not explicitly encode\[[2](https://arxiv.org/html/2606.24967#bib.bib1),[41](https://arxiv.org/html/2606.24967#bib.bib11)\]; at a fixed iteration budget both trail even no\-text generator inference, and are about an order of magnitude worse than oracle\-text inference \(§[4\.2](https://arxiv.org/html/2606.24967#S4.SS2)\)\. How much this categorical contribution matters varies with how tightly the head field already constrainsKK\(Figure[2](https://arxiv.org/html/2606.24967#S2.F2)\)\. On Circle and Ellipse the localised head response pinsKKdown even without text \(hh\-only K\-MSE≤0\.002\\leq 0\.002\), leaving little headroom for any conditioning; Band is intermediate, well\-constrained by the head field yet with residual ambiguity that text still resolves\. The head under\-constrainsKKmost for Layered, Random and Continuous \(the largest no\-text errors\)\. For Random and Continuous this stems from their high conductivity contrast and variance rather than from a collapsed head, and class naming supplies most of the missing information\. Layered is the distinct, extreme case: its horizontal strata lie parallel to the imposed flow and are nearly invisible to it \(forward\-collapse ratio1\.3×10−51\.3\\times 10^\{\-5\}; §[4\.5](https://arxiv.org/html/2606.24967#S4.SS5)\), so even the within\-class detail \(which layering is present\) is needed \(§[4\.4\.2](https://arxiv.org/html/2606.24967#S4.SS4.SSS2)\)\. Correspondingly, the largest structural gain appears here: the Layered sand/clay facies IoU rises from0\.140\.14to0\.710\.71\(§[4\.1](https://arxiv.org/html/2606.24967#S4.SS1)\)\. Degrading the observation extends the same logic: as the head is grid\-subsampled, the beyond\-class contribution grows where the missing structure is describable global geometry \(Band’s dip direction, Ellipse’s lens geometry\), while the other three classes show no seed\-robust rise \(§[4\.5](https://arxiv.org/html/2606.24967#S4.SS5)\)\. The geological content that helps is thus the describable large\-scale structure \(layering, dip direction, lens geometry\) that the head field leaves unresolved and the embedding demonstrably carries, not the precise placement of individual objects\.
From an engineering\-informatics standpoint, this decomposition is a map of when a language\-encoded source of engineering information can condition a computational inverse model and what it contributes: chiefly a categorical geological prior, and chiefly where the head field under\-constrainsKK\. Its value is therefore often intelligible from how far the quantitative observations alone constrainKK, rather than being uniform across sites, and the interface\-level capabilities developed next are positioned to complement existing site\-characterisation workflows, not to replace measured data\.
### 5\.2What the Sentence\-Embedding Interface Provides
Viewed as an interface rather than merely a code, the sentence embedding lets a user inject and perturb a structural prior without retraining the inverse solver\. The parameterisation comparison above does not test this flexibility directly, but it remains a practical advantage of free\-form text in the present FiLM\-conditioned setting\. Four affordances distinguish the sentence\-embedding interface from a discrete class label, none requiring text to be the more accurate code\. First, training stability: across ten retraining seeds the embedding converges to a tightly clustered oracle error, whereas a one\-hot code reaches a comparably low\-error solution on only about half \(§[4\.4\.3](https://arxiv.org/html/2606.24967#S4.SS4.SSS3)\), an engineering\-relevant reliability, not a lower attainable error\. Second and third, developed below, are the paraphrase\-ensemble dispersion signal and open\-vocabulary coverage; both persist regardless of how the accuracy comparison resolves at larger scale\. Fourth, and narrower, the embedding carries within\-class instance detail a fixed label cannot, which the solver demonstrably draws on in one case: Layered, where the head under\-constrainsKK\(§[4\.5](https://arxiv.org/html/2606.24967#S4.SS5), §[5\.1](https://arxiv.org/html/2606.24967#S5.SS1)\)\.
#### 5\.2\.1Paraphrase\-ensemble uncertainty proxy\.
Paraphrase ensembles yield a low\-cost, posterior\-inspired sensitivity probe onKKgiven\(h,t\)\(h,t\): meaning\-preserving paraphrases of one description map to nearby but distinct conditioning vectors, and the resulting reconstruction dispersion, summarised byσ¯K\\bar\{\\sigma\}\_\{K\}, serves as the uncertainty proxy \(§[3\.2](https://arxiv.org/html/2606.24967#S3.SS2), §[3\.3](https://arxiv.org/html/2606.24967#S3.SS3)\)\. Tested in §[4\.6](https://arxiv.org/html/2606.24967#S4.SS6), it is best read as a cross\-pattern triage signal \(rank\-reliable across patterns, considerably weaker within them\) whose absolute scale would need post\-hoc calibration before being treated as a predictive standard deviation\. A discrete one\-hot label has nothing to paraphrase: any re\-wording maps to the same vector, so this uncertainty construction has no analogue within a discrete\-label parameterisation\. Conceptually,σ¯K\\bar\{\\sigma\}\_\{K\}is a text\-derived estimate of where the head observations least determineKK, a candidate surrogate, if so far only cross\-pattern, for the local degree of ill\-posedness\. The forward\-collapse ratio of §[4\.5](https://arxiv.org/html/2606.24967#S4.SS5)measures that same ill\-posedness from the physics side and network\-independently;σ¯K\\bar\{\\sigma\}\_\{K\}is its text\-conditioned, input\-dependent counterpart \(supported here only as a cross\-pattern signal, §[4\.6](https://arxiv.org/html/2606.24967#S4.SS6)\), so the two are complementary diagnostics rather than competing ones\. This also suggests a future use of the same text\-conditioning interface as an inner objective for Bayesian optimal experimental design \(BED; e\.g\. sensor placement\)\[[21](https://arxiv.org/html/2606.24967#bib.bib20),[42](https://arxiv.org/html/2606.24967#bib.bib30),[13](https://arxiv.org/html/2606.24967#bib.bib50)\]: with the site narrative held fixed, candidate measurements could be ranked by the paraphrase\-ensemble dispersion reduction each is expected to induce, without additional generator training\. Whether the paraphrase dispersion is reliable enough within a single site to drive such a design \(as opposed to the cross\-pattern triage supported here\) remains untested\. More broadly, this is complementary to established Bayesian site\-characterisation frameworks\[[30](https://arxiv.org/html/2606.24967#bib.bib27),[7](https://arxiv.org/html/2606.24967#bib.bib29),[51](https://arxiv.org/html/2606.24967#bib.bib32)\], and, outside the Bayesian family, to error\-domain model falsification, which evaluates candidate models under systematic uncertainty by falsification rather than posterior weighting\[[15](https://arxiv.org/html/2606.24967#bib.bib52),[28](https://arxiv.org/html/2606.24967#bib.bib53)\]\. What is new is that the prior is supplied by a natural\-language description rather than by a parametric covariance, an analogue dataset, or a predefined candidate\-model set\.
#### 5\.2\.2Open\-vocabulary coverage beyond the training taxonomy\.
The open\-vocabulary capability is that the sentence\-embedding interface accepts any in\-domain text the user is willing to write, including descriptions that have no natural representation in a six\-class one\-hot taxonomy\. On SPE10 the nearest\-class\-mean proxy \(the closest automated class\-label substitute reachable from the fixed taxonomy\) collapses to a near\-degenerate assignment and does not outperform a single generic geological sentence \(§[4\.8](https://arxiv.org/html/2606.24967#S4.SS8)\), precisely in the out\-of\-distribution regime where open\-vocabulary coverage should matter most\. A discrete one\-hot interface, by contrast, has by construction no way to encode that sentence at all\. Beyond coverage, the same interface admits prompt refinement or optimisation\[[52](https://arxiv.org/html/2606.24967#bib.bib5)\]\(more technical phrasing or structured constraints\) as a future capability for language\-assisted site characterisation\[[33](https://arxiv.org/html/2606.24967#bib.bib35)\]; we do not test whether such prompt engineering improves reconstruction accuracy in the present SPE10 setting\.
### 5\.3Limitations
Several caveats temper the scope of our conclusions\. The physical setting is idealised: results are for64×6464\\times 64grids with normalised hydraulic conductivity, and SPE10 is evaluated only as 2D slices under a simplified line\-drive boundary condition rather than the original five\-spot configuration \(§[2](https://arxiv.org/html/2606.24967#S2)\); 3D field\-scale validity under realistic units and boundary conditions remains untested\. The head observations are likewise idealised, since training and main evaluation use noiseless simulated heads\. In an evaluation\-only stress test with no noise\-aware retraining \(stratifiedn=200n=200\), the text gain falls from80%80\\%to∼1%\{\\sim\}1\\%as added head\-noiseσ\\sigmarises from0to2×10−22\\times 10^\{\-2\}\(§[4\.7](https://arxiv.org/html/2606.24967#S4.SS7)\); this conflates reduced observation precision with the train/test mismatch of a noiseless\-trained solver, so the prior’s value under realistic monitoring\-well noise remains open, as does whether noise\-aware training can recover the lost gain\.
The linguistic and embedding scope is also limited\. On the linguistic side, ground\-truth descriptions are generated by GPT\-4o\-mini from structured prompts containing the known latent parameters, rather than authored by human geologists; real site reports are more ambiguous, verbose, and context\-dependent, and whether the class\-level mechanism persists under human\-authored knowledge is untested\. The fidelity of such human\-supplied context is itself an active question in AI\-assisted engineering design\[[47](https://arxiv.org/html/2606.24967#bib.bib51)\]\. On the embedding side, we use a general\-purpose SBERT model rather than one adapted to geological text; the decode probe already shows that encoder choice changes latent recovery \(§[4\.4\.2](https://arxiv.org/html/2606.24967#S4.SS4.SSS2)\), so the deployed encoder should not be read as a ceiling for domain\-adapted representations\.
The mechanism analysis is also bounded\. The when\-does\-text\-help findings are interpretive rather than predictive laws: over only six classes the forward\-collapse ratio does not significantly rank text benefit \(p\>0\.1p\>0\.1; §[4\.5](https://arxiv.org/html/2606.24967#S4.SS5)\), and Layered’s instance dependence rests on only ten uniqueKKrealisations \(§[4\.4\.2](https://arxiv.org/html/2606.24967#S4.SS4.SSS2)\)\. The sparse\-observation trend is measured on a separate grid\-retrained generator, which conflates degraded observation with any change in how the FiLM channel is used, and is modest in magnitude outside Layered\. Relatedly, because text enters only through FiLM at the4×44\\times 4bottleneck, our reading of the within\-class detail left unused under full observation as a property of the inverse problem rather than the embedding cannot be cleanly separated from an architectural one: the present bottleneck may be less able to route fine instance\-specific geometry than a conditioning path with access to finer\-resolution features\. The decodability and instance\-use results are specific to this solver, encoding, and observation scheme, and the audit protocol itself has likewise been exercised only on this Darcy\-flow benchmark, so its transfer to related language\-conditioned inverse problems remains to be demonstrated\. Finally, the no\-text baseline is itself a counterfactual: it uses a zero embedding, which is out of distribution because the generator was trained with full conditioning \(no embedding dropout\), so reported text\-versus\-no\-text gains compare against a never\-trained counterfactual, and an explicitly unconditional or dropout\-trained baseline could reduce the apparent gap\.
## 6Conclusions
We have examined what natural\-language site knowledge \(short geological descriptions encoded as frozen sentence embeddings\) carries into a neural solver for the Darcy\-flow hydraulic conductivity inverse problem, asking whether this form of engineering information conditions the reconstruction beyond what a discrete class label could\.
Under the leak\-free grouped split, reference text reduces mean K\-MSE by81%81\\%relative to a zero\-embedding no\-text counterfactual \(§[4\.1](https://arxiv.org/html/2606.24967#S4.SS1)\)\. To isolate what fraction of this gain is unique to sentence embeddings, we apply four representation controls on the same grouped partition: a class\-mean embedding, a within\-class text swap, and from\-scratch retrains with a one\-hot class code and a capacity\-matched random code\. Across ten seeds, the66\-dimensional one\-hot class indicator can match or beat SBERT when its optimisation reaches the low\-error mode\. But the one\-hot and random codes reach that mode only intermittently across retraining seeds, whereas SBERT converges there on every seed \(§[4\.4\.3](https://arxiv.org/html/2606.24967#S4.SS4.SSS3)\); the class\-mean and within\-class controls likewise leave the reconstruction essentially unchanged for all but one class \(negligible instance contribution; §[4\.4\.2](https://arxiv.org/html/2606.24967#S4.SS4.SSS2)\)\. At the present data scale and FiLM bottleneck, the language\-derived prior therefore acts mainly as a class\-level constraint onKK\. A clean discrete code can represent much of the same signal\. What the sentence\-embedding route adds is training stability: a nearly twenty\-fold lower across\-seed standard deviation of the reference\-text K\-MSE, rather than additional dense\-observation accuracy\. Beyond class identity, the embedding does carry within\-class instance information \(recoverable by a held\-out decode of the generative latent\), but in this benchmark the solver draws on it where the forward map or the observation under\-constrainsKK: consistently for the geology\-invariant Layered class, and increasingly under sparse observation for classes with describable global geometry \(Band, Ellipse\)\. Text helps mainly when the head data do not already constrainKKstrongly \(§[4\.5](https://arxiv.org/html/2606.24967#S4.SS5)\)\. Reaching this conclusion required treating the conditioning representation as the sole experimental variable \(a frozen embedding injected only through FiLM, with the synthetic models trained on pure reconstruction\) and auditing it through the controls and probes above\. Exercised here only on Darcy flow, this audit design is intended as a reusable template for related language\-conditioned inverse problems\.
Beyond the training\-stability and instance\-detail findings above, the sentence\-embedding interface adds two capabilities that a fixed class code cannot, by construction, support: First, its paraphrase ensemble yields a dispersion that rank\-orders reconstruction error across, though not yet within, pattern classes \(§[4\.6](https://arxiv.org/html/2606.24967#S4.SS6)\): not a calibrated variance, but a candidate inner objective for language\-driven Bayesian optimal experimental design\[[21](https://arxiv.org/html/2606.24967#bib.bib20),[42](https://arxiv.org/html/2606.24967#bib.bib30)\]\. Second, its open vocabulary covers descriptions outside the training taxonomy: in the exploratory SPE10 holdout, a single generic geological sentence performs broadly comparably to per\-layerKK\-informed reference text and numerically better than the nearest\-class\-mean control \(§[4\.8](https://arxiv.org/html/2606.24967#S4.SS8)\)\.
The conditions under which the language prior helps are thus a head field that under\-constrainsKKand a high\-precision observation regime: in an evaluation\-only stress test, the text gain falls from its noiseless best case toward zero as head\-observation noise grows \(§[4\.7](https://arxiv.org/html/2606.24967#S4.SS7)\)\. Three directions follow naturally from this mechanistic account:
- 1\.Removal of the reference\-text assumption\.Generate text automatically from the observed head field through gradient\-style optimisation in language space\[[52](https://arxiv.org/html/2606.24967#bib.bib5)\], so that the conditioning input no longer presupposes a correct in\-domain description supplied in advance\. The concept\-floor result \(§[4\.8](https://arxiv.org/html/2606.24967#S4.SS8)\) sets the bar such optimisation must clear: the recovered text must not only reduce the head residual but also name the discriminating physical concept\.
- 2\.Paraphrase\-ensemble calibration for deployment\.Calibrate the paraphrase\-ensemble dispersionσ¯K\\bar\{\\sigma\}\_\{K\}as a predictive standard\-deviation estimate so that the language\-driven experimental\-design utility of §[5\.2](https://arxiv.org/html/2606.24967#S5.SS2)can be benchmarked against parametric\-covariance alternatives on increasingly field\-like site\-characterisation problems\.
- 3\.Human\-authored site descriptions\.Replace the LLM\-generated synthetic descriptions with real borehole logs and geologist narratives to test whether the class\-level mechanism persists under the ambiguity, verbosity, and context\-dependence of field descriptions\.
Together these directions would extend the sentence\-embedding interface beyond the controlled Darcy\-flow setting of this paper: first to realistic subsurface\-characterisation workflows, and ultimately to the wider class of engineering inverse problems in which expert natural\-language knowledge serves, alongside quantitative data, as computable information that both constrains the solution and guides the next measurement\.
## Acknowledgments
This research was supported by JSPS KAKENHI Grant Numbers JP23H00195 and JP25KJ0619\.
## Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper\.
## References
- \[1\]K\. Areerob, V\. Nguyen, X\. Li, S\. Inadomi, T\. Shimada, H\. Kanasaki, Z\. Wang, M\. Suganuma, K\. Nagatani, P\. Chun, and T\. Okatani\(2025\)Multimodal artificial intelligence approaches using large language models for expert\-level landslide image analysis\.Computer\-Aided Civil and Infrastructure Engineering40\(19\),pp\. 2900–2921\.External Links:[Document](https://dx.doi.org/10.1111/mice.13482)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p5.1)\.
- \[2\]R\. C\. Aster, B\. Borchers, and C\. H\. Thurber\(2018\)Parameter estimation and inverse problems\.3rd edition,Elsevier\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p1.1),[§1](https://arxiv.org/html/2606.24967#S1.p3.1),[§3\.7](https://arxiv.org/html/2606.24967#S3.SS7.p1.4),[§5\.1\.3](https://arxiv.org/html/2606.24967#S5.SS1.SSS3.p1.8)\.
- \[3\]G\. B\. Baecher and J\. T\. Christian\(2003\)Reliability and statistics in geotechnical engineering\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p2.1)\.
- \[4\]Y\. Chen, Y\. Xu, L\. Wang, and T\. Li\(2023\)Modeling water flow in unsaturated soils through physics\-informed neural network with principled loss function\.Computers and Geotechnics161,pp\. 105546\.External Links:[Document](https://dx.doi.org/10.1016/j.compgeo.2023.105546)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[5\]Y\. Chen, H\. Zhang, J\. Ma, T\. J\. Cui, P\. del Hougne, and L\. Li\(2024\)Semantic–electromagnetic inversion with pretrained multimodal generative model\.Advanced Science11\(42\),pp\. 2406793\.External Links:[Document](https://dx.doi.org/10.1002/advs.202406793)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p6.1)\.
- \[6\]J\. Ching and K\.\-K\. Phoon\(2014\)Transformations and correlations among some clay parameters—the global database\.Canadian Geotechnical Journal51\(6\),pp\. 663–685\.External Links:[Document](https://dx.doi.org/10.1139/cgj-2013-0262)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p2.1)\.
- \[7\]J\. Ching, S\. Wu, and K\.\-K\. Phoon\(2021\)Constructing quasi\-site\-specific multivariate probability distribution using hierarchical Bayesian model\.Journal of Engineering Mechanics147\(10\),pp\. 04021069\.External Links:[Document](https://dx.doi.org/10.1061/%28ASCE%29EM.1943-7889.0001964)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p2.1),[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6)\.
- \[8\]M\. A\. Christie and M\. J\. Blunt\(2001\)Tenth SPE comparative solution project: a comparison of upscaling techniques\.SPE Reservoir Evaluation & Engineering4\(4\),pp\. 308–317\.External Links:[Document](https://dx.doi.org/10.2118/72469-PA)Cited by:[§2\.2](https://arxiv.org/html/2606.24967#S2.SS2.p1.3)\.
- \[9\]X\. Diao, M\. R\. Rownak, S\. Olatubosun, P\. K\. Vaddi, and C\. Smidts\(2025\)A multiple\-criteria sensor selection framework based on qualitative physical models\.Advanced Engineering Informatics65,pp\. 103228\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2025.103228)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[10\]H\. Du, Z\. Zhao, H\. Cheng, J\. Yan, and Q\. Z\. He\(2023\)Modeling density\-driven flow in porous media by physics\-informed neural networks for CO2sequestration\.Computers and Geotechnics159,pp\. 105433\.External Links:[Document](https://dx.doi.org/10.1016/j.compgeo.2023.105433)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[11\]B\. Efron\(1987\)Better bootstrap confidence intervals\.Journal of the American Statistical Association82\(397\),pp\. 171–185\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1987.10478410)Cited by:[§3\.8](https://arxiv.org/html/2606.24967#S3.SS8.p3.7)\.
- \[12\]T\. Enemark, L\. J\. M\. Peeters, D\. Mallants, and O\. Batelaan\(2019\)Hydrogeological conceptual model building and testing: a review\.Journal of Hydrology569,pp\. 310–329\.External Links:[Document](https://dx.doi.org/10.1016/j.jhydrol.2018.12.007)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[13\]T\. Fu, Q\. Hu, J\. Zhao, G\. Jiang, P\. Wang, L\. Shan, Y\. Yao, and Y\. Rong\(2026\)Optimal placement design of pressure and electric sensors for enhanced underwater source sensing\.Advanced Engineering Informatics72,pp\. 104435\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2026.104435)Cited by:[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6)\.
- \[14\]I\. J\. Goodfellow, J\. Pouget\-Abadie, M\. Mirza, B\. Xu, D\. Warde\-Farley, S\. Ozair, A\. Courville, and Y\. Bengio\(2014\)Generative adversarial nets\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.27,pp\. 2672–2680\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1),[§3\.6](https://arxiv.org/html/2606.24967#S3.SS6.p1.5)\.
- \[15\]J\.\-A\. Goulet and I\. F\. C\. Smith\(2013\)Structural identification with systematic errors and unknown uncertainty dependencies\.Computers & Structures128,pp\. 251–258\.External Links:[Document](https://dx.doi.org/10.1016/j.compstruc.2013.07.009)Cited by:[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6)\.
- \[16\]S\. Holm\(1979\)A simple sequentially rejective multiple test procedure\.Scandinavian Journal of Statistics6\(2\),pp\. 65–70\.Cited by:[§3\.8](https://arxiv.org/html/2606.24967#S3.SS8.p2.3)\.
- \[17\]M\. Jia, Q\. Cheng, C\. Tao, Y\. Hu, Q\. Hong, W\. Cheng, and Z\. Liu\(2025\)A physics\-informed train on synthetic and test on real method for evaluating large language model\-generated safety\-critical traffic scenarios\.Computer\-Aided Civil and Infrastructure Engineering40\(28\),pp\. 5153–5169\.External Links:[Document](https://dx.doi.org/10.1111/mice.70071)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p5.1)\.
- \[18\]P\. K\. Kitanidis\(1997\)Introduction to geostatistics: applications in hydrogeology\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[19\]E\. Laloy, R\. Hérault, D\. Jacques, and N\. Linde\(2018\)Training\-image based geostatistical inversion using a spatial generative adversarial neural network\.Water Resources Research54\(1\),pp\. 381–406\.External Links:[Document](https://dx.doi.org/10.1002/2017WR022148)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1),[§3\.6](https://arxiv.org/html/2606.24967#S3.SS6.p1.5)\.
- \[20\]N\. Linde, P\. Renard, T\. Mukerji, and J\. Caers\(2015\)Geological realism in hydrogeological and geophysical inverse modeling: a review\.Advances in Water Resources86,pp\. 86–101\.External Links:[Document](https://dx.doi.org/10.1016/j.advwatres.2015.09.019)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[21\]D\. V\. Lindley\(1956\)On a measure of the information provided by an experiment\.Annals of Mathematical Statistics27\(4\),pp\. 986–1005\.External Links:[Document](https://dx.doi.org/10.1214/aoms/1177728069)Cited by:[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6),[§6](https://arxiv.org/html/2606.24967#S6.p3.1)\.
- \[22\]H\. Liu, J\. C\. P\. Cheng, V\. J\. L\. Gan, and S\. Zhou\(2022\)A novel data\-driven framework based on BIM and knowledge graph for automatic model auditing and quantity take\-off\.Advanced Engineering Informatics54,pp\. 101757\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2022.101757)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p4.1)\.
- \[23\]S\. Mo, Y\. Zhu, N\. Zabaras, X\. Shi, and J\. Wu\(2019\)Deep convolutional encoder–decoder networks for uncertainty quantification of dynamic multiphase flow in heterogeneous media\.Water Resources Research55\(1\),pp\. 703–728\.External Links:[Document](https://dx.doi.org/10.1029/2018WR023528)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1),[§3\.6](https://arxiv.org/html/2606.24967#S3.SS6.p1.5)\.
- \[24\]D\. S\. Oliver, A\. C\. Reynolds, and N\. Liu\(2008\)Inverse theory for petroleum reservoir characterization and history matching\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p1.1),[§1](https://arxiv.org/html/2606.24967#S1.p6.2)\.
- \[25\]OpenAI\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p4.1)\.
- \[26\]OpenAI\(2024\)GPT\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§2\.1\.2](https://arxiv.org/html/2606.24967#S2.SS1.SSS2.p1.3)\.
- \[27\]Y\. Otake and Y\. Honjo\(2022\)Challenges in geotechnical design revealed by reliability assessment: Review and future perspectives\.Soils and Foundations62\(3\),pp\. 101129\.External Links:[Document](https://dx.doi.org/10.1016/j.sandf.2022.101129)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p2.1)\.
- \[28\]R\. Pasquier and I\. F\. C\. Smith\(2015\)Robust system identification and model predictions in the presence of systematic uncertainty\.Advanced Engineering Informatics29\(4\),pp\. 1096–1109\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2015.07.007)Cited by:[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6)\.
- \[29\]E\. Perez, F\. Strub, H\. de Vries, V\. Dumoulin, and A\. Courville\(2018\)FiLM: visual reasoning with a general conditioning layer\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32,pp\. 3942–3951\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v32i1.11671)Cited by:[§3\.1](https://arxiv.org/html/2606.24967#S3.SS1.p2.1)\.
- \[30\]K\.\-K\. Phoon, J\. Ching, and T\. Shuku\(2022\)Challenges in data\-driven site characterization\.Georisk: Assessment and Management of Risk for Engineered Systems and Geohazards16\(1\),pp\. 114–126\.External Links:[Document](https://dx.doi.org/10.1080/17499518.2021.1896005)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p2.1),[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6)\.
- \[31\]K\.\-K\. Phoon and F\. H\. Kulhawy\(1999\)Characterization of geotechnical variability\.Canadian Geotechnical Journal36\(4\),pp\. 612–624\.External Links:[Document](https://dx.doi.org/10.1139/t99-038)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p2.1)\.
- \[32\]K\.\-K\. Phoon and F\. H\. Kulhawy\(1999\)Evaluation of geotechnical property variability\.Canadian Geotechnical Journal36\(4\),pp\. 625–639\.External Links:[Document](https://dx.doi.org/10.1139/t99-039)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p2.1)\.
- \[33\]Z\. Qian and C\. Shi\(2025\)Large language model\-empowered paradigm for automated geotechnical site planning and geological characterization\.Automation in Construction173,pp\. 106103\.External Links:[Document](https://dx.doi.org/10.1016/j.autcon.2025.106103)Cited by:[§5\.2\.2](https://arxiv.org/html/2606.24967#S5.SS2.SSS2.p1.1)\.
- \[34\]Z\. Qian and C\. Shi\(2026\)Uncertainty\-aware multi\-agent system for automated and real\-time geotechnical analysis in tunneling\.Computer\-Aided Civil and Infrastructure Engineering49,pp\. 100079\.External Links:[Document](https://dx.doi.org/10.1016/j.cacaie.2026.100079)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p5.1)\.
- \[35\]W\. Qin, Y\. Pan, X\. Ye, D\. Yang, Y\. Tan, H\. Ouyang, and H\. Zhong\(2026\)PTC\-diffusion: a partitioned modeling with diffusion\-based uncertainty evaluation for blasting silt displacement method reconstruction\.Advanced Engineering Informatics70,pp\. 104123\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2025.104123)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[36\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever\(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p4.1)\.
- \[37\]M\. Raissi, P\. Perdikaris, and G\. E\. Karniadakis\(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational Physics378,pp\. 686–707\.External Links:[Document](https://dx.doi.org/10.1016/j.jcp.2018.10.045)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[38\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1410)Cited by:[§2\.1\.2](https://arxiv.org/html/2606.24967#S2.SS1.SSS2.p1.3)\.
- \[39\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 10684–10695\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p4.1)\.
- \[40\]O\. Ronneberger, P\. Fischer, and T\. Brox\(2015\)U\-Net: convolutional networks for biomedical image segmentation\.InMedical Image Computing and Computer\-Assisted Intervention \(MICCAI\),Lecture Notes in Computer Science, Vol\.9351,pp\. 234–241\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-24574-4%5F28)Cited by:[§3\.1](https://arxiv.org/html/2606.24967#S3.SS1.p1.9)\.
- \[41\]L\. I\. Rudin, S\. Osher, and E\. Fatemi\(1992\)Nonlinear total variation based noise removal algorithms\.Physica D: Nonlinear Phenomena60\(1–4\),pp\. 259–268\.External Links:[Document](https://dx.doi.org/10.1016/0167-2789%2892%2990242-F)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1),[§3\.7](https://arxiv.org/html/2606.24967#S3.SS7.p1.4),[§5\.1\.3](https://arxiv.org/html/2606.24967#S5.SS1.SSS3.p1.8)\.
- \[42\]E\. G\. Ryan, C\. C\. Drovandi, J\. M\. McGree, and A\. N\. Pettitt\(2016\)A review of modern computational algorithms for Bayesian optimal design\.International Statistical Review84\(1\),pp\. 128–154\.External Links:[Document](https://dx.doi.org/10.1111/insr.12107)Cited by:[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6),[§6](https://arxiv.org/html/2606.24967#S6.p3.1)\.
- \[43\]T\. Shan, F\. Zhang, A\. P\. C\. Chan, S\. Zhu, and K\. Li\(2025\)Large language models\-empowered automatic knowledge graph development based on multi\-modal data for building health resilience\.Advanced Engineering Informatics68,pp\. 103655\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2025.103655)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p4.1)\.
- \[44\]A\. M\. Stuart\(2010\)Inverse problems: a Bayesian perspective\.Acta Numerica19,pp\. 451–559\.External Links:[Document](https://dx.doi.org/10.1017/S0962492910000061)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p1.1),[§1](https://arxiv.org/html/2606.24967#S1.p7.4),[§3\.2](https://arxiv.org/html/2606.24967#S3.SS2.p2.10)\.
- \[45\]A\. Tarantola\(2005\)Inverse problem theory and methods for model parameter estimation\.SIAM\.Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p1.1)\.
- \[46\]A\. M\. Tartakovsky, C\. Ortiz Marrero, P\. Perdikaris, G\. D\. Tartakovsky, and D\. Barajas\-Solano\(2020\)Physics\-informed deep neural networks for learning parameters and constitutive relationships in subsurface flow problems\.Water Resources Research56\(5\),pp\. e2019WR026731\.External Links:[Document](https://dx.doi.org/10.1029/2019WR026731)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[47\]S\. Vyas, S\. Cheruku, and V\. R\. Krishnamurthy\(2026\)How does contextual fidelity impact how we think, talk, and act in AI\-assisted engineering design?\.Advanced Engineering Informatics72,pp\. 104456\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2026.104456)Cited by:[§5\.3](https://arxiv.org/html/2606.24967#S5.SS3.p2.1)\.
- \[48\]Q\. Wu, Z\. Xie, Y\. Zhao, M\. Tian, X\. Zhang, and Q\. Qiu\(2026\)Multi\-type and multi\-scale geological three\-dimensional modeling using entity\-relationship networks\.Advanced Engineering Informatics72,pp\. 104436\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2026.104436)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[49\]S\. Wu, Y\. Otake, Y\. Higo, and I\. Yoshida\(2024\)Pathway to a fully data\-driven geotechnics: lessons from materials informatics\.Soils and Foundations64\(3\),pp\. 101471\.External Links:[Document](https://dx.doi.org/10.1016/j.sandf.2024.101471)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p5.1)\.
- \[50\]S\. Wu, Y\. Otake, D\. Mizutani, C\. Liu, K\. Asano, N\. Sato, H\. Baba, Y\. Fukunaga, Y\. Higo, A\. Kamura, S\. Kodama, M\. Metoki, T\. Nakamura, Y\. Nakazato, T\. Saito, A\. Shioi, M\. Takenobu, K\. Tsukioka, and R\. Yoshikawa\(2025\)Future\-proofing geotechnics workflows: accelerating problem\-solving with large language models\.Georisk: Assessment and Management of Risk for Engineered Systems and Geohazards19\(2\),pp\. 307–324\.External Links:[Document](https://dx.doi.org/10.1080/17499518.2024.2381026)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p5.1)\.
- \[51\]I\. Yoshida, Y\. Tomizawa, and Y\. Otake\(2021\)Estimation of trend and random components of conditional random field using Gaussian process regression\.Computers and Geotechnics136,pp\. 104179\.External Links:[Document](https://dx.doi.org/10.1016/j.compgeo.2021.104179)Cited by:[§5\.2\.1](https://arxiv.org/html/2606.24967#S5.SS2.SSS1.p1.6)\.
- \[52\]M\. Yüksekgönül, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou\(2025\)Optimizing generative AI by backpropagating language model feedback\.Nature639,pp\. 609–616\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-08661-4)Cited by:[§5\.2\.2](https://arxiv.org/html/2606.24967#S5.SS2.SSS2.p1.1),[item 1](https://arxiv.org/html/2606.24967#S6.I1.i1.p1.1)\.
- \[53\]H\. Zhang, Y\. Chen, Z\. Wang, T\. J\. Cui, P\. del Hougne, and L\. Li\(2024\)Semantic regularization of electromagnetic inverse problems\.Nature Communications15,pp\. 3869\.External Links:[Document](https://dx.doi.org/10.1038/s41467-024-48115-5)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p6.1),[§3\.6](https://arxiv.org/html/2606.24967#S3.SS6.p1.5),[§5](https://arxiv.org/html/2606.24967#S5.p1.1)\.
- \[54\]Y\. Zhang, G\. Ma, T\. Qu, Z\. Wang, K\. Xiong, and W\. Zhou\(2026\)MoE\-D2AN: a mixture\-of\-experts surrogate and dual\-branch data assimilation network for spatiotemporal dam deformation modeling\.Advanced Engineering Informatics69,pp\. 104122\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2025.104122)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[55\]H\. Zhou, J\. J\. Gómez\-Hernández, and L\. Li\(2014\)Inverse methods in hydrogeology: evolution and recent trends\.Advances in Water Resources63,pp\. 22–37\.External Links:[Document](https://dx.doi.org/10.1016/j.advwatres.2013.10.014)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[56\]X\. Zhou, B\. Sheil, S\. Suryasentana, and P\. Shi\(2026\)Graph attention neural network for subsurface stratigraphy on spatial and feature level using multiple\-source sparse exploration data\.Advanced Engineering Informatics70,pp\. 104108\.External Links:[Document](https://dx.doi.org/10.1016/j.aei.2025.104108)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.
- \[57\]Y\. Zhu and N\. Zabaras\(2018\)Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification\.Journal of Computational Physics366,pp\. 415–447\.External Links:[Document](https://dx.doi.org/10.1016/j.jcp.2018.04.018)Cited by:[§1](https://arxiv.org/html/2606.24967#S1.p3.1)\.Similar Articles
Probabilistic Inversion with Flow Matching
This paper applies Flow Matching, a generative AI technique, to probabilistic inversion for seismic full-waveform inversion, demonstrating its effectiveness on synthetic datasets.
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
This paper introduces FLAS, a flow-based activation steering method that learns a concept-conditioned velocity field to steer language model activations at inference time. On the AxBench benchmark, FLAS is the first learned method to consistently outperform in-context prompting on held-out concepts without per-concept tuning.
Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP
This paper presents an empirical study of inference-time gating in a production-scale clinical NLP pipeline using Llama and MMed-Llama models, showing that learning filtering rules from verifier rejections fails at scale, while ontology-based and evidence-testing filters are effective.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
This paper identifies and addresses 'latent sink traps' in text-to-3D generative models where they become insensitive to text prompts, proposing a framework that decouples geometric representation from linguistic sensitivity to enable robust text-based 3D shape editing of out-of-distribution shapes.
DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue
This paper introduces DiagFlowBench, a benchmark dataset of 1,676 multi-turn diagnostic conversations derived from industrial flowcharts, designed to evaluate how well language models handle off-procedure inputs and abstain from giving inappropriate advice.