Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning with Vision-Language Models

arXiv cs.LG Papers

Summary

This paper investigates whether multimodal large language models (MLLMs) can leverage Miller indices as a latent representation to reason about crystallographic fracture geometry from visual inputs, evaluating their ability to infer physically valid plane hypotheses and determine when such representation is applicable across materials like ceramics, glass, metals, and concrete.

arXiv:2605.20416v1 Announce Type: new Abstract: We study whether multimodal large language models (MLLMs) can leverage crystallographic plane indices (Miller indices) as a structured latent representation for reasoning about fracture geometry. We formulate Miller indices $z = (h,k,l)$ as a latent variable governing idealized planar fracture and evaluate two complementary capabilities: (i) latent inference, where the model maps visual observations to plane hypotheses under physically valid conditions, and (ii) latent applicability assessment, where the model determines whether such a representation is meaningful for a given fracture image. Through extensive experiments spanning synthetic data, controlled 2D--3D geometric pairs, and real-world fracture images across multiple material classes -- including ceramics, glass, metals, and concrete -- we show that MLLMs can reliably perform latent inference in idealized settings and, critically, can reject the latent representation when the underlying physics does not support it. These results suggest that MLLMs can act as physics-aware reasoning systems conditioned on structured latent priors, provided that the domain of validity is explicitly modeled.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:25 AM

# Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning with Vision-Language Models
Source: [https://arxiv.org/html/2605.20416](https://arxiv.org/html/2605.20416)
###### Abstract

We study whether multimodal large language models \(MLLMs\) can leverage crystallographic plane indices \(Miller indices\) as a structured latent representation for reasoning about fracture geometry\. We formulate Miller indicesz=\(h,k,l\)z=\(h,k,l\)as a latent variable governing idealized planar fracture and evaluate two complementary capabilities: \(i\) latent inference, where the model maps visual observations to plane hypotheses under physically valid conditions, and \(ii\) latent applicability assessment, where the model determines whether such a representation is meaningful for a given fracture image\.

Through extensive experiments spanning synthetic data, controlled 2D–3D geometric pairs, and real\-world fracture images across multiple material classes—including ceramics, glass, metals, and concrete—we show that MLLMs can reliably perform latent inference in idealized settings and, critically, can reject the latent representation when the underlying physics does not support it\. These results suggest that MLLMs can act as physics\-aware reasoning systems conditioned on structured latent priors, provided that the domain of validity is explicitly modeled\.

## 1Introduction

Fracture geometry provides a direct visual manifestation of the underlying physical mechanisms governing material failure\. In crystalline solids, fracture often occurs via cleavage along crystallographic planes, which are naturally described using Miller indices\(h​k​l\)\(hkl\)\[[1](https://arxiv.org/html/2605.20416#bib.bib1)\]\. These indices encode the orientation of lattice planes and offer a compact, physically interpretable representation that links microscopic crystallographic structure to macroscopic fracture morphology\[[2](https://arxiv.org/html/2605.20416#bib.bib2),[3](https://arxiv.org/html/2605.20416#bib.bib3)\]\.

However, this representation is inherently limited\. The use of Miller indices assumes that fracture is governed by a single well\-defined crystallographic plane, an assumption that holds primarily in idealized or highly ordered materials\. In many real\-world scenarios—including polycrystalline ceramics, amorphous glass, and heterogeneous composites such as concrete—fracture is instead driven by complex interactions involving microstructural heterogeneity, stress distributions, and multi\-scale effects\[[2](https://arxiv.org/html/2605.20416#bib.bib2)\]\. As a result, the mapping from observed fracture geometry to a single set of Miller indices becomes ambiguous or fundamentally invalid\.

The modern multimodal large language models \(MLLMs\) have demonstrated strong capabilities in visual reasoning and cross\-modal understanding\[[4](https://arxiv.org/html/2605.20416#bib.bib4),[5](https://arxiv.org/html/2605.20416#bib.bib5),[6](https://arxiv.org/html/2605.20416#bib.bib6),[7](https://arxiv.org/html/2605.20416#bib.bib7),[8](https://arxiv.org/html/2605.20416#bib.bib8)\]\. These models can interpret visual inputs and generate structured explanations, suggesting the possibility of guiding their reasoning using physically meaningful latent representations\. This raises a key question: can MLLMs leverage structured latent variables derived from physics—such as Miller indices—to interpret fracture geometry, and can they determine when such representations are applicable?

In this work, we investigate this question by treating Miller indices as a guided latent variable and evaluating model behavior across a spectrum of fracture regimes\. Rather than framing the task as a direct classification problem, we adopt a latent\-guided reasoning perspective in which the model must both infer a plausible latent structure and assess its validity\. This formulation allows us to examine not only whether the model can identify crystallographic planes in idealized settings, but also whether it can recognize when such representations break down in more complex or realistic scenarios\.

Our results show that MLLMs can successfully perform latent inference in controlled synthetic settings where fracture is governed by a single planar structure\. However, this capability does not generalize to real\-world fracture, where the underlying assumptions of the latent representation are often violated\. Importantly, the model is able to reject such representations when they are not physically applicable\. These findings suggest that the primary capability of MLLMs in this context is not universal prediction of crystallographic structure, but context\-aware reasoning about the validity of structured latent representations\.

## 2Methodology

### 2\.1Latent\-Guided Reasoning Framework

We define a latent variable

z∈𝒵=\{\(h,k,l\)\}z\\in\\mathcal\{Z\}=\\\{\(h,k,l\)\\\}representing crystallographic plane indices\.

Here,hh,kk, andllare integers known as Miller indices, which specify the orientation of a plane in a crystal lattice\. Intuitively, they describe how a plane intersects the three coordinate axes of the lattice\.

More precisely, consider a plane that intersects thexx\-,yy\-, andzz\-axes at distances\(x0,y0,z0\)\(x\_\{0\},y\_\{0\},z\_\{0\}\)\. The Miller indices are defined as the reciprocals of these intercepts \(expressed in lattice units\), i\.e\.,

\(h,k,l\)=\(ax0,by0,cz0\),\(h,k,l\)=\\left\(\\frac\{a\}\{x\_\{0\}\},\\frac\{b\}\{y\_\{0\}\},\\frac\{c\}\{z\_\{0\}\}\\right\),whereaa,bb, andccare the lattice constants\. The resulting values are scaled to the smallest set of integers\.

This definition leads to a simple geometric interpretation:

- •\(100\)\(100\): the plane intersects thexx\-axis and is parallel to theyy\- andzz\-axes, producing a flat face\.
- •\(110\)\(110\): the plane intersects both thexx\- andyy\-axes, resulting in a tilted planar surface\.
- •\(111\)\(111\): the plane intersects all three axes equally, forming a diagonal plane across the lattice\.

These differences in orientation directly determine the shape of the intersection between the plane and a unit cube, which forms the basis of our synthetic data construction\.

Given an imagexx, we consider three related tasks\.

First, latent inference seeks to identify the most likely plane consistent with the observed geometry:

z^=arg⁡maxz⁡p​\(z∣x\)\.\\hat\{z\}=\\arg\\max\_\{z\}p\(z\\mid x\)\.
Second, latent applicability determines whether a Miller\-index\-based representation is valid:

a=𝕀​\(∃z​such that​x∼p​\(x∣z\)\)\.a=\\mathbb\{I\}\\bigl\(\\exists z\\text\{ such that \}x\\sim p\(x\\mid z\)\\bigr\)\.
Finally, consistency reasoning evaluates whether a fragment observationxfx\_\{f\}is geometrically compatible with a plane hypothesisxp​\(z\)x\_\{p\}\(z\):

y=𝕀​\(xf∼xp​\(z\)\)\.y=\\mathbb\{I\}\\bigl\(x\_\{f\}\\sim x\_\{p\}\(z\)\\bigr\)\.
This formulation highlights a key aspect of the problem: inference is meaningful only when applicability holds, and applicability itself must be inferred from the visual data\.

We formulate fracture interpretation as a latent\-guided reasoning problem in which crystallographic plane indices serve as structured latent variables\. This perspective is conceptually related to latent variable modeling frameworks in machine learning, where hidden variables capture underlying generative structure\[[9](https://arxiv.org/html/2605.20416#bib.bib9),[10](https://arxiv.org/html/2605.20416#bib.bib10)\]\.

Instead of mapping an input image directly to a label, we introduce an intermediate representationz=\(h,k,l\)z=\(h,k,l\)that encodes the orientation of a candidate fracture plane\.

In this formulation, the multimodal large language model \(MLLM\) is asked not only to infer a plausible latent variable, but also to evaluate whether such a representation is applicable\. Given an input imagexx, the model first assesses whether the observed geometry exhibits properties consistent with planar fracture, such as flat surfaces, consistent orientation, and geometric regularity\. When these conditions are met, the model attempts to associate the observation with a candidate plane index\. Otherwise, it rejects the latent representation\.

This perspective emphasizes that Miller indices should be interpreted as conditional latent variables whose validity depends on the underlying physical mechanism\. The role of the model is therefore twofold: to infer latent structure when appropriate, and to avoid applying it when the assumptions are violated\.

### 2\.2Synthetic Data Construction and Geometric Representation

To provide a controlled environment for evaluation, we construct a synthetic dataset based on idealized cube–plane intersections\. Each plane is defined by

a​x\+b​y\+c​z=d,ax\+by\+cz=d,where\(a,b,c\)\(a,b,c\)corresponds to the direction of the Miller index\(h,k,l\)\(h,k,l\)\.

The intersection of a plane with a unit cube produces a polygonal cross\-section whose shape depends on the orientation of the plane\. Representative examples are shown in Figure[1](https://arxiv.org/html/2605.20416#S2.F1)\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/figure1.png)Figure 1:Representation of index planes in cubuic unitPlanes in the\{100\}\\\{100\\\}family are aligned with the cube faces and therefore produce square or rectangular cross\-sections\. Planes in the\{110\}\\\{110\\\}family intersect two axes, resulting in skewed quadrilateral shapes\. In contrast, planes in the\{111\}\\\{111\\\}family intersect all three axes equally, producing triangular cross\-sections\.

More generally, as the Miller indices increase or become more asymmetric, the resulting intersection geometry becomes less regular, leading to increasingly distorted and asymmetric polygonal shapes\. To simulate observable fracture patterns, we extract the corresponding 2D polygonal cross\-sections from these cube–plane intersections\. These fragments serve as the primary input to the model, as illustrated in Figure[2](https://arxiv.org/html/2605.20416#S2.F2)\. This representation isolates geometric cues such as planarity, symmetry, and edge structure while avoiding confounding factors present in real\-world images, thereby providing a clean and interpretable mapping between latent variables and observable geometry\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/figure2.png)Figure 2:Miller indices planesTo further evaluate whether the model can relate observations to latent hypotheses, we construct paired samples consisting of a 2D fragmentxfx\_\{f\}and a corresponding 3D cube with a highlighted planexp​\(z\)x\_\{p\}\(z\), with different variations as shown in Figure[3](https://arxiv.org/html/2605.20416#S2.F3)\. Both consistent and inconsistent pairings are included\. In consistent cases, the fragment is generated from the given plane, while in inconsistent cases the fragment and plane are mismatched\. This setup enables evaluation of geometric compatibility between observation and latent hypothesis\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/Picture3.png)Figure 3:Latency variations of index planes within cubic unit
### 2\.3Task Formulation and Inference Protocol

Using the constructed dataset, we define three evaluation tasks\. Latent inference requires the model to identify a plausible plane family given a fragment observation\. Latent applicability requires determining whether a plane\-based representation is meaningful for the given input\. Consistency reasoning evaluates whether a fragment and a plane hypothesis are geometrically compatible\.

These tasks collectively capture both predictive and interpretive aspects of latent\-guided reasoning\. In particular, they explicitly separate the problem of identifying a latent structure from the problem of determining whether such a structure is valid in the first place\.

We use a multi\-modal large language model as a black\-box reasoning system\. The model is a few\-shot model, as prompted with images and structured instructions that explicitly reference the latent variable\. Prompts are designed to encourage the model to describe geometric properties, assess planarity, relate observations to candidate plane orientations, and determine whether the latent representation is applicable\. Responses are analyzed qualitatively to extract decisions for each task\.

Prompt example:You are given a 3D cube with a planar slice\. Describe the orientation of the plane using crystallographic plane indices \(hkl\), or identify the most likely plane family \(e\.g\.,\{100\},\{110\},\{111\}\)\.Explain your reasoning based on the geometry\.

So the model learns \(hkl\) at the 3D geometry along with 2D observation\.

Inference: 1\)\. Synthetic data: 1\.1 inference on the 2D framents which are augumented; 2\.2 inference on the 3D cube with 2D planes augumented\. 2\) Real data: different materials with fractures\.

### 2\.4Scope and Limitations

The synthetic dataset provides a well\-defined mapping between latent variables and geometry but does not capture real\-world fracture mechanisms such as heterogeneity, defects, or plastic deformation\. As a result, it represents an idealized setting in which the latent representation is guaranteed to be valid\.

The purpose of this methodology is therefore not to replicate real\-world fracture directly, but to establish a controlled baseline for evaluating latent inference and consistency reasoning\. The extent to which this latent representation generalizes to real\-world scenarios is addressed separately in the results section\.

## 3Results and Analysis

We evaluate the proposed latent\-guided reasoning framework across a spectrum of fracture scenarios, ranging from idealized synthetic data to complex real\-world images\. The goal is to assess not only whether the multimodal model can infer the latent crystallographic plane variablez=\(h,k,l\)z=\(h,k,l\), but also whether it can determine when such a representation is physically meaningful\.

### 3\.1Latent Inference in Idealized Synthetic Geometry

We begin with controlled synthetic examples where fracture is explicitly governed by a single planar cut through a cube\. These canonical configurations are illustrated in Figure 1, which shows representative plane families including\{100\}\\\{100\\\},\{110\}\\\{110\\\}, and\{111\}\\\{111\\\}, along with selected higher\-index planes\.

The corresponding 2D fragment geometries are shown in Figure 2, where distinct shapes emerge from different plane orientations\. Face\-aligned planes in the\{100\}\\\{100\\\}family produce square or rectangular fragments, edge\-aligned planes in the\{110\}\\\{110\\\}family produce skew quadrilateral shapes, and diagonal planes in the\{111\}\\\{111\\\}family generate triangular fragments\. These mappings provide a clear and physically grounded relationship between the latent variable and observable geometry\.

When presented with these synthetic inputs, the model consistently identifies the correct latent plane family\. For example, square fragments are associated with\{100\}\\\{100\\\}\-type planes, while triangular fragments are associated with\{111\}\\\{111\\\}\-type planes\. This demonstrates successful latent inference in a regime where the underlying physics supports a single\-plane interpretation\.

Further validation is provided in Figure 3, which presents paired 2D–3D examples\. Each pair consists of a fragment image and a cube visualization with a highlighted plane\. In these experiments, the model correctly determines whether the fragment geometry is consistent with the proposed plane hypothesis\. For instance, triangular fragments are judged consistent with\(111\)\(111\)\-type planes and inconsistent with\(100\)\(100\)\-type planes, while square fragments exhibit the opposite behavior\. This indicates that the model is not merely recognizing shapes, but is performing cross\-representation reasoning, aligning 2D observations with 3D latent structure\.

### 3\.2Higher\-Index Planes and Fine\-Grained Latent Structure

We extend the synthetic dataset to include higher\-index planes, as shown in Figure[4](https://arxiv.org/html/2605.20416#S3.F4)where planes such as\(112\)\(112\)and\(102\)\(102\)produce asymmetric fragment geometries\. These cases introduce finer distinctions in the latent space, as the plane intersects axes with unequal ratios\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/Picture4.png)Figure 4:Two fracture planes and that with higher indexIn these examples, the model often correctly identifies qualitative properties of the plane, such as whether it intersects one, two, or three axes\. However, it frequently fails to distinguish precise index values, particularly when the differences correspond to subtle variations in intercept ratios\. For example, fragments corresponding to\(112\)\(112\)and\(102\)\(102\)are often described as “non\-symmetric” or “skewed,” but the exact index values are not reliably recovered\.

This behavior suggests that the model captures a coarse approximation of the latent space, distinguishing between major plane families, but has limited resolution for fine\-grained crystallographic distinctions\.

### 3\.3Consistency Reasoning and Negative Examples

To evaluate whether the model uses the latent variable as a structured hypothesis rather than a classification label, we construct explicit consistency and inconsistency cases\. These include both positive pairings, where fragment geometry matches the plane orientation, and negative pairings, where the two are incompatible\. In consistent cases, such as a square fragment paired with a\(100\)\(100\)plane, the model affirms compatibility and provides explanations based on planarity and symmetry\. In inconsistent cases, such as a triangular fragment paired with a face\-aligned plane, the model correctly identifies the mismatch, often referencing the number of edges and the implied orientation of the fracture surface\.

The ability to correctly reject inconsistent pairings is particularly important, as it demonstrates that the model is evaluating the compatibility between observation and latent hypothesis, rather than assigning labels independently\.

### 3\.4Multi\-Plane Fracture in Polycrystalline Materials

We next consider fracture patterns that exhibit multiple planar facets, as shown in Figure 6, which includes ceramic\-like fracture images\. These images contain fragments with flat faces, but the orientations vary significantly across the image\.

In this regime, the model does not assign a single Miller index\. Instead, it describes the fracture as involving multiple planar surfaces or multiple cleavage directions\. This corresponds to a generative model of the form

x∼∑ip​\(x∣zi\),i\>1,x\\sim\\sum\_\{i\}p\(x\\mid z\_\{i\}\),\\quad i\>1,where each fragment is associated with a different latent plane\.

Importantly, the model’s responses reflect an understanding that Miller indices may be applicable at a local level, but not globally across the entire image\. This aligns with the physics of polycrystalline fracture, where different grains may fracture along different crystallographic planes\.

### 3\.5Amorphous Fracture: Absence of Latent Structure

We then analyze fracture patterns in amorphous materials, such as glass, shown in Figure[5](https://arxiv.org/html/2605.20416#S3.F5)\. These images exhibit smooth, curved fracture surfaces characteristic of conchoidal fracture\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/fracture_glass_ceramic.png)Figure 5:Fractures of: a\) glass and b\) ceramicIn these cases, the model consistently rejects the applicability of Miller indices\. It correctly identifies that the surfaces are not planar and that the material lacks a crystal lattice, making a crystallographic interpretation invalid\. This corresponds to the regime

x∉⋃zp​\(x∣z\),x\\notin\\bigcup\_\{z\}p\(x\\mid z\),where no valid latent representation exists\.

The model’s ability to reject the latent hypothesis in this setting is crucial, as it prevents incorrect overextension of the representation\.

### 3\.6Heterogeneous Composite Fracture: Concrete

We further evaluate the model on fracture images of concrete, shown in Figure[6](https://arxiv.org/html/2605.20416#S3.F6)\. These images display highly irregular fragments, rough surfaces, and visible aggregates, reflecting the heterogeneous nature of the material\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/fracture_concrete.png)Figure 6:Fracture of concrete objects of variable scale lengthsIn these cases, the model attributes the fracture pattern to material heterogeneity and the presence of multiple interacting mechanisms\. It explicitly notes the absence of planar facets and the lack of consistent orientation, correctly concluding that Miller indices are not applicable\.

This behavior is consistent with the physical properties of concrete, which is a composite material rather than a crystalline solid\.

### 3\.7Ductile Fracture and Plastic Deformation

Finally, we consider ductile fracture examples, shown in Figure[7](https://arxiv.org/html/2605.20416#S3.F7), where metal specimens exhibit necking and irregular fracture surfaces\. These images are characterized by fibrous morphology and significant plastic deformation\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/metal_ductile_fracture.png)Figure 7:Metal ductile fractureThe model consistently identifies these features and rejects any crystallographic interpretation\. The absence of planar cleavage surfaces is correctly recognized as a key indicator that Miller indices are not applicable\.

### 3\.8Unified Interpretation Across Regimes

The experimental results reveal a consistent pattern in model behavior across different fracture scenarios, which can be understood in terms of three distinct regimes\. These regimes correspond to whether the underlying fracture geometry can be explained by a single plane, multiple planes, or no planar structure at all\.

In idealized synthetic cases, where fracture is governed by a single planar intersection, the model is able to infer a consistent latent variablez=\(h,k,l\)z=\(h,k,l\)\. This corresponds to a regime in which the observation is well described by a single latent hypothesis, and the model performs accurate latent inference\.

In polycrystalline scenarios, where fracture surfaces consist of multiple planar facets with different orientations, the model does not assign a single global plane\. Instead, it identifies the presence of multiple local planar structures, reflecting a mixture of latent variables\. In this case, the observed geometry can be understood as arising from a superposition of multiple plane hypotheses\.

In contrast, for amorphous and heterogeneous materials, such as glass and concrete, the model consistently rejects the use of a crystallographic plane representation\. These fracture patterns lack planar structure and are governed by mechanisms that do not correspond to any Miller\-index\-based description\.

These behaviors can be summarized formally as:

Mode​\(x\)=\{Inference,x∼p​\(x∣z\)Partial,x∼∑ip​\(x∣zi\)Rejection,x∉⋃zp​\(x∣z\)\\text\{Mode\}\(x\)=\\begin\{cases\}\\text\{Inference\},&x\\sim p\(x\\mid z\)\\\\ \\text\{Partial\},&x\\sim\\sum\_\{i\}p\(x\\mid z\_\{i\}\)\\\\ \\text\{Rejection\},&x\\notin\\bigcup\_\{z\}p\(x\\mid z\)\\end\{cases\}which correspond respectively to single\-plane fracture, multi\-plane fracture, and non\-planar fracture\.

This progression is illustrated in Figure[8](https://arxiv.org/html/2605.20416#S3.F8), which presents representative examples from each regime\. A key observation is that model behavior is not determined by visual complexity, but by the validity of the latent representation\. When the underlying physics supports a plane\-based interpretation, the model successfully applies the latent structure\. When it does not, the correct behavior is to reject the latent hypothesis\.

![Refer to caption](https://arxiv.org/html/2605.20416v1/Picture10.png)Figure 8:Representative examples of different regimeThis suggests that the primary capability of the model is not universal prediction of Miller indices, but rather context\-aware application of structured latent representations\. In this sense, correct rejection is as important as correct inference, and both should be considered in evaluating multimodal reasoning systems\.

## 4Discussion

The experimental results reveal a clear and consistent pattern: the effectiveness of Miller indices as a latent representation is strongly dependent on the underlying physical regime\. In idealized synthetic settings, where fracture is explicitly constructed as a single planar intersection, the mapping between the latent variablez=\(h,k,l\)z=\(h,k,l\)and observed geometry is well\-defined\. In this regime, the multimodal model is able to infer latent plane families and perform consistency reasoning with high reliability\.

However, this behavior does not generalize to real\-world fracture scenarios\. In polycrystalline materials, fracture surfaces arise from multiple local planes with varying orientations, making any single global Miller index insufficient to describe the observed geometry\. In amorphous materials such as glass, as well as heterogeneous composites such as concrete, fracture is governed by mechanisms that are not related to crystallographic planes at all\. In these cases, the assumption that fracture can be represented by a single\(h,k,l\)\(h,k,l\)plane is fundamentally invalid\.

As a result, the apparent “failure” of the model in real\-world examples is not due to a limitation of the model itself, but rather reflects the breakdown of the latent representation\. The Miller\-index framework does not align with the dominant physics of macroscopic fracture in most practical materials\. The model’s ability to reject such interpretations should therefore be viewed as correct behavior rather than an error\.

This leads to an important reframing: Miller indices should not be treated as universally applicable latent variables, but rather as conditional representations that are valid only within a narrow regime of plane\-dominated fracture\. The primary capability of the model is not to universally predict latent structure, but to evaluate whether such a structure is appropriate for a given input\.

## 5Conclusion and Future Work

We investigated the use of Miller indices as a structured latent representation for multimodal reasoning about fracture geometry\. Our results show that this representation is effective in idealized synthetic settings where fracture is governed by a single planar surface\. In these cases, multimodal models can successfully map visual observations to latent plane hypotheses and perform consistency reasoning across 2D and 3D representations\.

However, this framework does not extend to real\-world macroscopic fracture\. In polycrystalline, amorphous, and heterogeneous materials, fracture geometry is not governed by a single crystallographic plane, and therefore cannot be meaningfully described using Miller indices\. The mismatch between the latent representation and the underlying physics fundamentally limits the applicability of this approach\.

Consequently, the primary contribution of this work is not a general method for predicting crystallographic planes from arbitrary fracture images, but rather a characterization of the boundary of validity of such a representation\. Multimodal models can both infer latent structure in idealized regimes and correctly reject it when it is not physically applicable\.

This highlights a broader principle: structured latent representations in multimodal reasoning must be evaluated not only by their predictive capability, but also by their alignment with the underlying physical mechanisms\.

Future work should focus on improving the fidelity of synthetic data and the rigor of evaluation\. In particular, exact geometric simulation using analytical plane slicing could provide a more precise mapping between plane indices and observable geometry, enabling finer distinctions within the latent space\. Quantitative evaluation frameworks should also be developed to measure both latent inference and latent rejection\.

More importantly, future research should explore alternative latent representations that better align with real\-world fracture physics\. Instead of relying on crystallographic plane indices, which are inherently limited to idealized scenarios, it may be more appropriate to consider representations based on crack propagation dynamics, stress fields, or statistical fracture patterns\.

Finally, this work suggests that multimodal reasoning systems should incorporate latent representation selection, rather than assuming a fixed latent structure\. Enabling models to determine which representation is appropriate—or whether none is applicable—may lead to more robust and physically grounded reasoning across a wide range of domains\.

## References

- \[1\]B\. D\. Cullity and S\. R\. Stock \(2001\)\.Elements of X\-Ray Diffraction\(3rd ed\.\)\. Prentice Hall\.
- \[2\]T\. L\. Anderson \(2017\)\.Fracture Mechanics: Fundamentals and Applications\(4th ed\.\)\. CRC Press\.
- \[3\]W\. D\. Callister and D\. G\. Rethwisch \(2018\)\.Materials Science and Engineering: An Introduction\(10th ed\.\)\. Wiley\.
- \[4\]A\. Radford, J\. W\. Kim, C\. Hallacy, et al\. \(2021\)\. Learning Transferable Visual Models From Natural Language Supervision\.International Conference on Machine Learning \(ICML\)\.
- \[5\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)\. Visual Instruction Tuning\.Advances in Neural Information Processing Systems \(NeurIPS\)\.
- \[6\]J\. Yang, L\. Gao, K\. Li, et al\. \(2023\)\. MM\-ReAct: Prompting ChatGPT for Multimodal Reasoning and Action\.International Conference on Machine Learning \(ICML\)\.
- \[7\]OpenAI \(2023\)\. GPT\-4 Technical Report\.arXiv:2303\.08774\.
- \[8\]Google DeepMind \(2023\)\. Gemini: A Family of Highly Capable Multimodal Models\.arXiv:2312\.11805\.
- \[9\]D\. P\. Kingma and M\. Welling \(2014\)\. Auto\-Encoding Variational Bayes\.International Conference on Learning Representations \(ICLR\)\.
- \[10\]I\. Higgins, L\. Matthey, A\. Pal, et al\. \(2017\)\. beta\-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework\.International Conference on Learning Representations \(ICLR\)\.

Similar Articles

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv cs.CL

This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

arXiv cs.AI

This paper proposes a causal framework for probing internal visual representations in Multimodal Large Language Models, revealing differences in how entities and abstract concepts are encoded. The study highlights that increasing model depth is crucial for encoding abstract concepts and uncovers a disconnect between perception and reasoning in current MLLMs.