A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

arXiv cs.LG Papers

Summary

This paper presents a validated VLM-judge protocol for evaluating single-image-to-3D mesh quality, showing that cheap proxies like render-CLIP and geometry statistics fail to reliably track perceived quality.

arXiv:2606.18451v1 Announce Type: new Abstract: Single-image-to-3D generators are improving quickly, but there is no agreed, human-free way to tell whether one generated mesh is better than another. Practitioners commonly rely on cheap automatic proxies (render-space CLIP similarity and mesh geometry-validity statistics), yet how well these track perceived quality is unestablished. We make two contributions. First, we propose and validate a reproducible VLM-judge evaluation protocol: a fixed 24-view headless render rig, two independent vision-language judge families, and a mandatory position-bias correction that queries both presentation orders and keeps only order-consistent verdicts. The two judge families agree substantially with each other (Cohen's kappa = 0.66), well above the chance-agreement floor. Second, using this protocol as the reference, we show the cheap proxies do not substitute for it. Geometry validity is only a weak signal on average (because, as we show, it is bimodal) and stays below our pre-registered target, while render-CLIP is at chance. A learned Bradley-Terry head collapses onto a single manifoldness statistic (giving render-CLIP a negative weight) and matches geometry-only exactly, so learning the feature weights buys nothing. The proxy is also bimodal: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient. We therefore recommend the VLM-judge protocol as a reliable, reproducible evaluator under the conditions tested (two feed-forward generators on Google Scanned Objects, with a face-drop degradation regime) and advise against geometry/CLIP proxies as optimization targets.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:43 AM

# A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)
Source: [https://arxiv.org/html/2606.18451](https://arxiv.org/html/2606.18451)
Tony Salomone Transformer LabDeep Gandhi Transformer LabCorresponding author:deep@lab\.cloud

###### Abstract

Single\-image\-to\-3D generators are improving quickly, but there is no agreed, human\-free way to tell whether one generated mesh is better than another\. Practitioners commonly rely on cheap automatic proxies \(render\-space CLIP similarity and mesh geometry\-validity statistics\), yet how well these track perceived quality is unestablished\. We make two contributions\. First, we propose and validate a reproducible*VLM\-judge evaluation protocol*: a fixed 24\-view headless render rig,*two independent*vision\-language judge families, and a mandatory position\-bias correction that queries both presentation orders and keeps only order\-consistent verdicts\. The two judge families agree*substantially*with each other \(Cohen’sκ=0\.66\\kappa\{=\}0\.66\), well above the chance\-agreement floor\. Second, using this protocol as the reference, we show the cheap proxies do*not*substitute for it\. Geometry validity is only a*weak*signal on average \(because, as we show, it is bimodal\) and stays below our pre\-registered target, while render\-CLIP is*at chance*\. A learned Bradley–Terry head collapses onto a single manifoldness statistic \(giving render\-CLIP a negative weight\) and matches geometry\-only exactly, so learning the feature weights buys nothing\. The proxy is also*bimodal*: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient\. We therefore recommend the VLM\-judge protocol as a reliable, reproducible evaluator*under the conditions tested*\(two feed\-forward generators on Google Scanned Objects, with a face\-drop degradation regime\) and advise against geometry/CLIP proxies as optimization targets\.

## 1 Introduction

Single\-image\-to\-3D generators now produce plausible textured meshes from one photo, but evaluating them remains ad hoc\. Cherry\-picked demos look strong; quantitative comparison defaults to render\-space CLIP similarity or to mesh geometry\-validity statistics \(watertightness, manifoldness, normal consistency\), because both are cheap and need no human labels\. Whether these proxies actually track*perceived*mesh quality, the thing a downstream user, ranker, or training signal cares about, has not been carefully measured\.

We take the position that the right reference for “is this mesh good?” is a*vision\-language\-model judge*applied to a fixed multi\-view render of the asset, and that the central engineering questions are how to make such a judge*reliable*and how to test whether cheaper proxies can stand in for it\. We build a protocol around two independent open VLM judge families \(an oracle and a separate validation judge\) on a fixed 24\-view headless render rig, with a position\-bias correction that turns out to be essential\. We then use the validated judge as the reference against which the cheap proxies \(geometry validity, render\-CLIP, their composite, and a learned combination\) are scored on strictly held\-out objects\.

Our finding is that the protocol is reliable \(substantial cross\-model agreement\) while the proxies are weak, and that the proxies fail in a specifically misleading way: they are significantly above chance only on contrasts where the geometric defect is*visible*in the render \(telling two generators apart, or a clearly\-holed mesh from an intact one\) and fall to chance on the more ambiguous quality calls that matter for ranking or optimizing a single model\.

#### Contributions\.

1. 1\.A validated VLM\-judge evaluation protocolfor single\-image\-to\-3D mesh quality \(fixed render rig, two independent judge families, and a swap\-and\-keep\-consistent position\-bias correction\), whose judge families agree at0\.830\.83\(Cohen’sκ=0\.66\\kappa\{=\}0\.66, substantial; chance floor0\.510\.51\), well above a trivial\-agreement baseline \(Table[1](https://arxiv.org/html/2606.18451#S5.T1)\)\.
2. 2\.Evidence that position bias is large and must be corrected: about26%26\\%of raw verdicts flip with presentation order \(and this fraction is itself sample\-dependent\); as one illustrative baseline, correcting it moved an agreement estimate from0\.3330\.333to0\.7140\.714\(§[3](https://arxiv.org/html/2606.18451#S3)\)\.
3. 3\.A demonstration that cheap proxies do not substitute for the judge: geometry validity is significantly above chance but*weak*\(0\.620\.62,\[0\.55,0\.69\]\[0\.55,0\.69\]\) and below our target; render\-CLIP is at chance \(0\.480\.48\); and a learned head collapses onto a single manifoldness statistic, so learning the weights buys nothing \(Table[1](https://arxiv.org/html/2606.18451#S5.T1)\)\.
4. 4\.A subgroup/error analysisshowing the proxy is bimodal: it is significantly above chance on visible\-defect contrasts \(cross\-generator0\.910\.91, within\-TripoSR0\.800\.80\) but at chance on ambiguous ones \(cross\-generator\-mixed0\.530\.53;z=5\.06z\{=\}5\.06for the gap\), which we interpret as geometry validity tracking the judge only when the defect is visually salient \(§[5](https://arxiv.org/html/2606.18451#S5), §[6](https://arxiv.org/html/2606.18451#S6)\)\.

## 2 Related Work

#### Single\-image and multi\-view 3D generation\.

Fast feed\-forward single\-image\-to\-3D pipelines such as Unique3D\[[1](https://arxiv.org/html/2606.18451#bib.bib1)\]and asset pipelines like Meta 3D Gen\[[2](https://arxiv.org/html/2606.18451#bib.bib2)\]have made per\-object generation cheap enough to evaluate at scale, and studies of 3D representations\[[3](https://arxiv.org/html/2606.18451#bib.bib3)\]and photorealistic generation\[[4](https://arxiv.org/html/2606.18451#bib.bib4)\]highlight how differently generators behave\. We treat such generators as black boxes and focus on*how to judge*their outputs\.

#### Evaluating 3D generation\.

Benchmarks for multi\-view generation\[[5](https://arxiv.org/html/2606.18451#bib.bib5)\]and rethinking of point\-cloud generation metrics\[[6](https://arxiv.org/html/2606.18451#bib.bib6)\]both note that standard automatic metrics correlate weakly with perceived quality; curated\-quality datasets\[[7](https://arxiv.org/html/2606.18451#bib.bib7)\]and large object corpora\[[8](https://arxiv.org/html/2606.18451#bib.bib8)\]provide inputs but not a quality oracle\. Empirical studies of memorization in 3D shape generation\[[9](https://arxiv.org/html/2606.18451#bib.bib9)\]and robust shape generation\[[10](https://arxiv.org/html/2606.18451#bib.bib10)\]further motivate evaluation that looks at the rendered asset rather than at a single scalar\. Our protocol contributes a reusable, human\-free reference and a direct measurement of how far the cheap proxies fall short of it\.

#### VLM\-as\-judge\.

Using vision\-language models as judges is now common for image generation\[[11](https://arxiv.org/html/2606.18451#bib.bib11)\], but their reliability is contested:Kumar et al\. \[[12](https://arxiv.org/html/2606.18451#bib.bib12)\]report that even frontier VLMs are imperfect judges on grounded multimodal tasks, and presentation\-order \(position\) bias is a known failure mode of LLM/VLM judges\[[13](https://arxiv.org/html/2606.18451#bib.bib13),[14](https://arxiv.org/html/2606.18451#bib.bib14)\]\. Our contribution is not these techniques in the abstract but a*validated*instantiation for the 3D\-render setting: cross\-model agreement between*two different*judge families together with swap\-consistency position\-bias correction, quantified by Cohen’sκ\\kappaagainst a chance\-agreement baseline, and a measurement of how far cheap proxies fall short of it\.

#### Aligning 3D generators with rewards\.

A growing line uses reward or preference signals to align 3D generators: simulation feedback for physical soundness\[[15](https://arxiv.org/html/2606.18451#bib.bib15)\], 2D\-reward diffusion alignment\[[16](https://arxiv.org/html/2606.18451#bib.bib16)\], and direct preference optimization from human preferences\[[17](https://arxiv.org/html/2606.18451#bib.bib17)\], with related preference\-optimization\[[18](https://arxiv.org/html/2606.18451#bib.bib18)\]and parameter\-efficient adaptation\[[19](https://arxiv.org/html/2606.18451#bib.bib19)\]machinery\. Our results bear directly on this line: we find that the cheap automatic proxies one might use as such a reward are a weak quality signal \(geometry significantly above chance but well below target; render\-CLIP at chance\), so reward\-based specialization should be driven by the \(de\-biased\) VLM\-judge preferences themselves rather than by a proxy\. We treat generator specialization as out of scope and future work\.

## 3 Method

#### The evaluation protocol\.

Given a generated mesh, we normalize it to a unit bounding box and render a fixed2424\-view turntable rig with a headless offscreen rasterizer\. Quality is a*pairwise*judgment between two meshes generated from the same input image\. We use*two different*open VLM families: an oracle judgeXX\(Qwen2\.5\-VL\-7B\-Instruct\[[20](https://arxiv.org/html/2606.18451#bib.bib20)\]\) and an independent validation judgeYY\(InternVL3\-8B\[[21](https://arxiv.org/html/2606.18451#bib.bib21)\]\)\. KeepingX≠YX\\neq Ylets us report cross\-model agreement as a reliability check rather than trusting a single model\.

#### Position\-bias correction\.

VLM judges exhibit a presentation\-order bias\. For each pair we query the judge in*both*orders \(A,BA,BandB,AB,A\) and keep the verdict only if it is consistent across the swap; order\-dependent verdicts are discarded as position\-biased\. This is not optional: about26%26\\%of raw verdicts are inconsistent \(a fraction that is itself sample\-dependent,0\.580\.58–0\.630\.63consistent at smallerNN\); as one illustrative baseline, the uncorrected agreement reads0\.3330\.333versus0\.7140\.714after correction\.

#### Proxy rewards under test\.

We score the same renders/meshes with the cheap proxies a practitioner might use\. For a mesh we extract five features: watertightness, manifoldness, non\-self\-intersection, normal consistency, and render\-CLIP similarity \(CLIP\[[22](https://arxiv.org/html/2606.18451#bib.bib22)\], open\_clip ViT\-B\-32\[[23](https://arxiv.org/html/2606.18451#bib.bib23)\]\)\. From these we form: \(i\) a geometry\-only score, \(ii\) a render\-CLIP\-only score, \(iii\) a fixed\-weight composite, and \(iv\) a*learned*pairwise Bradley–Terry head\[[24](https://arxiv.org/html/2606.18451#bib.bib24)\]fit on judge\-XXlabels:

P​\(a≻b\)=σ​\(w⊤​\(ϕ​\(a\)−ϕ​\(b\)\)\),P\(a\\succ b\)=\\sigma\\\!\\big\(w^\{\\top\}\(\\phi\(a\)\-\\phi\(b\)\)\\big\),\(1\)whereϕ​\(⋅\)\\phi\(\\cdot\)is the five\-feature vector andwwis learned by logistic regression\[[25](https://arxiv.org/html/2606.18451#bib.bib25)\]\. All proxies are evaluated against the independent judgeYY\.

## 4 Experimental Setup

#### Data and generators\.

Inputs are single\-view photos from Google Scanned Objects\[[26](https://arxiv.org/html/2606.18451#bib.bib26)\]\(public, CC\-BY 4\.0;3,0693\{,\}069object directories available\)\. Per object we form four candidates spanning a quality spectrum: meshes from two single\-image\-to\-3D generators run as black boxes \(Stable Fast 3D\[[27](https://arxiv.org/html/2606.18451#bib.bib27)\]and TripoSR\[[28](https://arxiv.org/html/2606.18451#bib.bib28)\]\), plus a face\-dropped degraded variant of each\. We sampleN=60N\{=\}60objects for the main evaluation\.

#### Splits and determinism\.

Splits are strictly by object so no object straddles train and test, and the oracle judgeXXis never the validation judgeYY\. The render rig and candidate construction are deterministic given a fixed seed, and both judges use greedy \(argmax\) decoding\. The full corpus is262262position\-consistent pairs atN=60N\{=\}60\. The rule\-based signals \(geometry, render\-CLIP, the fixed composite\) are*not*fit to any data, so we evaluate them descriptively over the full corpus; the strict by\-object train/test split matters only for the*learned*head, which is trained on train\-object pairs and compared on the held\-out test\-object pairs \(9898pairs\) against judgeYY\.

#### Statistical analysis\.

We report proportions with Wilson 95% confidence intervals\[[29](https://arxiv.org/html/2606.18451#bib.bib29)\]and test agreement against the chance rate0\.50\.5with a two\-sided binomial test\. Because the unit of observation \(a pair\) is clustered within objects \(four candidates, hence six pairs, per object\), the effective sample is the6060objects rather than the9898–262262pairs; our headline intervals therefore use a cluster bootstrap\[[30](https://arxiv.org/html/2606.18451#bib.bib30)\]that resamples*objects*\. Judge–judge agreement is additionally summarized by Cohen’sκ\\kappa\[[31](https://arxiv.org/html/2606.18451#bib.bib31)\]relative to the marginal\-agreement baseline\. The cluster bootstrap is used for the headline geometry and agreement intervals; the per\-subgroup and CLIP CIs in Tables[1](https://arxiv.org/html/2606.18451#S5.T1)–[2](https://arxiv.org/html/2606.18451#S5.T2)are pair\-level Wilson intervals, which are anti\-conservative under clustering and so should be read together with the exploratory caveat below\. The four subgroup cells \(n=43/50/119/50n\{=\}43/50/119/50\) are*exploratory*; we make a large family of accuracy\-vs\-chance comparisons and do not apply a formal multiple\-comparison correction, so single\-cell directional readings should be treated as hypothesis\-generating\.

#### Compute\.

The study used≈3\.4\\approx 3\.4H100\-hours across1414completed jobs \(one further run discarded\) on a single H100\-class GPU \(predominantly H100\-SXM5\)\.

## 5 Results

#### The judge protocol is reliable\.

The two independent judge families agree with each other on0\.830\.83of120120dual\-labeled pairs \(Wilson 95% CI\[0\.76,0\.89\]\[0\.76,0\.89\]\)\. Because a forced\-choice verdict has a0\.50\.5chance rate \(and the marginal agreement floor here is0\.510\.51\), we summarize agreement by Cohen’sκ=0\.66\\kappa\{=\}0\.66, “substantial” on the usual scale\[[32](https://arxiv.org/html/2606.18451#bib.bib32)\], confirming the agreement is not a trivial artifact of skewed marginals\. Agreement is above chance in every subgroup, and higher on contrasts with a clear quality difference than on ambiguous ones \(0\.720\.72on cross\-generator\-mixed vs\.0\.950\.95–0\.970\.97on the clearest cells; the0\.720\.72cell is above chance but not, on its ownnn, distinguishable from the others\) \(Table[2](https://arxiv.org/html/2606.18451#S5.T2), Fig\.[1](https://arxiv.org/html/2606.18451#S5.F1)\)\. We did not run the optional human spot\-check, so reliability here means cross\-model consistency, not validated agreement with human raters\.

#### The proxies do not substitute for the judge\.

Geometry validity is a*weak but real*signal: across the full corpus it agrees with the judge on0\.620\.62of pairs \(Wilson\[0\.56,0\.68\]\[0\.56,0\.68\]; cluster\-bootstrap over objects\[0\.55,0\.69\]\[0\.55,0\.69\]; two\-sided binomialp<0\.001p\{<\}0\.001vs\.0\.50\.5\), so it is significantly above chance yet far below our pre\-registered target of≈≥0\.75\\approx\{\\geq\}0\.75\. Render\-CLIP, by contrast, is*at chance*\(0\.480\.48,\[0\.42,0\.54\]\[0\.42,0\.54\],p=0\.50p\{=\}0\.50\); we therefore cannot claim it is anti\-correlated, only that it carries no usable quality signal\. The learned Bradley–Terry head does not help: it places almost all weight on manifoldness \(2\.162\.16\) and a*negative*weight on render\-CLIP \(−0\.11\-0\.11\), and the resulting ranking matches geometry\-only exactly \(zero lift at bothN=30N\{=\}30andN=60N\{=\}60\): given freedom to weight the features, it collapses onto a single geometry statistic\. We read this as evidence that the limit lies in this feature set rather than in the model, though we cannot rule out that richer features would do better\. On the strict held\-out test subset \(9898pairs\) the same three rewards drop to0\.520\.52and render\-CLIP to0\.400\.40; this subset is smaller and noisier than the full corpus, and the downward shift from larger in\-sample estimates \(0\.660\.66atN=24N\{=\}24\) is consistent with small\-sample optimism rather than a demonstrated trend\.

Table 1:Agreement with the independent validation judgeYY, with Wilson 95% CIs\. Geometry validity is significantly above chance but weak; render\-CLIP is at chance\. The geometry\-only, learned, and composite rewards produce effectively the*same*ranking \(the learned head collapses onto manifoldness\), so they are not three independent corroborations\. “Full corpus” = all262262position\-consistent pairs \(the rule\-based signals are untrained\); “held\-out” = the9898test pairs used for the learned head\. Target is our pre\-registered directional goal, not a pass/fail gate\.
#### Why this is dangerous: a bimodal subgroup pattern\.

Table[2](https://arxiv.org/html/2606.18451#S5.T2)and Fig\.[1](https://arxiv.org/html/2606.18451#S5.F1)break accuracy down by pair type\. Geometry validity is significantly above chance on the two contrasts with*visible*geometric defects: cross\-generator \(Stable Fast 3D vs\. TripoSR,0\.910\.91,\[0\.78,0\.96\]\[0\.78,0\.96\],p<0\.001p\{<\}0\.001\) and within\-TripoSR degradation \(0\.800\.80,\[0\.67,0\.89\]\[0\.67,0\.89\],p<0\.001p\{<\}0\.001\)\. It is only at chance on cross\-generator\-mixed pairs \(0\.530\.53,\[0\.44,0\.62\]\[0\.44,0\.62\],p=0\.58p\{=\}0\.58\)\. The cross\-generator\-vs\-ambiguous gap is large and significant \(two\-proportionz=5\.06z\{=\}5\.06,p<0\.0001p\{<\}0\.0001\)\. The within\-Stable\-Fast\-3D cell \(0\.400\.40,\[0\.28,0\.54\]\[0\.28,0\.54\],p=0\.20p\{=\}0\.20\) is*not*statistically below chance and is confounded \(see Limitations\), so we do not lean on it\. Render\-CLIP is weak throughout \(0\.370\.37–0\.700\.70\)\. The danger is that the easy, visible\-defect regime \(exactly the setting in which proxies are usually reported to “work”\) hides the chance\-level behavior on the more ambiguous calls that matter for ranking or optimizing a single model\.

Table 2:Per\-subgroup geometry\-proxy agreement with judgeYY\(Wilson 95% CIs\)\. Geometry is significantly above chance only on the visible\-defect contrasts \(cross\-generator\-full, within\-TripoSR\); on cross\-generator\-mixed it is at chance, and the within\-SF3D cell is not significantly below chance and is confounded\. Cells are exploratory \(nnshown\)\. Render\-CLIP and judge agreement given for reference\.†confounded: the judge’s own intact\-vs\-degraded call is itself at chance here \(0\.40​\[0\.28,0\.54\]0\.40\\,\[0\.28,0\.54\]\), so this cell measures reference noise as much as proxy failure\.

![Refer to caption](https://arxiv.org/html/2606.18451v1/figures/subgroup.png)Figure 1:Cheap proxies collapse on the ambiguous subgroups while the VLM judge stays reliable\. Geometry\-proxy accuracy \(blue\) is significantly above chance on the visible\-defect contrasts \(cross\-generator\-full, within\-TripoSR\) but at chance on cross\-generator\-mixed \(within\-SF3D is confounded; see text\); render\-CLIP \(red\) is weak throughout; cross\-model judge agreement \(green\) stays0\.720\.72–0\.970\.97\. Dashed line is chance \(0\.50\.5\); error bars / smallnnmean single\-cell readings are exploratory\.
#### Error mechanism \(interpretation\)\.

The subgroup pattern is*consistent with*a visual\-salience account\. On within\-Stable\-Fast\-3D pairs the judge prefers the intact mesh only0\.400\.40of the time \(\[0\.28,0\.54\]\[0\.28,0\.54\], i\.e\. itself at chance\), versus0\.800\.80for within\-TripoSR, and prefers Stable Fast 3D over TripoSR only0\.090\.09of the time\. Stable Fast 3D meshes are open shells, so dropping faces is plausibly not clearly visible in a rendered view, whereas TripoSR meshes are watertight and degradation shows up as holes\. Geometry validity always penalizes the degraded mesh, but the judge appears to agree mainly when the geometric defect is*visually salient*\. We did not run the controlled manipulation \(e\.g\. re\-rendering at higher resolution or measuring silhouette difference\) that would turn this from an observational correlation into a tested mechanism, so we offer it as our interpretation rather than a demonstrated cause; either way, the within\-SF3D cell is confounded by stimulus visibility and is excluded from our proxy\-failure claim\.

#### Calibration\.

There is a hint that the geometry proxy is more often right on large\-reward\-gap pairs \(0\.670\.67\) than small\-gap pairs \(0\.570\.57\) \(Fig\.[2](https://arxiv.org/html/2606.18451#S5.F2)\), but withn≈49n\{\\approx\}49per bin this difference is not statistically significant \(overlapping CIs\); we report it as directionally suggestive only\.

![Refer to caption](https://arxiv.org/html/2606.18451v1/figures/calibration.png)Figure 2:The geometry proxy is weakly but correctly calibrated: accuracy is higher when the reward gap between candidates is large\. Dashed line is chance\.

## 6 Discussion and Limitations

The positive result is a usable, reproducible evaluator: a fixed render rig, two independent VLM judge families, and a position\-bias correction together give cross\-model agreement of0\.830\.83\(κ=0\.66\\kappa\{=\}0\.66, substantial\), with no human labels\. The cautionary result is that the cheap proxies one might reach for instead are a weak signal at best \(0\.620\.62for geometry, render\-CLIP at chance\) and, more subtly, look deceptively good on the visible\-defect contrasts that are most often reported while dropping to chance on the ambiguous ones\. A practical recommendation follows for the reward\-alignment literature\[[15](https://arxiv.org/html/2606.18451#bib.bib15),[16](https://arxiv.org/html/2606.18451#bib.bib16),[17](https://arxiv.org/html/2606.18451#bib.bib17)\],*within the regime we tested*\(two feed\-forward generators on Google Scanned Objects with a face\-drop degradation\): drive specialization with the de\-biased VLM\-judge preferences directly rather than with a geometry/CLIP proxy reward, which a generator could satisfy \(e\.g\. by maximizing manifoldness\) without improving perceived quality\.

### 6\.1 Limitations

#### The judge is the oracle\.

We treat the VLM judge as ground truth, so “quality” is defined relative to these judges\. We mitigate by requiring agreement between two independent families, correcting position bias, and reporting agreement rather than assuming a gold label, but VLM judges remain imperfect\[[12](https://arxiv.org/html/2606.18451#bib.bib12)\]\.

#### One subgroup is confounded\.

On within\-SF3D pairs the judge’s own intact\-vs\-degraded call is itself at chance \(face\-dropping an open shell is barely visible at256256px\), so we exclude that cell and rest the bimodality claim on the significant cross\-generator and within\-TripoSR cells\.

#### Narrow regime\.

The face\-drop degradation injects exactly the missing\-geometry defect geometry statistics detect and CLIP misses, so it may flatter geometry; and we test only two feed\-forward generators, one object source \(Google Scanned Objects\), andN=60N\{=\}60objects\. Naturalistic failures \(thin structures, transparency, multi\-object inputs\) and structurally different generators are future work; our intervals reflect object\-level clustering but not generator or dataset transfer\.

## 7 Availability

The evaluation protocol and per\-pair data are available from the authors on request\. No model checkpoints are released; the generators and judges used are existing public models\.

## 8 Conclusion and Future Work

For single\-image\-to\-3D, a cross\-model, position\-bias\-corrected VLM\-judge protocol is a reliable, reproducible human\-free evaluator \(κ=0\.66\\kappa\{=\}0\.66\) under the conditions we tested, while cheap geometry/CLIP proxies are weak \(geometry0\.620\.62, below target; render\-CLIP at chance\) and misleading on the visible\-defect contrasts where they are usually reported\. Future work: richer learned visual representations over the render rig \(rather than CLIP\) as a candidate automatic reward, broader generator/object coverage and naturalistic failure modes, a human spot\-check to validate the judge against people, and \(in a separate effort\) generator specialization driven directly by the de\-biased judge preferences\.

## References

- Wu et al\. \[2024\]Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma\.Unique3D: High\-quality and efficient 3d mesh generation from a single image\.2024\.URL[https://arxiv\.org/abs/2405\.20343](https://arxiv.org/abs/2405.20343)\.
- Bensadoun et al\. \[2024\]Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Animesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric\-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, and Andrea Vedaldi\.Meta 3D Gen\.2024\.URL[https://arxiv\.org/abs/2407\.02599](https://arxiv.org/abs/2407.02599)\.
- Wiedemann et al\. \[2025\]Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, and Kai Yuan\.Unifi3D: A study on 3d representations for generation and reconstruction in a common framework\.2025\.URL[https://arxiv\.org/abs/2509\.02474](https://arxiv.org/abs/2509.02474)\.
- Sobol et al\. \[2026\]Ido Sobol, Kihyuk Sohn, Yoav Blum, Egor Zakharov, Max Bluvstein, Andrea Vedaldi, and Or Litany\.Realiz3D: 3d generation made photorealistic via domain\-aware learning\.2026\.URL[https://arxiv\.org/abs/2605\.13852](https://arxiv.org/abs/2605.13852)\.
- Xie et al\. \[2025\]Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, and Gerard Pons\-Moll\.MVGBench: Comprehensive benchmark for multi\-view generation models\.2025\.URL[https://arxiv\.org/abs/2507\.00006](https://arxiv.org/abs/2507.00006)\.
- Bastico et al\. \[2025\]Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, and Etienne Decencière\.Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation\.2025\.URL[https://arxiv\.org/abs/2511\.05308](https://arxiv.org/abs/2511.05308)\.
- Lin et al\. \[2025\]Chendi Lin, Heshan Liu, Qunshu Lin, Zachary Bright, Shitao Tang, Yihui He, Minghao Liu, Ling Zhu, and Cindy Le\.Objaverse\+\+: Curated 3d object dataset with quality annotations\.2025\.URL[https://arxiv\.org/abs/2504\.07334](https://arxiv.org/abs/2504.07334)\.
- Wang et al\. \[2026\]Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, and Ping Luo\.ManiTwin: Scaling data\-generation\-ready digital object dataset to 100k\.2026\.URL[https://arxiv\.org/abs/2603\.16866](https://arxiv.org/abs/2603.16866)\.
- Pu et al\. \[2025\]Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, and Zhuang Liu\.Memorization in 3D Shape Generation: An Empirical Study\.2025\.URL[https://arxiv\.org/abs/2512\.23628](https://arxiv.org/abs/2512.23628)\.
- Siddiqui et al\. \[2026\]Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard\-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, and Jakob Engel\.ShapeR: Robust conditional 3d shape generation from casual captures\.2026\.URL[https://arxiv\.org/abs/2601\.11514](https://arxiv.org/abs/2601.11514)\.
- Sani et al\. \[2026\]Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei\-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao\-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak\-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I\-Sheng Fang, Shih\-Ying Yeh, Ho Kei Cheng, Ping Nie, and Wenhu Chen\.ImagenWorld: Stress\-testing image generation models with explainable human evaluation on open\-ended real\-world tasks\.2026\.URL[https://arxiv\.org/abs/2603\.27862](https://arxiv.org/abs/2603.27862)\.
- Kumar et al\. \[2026\]Bhavesh Kumar, Dylan Feng, and Leonard Tang\.MJ1: Multimodal Judgment via Grounded Verification\.2026\.URL[https://arxiv\.org/abs/2603\.07990](https://arxiv.org/abs/2603.07990)\.
- Zheng et al\. \[2023\]Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\.Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.2023\.URL[https://arxiv\.org/abs/2306\.05685](https://arxiv.org/abs/2306.05685)\.
- Wang et al\. \[2023\]Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui\.Large Language Models are not Fair Evaluators\.2023\.URL[https://arxiv\.org/abs/2305\.17926](https://arxiv.org/abs/2305.17926)\.
- Li et al\. \[2025\]Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi\.DSO: Aligning 3d generators with simulation feedback for physical soundness\.2025\.URL[https://arxiv\.org/abs/2503\.22677](https://arxiv.org/abs/2503.22677)\.
- Liu et al\. \[2025\]Qingming Liu, Zhen Liu, Dinghuai Zhang, and Kui Jia\.Nabla\-R2D3: Effective and efficient 3d diffusion alignment with 2d rewards\.2025\.URL[https://arxiv\.org/abs/2506\.15684](https://arxiv.org/abs/2506.15684)\.
- Zhou et al\. \[2025\]Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat\-Seng Chua\.DreamDPO: Aligning text\-to\-3d generation with human preferences via direct preference optimization\.2025\.URL[https://arxiv\.org/abs/2502\.04370](https://arxiv.org/abs/2502.04370)\.
- Chen et al\. \[2025\]Chieh\-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei\-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, and Humphrey Shi\.MapReduce LoRA: Advancing the pareto front in multi\-preference optimization for generative models\.2025\.URL[https://arxiv\.org/abs/2511\.20629](https://arxiv.org/abs/2511.20629)\.
- Truong et al\. \[2025\]Anh Truong, Ahmed H\. Mahmoud, Mina Konaković Luković, and Justin Solomon\.Low\-Rank Adaptation of Neural Fields\.2025\.URL[https://arxiv\.org/abs/2504\.15933](https://arxiv.org/abs/2504.15933)\.
- Bai et al\. \[2025\]Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin\.Qwen2\.5\-VL Technical Report\.arXiv:2502\.13923 \[cs\.CV\], 2025\.URL[https://arxiv\.org/abs/2502\.13923](https://arxiv.org/abs/2502.13923)\.
- Zhu et al\. \[2025\]Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, and Wenhai Wang\.InternVL3: Exploring Advanced Training and Test\-Time Recipes for Open\-Source Multimodal Models\.arXiv:2504\.10479 \[cs\.CV\], 2025\.URL[https://arxiv\.org/abs/2504\.10479](https://arxiv.org/abs/2504.10479)\.
- Radford et al\. \[2021\]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever\.Learning Transferable Visual Models From Natural Language Supervision\.arXiv:2103\.00020 \[cs\.CV\], 2021\.URL[https://arxiv\.org/abs/2103\.00020](https://arxiv.org/abs/2103.00020)\.
- Ilharco et al\. \[2021\]Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt\.OpenCLIP\.Zenodo, software \(version 0\.1\), 2021\.URL[https://doi\.org/10\.5281/zenodo\.5143773](https://doi.org/10.5281/zenodo.5143773)\.
- Bradley and Terry \[1952\]Ralph Allan Bradley and Milton E\. Terry\.Rank Analysis of Incomplete Block Designs: I\. The Method of Paired Comparisons\.*Biometrika*, 39\(3/4\):324–345, 1952\.
- Pedregosa et al\. \[2011\]Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay\.Scikit\-learn: Machine Learning in Python\.*Journal of Machine Learning Research*, 12:2825–2830, 2011\.
- Downs et al\. \[2022\]Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B\. McHugh, and Vincent Vanhoucke\.Google Scanned Objects: A High\-Quality Dataset of 3D Scanned Household Items\.arXiv:2204\.11918 \[cs\.RO\], 2022\.URL[https://arxiv\.org/abs/2204\.11918](https://arxiv.org/abs/2204.11918)\.
- Boss et al\. \[2024\]Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani\.SF3D: Stable Fast 3D Mesh Reconstruction with UV\-unwrapping and Illumination Disentanglement\.arXiv:2408\.00653 \[cs\.CV\], 2024\.URL[https://arxiv\.org/abs/2408\.00653](https://arxiv.org/abs/2408.00653)\.
- Tochilkin et al\. \[2024\]Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan\-Pei Cao\.TripoSR: Fast 3D Object Reconstruction from a Single Image\.arXiv:2403\.02151 \[cs\.CV\], 2024\.URL[https://arxiv\.org/abs/2403\.02151](https://arxiv.org/abs/2403.02151)\.
- Wilson \[1927\]Edwin B\. Wilson\.Probable Inference, the Law of Succession, and Statistical Inference\.*Journal of the American Statistical Association*, 22\(158\):209–212, 1927\.
- Efron \[1979\]Bradley Efron\.Bootstrap Methods: Another Look at the Jackknife\.*The Annals of Statistics*, 7\(1\):1–26, 1979\.
- Cohen \[1960\]Jacob Cohen\.A Coefficient of Agreement for Nominal Scales\.*Educational and Psychological Measurement*, 20\(1\):37–46, 1960\.
- Landis and Koch \[1977\]J\. Richard Landis and Gary G\. Koch\.The Measurement of Observer Agreement for Categorical Data\.*Biometrics*, 33\(1\):159–174, 1977\.

Similar Articles

Quantitative Video World Model Evaluation for Geometric-Consistency

Hugging Face Daily Papers

A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

Hugging Face Daily Papers

JanusMesh is a fast, training-free framework that generates text-driven 3D visual illusions—a single mesh revealing different semantics from different viewing angles—by decoupling generation into cross-space dual-branch denoising and view-conditioned texture synthesis, achieving high realism in just 3-5 minutes.

Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

arXiv cs.AI

This research introduces a 3D benchmark to evaluate whether Vision-Language Model (VLM) agents can achieve mirror self-recognition, a proxy for higher-order cognition. The study finds that while stronger VLMs can use reflected evidence for action, weaker models often fail to extract self-relevant information or misattribute reflections, highlighting the distinction between linguistic compliance and grounded self-identification.