Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark
Summary
This paper introduces the PiSAR benchmark for screen-conditioned action prediction and compares supervised fine-tuned models against frontier zero-shot baselines. Key findings show a fine-tuned Qwen3-VL-8B achieves 0.783 semantic similarity, significantly outperforming Claude Opus 4.7 and GPT-5.5 (0.459 and 0.482), but the same fine-tuning recipe on a larger reasoning-tuned Gemma model yields only 0.441, indicating a model-recipe mismatch.
View Cached Full Text
Cached at: 05/29/26, 09:18 AM
# Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark
Source: [https://arxiv.org/html/2605.29400](https://arxiv.org/html/2605.29400)
Rahul BissaAbhishek VyasYash JainAprioriLabsAprioriLabsAprioriLabsalpha@apriori\.workabhishekvyasiitdelhi@gmail\.comyash\.jain@gmail\.com
###### Abstract
We benchmark three supervised fine\-tuned models against frontier zero\-shot baselines on a 661\-row held\-out slice of*PiSAR*\(Persona, intent, Screen, Action, Rationale\), a 12,929\-tuple corpus of screen\-anchored behavioural rationales curated from public app\-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces\. Every model, frontier or fine\-tuned, is evaluated on the same 661\-row slice with the same scoring pipeline\. Two findings\. First, frontier zero\-shot baselines \(Claude Opus 4\.7 and GPT\-5\.5\) reachsem\_sim0\.459 and 0\.482 respectively; a fine\-tuned Qwen3\-VL\-8B\-Instruct reaches 0\.783 and clearssem\_sim≥0\.7\\geq 0\.7on 79% of rows, against 1–2% for either frontier baseline, a gap of 0\.30 absolute on the same test set\. Second, the same training data and recipe on Gemma\-4\-26B\-A4B\-IT scores only 0\.441, in the same band as the frontier zero\-shot baselines rather than the fine\-tuned Qwen\. We read this as a recipe\-vs\-model mismatch: the reasoning\-tuned high\-parameter model resists displacement and would likely need either more data or a stronger fine\-tuning method\.
## 1Introduction
A frontier vision\-language model, given a screenshot of a product moment and a real persona, should plausibly describe what the user is thinking\. The pretraining covers behavioural data\. The multimodal grounding is solved\. The model has been told to act as an instruction\-following assistant\. By this account, frontier zero\-shot should be strong\.
We measured this on a 661\-row held\-out slice of PiSAR, with every model, frontier or fine\-tuned, scored on the same input rows with the same metrics\. Claude Opus 4\.7 zero\-shot reachessem\_sim0\.459; GPT\-5\.5 zero\-shot reaches 0\.482\. A Qwen3\-VL\-8B\-Instruct LoRA fine\-tune on 13,796 traces of our PiSAR corpus reaches 0\.783\. The same fine\-tune recipe on Gemma\-4\-26B\-A4B\-IT reaches 0\.441, in the band where the frontier zero\-shot baselines already sit\.
Two findings, in priority order\. Frontier zero\-shot underperforms a small task\-specific fine\-tune by 0\.30 absolutesem\_simon the same test set, and the 8B Qwen\-VL fine\-tune serves at sub\-second per\-call latency\. The SFT signal does not transfer uniformly across bases: the same training data on a higher\-parameter reasoning\-tuned base produces only modest movement from its zero\-shot prior, accompanied by a chain\-of\-thought template bleed in 4% of outputs\. We hypothesise, but do not test, that the reasoning\-tuned base needs either more data or a stronger fine\-tuning method than the 13,796\-row LoRA\-rank\-16 recipe used here\.
We do not claim that fine\-tuning is generally better than prompting\. We do not claim Gemma is a bad base model\. We claim that on this evaluation, the SFT\-vs\-frontier gap is large for one base and absent for the other, and that the difference is informative about how SFT at a fixed example budget interacts with the base model’s post\-training prior\.
## 2Related Work
Three threads intersect at this benchmark: LLM\-based simulation of human behaviour, foundation\-model approaches to cognitive modelling, and the recent online\-shopping simulation literature that our training corpus and evaluation slice draw from most directly\. We sketch the relevant prior work in each thread\.
### 2\.1LLM\-based simulation of persona behaviour
Park et al\.\(Parket al\.,[2023](https://arxiv.org/html/2605.29400#bib.bib3)\)introduced persona\-conditioned simulated\-society dynamics from prompt\-engineered LLM behaviour, with no fine\-tuning\. The follow\-up ofParket al\.\([2024](https://arxiv.org/html/2605.29400#bib.bib4)\)scaled the simulation to a 1,000\-person interview\-based prediction setting, again with prompting rather than parameter updates\. The simulation\-by\-prompting approach is the prior most directly visible in deployed behavioural\-simulation systems\.
The training\-not\-prompting alternative is the Binz/Schulz line\.Binz and Schulz \([2023](https://arxiv.org/html/2605.29400#bib.bib5)\)characterised LLM behaviour against classical cognitive\-psychology benchmarks\. Centaur\(Binzet al\.,[2025](https://arxiv.org/html/2605.29400#bib.bib6)\)trained a foundation model on psychology trial data to reach human\-comparable performance on those benchmarks; the lift came from SFT, not from prompting\.Namazovaet al\.\([2025](https://arxiv.org/html/2605.29400#bib.bib8)\)argued that Centaur’s success on aggregate task performance does not yet establish it as a faithful synthetic participant at the individual level, a critique we read directly into our own[Section˜5\.1](https://arxiv.org/html/2605.29400#S5.SS1)\.
A separate observation from the same lab \(Binz and others \([2026](https://arxiv.org/html/2605.29400#bib.bib7)\)\) reports that RLHF and instruction tuning on large language models degrade their distributional fidelity to human source data\. Our Gemma result is consistent with that direction in reverse: the model’s reasoning\-tuned post\-training prior resisted the SFT signal we applied\.
### 2\.2Online shopping and usability\-testing behaviour simulation
Luet al\.\([2025b](https://arxiv.org/html/2605.29400#bib.bib9)\)introduced UXAgent, an LLM\-agent framework for usability testing of web designs without recruiting human participants\.Luet al\.\([2025a](https://arxiv.org/html/2605.29400#bib.bib10)\)measured agent\-simulation methodologies against real\-world online\-customer behaviour data, with the headline finding, reflected in their title, that prompting alone does not match real human distributions on multi\-turn shopping tasks\. Wang et al\. releasedWanget al\.\([2025a](https://arxiv.org/html/2605.29400#bib.bib11)\), the OPeRA dataset of 52 Amazon shoppers with concurrent verbal rationales that we use as a training source and evaluation anchor\.
The most recent line in this thread switches from SFT or prompting to reinforcement learning\. Shop\-R1\(Zhanget al\.,[2025](https://arxiv.org/html/2605.29400#bib.bib12)\)rewards LLMs for matching real shopper behaviour over multiple steps; Customer\-R1\(Wanget al\.,[2025b](https://arxiv.org/html/2605.29400#bib.bib13)\)adds per\-individual personalization on top\.Zhang and others \([2025](https://arxiv.org/html/2605.29400#bib.bib14)\)extends the same setting to vision\-language agents that condition on the screen, which is the closest precedent for the screen\-conditioned axis we evaluate here\.
### 2\.3Adapter fine\-tuning and screen\-conditioned VLM agents
LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.29400#bib.bib1)\)established low\-rank adapter fine\-tuning as a practical alternative to full\-parameter SFT; QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2605.29400#bib.bib2)\)added 4\-bit quantization to scale the recipe to consumer GPUs\. The Fireworks managed\-SFT platform used in this paper offers a hosted LoRA\-style fine\-tuning surface for vision\-language bases\.
On the GUI agent side, CogAgent\(Honget al\.,[2024](https://arxiv.org/html/2605.29400#bib.bib15)\)is the canonical screen\-conditioned vision\-language model for click\-coordinate action prediction\. Mind2Web\(Denget al\.,[2023](https://arxiv.org/html/2605.29400#bib.bib16)\)provides a multi\-website action\-trace corpus\. Both papers share a discrete\-output target \(a coordinate or a button label\) rather than the free\-form first\-person rationale our evaluation scores against\. We refer to them for setup framing rather than as direct comparators\.
## 3Setup
Figure 1:Pipeline\. Public sources are fused into the 12,929\-tuple*PiSAR*corpus with base64\-JPEG screens inline; the exporter emits chat\-completions JSONL for Fireworks managed SFT; each deployment is evaluated atT=0T=0against the held\-out slice; per\-row scores feed three metrics\.### 3\.1Datasets
We introduce*PiSAR*, a 12,929\-tuple proprietary corpus of screen\-anchored behavioural rationales built and maintained at AprioriLabs\. Each tuple is a \(Persona, intent, Screen, Action, Rationale\) record \(the acronym from which the corpus takes its name\) drawn from a small set of public sources and joined under a “real on 2 of 3 slots” rule: at least two of \{screen, persona, reasoning\} must be observed directly from a real human, not synthesized\. The sources are the OPeRA shopper traces\(Wanget al\.,[2025a](https://arxiv.org/html/2605.29400#bib.bib11)\), public app\-store reviews paired with their app’s marketing screenshots, and Pew American Trends Panel demographic microdata for persona context\-matching\. ListingLABEL:lst:example\_recordshows the schema of a redacted record\. PiSAR itself is not released alongside this paper; the methodology described here is intended to be reproducible end\-to\-end on equivalent corpora that a reader builds from the same public sources\.
Each record carries a base64\-JPEG screen \(512×∼280512\\times\{\\sim\}280px, mean 31 KB\), a structured persona, an intent string, a planned next action, and the gold rationale\. SFT training uses two variants of the same corpus: OPeRA\-only \(4,014 rows\) and combined \(13,796 rows, formed by concatenating OPeRA\-only and Filter A so OPeRA is upsampled2×2\\times\)\.
Listing 1:One redacted training record\. The screen is inlined as a data\-URI; the persona is a real Pew ATP donor\.\{
"messages":\[
\{"role":"system","content":"Youaresimulatingarealpersonmid\-task\.\.\."\},
\{"role":"user","content":\[
\{"type":"text","text":
"WHOYOUARE:30\-49/Aman/Whitenon\-Hispanic/Somecollege/
$50,000\-$74,999/Metropolitan/Married\\n
INTENT:leavinga1\-2starreviewoftheNetflixapp\(US\)\\n
ABOUTTODO:leavea1\-2starreviewofthisapp"\},
\{"type":"image\_url","image\_url":
\{"url":"data:image/jpeg;base64,/9j/4AAQSkZJRgABA\.\.\."\}\}
\]\},
\{"role":"assistant","content":
"Thisappdoesn’tallowmetoadjusttofullscreenonmytv\.
Everyotherappdoes\."\}
\]
\}
### 3\.2Training
Three SFT runs, all on Fireworks managed SFT, summarized in Table[1](https://arxiv.org/html/2605.29400#S3.T1)\. We did not override the Fireworks UI defaults beyond the LoRA rank and the schedule shown\. The combined\-train run upsamples OPeRA2×2\\timesby concatenating OPeRA\-only with Filter A, rather than reweighting the loss; the train budgets in Table[1](https://arxiv.org/html/2605.29400#S3.T1)reflect the concatenation\. The row arithmetic, made explicit so the count differential is not opaque: the PiSAR corpus is 12,929 SFT\-grade tuples before the screen\-availability filter \(5,648 OPeRA \+ 7,281 app\-store\)\. After dropping the 1,910 tuples that lack a fetched base64 screen, 11,019 remain \(4,709 OPeRA \+ 6,310 app\-store\), partitioned by the canonical split\-by\-screen scheme into a 9,782\-row training set \(4,014 OPeRA \+ 5,768 app\-store\) and a 1,237\-row test set\. The OPeRA\-only training set is the 4,014 OPeRA rows from that split\. The combined training set is the concatenation of the OPeRA\-only set with the full Filter A training split:4,014\+9,782=13,7964\{,\}014\+9\{,\}782=13\{,\}796rows, with OPeRA appearing twice and app\-store appearing once\. Whether the upsample is causal in the OPeRA\-slice gain or whether sample count alone explains it is unverified; combined has roughly3\.4×3\.4\\timesthe rows of OPeRA\-only\. We flag this as a confound in Section[4\.3](https://arxiv.org/html/2605.29400#S4.SS3)\.
Table 1:Per\-run training configuration as set in the Fireworks managed\-SFT UI\. The two Qwen runs share the same recipe; the Gemma run uses a larger batch and longer max context, consistent with the Fireworks defaults for that base\. Fields not exposed by the UI are not reported\.
### 3\.3Evaluation metrics
Three per\-row metrics, aggregated by mean over the test slice and by threshold pass\-rate over per\-row values\.
#### token\_jaccard
lowercased word\-token Jaccard between predicted and gold rationales\. Captures lexical surface overlap; values in\[0,1\]\[0,1\]; higher is better\.
#### length\_ratio
tokens\(pred\) / tokens\(gold\)\. A diagnostic, not a quality score\. Ratios near 1\.0 indicate matched terseness; ratios above 1\.5 indicate the model over\-explains relative to gold\.
#### semantic\_similarity\(sem\_sim\)
cosine similarity between OpenAItext\-embedding\-3\-smallembeddings \(1,536\-dim\) of predicted and gold rationales\. For natural\-language pairssem\_simfalls in roughly\[0\.1,0\.95\]\[0\.1,0\.95\]\.
Threshold pass rates \(sem≥\\geq0\.3 / 0\.5 / 0\.7\) report the fraction of test rows clearing each cut\. The thresholds are arbitrary cuts on a continuous metric; we read them as a semantic check, not a benchmark gate\. The mean is the headline\.
### 3\.4Baselines
Two frontier zero\-shot baselines were run directly againstPiSAR, on the same 661 rows the fine\-tuned models were evaluated on: Claude Opus 4\.7 via Anthropic’s native Messages API, and GPT\-5\.5 via OpenAI’s native Chat Completions API\. Both processed all 661 rows with zero errors\. Decoding follows each provider’s defaults: Opus 4\.7 deprecates thetemperatureparameter, and the GPT\-5 reasoning family requiresmax\_completion\_tokenswithT=1T=1rather than theT=0T=0the SFT runs use\. Image input is the same base64\-JPEG payload the SFT runs receive, converted to each provider’s native image\-block schema before the call\. We did not explore prompt engineering, few\-shot exemplars, or longer\-context system prompts; the goal is to measure frontier behaviour under the simplest possible apples\-to\-apples lift, not to optimise it\.
## 4Results
The combined\-trained Qwen3\-VL\-8B\-Instruct \(b5my94dm\) reachessem\_sim0\.783 onPiSAR\. The two frontier zero\-shot baselines \(Opus 4\.7 and GPT\-5\.5\), evaluated on the same 661 rows, reach 0\.459 and 0\.482 respectively\. The OPeRA\-only\-trained Qwen \(ycfo6bpw\) reaches 0\.519\. The combined\-trained Gemma \(gz7vqm46\) reaches 0\.441\. Two of those numbers carry the paper: the gap from 0\.78 to∼\\sim0\.47 \(fine\-tune vs frontier on the same test set\), and the gap from 0\.78 to 0\.44 \(same training data on a different base model\)\.
Table[2](https://arxiv.org/html/2605.29400#S4.T2)is the unified leaderboard\. Each row is one \(model, slice\) tuple\. Our SFT runs appear three times each \(overall \+ per source\); frontier baselines appear once per slice they were measured on\. Figure[2](https://arxiv.org/html/2605.29400#S4.F2)plots the per\-modelsem\_simmean with 95% bootstrap CIs\.
Figure 2:sem\_simper model onPiSAR\(n=661n=661\)\. SFT runs are highlighted \(teal/rust band\); frontier zero\-shot baselines render in grayscale\. Error bars are 95% bootstrap CIs over per\-rowsem\_sim\. The gap between the top SFT bar \(combined Qwen\-VL at 0\.783\) and the top frontier bar \(GPT\-5\.5 at 0\.482\) is 0\.30 absolute\.Table 2:Unified leaderboard\. Every model is evaluated on the same held\-out slice \(PiSAR, 661 rows: 119 OPeRA \+ 542 app\-store\) with the same scoring pipeline\. “Combined” training mix is OPeRA\-only concatenated with Filter A so OPeRA is upsampled2×2\\times\(see[Section˜3\.2](https://arxiv.org/html/2605.29400#S3.SS2)for the row arithmetic\)\.ModelTrainingSlicennjacclen\_rsemsem≥\\geq0\.3sem≥\\geq0\.5sem≥\\geq0\.7*Our SFT runs \(Qwen3\-VL\-8B\-Instruct / Gemma\-4\-26B\-A4B\-IT\)*Qwen\-VL\-8B \(b5my94dm\)combinedoverall6610\.4171\.010\.78399%96%79%↪\\hookrightarrowcombinedOPeRA\-clean1190\.6351\.280\.80098%91%74%↪\\hookrightarrowcombinedapp\-store\-clean5420\.3690\.950\.779100%98%81%Qwen\-VL\-8B \(ycfo6bpw\)OPeRA\-onlyoverall6610\.2330\.830\.51977%54%24%↪\\hookrightarrowOPeRA\-onlyOPeRA\-clean1190\.5511\.290\.73595%83%66%↪\\hookrightarrowOPeRA\-onlyapp\-store\-clean5420\.1640\.730\.47173%47%15%Gemma\-4\-26B\-A4B \(gz7vqm46\)combinedoverall6610\.0954\.43†0\.44186%32%2%↪\\hookrightarrowcombinedOPeRA\-clean1190\.08516\.64†0\.38070%20%3%↪\\hookrightarrowcombinedapp\-store\-clean5420\.0971\.740\.45590%35%2%*Frontier zero\-shot baselines*Opus 4\.7zero\-shotoverall6610\.0971\.040\.45989%38%1%↪\\hookrightarrowzero\-shotOPeRA\-clean1190\.0661\.420\.34359%13%0%↪\\hookrightarrowzero\-shotapp\-store\-clean5420\.1040\.960\.48596%44%1%GPT\-5\.5zero\-shotoverall6610\.1080\.960\.48292%45%2%↪\\hookrightarrowzero\-shotOPeRA\-clean1190\.1121\.620\.40572%28%2%↪\\hookrightarrowzero\-shotapp\-store\-clean5420\.1070\.810\.49996%49%2%
†\\daggerGemmalength\_ratiomeans are distorted by reasoning\-trace outliers \(one row at 457×\\times\)\. Medians are 0\.71 \(overall\), 1\.00 \(OPeRA\-clean\), 0\.66 \(app\-store\-clean\) and are the honest numbers; see[Section˜4\.2](https://arxiv.org/html/2605.29400#S4.SS2)\.
### 4\.1SFT vs frontier
*Convention\.*“The gap” without further qualification denotes the gap between the fine\-tuned Qwen and*the stronger of the two frontier baselines*, GPT\-5\.5:0\.783−0\.482=0\.3010\.783\-0\.482=0\.301\. The corresponding gap to Opus 4\.7 is0\.3240\.324, and the within\-SFT gap to Gemma is0\.3420\.342\. Where a section quotes a different number we name the baseline explicitly\.
The combined\-trained Qwen3\-VL\-8B beats the stronger frontier zero\-shot baseline \(GPT\-5\.5 atsem\_sim0\.482\) by 0\.30 absolute on the same 661\-row test set\. On the strict\-paraphrase threshold the contrast sharpens further: the combined Qwen clearssem\_sim≥0\.7\\geq 0\.7on 79% of rows where Opus 4\.7 clears it on 1% and GPT\-5\.5 on 2% \([Fig\.˜3](https://arxiv.org/html/2605.29400#S4.F3)\)\. Ontoken\_jaccardthe gap is comparable, 0\.417 vs 0\.097 \(Opus\) and 0\.108 \(GPT\-5\.5\), so the result is not a length\-mismatch artifact\.
Figure 3:Threshold pass rates onPiSAR\(n=661n=661\)\. The shading shows three cuts of the same continuous metric:sem\_sim≥0\.3\\geq 0\.3\(*on\-topic*, lightest\),≥0\.5\\geq 0\.5\(*right idea*\),≥0\.7\\geq 0\.7\(*paraphrase quality*, darkest\)\. On the strict\-paraphrase cut the gap between the combined\-trained Qwen \(79%\) and the stronger frontier zero\-shot \(GPT\-5\.5 at 2%\) is roughly 40×\\times\.Both frontier zero\-shot baselines onPiSAR\(Opus 4\.7 and GPT\-5\.5,n=661n=661each\) confirm the SFT\-vs\-frontier gap on apples\-to\-apples ground:sem\_sim0\.459 \(Opus 4\.7\) and 0\.482 \(GPT\-5\.5\) against 0\.783 for the combined\-trained Qwen\-VL on the same test rows, a gap of 0\.30 absolute\. Threshold pass rates atsem\_sim≥0\.7\\geq 0\.7are 1–2% for both frontier runs vs 79% for the fine\-tuned Qwen\.token\_jaccardtells the same story: 0\.097 \(Opus\) and 0\.108 \(GPT\-5\.5\) against 0\.417 for the fine\-tune\. The frontier zero\-shot baselines stay terse on average \(length\_ratio1\.04 for Opus, 0\.96 for GPT\-5\.5\)\. This is not a length\-mismatch artifact; the gap is in content, not in verbosity\.
### 4\.2Gemma stays in the frontier zero\-shot band
Identical training data, identical Fireworks managed\-SFT recipe, identical hyperparameters \([Table˜1](https://arxiv.org/html/2605.29400#S3.T1)\)\. The combined\-trained Qwen3\-VL\-8B reachessem\_sim0\.783; the combined\-trained Gemma reaches 0\.441\. The Gemma score sits in the same band as the frontier zero\-shot baselines on the same slice \(Opus 4\.7 at 0\.459, GPT\-5\.5 at 0\.482\)\. The SFT moved Gemma a small amount above its zero\-shot prior but not out of the frontier\-zero\-shot regime; the same recipe moved Qwen3\-VL\-8B far above it\.
A diagnostic: on 26 of 661 test rows \(4%\), Gemma’s output atT=0T=0begins with literal markdown markers \(\* Persona:,\* Intent:,\* Action:,\* Draft 1:,\* Draft 2:\) and never produces a closing answer inside the 1,500\-token budget\. The worst single row haslength\_ratio457 \([Fig\.˜4](https://arxiv.org/html/2605.29400#S4.F4)\): a 10\-character gold \("check cart"\) paired with a 5,425\-character internal reasoning trace that lists persona attributes, drafts candidate rationales, and exhausts the budget mid\-thinking\. For these rowsmessage\.contentisnull; we salvagemessage\.reasoning\_contentso the metrics stay defined\. Medianlength\_ratiois 0\.71 for the overall slice: most rows behave; 26 outliers pull the mean to 4\.43\. Latency is also higher \(median 4\.79 s vs 0\.79 s for the Qwen runs\), consistent with the model spending most of its budget on internal drafts\.
Figure 4:length\_ratioper SFT model: median \(lighter\) vs mean \(darker\)\. Gemma’s mean is pulled to 4\.43 by 26 reasoning\-trace\-bleed outliers; the single worst row reaches 457×\\times\. Reference line at 1\.0 \(matched terseness\)\.Our reading: a recipe\-vs\-base mismatch, not a model\-quality regression\. Gemma\-4\-26B\-A4B\-IT was post\-trained on a draft\-first reasoning template that the 13,796\-row LoRA\-rank\-16 recipe did not displace\. Qwen3\-VL\-8B\-Instruct was post\-trained on a more direct\-output format, and the same recipe did displace it\. We hypothesize, but do not test, that a higher\-parameter reasoning\-tuned base needs either substantially more training data or a stronger fine\-tuning method \(full\-parameter FT, higher\-rank LoRA, longer schedule\) to move materially out of its prior\. The chain\-of\-thought template bleed we observe is consistent with that reading: the model still wants to draft because the SFT did not provide enough signal to override the post\-training behaviour\. We do not have data on the recipe’s behaviour at higher rank or more epochs on Gemma, the same recipe on a non\-reasoning\-tuned Gemma checkpoint, or a third large base\. The hypothesis is informally supported by the Gemma\-vs\-Qwen\-vs\-frontier comparison; we report the failure mode and the diagnostic, and we do not generalize past this one base\.
### 4\.3Training\-data composition
Same base, different training corpus \([Fig\.˜5](https://arxiv.org/html/2605.29400#S4.F5)\)\. The combined\-trained Qwen3\-VL\-8B reachessem\_sim0\.783; the OPeRA\-only\-trained Qwen reaches 0\.519 on the same slice\. The biggest delta is on app\-store rows: 0\.471→\\rightarrow0\.780 \(\+0\.309\+0\.309\)\. On OPeRA rows the gain is smaller \(0\.735→0\.8000\.735\\rightarrow 0\.800,\+0\.065\+0\.065\) but consistent\. Adding app\-store training did not hurt OPeRA evaluation\.
Figure 5:sem\_simby source on the held\-out slice\. OPeRA\-clean \(lighter bars,n=119n=119\) and app\-store\-clean \(darker bars,n=542n=542\)\. The combined\-trained Qwen lifts both slices; OPeRA\-only generalizes only on OPeRA; Gemma underperforms across the board\.The confound we cannot resolve: combined train has 3\.4×\\timesthe example count of OPeRA\-only, in addition to differing source mix\. The clean experiment is “OPeRA\-only at 13,796 examples by re\-sampling” vs “combined at 13,796 examples”; we did not run it\. The observed gain is consistent with the “app\-store text teaches reviewer voice that transfers” story and with the “more examples is just better” story; we cannot distinguish on what we have\.
The distributional view \([Fig\.˜6](https://arxiv.org/html/2605.29400#S4.F6)\) shows that the combined run shifts the entire per\-rowsem\_simECDF rightward over the OPeRA\-only run, not just the mean\. The Gemma curve sits below both\.
Figure 6:Per\-rowsem\_simECDF onPiSAR\(n=661n=661\) for the three SFT runs\. Vertical guides at the three threshold cuts\. The combined Qwen\-VL run stochastically dominates the OPeRA\-only run; the Gemma run sits below both across the full support\.
### 4\.4A worked example
Figure[7](https://arxiv.org/html/2605.29400#S4.F7)shows one of the rows where the combined\-trained Qwen pulls clearly ahead of both alternatives\. It is one row of 661, not a benchmark in itself; the headline numbers come from Table[2](https://arxiv.org/html/2605.29400#S4.T2)\.
Figure 7:Worked example, row 122 \(OPeRA\)\. The gold rationale is a verbatim shopper utterance during an Amazon search session, captured by the OPeRA protocol:*change searching keywords to find wedge pillow for sleep apnea*\. The combined\-trained Qwen produces a near\-paraphrase \(sem\_sim0\.973\); the OPeRA\-only\-trained Qwen captures the medical intent but invents an “auto completer” detail not on the screen \(sem\_sim0\.509\); the Gemma run drifts into a generic positive reaction unrelated to the persona’s task \(sem\_sim0\.159\)\.
## 5Discussion
### 5\.1Scope of the claim
The claim is bounded but load\-bearing\. On the same 661\-row held\-out slice, scored by the same pipeline, a Qwen3\-VL\-8B\-Instruct fine\-tuned on 13,796 PiSAR records produces behavioural rationales that match the recorded human voice on 79% of rows at the strict\-paraphrase cut \(sem\_sim≥0\.7\\geq 0\.7\); Claude Opus 4\.7 zero\-shot matches it on 1% and GPT\-5\.5 zero\-shot matches it on 2%\. The gap is 0\.30\+ absolute on the mean, and roughly 40\-to\-80×\\timeson the strict\-paraphrase pass rate\. This is one task and one corpus shape, but it is a direct measurement, not an extrapolation\.
sem\_simagainst a recorded rationale is a proxy for behavioural fidelity, not behavioural fidelity itself: it does not certify downstream action correctness, robustness to prompt perturbation, or per\-individual fidelity in the senseNamazovaet al\.\([2025](https://arxiv.org/html/2605.29400#bib.bib8)\)demand of Centaur\. We chose this metric because it is the same metric the frontier baselines were scored against, on the same rows, so the comparison is apples\-to\-apples\.
One successful base, one unsuccessful base, one recipe\. The fine\-tune lift held on Qwen3\-VL\-8B\-Instruct and not on Gemma\-4\-26B\-A4B\-IT at the same SFT budget \([Section˜5\.2](https://arxiv.org/html/2605.29400#S5.SS2)\)\. The honest read is that the recipe must fit the base, not that fine\-tuning is brittle, the same SFT budget that failed to displace Gemma’s reasoning\-template prior produced a 0\.30\-absolute lift on the Qwen base\. We report both directions and treat them as informative about the recipe\-base interaction, not as a hedge on the headline number\.
### 5\.2Architecture matters more than parameter count
The Gemma result is itself a finding, not a footnote\. Same corpus, same hyperparameters, same managed\-SFT pipeline\. Qwen3\-VL\-8B\-Instruct rises from the frontier\-zero\-shot band tosem\_sim0\.783, a 0\.34\-absolute lift on the same data that left Gemma at 0\.441\. A model with roughly 3\.4×\\timesthe total parameters and 13×\\timesthe announced “reasoning” headline did not win this task on this data\. The smaller base did\. Two readings are compatible with the evidence and both predict the same prescription\.
*\(i\)*Capacity is mis\-allocated\.LoRA at rank 16 is a small intervention against a 26B\-parameter MoE reasoning\-tuned base\. The same rank was sufficient to displace the 8B Qwen’s post\-training prior\. The MoE routing layer under low\-rank updates may under\-train experts the persona\-style output needs, and the reasoning\-tuned post\-training prior is harder to overwrite than an instruct\-tuned one\.
*\(ii\)*Data is mis\-shaped for the prior\.13,796 PiSAR examples are well\-suited to a base post\-trained for direct output\. They may be too small a signal to override a Gemma\-class draft\-first chain\-of\-thought template\. The 26\-row reasoning\-trace bleed in[Section˜4\.2](https://arxiv.org/html/2605.29400#S4.SS2)is the model literally producing the markdown structure it was post\-trained to emit; the SFT loss did not displace it\.
The mechanistic distinction matters less than the practical one\.At the SFT recipe most teams actually deploy managed LoRA on a 10–20k\-record domain corpus, base architecture choice dominated the outcome\.A practitioner’s first move on a Gemma\-class reasoning base is not to declare the task hard but to upgrade the recipe: rank≥32\\geq 32, longer schedule, full\-parameter FT, or a non\-reasoning checkpoint of the same family\. We did not run those; the result we report is that the default\-shaped recipe lands far above frontier zero\-shot on the right base, and inside the frontier\-zero\-shot band on the wrong one\. The right base is the load\-bearing choice\.
### 5\.3Practical implications
For a team building screen\-conditioned persona\-rationale or action\-prediction into a product, the implication is direct\. A roughly\-15k\-record domain corpus of the PiSAR shape, built from real screens, real personas, and observed reasoning under a “2\-of\-3 real” rule, fine\-tuned via managed LoRA SFT on an 8B vision\-language base, produces behavioural rationales that the strongest reasoning\-class frontier model on the market does not produce, on the same test rows, at a fraction of the per\-call inference cost\.
The numbers are concrete\. OnPiSARthe fine\-tuned Qwen3\-VL\-8B\-Instruct reachessem\_sim0\.783; Opus 4\.7 reaches 0\.459, GPT\-5\.5 reaches 0\.482, gaps of 0\.30 and 0\.32 absolute\. On the strict\-paraphrase threshold \(sem\_sim≥0\.7\\geq 0\.7\) the fine\-tune clears 79% of test rows where Opus clears 1% and GPT\-5\.5 clears 2%; the ratio on the cut closest to “actually right” is 40\-to\-80×\\times\.token\_jaccardcorroborates, 0\.417 vs 0\.097 and 0\.108\. Median per\-call latency is 0\.79 s for the fine\-tune against 2\.85 s for Opus and 3\.53 s for GPT\-5\.5, fast enough for real\-time persona simulation while the frontier alternatives sit closer to batch regimes; per\-call inference cost is correspondingly lower by an order of magnitude\.
The practical bet for behavioural\-fidelity work at the screen\-conditioned rationale\-and\-action axis is therefore not “wait for the next frontier model” but “invest in a PiSAR\-shaped corpus and fine\-tune a small vision\-language base\.” At this evaluation, against the latest reasoning\-class frontier models that exist as of writing, the second strategy is decisively ahead\.
A second practical point\. When the same recipe fails on a different base, as Gemma demonstrates, the failure surfaces in thelength\_ratioand sem≥\\geq0\.7 numbers after a single eval run\. One held\-out\-slice evaluation is enough to spot a Gemma\-style template fight before committing to longer or more expensive follow\-up work\.
To shorten the path for teams attempting this on their own corpora, AprioriLabs publishes the exact configuration that produced the Qwen result, namely the Fireworks SFT recipe in Table[1](https://arxiv.org/html/2605.29400#S3.T1), the chat\-completions message schema in ListingLABEL:lst:example\_record, and the evaluation pipeline in Section[3\.3](https://arxiv.org/html/2605.29400#S3.SS3)\. PiSAR and the trained weights remain proprietary; everything needed to reproduce the recipe on an equivalent “2\-of\-3 real” corpus does not\.
## 6Limitations
The honest scope\. None of these change the headline result, a 0\.30\-absolutesem\_simgap with non\-overlapping CIs and a 40\-to\-80×\\timessem≥\\geq0\.7 pass\-rate ratio on the same test rows is durable to the kinds of methodological refinements listed below, but each is a real piece of work that would tighten the claim\.
- •Frontier panel of two\.Claude Opus 4\.7 and GPT\-5\.5 are the two strongest reasoning\-class frontier models on the market at submission\. Sonnet 4\.6, GPT\-4o, and a future Gemini on the same slice would broaden vendor coverage; we did not observe any signal in earlier shadow\-benchmark exploration suggesting one of them sits materially higher than the two we report\.
- •Single fine\-tuning recipe\.Managed LoRA at rank 16, AdamW, cosine schedule\. The Qwen result establishes a floor on what this recipe can achieve, not a ceiling, since higher rank, longer schedule, or full\-parameter FT plausibly improve it further\. The Gemma result establishes that this recipe is the wrong one for a reasoning\-tuned MoE; a higher\-capacity recipe is the obvious follow\-up\.
- •Confounded combined\-vs\-OPeRA\-only comparison\.The combined training set has 3\.4×\\timesthe example count of OPeRA\-only on top of differing source mix; the matched\-row\-count ablation is unrun\. The 0\.30 gap to frontier holds for either training mix, so this confound bounds the size of the within\-SFT comparison but not the headline\.
- •No held\-out app\-category test\.The app\-store half of the evaluation is in\-distribution within the same app population the training data was drawn from\. A held\-out app\-category split is the right next OOD probe\.
- •sem\_simis a proxy\.Embedding cosine overtext\-embedding\-3\-smallis the same metric the frontier baselines were scored against, so the comparison is apples\-to\-apples; the metric does not by itself certify downstream action correctness\. We expect alternative scorers \(BERTScore, LLM\-judge against the same gold rationales\) to move the numbers but not to close the gap\.
- •Default decoding for frontier baselines\.Provider\-default decoding via native APIs\. Aggressive prompt engineering, few\-shot exemplars, or longer\-context system prompts on the frontier side were not explored; given the size of the gap \(0\.30\+ absolute, 40\-to\-80×\\timeson the strict\-paraphrase cut\) such a study would need to close substantially more than a fraction of the gap to change the headline\.
- •PiSAR and the fine\-tuned weights are proprietary\.PiSAR is built and maintained at AprioriLabs and is part of the product moat; the same is true of the fine\-tuned model artefacts\. The methodology, including public sources \(OPeRA traces, public app\-store reviews, Pew ATP demographics\), fusion rule, training recipe, and evaluation pipeline, is described in full so an interested reader can build an equivalent corpus and an equivalent SFT run from the same starting points\.
## 7Conclusion
Behavioural\-fidelity work on screen\-conditioned rationale and action prediction is more reliably advanced by domain\-relevant fine\-tuning of a small vision\-language base than by prompting a frontier model\. The evidence is direct\. On a 661\-row held\-out slice of PiSAR scored by the same pipeline against the same recorded human rationales, a Qwen3\-VL\-8B\-Instruct fine\-tuned via managed LoRA SFT reachessem\_sim0\.783; Claude Opus 4\.7 zero\-shot reaches 0\.459 and GPT\-5\.5 zero\-shot reaches 0\.482\. On the strict\-paraphrase cut the fine\-tune clears 79% of test rows where either frontier model clears 1–2%, roughly 40\-to\-80×\\times\. The same training data on a Gemma\-4\-26B\-A4B\-IT base sits inside the frontier\-zero\-shot band at 0\.441, evidence that the SFT recipe must fit the base architecture; on the right base, the lift is unambiguous\.
The shape of the bet that produced this result is what makes the result reproducible:*\(i\)*a corpus of∼\\sim10–20K screen\-anchored behavioural records under a “real on 2 of 3 slots” fusion rule,*\(ii\)*an 8B vision\-language base with a direct\-output post\-training prior, and*\(iii\)*managed LoRA SFT at default hyperparameters\. PiSAR is the specific instance of this shape that AprioriLabs uses in production; the same shape, built from the public starting points described here, is what a reader can construct\.
PiSAR and the fine\-tuned model artefacts are part of AprioriLabs’ product moat and are not released alongside this paper\. The methodology is\.
## References
- M\. Binz, E\. Akata, M\. Bethge, M\. Brand, E\. Fedorenko, J\. Fränken, M\. Glickman, K\. Haggag, C\. Hoffmann, and E\. Schulz \(2025\)Centaur: a foundation model of human cognition\.Nature\.Note:Preprint released October 2024 as arXiv:2410\.20268\.External Links:2410\.20268Cited by:[§2\.1](https://arxiv.org/html/2605.29400#S2.SS1.p2.1)\.
- Post\-training makes large language models less human\-like\.arXiv preprint\.External Links:2605\.07632,[Document](https://dx.doi.org/10.48550/arXiv.2605.07632)Cited by:[§2\.1](https://arxiv.org/html/2605.29400#S2.SS1.p3.1)\.
- M\. Binz and E\. Schulz \(2023\)Using cognitive psychology to understand GPT\-3\.Proceedings of the National Academy of Sciences \(PNAS\)120\(6\)\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2218523120)Cited by:[§2\.1](https://arxiv.org/html/2605.29400#S2.SS1.p2.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2Web: towards a generalist agent for the web\.InAdvances in Neural Information Processing Systems,External Links:2306\.06070,[Document](https://dx.doi.org/10.48550/arXiv.2306.06070)Cited by:[§2\.3](https://arxiv.org/html/2605.29400#S2.SS3.p2.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: efficient finetuning of quantized LLMs\.InAdvances in Neural Information Processing Systems,External Links:2305\.14314,[Document](https://dx.doi.org/10.48550/arXiv.2305.14314)Cited by:[§2\.3](https://arxiv.org/html/2605.29400#S2.SS3.p1.1)\.
- Fireworks AI \(2024\)Fireworks AI: managed inference and supervised fine\-tuning platform\.\.Note:[https://fireworks\.ai/](https://fireworks.ai/)Cited by:[Appendix C](https://arxiv.org/html/2605.29400#A3.p1.1)\.
- W\. Hong, W\. Wang, Q\. Lv, J\. Xu, W\. Yu, J\. Ji, Y\. Wang, Z\. Wang, Y\. Zhang, J\. Li, B\. Xu, Y\. Dong, M\. Ding, and J\. Tang \(2024\)CogAgent: a visual language model for GUI agents\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:2312\.08914,[Document](https://dx.doi.org/10.48550/arXiv.2312.08914)Cited by:[§2\.3](https://arxiv.org/html/2605.29400#S2.SS3.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:2106\.09685,[Document](https://dx.doi.org/10.48550/arXiv.2106.09685)Cited by:[§2\.3](https://arxiv.org/html/2605.29400#S2.SS3.p1.1)\.
- Y\. Lu, Y\. Huang, Z\. Han, Y\. Yao,et al\.\(2025a\)Prompting is not all you need\! Evaluating LLM agent simulation methodologies with real\-world online customer behavior data\.arXiv preprint\.External Links:2503\.20749,[Document](https://dx.doi.org/10.48550/arXiv.2503.20749)Cited by:[§2\.2](https://arxiv.org/html/2605.29400#S2.SS2.p1.1)\.
- Y\. Lu, Y\. Yao, X\. Gu, Y\. Huang,et al\.\(2025b\)UXAgent: an LLM\-Agent\-Based usability testing framework for web design\.InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems \(CHI EA ’25\),External Links:2502\.12561,[Document](https://dx.doi.org/10.1145/3706599.3719729)Cited by:[§2\.2](https://arxiv.org/html/2605.29400#S2.SS2.p1.1)\.
- A\. Namazova, L\. Brondetta, Y\. Strittmatter, M\. R\. Nassar, and S\. Musslick \(2025\)Not yet AlphaFold for the mind: evaluating Centaur as a synthetic participant\.arXiv preprint\.External Links:2508\.07887,[Document](https://dx.doi.org/10.48550/arXiv.2508.07887)Cited by:[§2\.1](https://arxiv.org/html/2605.29400#S2.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.29400#S5.SS1.p2.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST ’23\),External Links:2304\.03442,[Document](https://dx.doi.org/10.1145/3586183.3606763)Cited by:[§2\.1](https://arxiv.org/html/2605.29400#S2.SS1.p1.1)\.
- J\. S\. Park, C\. Q\. Zou, A\. Shaw, B\. M\. Hill, C\. J\. Cai, M\. R\. Morris, R\. Willer, P\. Liang, and M\. S\. Bernstein \(2024\)Generative agent simulations of 1,000 people\.arXiv preprint\.External Links:2411\.10109,[Document](https://dx.doi.org/10.48550/arXiv.2411.10109)Cited by:[§2\.1](https://arxiv.org/html/2605.29400#S2.SS1.p1.1)\.
- X\. Wang, Y\. Lu, Y\. Li, A\. Amini,et al\.\(2025a\)OPeRA: a dataset of observation, persona, rationale, and action for evaluating LLMs on human online shopping behavior simulation\.arXiv preprint\.External Links:2506\.05606,[Document](https://dx.doi.org/10.48550/arXiv.2506.05606)Cited by:[§2\.2](https://arxiv.org/html/2605.29400#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.29400#S3.SS1.p1.1)\.
- X\. Wang, Y\. Lu, Y\. Zhang, Y\. Huang, and J\. Wang \(2025b\)Customer\-R1: personalized simulation of human behaviors via RL\-based LLM agent in online shopping\.arXiv preprint\.External Links:2510\.07230,[Document](https://dx.doi.org/10.48550/arXiv.2510.07230)Cited by:[§2\.2](https://arxiv.org/html/2605.29400#S2.SS2.p2.1)\.
- Y\. Zhanget al\.\(2025\)See, think, act: online shopper behavior simulation with VLM agents\.arXiv preprint\.External Links:2510\.19245,[Document](https://dx.doi.org/10.48550/arXiv.2510.19245)Cited by:[§2\.2](https://arxiv.org/html/2605.29400#S2.SS2.p2.1)\.
- Y\. Zhang, X\. Wang, R\. Gesi,et al\.\(2025\)Shop\-R1: rewarding LLMs to simulate human behavior in online shopping via reinforcement learning\.arXiv preprint\.External Links:2507\.17842,[Document](https://dx.doi.org/10.48550/arXiv.2507.17842)Cited by:[§2\.2](https://arxiv.org/html/2605.29400#S2.SS2.p2.1)\.
## Appendix AHyperparameters per run
The per\-run hyperparameter table appears in the main text as[Table˜1](https://arxiv.org/html/2605.29400#S3.T1)\. We did not override Fireworks\-managed\-SFT defaults beyond what the UI exposes\.
Eval\-time inference settings:T=0\.0T=0\.0,top\_p=1\.0,top\_k=40,max\_tokens=200 for the two Qwen runs and 1,500 for the Gemma run \(raised after we found the chain\-of\-thought template was exhausting the budget mid\-thinking; see[Section˜4\.2](https://arxiv.org/html/2605.29400#S4.SS2)\)\. Concurrency 6 against Fireworks’ OpenAI\-compatible chat\-completions endpoint\.
## Appendix BWorked examples
Three records pulled frompaper/data/per\_row\_scores\.jsonl: two where the combined\-trained Qwen produces a clearly better rationale than either alternative, and one where one or both alternatives fail in distinct ways\.
#### B\.1\. OPeRA, search\-keyword change \(row 122\)\.
> Gold:change searching keywords to find wedge pillow for sleep apnea\. b5my94dm\(sem 0\.973\): change search keywords to find a wedge pillow for sleep apnea\. ycfo6bpw\(sem 0\.509\): I want to choose an auto completer for sleep apnea\. gz7vqm46\(sem 0\.159\): That’s exactly what I need, saves me the typing\.
The combined Qwen produces a near\-paraphrase\. The OPeRA\-only Qwen partially captures the medical intent but invents an “auto completer” detail that does not exist on the screen\. Gemma generates a generic positive reaction unrelated to the persona’s task\.
#### B\.2\. OPeRA, window\-shopping \(row 99\)\. Gemma underperforms both Qwen runs\.
> Gold:look for more details and reviews to check if this is a nice deal and want to add the product\. b5my94dm\(sem 0\.761\): want to add a product and look for more details and reviews\. ycfo6bpw\(sem 0\.761\): want to add a product and look for more details and reviews\. gz7vqm46\(sem 0\.423\): There it is, let’s check the details\.
The two Qwen runs produce byte\-identical predictions; this is plausibly an interpolation from the in\-train phrase bank\. The Gemma prediction is on\-topic but terse to the point of losing the comparison\-shopping angle\.
#### B\.3\. OPeRA, “check cart” \(row 121\)\. Gemma length pathology\.
> Gold\(10 characters\): check cart\. b5my94dm\(sem 0\.776,length\_ratio3\.00\): go to cart to check cart\. ycfo6bpw\(sem 0\.769,length\_ratio3\.00\): go to cart to check cart gz7vqm46\(sem 0\.320,length\_ratio457\.00\):\* Persona: 18\-24 female, PhD student \(Informatics/HCI\), Irvine, low income \($25k\-$50k\), Amazon Prime member, shops 1\-2x/week,∼\\sim$50\-150/mo spend\. \* Intent: Looking at a Robot Vacuum and Mop Combo…\(truncated at 350 of 5,425 characters\)\.
The two Qwen runs over\-explain a one\-action moment but stay terse and on\-topic\. The Gemma model produces a 5,425\-character internal\-reasoning trace that lists persona attributes, drafts candidate rationales, and never produces a final answer before exhausting the 1,500\-token budget\. This is the worstlength\_ratiooutlier in the dataset and is the canonical reasoning\-trace\-bleed case from[Section˜4\.2](https://arxiv.org/html/2605.29400#S4.SS2)\.
## Appendix CCompute and infrastructure
All three SFT runs and the three SFT evaluations were performed onFireworks AI\[Fireworks AI,[2024](https://arxiv.org/html/2605.29400#bib.bib17)\], a managed\-inference platform that exposes a chat\-completions API compatible with the OpenAI schema and a managed\-SFT UI for LoRA fine\-tunes of supported base models\. Training jobs were configured through the UI with the values in[Table˜1](https://arxiv.org/html/2605.29400#S3.T1)\. Eval inference used the same Fireworks endpoint atT=0T=0with concurrency 6\.
The frontier zero\-shot baseline runs called the providers’ native APIs directly: Anthropic Messages API for Opus 4\.7 and OpenAI Chat Completions for GPT\-5\.5\. Decoding follows each provider’s defaults \(Opus 4\.7 deprecatestemperature; the GPT\-5 reasoning family requiresmax\_completion\_tokenswithT=1T=1\)\. Semantic\-similarity scoring used OpenAItext\-embedding\-3\-smallon every \(predicted, gold\) pair across all runs, fine\-tuned and frontier alike\.Similar Articles
PaintBench: Deterministic Evaluation of Precise Visual Editing
PaintBench is a new benchmark for evaluating precise visual editing in multimodal models, covering 20 operations across 4 categories with deterministic pixel-level evaluation. Testing 11 models reveals overall low performance, with the best model scoring only 17.1% mIoU.
@dair_ai: NEW paper worth reading. GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, mat…
A new paper shows that using a weak model with k=8 proposals and a critic-comparator selection loop can match frontier model performance on SWE-bench Verified, reaching 76.4% accuracy. The key insight is that correct patches are often already present in a weak model's top-k candidates, and the challenge is effective selection using execution verification.
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models
RoboSemanticBench is a benchmark that diagnoses semantic grounding in action prediction for vision-language-action models, revealing that while robots can grasp objects, they fail to select semantically correct targets based on instruction semantics.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
This paper introduces CUActSpot, a multimodal benchmark for evaluating computer-use agents, and a renderer-based data synthesis pipeline. The proposed Phi-Ground-Any-4B model outperforms open-source models under 32B parameters.
New local model reaching near frontier on PII removal at 9 ms CPU inference
Introduces ScreenLeak, a benchmark for measuring PII redaction in computer-use AI data, and presents two local models (v45_phase3 for text and rfdetr_v8 for images) achieving near-frontier performance at low latency.