The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report
Summary
This paper identifies the 'Inattentional Gap' where task-conditioned AI models suppress reporting of safety-critical signals they can otherwise detect, analogous to human inattentional blindness, challenging the assumption that benchmark performance ensures real-world safety.
View Cached Full Text
Cached at: 06/26/26, 05:18 AM
# Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report
Source: [https://arxiv.org/html/2606.26529](https://arxiv.org/html/2606.26529)
Kwan Soo Shin111PolymathMinds Lab, Seoul, Republic of Korea\. ORCID 0009\-0001\-5799\-7556\. Correspondence: sshin@pmminds\.ai
\(June 2026\)
## Summary
AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified\. We show that conditioning a language or vision model on a narrow task suppresses its reporting of co\-present, safety\-critical signals it can otherwise report, a machine analogue of human inattentional blindness arising from a different mechanism\. Across radiology and driving text scenarios and chest\-radiograph vision tasks, suppression appeared in every model tested, did not diminish with scale, persisted in a reasoning model, and varied more by model family than by size, while the same models reported these signals at substantially higher rates when unconstrained\. We name this dissociation the Inattentional Gap and argue that it decouples measured benchmark safety from real\-world safety: a system can score near\-perfectly on the hazards an evaluation specifies while remaining blind to those that cause harm\.
## The Bigger Picture
AI systems are being deployed as perceptual front\-ends in domains where the cost of a missed signal is measured in lives: radiology, autonomous driving, security screening, and safety\-critical code review\. The prevailing assumption is that a more capable model, scoring higher on targeted benchmarks, is a safer model\. Our results complicate that assumption\. We find that conditioning a model on a specific task, the normal way these systems are deployed, suppresses its reporting of unrequested but safety\-critical information that it can otherwise report\. Because safety evaluations measure performance on specified targets while accidents often arise from unspecified ones, benchmark gains need not translate into deployment safety\. The phenomenon is a behavioral analogue of human inattentional blindness, including the radiology “gorilla” effect, but with a different mechanism: our dose analysis traces the trigger to output scope rather than human\-like perceptual load\. A within\-item reportability check, the same model on the same input reporting the signal when unconstrained, separates this effect from prior reports of models merely missing content\. It reframes the central question from whether AI can identify a signal to whether task\-conditioning prevents it from reporting signals it can otherwise report\. In dual\-process terms, the conditioned model resembles System\-1\-style task capture, with no reliable architecture\-general oversight behavior in the systems sampled here; monitoring appeared only as a model\-family\-specific safety\-reporting disposition or as an explicit second process\. Because deployed systems must scope their tasks, the Inattentional Gap is not merely a prompt\-level bug but a measurable risk of task\-scoped deployment and a target for redesigned safety evaluation\.
## Introduction
Accidents happen where they are not expected\. The hazards an operator anticipates are, by construction, the hazards a system is built and tested to detect; the residual risk lives in the events outside that specification\. This asymmetry is the organizing intuition of this paper, and it has a precise correlate in human cognition\. In the “invisible gorilla” paradigm, observers counting basketball passes fail to notice a person in a gorilla suit walking through the scene,1and the effect persists in domain experts: 83% of radiologists searching a chest CT for lung nodules failed to report a matchbook\-sized gorilla composited into the image, many having fixated directly on it\.2Inattentional blindness is not simply a failure of acuity\. It is a failure of report under an attentional set: the observer looks, but the task has narrowed what counts as relevant\.3
Large language and vision\-language models are now routinely conditioned on narrow tasks and deployed as perceptual or interpretive front\-ends in safety\-critical settings\. A radiology assistant is asked to assess a nodule; a driving system is asked to track the lead vehicle; a screening model is asked to flag a specified threat\. The question we pose is whether task\-conditioning produces, in these systems, a functional analogue of inattentional blindness: suppression of safety\-critical information that the model can otherwise report under an unconstrained instruction\.
This question sits at the intersection of two literatures\. The first, machine psychology, treats models as participants in cognitive experiments and has documented that language models reproduce, and sometimes shed, human reasoning biases\.4,5The second concerns the limits of machine perception and visual\-language reporting, including evidence that vision\-language models miss visually salient elements6and that behavioral parity with humans need not imply mechanistic equivalence\.7,8What has not been established is whether an attentional\-set manipulation, the defining cause of inattentional blindness in humans, suppresses safety\-critical reporting in AI systems, and whether this effect holds across modalities, domains, and model scales\.
We make three contributions\. First, we formalize the Inattentional Gap: a divergence between what a model reports under a narrow task and what the same model reports when unconstrained\. Second, we demonstrate the gap across two modalities, in language and vision, and across two safety\-critical domains, using a controlled attentional\-set manipulation and an adjudication procedure that separates task\-induced omission from mere capability failure\. Third, we show that the gap is not eliminated by model scale, persists in a reasoning model, and varies with modality, task load, signal salience, and model family; we further localize its proximal trigger to output scope and probe its behavioral status, finding System\-1\-style task capture with no reliable architecture\-general oversight behavior in the systems sampled here\. Together, these results map the conditions under which a task\-conditioned AI system omits safety\-critical signals it can otherwise report\.
## The Inattentional Gap: a behavioral construct
Let a deployed model be conditioned by a task instruction that designates a target set D\_specified, while the same scene also contains co\-present, unrequested, safety\-critical information D\_unspecified\. Conventional safety benchmarks measure performance on D\_specified; deployment harm often turns on whether D\_unspecified is surfaced\. The Inattentional Gap denotes the regime in which a model can perform well on the specified target while suppressing report of the unspecified safety\-critical signal\. Formally, for modelmmand itemii, letRm,iopenR^\{\\text\{open\}\}\_\{m,i\}andRm,itaskR^\{\\text\{task\}\}\_\{m,i\}indicate whether the safety\-critical signal is reported under an unconstrained instruction and a task\-conditioned instruction, respectively\. The item\-level gap isIGm,i=Rm,iopen−Rm,itaskIG\_\{m,i\}=R^\{\\text\{open\}\}\_\{m,i\}\-R^\{\\text\{task\}\}\_\{m,i\}, and the model\-level gap is estimated over items for whichRm,iopen=1R^\{\\text\{open\}\}\_\{m,i\}=1, so that the effect is conditioned on open\-condition reportability rather than on a presumed perceptual state \(Figure 1\)\.
The construct is behavioral and falsifiable\. Its signature is a within\-item dissociation: the same model, on the same input, reports a critical signal when unconstrained but omits it when the task instruction narrows the reporting frame\. This control distinguishes task\-induced omission from a capability deficit\. An item whose open condition does not surface the signal cannot support a claim of suppression and is therefore excluded from the gap estimate\.
A natural objection is that a model omitting a finding under a narrowing instruction is merely following instructions, not exhibiting an analogue of inattentional blindness\. But in the canonical human paradigm, the omission is itself induced by a task instruction: observers miss the gorilla because they were told to count the passes of the white\-shirted team,1and what is noticed is shaped by the attentional set the task imposes, “what you see is what you set\.”9The clinical form is similar: eleven of twelve ophthalmologists missed signs of iron toxicity when the posed question asked them to rule out melanoma,10and neuroradiologists on a routine read overlooked lesions that the same images yielded to readers instructed to seek them11\(for a review of inattentional blindness in medicine, see Hults et al\.12\)\. Instruction\-following is therefore not a rival explanation for the Inattentional Gap; it is the channel through which an attentional set is imposed, in humans by instruction or expectation and in models by the prompt\. The open\-condition recovery provides the machine analogue of the full\-attention control: the signal is first shown to be reportable, then shown to be omitted once the task set is imposed\.
The construct connects to a broader hypothesis about closed\-system reasoning failure: a model solves an open\-world problem by reducing it to the closed world its prompt defines, and fails outside that closure\. The Inattentional Gap is the report\-level instance of that failure\.
This closure is not merely a prompt\-engineering defect but a measurable risk of task\-scoped deployment\. To guarantee that no co\-present critical signal is ever suppressed, a system would have to analyze every input exhaustively and without scoping, the open condition run universally, a single model reporting everything that might matter on every case\. No deployed system can do this without cost\. Real deployment is therefore necessarily task\-scoped: a chest CT is routed to a nodule detector, a product is cleared for one indication, or a clinician asks only whether a nodule is present\. The same scoping that makes AI tractable, certifiable, and affordable is also the condition under which unrequested safety\-critical signals can be omitted\.
## Related work: two lineages and an empty cell
The Inattentional Gap sits at the convergence of two research lineages that, to our knowledge, have not been joined in prior machine studies\. Its novelty is clearest when read against both\. We surveyed 116 prior works across eleven axes \(full annotated bibliography, Supplementary File S1\)\. The main text cites the lineage anchors that bear directly on the argument, while Figure 6 provides a structured positioning map; the complete annotated bibliography, axis coding, and search rationale are provided in S1\.
A human lineage runs from laboratory to clinic\. Selective\-looking studies13led to the formalization of inattentional blindness as the principle that without attention there is no report,3made vivid by the invisible gorilla,1and Most et al\.9identified the operative variable, the attentional set, in the phrase “what you see is what you set\.” The clinical instantiation is satisfaction of search: detecting one abnormality lowers detection of a co\-present second,14and eye tracking shows that the missed second finding is often fixated, looked at but not reported,15a pattern now unified across radiology and the laboratory as subsequent\-search\-miss error\.16The effect survives expertise: 83% of radiologists missed a gorilla composited into a chest scan,2the miss extends to clinically relevant pathology that experience does not protect against,17and greater expertise can deepen the blindness by sharpening the attentional set\.18Each of these studies established set\-dependent omission in humans; none asked whether the same manipulation suppresses reporting in machines\.
A machine lineage treats models as cognitive subjects\. From the question of whether machines think like people,19the machine\-psychology programme20applied cognitive\-experimental tools to language models,4documenting human\-like content effects,21emergent then alignment\-suppressed biases,5and a dissociation of formal from functional competence\.22A parallel strand argues that what a model “knows” is structured by its task,23the static precursor to the live, inference\-time gating we measure\. This lineage probes reasoning and competence; it does not manipulate an attentional set over an otherwise reportable, safety\-critical signal\. That these models now reach human\-level performance on tasks from analogical reasoning to theory of mind24sharpens the point: the conditioned omission we report is task\-induced omission rather than a capability failure\. The two lineages meet at our question\.
The claim has direct interlocutors in the Cell Press family\. Block’s25thesis that perceptual consciousness can exceed cognitive access provides the theoretical analogue; here we operationalize only a report\-level dissociation in machines, where the prompt narrows the model’s usable report bandwidth\.26van Amsterdam et al\.,27in Patterns, showed that accurate models can yield harmful prophecies; we extend their accuracy\-harm decoupling from prediction loops to report under task\-conditioning\. Park et al\.,28also in Patterns, catalogue strategic deception, models that misreport to reach a goal; the Inattentional Gap is its non\-strategic counterpart, omission with no apparent goal to hide but a narrowed reporting frame\. Mahowald et al\.22dissociated formal from functional competence, and we add a third dissociation: otherwise reportable signal versus task\-conditioned report\. Sanchez et al\.29built a radiologist\-AI disagreement pipeline that does not directly target silent omission, the failure mode isolated here\.
The nearest machine neighbors each fall one axis short\. Vision\-language\-limit studies, including saliency benchmarks,6demonstrations that capable models miss obvious features,30and findings that models fail on representationally confusable inputs,31ask the model directly and measure capability, whereas we measure suppression of what the model is not asked to report\. Object\-hallucination benchmarks32measure the opposite error, reporting what is absent, while we measure failing to report what is present; the two are complementary axes of safety\-relevant reporting error\. AbsenceBench33tests detection of deliberately removed tokens with a direct query, not set\-induced suppression of present, otherwise reportable content under a competing task\. Each neighbor stops one axis short\.
The deployment\-safety literature our thesis ultimately addresses concerns benchmark scores, distribution shift, and specification gaming\. Task\-conditioning is, in this frame, a prompt\-induced shortcut: the model satisfies the stated task and shortcuts past everything else,34an inference\-time instance of two canonical safety problems, negative side effects and evaluation\-to\-deployment distributional shift,35and the prompt\-level analogue of agents that game a literal specification while violating its intent\.36That aggregate scores hide within\-case failures is documented in medical imaging,37and that evaluation routinely diverges from deployment reality is visible in the regulatory record of cleared medical AI\.38Our contribution locates this decoupling at the within\-item reporting level: benchmark accuracy on D\_specified, harm on D\_unspecified\.
The mechanistic reading is constrained by work showing that behavioral parity need not imply mechanistic equivalence\.7,8Transformers do not instantiate the same human capacity\-limited attentional bottleneck,39so convergence with the human gorilla effect is informative precisely because the mechanism likely differs\. We locate the machine\-side mechanism in report\-conditioning rather than in a human\-like perceptual bottleneck\. In dual\-process terms,40a narrow prompt induces System\-1\-style task capture, while the bottom\-up salience that a classical model41would predict to capture attention can be overridden by the top\-down task set\.
The consequence is anticipated by the human\-factors lineage on the ironies of automation:42the overseer most likely to trust an accurate system is least able to recover what it silently omits\.43Automation complacency is itself attentional,44with humans allocating less scrutiny to automated channels, so a model’s task\-conditioned reporting gate and an operator’s complacency can compound; trust decouples from reliability,45and in medicine automation bias predicts that a clinician may not independently recover an omitted finding\.46Two attention gates in series, the model’s and the overseer’s, define the deployment risk surface\.
Mapping the 116 works on two axes, engagement with inattentional\-blindness theory and use of a task\-conditioning manipulation with a within\-item reportability control, the high\-engagement quadrant is occupied only by human studies, while machine clusters sit at low\-to\-mid values on one or both axes \(Figure 6\)\. The top\-right cell, in machines, is empty, not for lack of adjacent effort but because each neighbor stops one axis short\. This paper fills that cell by transplanting the human attentional\-set paradigm onto machines with the reportability control intact: open\-condition recovery distinguishes the effect from absence or capability benchmarks, and the explicit attentional\-set framing distinguishes it from vision\-language capability\-limit studies\.
## Results
Table 1\. Study design across the six experiments\.Every comparison is made within item; report of the co\-present safety\-critical signal is the dependent measure\. In the open/closed studies, the open condition serves as the within\-item reportability reference; in the dual\-processβ\\betaprobe, the comparison is task\-only versus task\-plus\-critic\.
ExperimentModalityStimuliConditionsDomainModel configurationStudy 1Language100 \(64 radiology, 36 driving\)focused, strict, openradiology, driving4 modelsStudy 2Vision48 \(12 radiographs×\\times2 objects×\\times2 sides\)counting, openradiograph imaging4 modelsStudy 3 \(output scope\)Language240, 1, 2, 3 sub\-questionsradiology4 modelsCross\-vendorLanguage24focused, strict, openradiology3 flagship modelsDual\-processα\\alphaLanguage24standard, adversarial\-suppression, openradiologyOpus 4\.8Dual\-processβ\\betaLanguage24task\-only, task \+ criticradiologytask model \+ critic### Study 1: suppression of reporting in language models
We constructed 100 textual scenarios \(64 radiology, 36 autonomous\-driving\) by procedural composition, each pairing a designated target \(a pulmonary nodule; a lead vehicle\) with a co\-present, unrequested, safety\-critical signal \(e\.g\., tension pneumothorax, aortic widening, free subdiaphragmatic air, lytic bone lesions; a child entering the road, fallen cargo, a cyclist in the blind spot\)\. Procedural composition reduces the data\-contamination concern that limits canonical\-vignette studies, since the items are generated from controlled templates rather than copied from known cases\. Each scenario was presented to four models \(claude\-haiku\-4\-5, gpt\-4o\-mini, gpt\-4o, claude\-sonnet\-4\-6\) under three instructions: a high\-load focused instruction \(“provide a detailed work\-up of the nodule”; “report the lead vehicle’s status”\), a strict exclusive instruction \(“report only the nodule / lead vehicle; do not mention anything else”\), and an open instruction \(“report all clinically significant findings / hazards”\)\. Whether each response reported the critical signal was adjudicated by two independent language\-model judges rather than by keyword matching; spot\-checks validated judge decisions against response text\.
Adjudication showed near\-perfect agreement \(Cohenκ\\kappa= 0\.895, n = 1,200\); we report the two\-judge consensus\. Under the strict instruction, suppression was complete and universal\. Every model, in both domains, reported the critical signal in 0\.00 of radiology strict\-condition items versus approximately 1\.00 of open\-condition items \(radiology McNemar p = 1\.1e\-19, N = 64; driving 0\.08 to 0\.33 versus 1\.00, p < 1\.3e\-7\)\. The instruction to attend exclusively to the designated target abolished reporting of co\-present life\-threatening findings that the same models reported when unconstrained\.
Under the high\-load focused instruction, suppression was substantial in radiology and present but weaker in driving\. In radiology, focused\-condition report rates were 0\.09 \(gpt\-4o\-mini\), 0\.08 \(gpt\-4o\), 0\.48 \(sonnet\), and 0\.53 \(haiku\), against ~1\.00 in the open condition \(gaps 0\.47 to 0\.92; all McNemar p < 2e\-9\)\. In driving, focused\-condition rates were higher \(0\.58 to 0\.97\), indicating that the more salient driving hazards resisted suppression more strongly\. The contrast identifies a boundary condition: focused\-task suppression is larger in higher\-load, lower\-salience settings \(Figure 2\)\.
Two further patterns emerged\. The effect did not diminish with scale: gpt\-4o, a frontier model, suppressed radiology findings under the focused instruction as strongly as the small gpt\-4o\-mini \(0\.08 versus 0\.09\)\. The language\-model judges also produced stricter estimates than the preliminary lexical parser, indicating that keyword\-based measurements understated the focused\-condition gap\.
### Study 2: suppression of visual report in vision\-language models
To test whether the gap extends from language to vision, we turned to the original inattentional\-blindness paradigm, in which a demanding primary task competes with report of an unexpected object\. We obtained 12 real, publicly available chest radiographs \(patient identifiers redacted\) and composited a clearly visible but non\-anatomical object onto the lung field of each, in the spirit of Drew et al\.2: a subtle translucent disc or a yellow star, varied by side, yielding 48 stimuli\. Each image was presented to the four vision\-capable models under two instructions: a closed counting task \(“carefully count the pairs of ribs”\), which consumes attention without invoking diagnosis, and an open instruction \(“describe everything you see, including anything unexpected or artificial”\)\. A system message framing the images as synthetic research stimuli eliminated the medical\-image refusals that confounded earlier attempts\. Object report was again adjudicated by two independent judges, and refusals were excluded\.
Task\-conditioning suppressed report of the unexpected object in every model \(two\-judge consensus, Cohenκ\\kappa= 0\.918, n = 375\)\. Under the counting task, report rates were far below the open condition\. For the star, closed\-condition report rates were 0\.00 \(gpt\-4o\), 0\.00 \(gpt\-4o\-mini\), 0\.50 \(haiku\), and 0\.54 \(sonnet\), against open\-condition rates ranging from 0\.79 to 0\.96\. For the disc, closed\-condition report rates ranged from 0\.00 in both OpenAI models to 0\.70 in sonnet, against open\-condition rates ranging from 0\.67 to 0\.96\. The strongest effect was observed for gpt\-4o on the star condition: the model omitted the star in every counting\-task item while reporting it in nearly all open\-condition items, a machine analogue of the invisible\-gorilla result \(exact McNemar p < 1e\-6; Figure 3\)\. Open\-condition report was below ceiling for the most subtle objects, reflecting genuine low\-salience difficulty; the within\-item gap controls for this because it conditions on each model’s own open\-condition reportability\.
Inspection of responses confirmed task\-induced omission rather than refusal or a capability failure\. In closed\-condition responses, the model performed the counting task in earnest \(“there are 12 pairs of ribs visible; I started from the topmost rib…”\) and simply did not mention the object; in open\-condition responses, the same model on the same image flagged “an artificial object” overlaid on the scan\. The open\-condition response establishes that the object was reportable by that model on that item; the counting\-condition omission establishes that the task set suppressed its report\.
### Study 3: the gap is a threshold on output scope, not a monotonic load gradient
To probe the trigger of suppression, we varied the number of nodule\-directed sub\-questions from zero \(open analysis\) to three on the medical scenarios \(two\-judgeκ\\kappa= 0\.98\)\. The relationship was not a smooth dose\-response \(Figure 8\)\. A single sub\-question sufficed to switch suppression almost fully on: critical\-finding report fell from 1\.00 at zero questions to near zero at one\. Adding further questions did not monotonically worsen suppression; the Anthropic models partially recovered \(haiku 0\.00 to 0\.88 across one to three questions\) while OpenAI models stayed at zero throughout\. Response\-length analysis clarified the pattern: the single narrow question produced the shortest outputs \(mean 189 characters, rising to 435 and 1,046 as further questions were added, against 1,770 in the open condition\) and the strongest suppression, whereas the longer outputs re\-admitted the critical finding, sometimes as an explicit alert\. The pattern therefore implicates output scope, the breadth of what the model commits to reporting, rather than question count alone: a narrow task compresses output onto the target and crowds out the unrequested signal\. OpenAI models maintain this compression regardless of output length; Anthropic models relax it as output widens\.
### Cross\-cutting patterns
Table 2\. Critical\-signal report rates by condition \(two\-judge consensus\)\.Low values under a task instruction indicate inattentional suppression; in open/closed studies, the open condition is the within\-item reportability reference\. Refusals and errors were excluded before adjudication\. For GPT\-5, some responses were empty because reasoning consumed the output budget and were excluded from paired analyses\.
ExperimentModelTask\-condition report rateOpenKey statisticStudy 1 \(radiology\)gpt\-4o\-mini0\.09 focused / 0\.00 strict~1\.00strict McNemar p = 1\.1e\-19;κ\\kappa= 0\.90Study 1 \(radiology\)gpt\-4o0\.08 focused / 0\.00 strict~1\.00Study 1 \(radiology\)claude\-sonnet\-4\-60\.48 focused / 0\.00 strict~1\.00Study 1 \(radiology\)claude\-haiku\-4\-50\.53 focused / 0\.00 strict~1\.00Study 2 \(vision, star\)gpt\-4o0\.00 counting \(n=22\)0\.96McNemar p < 1e\-6 \(b=21, c=0\);κ\\kappa= 0\.92Study 2 \(vision, star\)gpt\-4o\-mini0\.00 counting \(n=22\)0\.79p = 1\.5e\-5Study 2 \(vision, star\)claude\-haiku\-4\-50\.50 counting \(n=24\)0\.79p = 0\.016Study 2 \(vision, star\)claude\-sonnet\-4\-60\.54 counting \(n=24\)0\.88p = 0\.008Cross\-vendorgpt\-5 \(reasoning\)0\.00 focused \(n = 9\) / 0\.00 strict \(n = 21\)1\.00strict p = 9\.5e\-7; focused p = 3\.9e\-3;κ\\kappa= 0\.97Cross\-vendorgemini\-2\.5\-pro0\.00–0\.091\.00p < 1e\-6Cross\-vendorclaude\-opus\-4\-80\.83–0\.961\.00Dual\-processβ\\betagpt\-4o\-mini \+ critic0\.00→\\rightarrow1\.00 \(task→\\rightarrowtask \+ critic\)n/aMcNemar p = 1\.2e\-7Combining the studies yields a map of when task\-conditioned AI systems omit otherwise reportable signals\. Suppression under an exclusive or attention\-consuming instruction was observed across models and modalities; what varied was degree\. In language, the strict instruction produced complete suppression everywhere, while the focused instruction produced model\- and salience\-dependent suppression\. In vision, suppression was present in every model but strongest in the sampled OpenAI models: gpt\-4o and gpt\-4o\-mini reported the unexpected object in 0\.00 of counting\-task items, whereas the sampled Anthropic models showed more partial suppression\. Across the sampled systems, model family was a stronger correlate of the gap than model scale, alongside task load and signal salience \(Figure 4\)\. An apparent modality\-by\-provider reversal in small\-N pilots did not survive scaling and two\-judge adjudication\. This heterogeneity is not a weakness of the construct but its resolution: the highest\-risk setting is a high\-load narrow task over a low\-salience but safety\-critical signal handled by a model family that more strongly compresses output to the requested target\.
### Cross\-vendor and flagship validation
To test whether the effect is an artifact of one vendor or of model scale, we ran the medical text study on three flagship models from three vendors \(Opus 4\.8, GPT\-5, Gemini 2\.5 Pro; two\-judgeκ\\kappa= 0\.97\)\. The provider pattern held and sharpened\. GPT\-5, a reasoning model, showed complete suppression in valid responses: focus and strict report rates were 0\.00 versus open 1\.00\. The strict\-condition comparison was significant on 21 paired items after reasoning\-budget attrition \(exact McNemar p = 9\.5e\-7\), and the focused\-condition comparison was significant on the 9 paired items remaining after attrition \(p = 3\.9e\-3\)\. This refutes, in the sampled setting, the hypothesis that inference\-time reasoning alone resolves the gap\. Gemini 2\.5 Pro, a third\-vendor model, suppressed nearly as strongly \(0\.00 to 0\.09; p < 1e\-6\)\. Opus 4\.8, by contrast, was robust, reporting the co\-present finding in 0\.83 to 0\.96 of task\-conditioned items and 1\.00 of open items\. Inspection showed the behavioral contrast directly: GPT\-5 produced a clean nodule\-only work\-up and omitted the co\-present pneumothorax, whereas Opus 4\.8 prefaced its report with “a critical finding that I can’t ethically ignore despite the nodule\-only request” and flagged it\. The determinant is therefore best described as a model\-family\-associated safety\-reporting disposition rather than simple scale: within Anthropic, suppression did not increase with capability, while OpenAI models suppressed strongly at every sampled scale, including GPT\-5\. GPT\-5’s reasoning\-budget attrition was handled by excluding empty responses from paired analyses; suppression in valid responses was unambiguous\.
### A System\-1\-style task capture without reliable intrinsic oversight
In dual\-process terms, human inattentional blindness can be read as a failure of deliberate monitoring under a primary attentional set: the process that would catch the unexpected is consumed by the task\.47We asked whether conditioned models exhibit any analogous oversight behavior, using three probes on the medical scenarios\. First, inference\-time reasoning did not supply such oversight in the sampled reasoning model\. GPT\-5 showed complete suppression in valid responses under both focused and strict instructions, indicating that a reasoning\-capable inference pass remained inside the assigned task rather than acting as a boundary monitor\. Second, one sampled model family exhibited a robust safety\-reporting override, most clearly in Opus 4\.8\. Opus 4\.8 reported the co\-present critical finding in 0\.96 of items even under an adversarial\-suppression instruction that explicitly forbade mentioning any other finding or safety caveat and scored such mentions as failures, matching its 0\.96 under standard focus and approaching its 1\.00 open rate \(24 scenarios, inter\-judge agreement 0\.97\)\. Strengthening the suppression instruction therefore did not break the override, suggesting a robust model\-specific safety\-reporting disposition rather than a fragile prompt reflex\. No comparable override appeared in the other sampled models\. Third, oversight could be supplied externally\. Routing each narrow report from a task model \(gpt\-4o\-mini, which reported the co\-present finding in 0\.00 of items on its own\) to an independent open\-ended critic that re\-examined the same input for omissions raised the pipeline’s report rate to 1\.00 \(McNemar p = 1\.2e\-7\); the critic recovered every omission the single pass had dropped\.
Together, these probes locate the gap as System\-1\-style task capture rather than as a failure eliminated by scale or inference\-time reasoning alone \(Figure 7\)\. In the sampled systems, reliable oversight did not appear as an architecture\-general behavior\. Where monitoring appeared, it took one of two forms: a model\-specific safety\-reporting disposition or an explicit second process\. The scaffold that restored reporting was the open\-ended review run as a standing parallel analysis\. This returns to the structural cost identified by the construct: if the architecture does not reliably supply boundary monitoring, deployment must implement and budget it as an explicit process\.
## Discussion
The Inattentional Gap reframes what AI safety evaluation measures\. A model that scores well on a targeted benchmark is being measured primarily on D\_specified; the events that produce harm often live in D\_unspecified, the region task\-conditioning can suppress from report\. Our results show this suppression to be strong, statistically robust, and present in sampled frontier models, indicating that benchmark performance and deployment safety can decouple\. This decoupling compounds documented technical and epistemic shortcomings of current safety benchmarks48and aligns with the emerging community effort to redefine how AI is evaluated\.49The decoupling is sharpest where it is most consequential: a diagnostic AI told to assess one finding omits a life\-threatening second finding it can otherwise describe; a vision\-language model performing a primary task fails to report an unexpected object in plain view\.
The phenomenon mirrors human inattentional blindness at the level of behavior, but the mechanism is unlikely to be identical\. Human inattentional blindness is classically explained by a capacity\-limited attentional bottleneck; transformers do not instantiate that same human bottleneck\. The behavioral convergence with a different likely mechanism is therefore informative, and consistent with prior arguments that performance parity need not imply architectural equivalence\.7Our results locate the machine\-side effect not in a perceptual\-capacity deficit, since the open condition recovers the signal, but in the conditioning of report on the task instruction \(Figure 5\)\. The family\-dependence of the effect and its insensitivity to scale suggest a model\-family\-associated reporting disposition rather than a scaling\-limited capability boundary\.
The clinical and operational implications are concrete\. In radiology, suppression of co\-present findings is the machine analogue of satisfaction of search and incidental\-finding omission, long documented in human readers, where it reflects perceptual\-cognitive rather than acuity limits50,51and extends beyond radiology to real\-time clinical monitoring\.52Here, we show that a comparable omission can be induced in AI by the framing of the request\. In autonomous driving, the same structure predicts that a perception stack optimized for specified hazards may underreport out\-of\-distribution ones, the failure mode that the driving anomaly\-detection literature53and dedicated anomaly\-segmentation benchmarks54were built to expose\. In both settings, the corrective is not simply a better detector of D\_specified, but an evaluation and deployment regime that explicitly measures and protects D\_unspecified\.
This work also resolves the question that motivated it: whether AI exhibits a human\-like form of inattentional blindness\. The behavioral parity is real, but the probes above indicate that the underlying mechanism differs\. Humans miss critical signals when a capacity bottleneck exhausts deliberate monitoring;47the sampled models omit them because task\-conditioning closes the reporting frame, and reliable boundary monitoring is not supplied by scale or inference\-time reasoning alone\. The productive reading is neither that AI cognition is human\-like nor that it is alien, but that task\-conditioning imposes a reporting closure that is structurally useful for compliance while hazardous for safety\-critical deployment\. Its failure modes can therefore be measured, mapped, and engineered against\.
That the omission faithfully obeys the posed task does not soften the safety implication; it sharpens it\. The more competently a model executes a narrow task, the more completely a mis\-posed task may be obeyed, and the life\-threatening finding goes unreported precisely because the system did what it was told\. This is the machine form of the bias of the question posed\.10Safety is therefore not secured by making the model follow the narrow task more narrowly, nor by assuming an infeasible universal analyzer that reports everything on every input\. It requires an evaluation and deployment regime, or an explicit dual\-process scaffold, that measures and protects the unrequested region\.
## Limitations
The studies remain moderate in scale \(100 textual scenarios and 48 visual scenarios\) and should be replicated at field scale\. Adjudication used two independent language\-model judges with high agreement \(Cohenκ\\kappa= 0\.90 for text and 0\.92 for vision\); future work should add blinded human and domain\-expert adjudication, especially for clinically realistic radiology tasks\. The visual stimuli used composited non\-anatomical objects rather than naturally occurring pathological findings, which isolates the attentional\-set mechanism at the cost of clinical realism\. A complementary study using real co\-present pathology is therefore an important next step\.
A pilot in this direction, run on real chest radiographs with no composited object and using each model’s own open\-condition description as the reportability reference, qualitatively reproduced the suppression on genuine radiographic content\. For example, a model that reported cardiomegaly when asked openly omitted it under a rib\-counting task\. However, reliable quantification was limited by judge disagreement on open\-ended clinical findings and by vision refusals on real radiographs\. Radiologist\-adjudicated multi\-finding datasets would provide a stronger clinical test of the construct\.
Open\-condition report of the visual object was below ceiling for subtle stimuli, reflecting genuine low\-salience difficulty\. The within\-item gap measure is robust to this limitation because suppression is estimated only relative to each model’s own open\-condition reportability\. Two OpenAI models initially refused diagnostic framings until a synthetic\-stimulus system message was supplied; this refusal behavior is itself deployment\-relevant and is reported as part of the evaluation context\.
The dual\-process probes used 24 medical scenarios each and should be treated as demonstrations rather than full benchmarks\. Scaling the external\-critic scaffold across models, modalities, and domains, and testing whether the robust safety\-reporting override observed in one model generalizes beyond these items, are immediate next steps\. Finally, the construct is demonstrated behaviorally but not yet fully causally decomposed\. Isolating the separate contributions of task load, signal salience, output scope, and model\-family\-specific reporting dispositions remains for future work\.
## Experimental Procedures
Models\.Studies 1 to 3 used claude\-haiku\-4\-5, gpt\-4o\-mini, gpt\-4o, and claude\-sonnet\-4\-6; the cross\-vendor validation additionally used three flagship models, claude\-opus\-4\-8, gpt\-5 \(a reasoning model\), and gemini\-2\.5\-pro\. Models were queried at temperature 0 via their respective APIs, with temperature omitted for claude\-opus\-4\-8 \(which does not accept it\) and for the gpt\-5 reasoning model \(whose output budget is set by max\_completion\_tokens\)\. All models were queried in June 2026 against each vendor’s production API; the complete per\-call configuration \(exact model identifiers, maximum output tokens, image\-detail setting, and retry and refusal handling\) is provided in the deposited code\.
Table 3\. Models and query configuration\.All models were queried in June 2026 via each vendor’s production API at temperature 0, except as noted\.
ModelVendorClassRole\(s\)Notesclaude\-haiku\-4\-5Anthropicsmallsubject, judgegpt\-4o\-miniOpenAIsmallsubject, judgegpt\-4oOpenAIfrontiersubjectclaude\-sonnet\-4\-6Anthropicmidsubject, criticclaude\-opus\-4\-8Anthropicflagshipsubjecttemperature omittedgpt\-5OpenAIreasoningsubjecttemperature omitted; output budget set by max\_completion\_tokensgemini\-2\.5\-proGoogleflagshipsubjectStudy 1 stimuli\.100 procedurally composed scenarios \(64 radiology: 8 nodule descriptions×\\times8 critical findings; 36 driving: 6 lead\-vehicle scenes×\\times6 hazards\)\. Three instructions per scenario \(focused, strict, open\)\.
Study 2 stimuli\.12 public\-domain chest radiographs \(Wikimedia Commons; top and bottom text bands redacted to remove identifiers\), each composited with a subtle translucent disc or yellow star at 4% image width, alpha ~0\.59, at left or right lung field, yielding 48 images \(12 radiographs×\\times2 objects×\\times2 sides\)\. Two instructions per image \(closed rib\-counting, open description\) under a synthetic\-stimulus system message\.
Study 3 stimuli \(output scope\)\.A subset of 24 radiology scenarios \(4 nodule descriptions×\\times6 critical findings\) was presented under four task\-load levels defined by the number of nodule\-directed sub\-questions appended to a neutral report request \(0 = open analysis; 1, 2, 3 cumulative\)\. The number of sub\-questions, and the resulting output length, were the manipulated variables; the dependent measure was report of the co\-present critical finding\.
Cross\-vendor \(flagship\) stimuli\.The same 24 radiology scenarios were presented to the three flagship models under the focused, strict, and open instructions of Study 1\.
Dual\-process probes\.Two further experiments used the 24 medical scenarios\. To test whether a robust model’s override is a stable disposition or a fragile reflex, claude\-opus\-4\-8 was run under a standard focused instruction, an adversarial\-suppression instruction that explicitly forbade mentioning any other finding or any safety or ethical caveat \(scoring such mentions as task failure\), and an open instruction\. To test whether external oversight closes the gap, each focused report from a task model \(gpt\-4o\-mini\) was passed, together with the original imaging description, to an independent open\-ended critic model \(claude\-sonnet\-4\-6\) instructed to flag any clinically critical finding omitted from the report; the pipeline counted as reporting the finding if either the task report or the critic flagged it, judged separately to avoid truncation of the critic text\.
Adjudication\.Each response was scored for report of the critical signal/object by two independent language\-model judges \(gpt\-4o\-mini and claude\-haiku\-4\-5\), each returning a binary decision; we report the both\-agree consensus and inter\-rater Cohenκ\\kappa\(0\.895 Study 1 text, 0\.918 Study 2 vision, 0\.984 Study 3, 0\.969 cross\-vendor\)\. Refusals and errors were excluded; spot\-checks validated judge decisions against response text\. In the dual\-process probes, critical\-finding report is near\-constant within a condition, so Cohen’sκ\\kappais degenerate \(the prevalence paradox in whichκ\\kappacollapses toward zero despite high agreement\); for these conditions we therefore report raw inter\-judge agreement \(0\.97 for the override probe, 0\.88 for the single\-pass baseline\) and, for the override probe, Gwet’s AC1 \(0\.97\), which is robust to skewed prevalence\. Throughout, the n reported for a McNemar comparison denotes paired items in which both the task and the open condition returned a valid \(non\-refusal, non\-error\) response; for GPT\-5 this paired n is 21 \(strict\) and 9 \(focus\) after reasoning\-budget attrition, and is therefore smaller than the per\-condition valid\-response counts\.
Statistics\.Within\-item paired comparison of closed versus open conditions by exact McNemar test; report rates and condition gaps reported per model, object, and domain\. For Study 3, report rate is given per dose level, with the partial recovery at higher doses analyzed against response length\. Because items were procedurally composed from a small set of templates, report rates were re\-estimated with a cluster bootstrap resampling the nodule and critical\-finding templates; the focused\-condition gaps and the universal strict\-condition suppression \(report rate 0\.00 across all eight nodule and all eight critical\-finding templates\) were robust to this clustering\. The dual\-process probes were also repeated across independent runs to check stability against temperature\-0 non\-determinism: the external\-critic scaffold reproduced exactly in every run \(single\-pass report 0\.00, pipeline report 1\.00\), and the robust\-model override remained high throughout \(0\.92 to 1\.00\)\.
## Resource Availability
Lead contact\.Requests for further information and resources should be directed to the lead contact, Kwan Soo Shin \(sshin@pmminds\.ai\)\.
Materials availability\.This study did not generate new physical materials\.
Data and code availability\.Scenario generators, image\-composition and stimulus\-generation code, model\-query scripts, adjudication scripts including the verbatim judge prompts, raw model outputs, per\-item adjudication files, analysis scripts, robustness scripts, and figure\-generation scripts are archived in a public repository under CC\-BY 4\.0 \(Zenodo: https://doi\.org/10\.5281/zenodo\.20826824\)\. Source chest radiographs are public\-domain Wikimedia Commons images; the repository includes the source\-image list, license and source metadata, the redaction procedure, and the generated synthetic stimuli\. Because commercial production models may change over time, the deposited raw outputs and per\-item adjudication files constitute the authoritative evidence base for the reported analyses; API rerun scripts are provided for reproducibility and extension\. Additional information required to reanalyze the data reported here is available from the lead contact upon reasonable request\.
## Author Contributions
K\.S\.S\. conceived the study, designed and ran the experiments, analyzed the data, and wrote the manuscript\.
## Declaration of Interests
The author declares no competing interests\.
## Declaration of Generative AI and AI\-Assisted Technologies
Large language and vision\-language models are the objects of study in this work and were queried programmatically as experimental subjects\. Separately, during manuscript preparation, the author used AI\-assisted tools for code scaffolding and language editing\. All figures and statistical results were generated from the deposited analysis scripts\. All experimental results, adjudication outputs, analyses, and scientific claims were independently reviewed and verified by the author, who takes full responsibility for the content\.
## Figures
Figure 1\. The Inattentional Gap \(schematic\)\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig1_concept.png)
Task\-conditioning focuses evaluation on the specified target, where report on the specified target appears high, while report of co\-present unspecified safety\-critical signals is suppressed, decoupling benchmark from real\-world safety\.
Figure 2\. Study 1 \(text\): critical\-signal report rate by model and condition\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig2_study1_heatmap.png)
Low values indicate stronger inattentional suppression\. Open\-condition columns are near ceiling, establishing the reportability reference; task\-conditioned columns show suppression, strongest under strict instructions and in the radiology focused condition\.
Figure 3\. Study 2 \(vision\): example stimulus and star\-object report rate\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig3_study2.png)
A public\-domain chest radiograph was redacted and composited with a subtle unexpected object\. The right panel shows star\-object report rates under the rib\-counting task versus the open\-description condition; the disc condition is summarized in the text\.
Figure 4\. Inattentional gap by modality and model\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig4_modality_provider.png)
Higher values indicate stronger task\-conditioned suppression\. In the non\-strict/task\-loaded conditions, the sampled OpenAI models showed larger gaps than the sampled Anthropic models in both text and vision settings; the cross\-vendor validation further tests whether this pattern tracks model family rather than model scale\.
Figure 5\. Mechanism\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig5_mechanism.png)
Under an open instruction, the model reports both the specified target and the co\-present safety\-critical signal\. Under a task\-conditioned instruction, the same input is compressed into the requested reporting frame, and the unrequested hazard is omitted\. The gap is therefore located in task\-conditioned reporting, not in failure to report the signal under open conditions\.
Figure 6\. Structured positioning map of the literature\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig6_literature_map.png)
We coded 116 works across eleven axes and rendered them as twelve thematic clusters\. The x\-axis captures engagement with inattentional\-blindness theory; the y\-axis captures whether task\-conditioning is used as the independent variable with an open/closed within\-item reportability control\. Human inattentional\-blindness studies occupy the high\-x, high\-y region, whereas prior machine studies remain outside the corresponding high\-x, high\-y cell in our coding\. The Inattentional Gap fills that machine\-side cell by combining attentional\-set manipulation, within\-item reportability recovery, and safety\-critical framing\. Coordinates summarize the coding rationale reported in Supplementary File S1 and should be read as a structured positioning map rather than a quantitative meta\-analysis\.
Figure 7\. Dual\-process probes\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig7_dualprocess.png)
\(A\) In valid responses, the sampled reasoning model \(GPT\-5\) showed complete suppression under both focused and strict task instructions, whereas Opus 4\.8 reported the co\-present critical finding even under an adversarial\-suppression instruction that explicitly discouraged reporting other findings or safety caveats, indicating a robust safety\-reporting override rather than a fragile prompt reflex\. \(B\) Routing a task model’s narrow report to an independent open\-ended critic raised the co\-present\-finding report rate from 0\.00 to 1\.00 \(McNemar p = 1\.2e\-7\), showing that oversight\-like recovery can be supplied externally as a second process\.
Figure 8\. Study 3: output scope, not monotonic load\.
![[Uncaptioned image]](https://arxiv.org/html/2606.26529v1/figures/Fig8_dose_outputscope.png)
Critical\-finding report rate \(left axis, solid\) collapses from 1\.00 at zero sub\-questions to near zero at one, then partially recovers for the Anthropic models as output scope widens with longer responses \(mean output length, right axis, dashed\); the OpenAI models stay suppressed\. The pattern implicates the breadth of committed output, rather than the number of instructions alone, as the proximal trigger of suppression\.
## References
1. 1\.Simons, D\. J\. & Chabris, C\. F\. \(1999\)\. Gorillas in our midst: Sustained inattentional blindness for dynamic events\. Perception, 28\(9\), 1059–1074\. DOI: 10\.1068/p281059\.
2. 2\.Drew, T\., Võ, M\. L\.\-H\. & Wolfe, J\. M\. \(2013\)\. The invisible gorilla strikes again: Sustained inattentional blindness in expert observers\. Psychological Science, 24\(9\), 1848–1853\. DOI: 10\.1177/0956797613479386\.
3. 3\.Mack, A\. & Rock, I\. \(1998\)\. Inattentional Blindness\. MIT Press\. DOI: 10\.7551/mitpress/3707\.001\.0001\.
4. 4\.Binz, M\. & Schulz, E\. \(2023\)\. Using cognitive psychology to understand GPT\-3\. PNAS, 120\(6\), e2218523120\. DOI: 10\.1073/pnas\.2218523120\.
5. 5\.Hagendorff, T\., Fabi, S\. & Kosinski, M\. \(2023\)\. Human\-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT\. Nature Computational Science, 3\(10\), 833–838\. DOI: 10\.1038/s43588\-023\-00527\-x\.
6. 6\.Dahou, Y\., Huynh, N\. D\., Le\-Khac, P\. H\., Para, W\. R\., Singh, A\. & Narayan, S\. \(2025\)\. Vision\-language models can’t see the obvious \(SalBench\)\. ICCV 2025\. DOI: 10\.1109/ICCV51701\.2025\.02239\. arXiv:2507\.04741\.
7. 7\.Bowers, J\. S\., Malhotra, G\., Dujmović, M\., et al\. \(2023\)\. Deep problems with neural network models of human vision\. Behavioral and Brain Sciences, 46, e385\. DOI: 10\.1017/S0140525X22002813\. arXiv:2312\.05355\.
8. 8\.Quilty\-Dunn, J\., Porot, N\. & Mandelbaum, E\. \(2023\)\. The best game in town: The re\-emergence of the language\-of\-thought hypothesis across the cognitive sciences\. Behavioral and Brain Sciences, 46, e261\. DOI: 10\.1017/S0140525X22002849\.
9. 9\.Most, S\. B\., Scholl, B\. J\., Clifford, E\. R\. & Simons, D\. J\. \(2005\)\. What you see is what you set: Sustained inattentional blindness and the capture of awareness\. Psychological Review, 112\(1\), 217–242\. DOI: 10\.1037/0033\-295X\.112\.1\.217\.
10. 10\.Zamir, E\. \(2014\)\. The bias of the question posed: A diagnostic “invisible gorilla\.” Diagnosis, 1\(3\), 245\-248\. DOI: 10\.1515/dx\-2014\-0017\.
11. 11\.Garg, R\. K\., Ouyang, B\., Kocak, M\., Bhabad, S\., Bleck, T\. P\. & Jhaveri, M\. \(2022\)\. Inattentional blindness to DWI lesions in spontaneous intracerebral hemorrhage\. Neurological Sciences, 43\(7\), 4355\-4361\. DOI: 10\.1007/s10072\-022\-05992\-2\.
12. 12\.Hults, C\. M\., Ding, Y\., Xie, G\. G\., Raja, R\., Johnson, W\., Lee, A\. & Simons, D\. J\. \(2024\)\. Inattentional blindness in medicine\. Cognitive Research: Principles and Implications, 9\(1\), 18\. DOI: 10\.1186/s41235\-024\-00537\-x\.
13. 13\.Neisser, U\. & Becklen, R\. \(1975\)\. Selective looking: Attending to visually specified events\. Cognitive Psychology, 7\(4\), 480–494\. DOI: 10\.1016/0010\-0285\(75\)90019\-5\.
14. 14\.Berbaum, K\. S\., Franken, E\. A\., Dorfman, D\. D\., Rooholamini, S\. A\., Kathol, M\. H\., Barloon, T\. J\., Behlke, F\. M\., Sato, Y\., Lu, C\. H\., el\-Khoury, G\. Y\., Flickinger, F\. W\. & Montgomery, W\. J\. \(1990\)\. Satisfaction of search in diagnostic radiology\. Invest\. Radiol\. 25, 133–140\. DOI: 10\.1097/00004424\-199002000\-00006\.
15. 15\.Berbaum, K\. S\., Franken, E\. A\., Caldwell, R\. T\. & Schartz, K\. M\. \(2001\)\. Gaze dwell times on acute trauma injuries missed because of satisfaction of search\. Acad\. Radiol\. 8, 304–314\. DOI: 10\.1016/S1076\-6332\(03\)80499\-3\.
16. 16\.Adamo, S\. H\., Gereke, B\. J\., Shomstein, S\. & Schmidt, J\. \(2021\)\. From “satisfaction of search” to “subsequent search misses”: A review of multiple\-target search errors across radiology and cognitive science\. Cogn\. Res\. Princ\. Implic\. 6, 59\. DOI: 10\.1186/s41235\-021\-00318\-w\.
17. 17\.Williams, L\. H\., Carrigan, A\. J\., Auffermann, W\. F\., Mills, M\. K\., Rich, A\. N\., Elmore, J\. G\. & Drew, T\. \(2021\)\. The invisible breast cancer: Experience does not protect against inattentional blindness to clinically relevant findings in radiology\. Psychonomic Bulletin & Review, 28\(2\), 503–511\. DOI: 10\.3758/s13423\-020\-01826\-4\.
18. 18\.Robson, S\. G\. & Tangen, J\. M\. \(2023\)\. The invisible 800\-pound gorilla: Expertise can increase inattentional blindness\. Cogn\. Res\. Princ\. Implic\. 8, 33\. DOI: 10\.1186/s41235\-023\-00486\-x\.
19. 19\.Lake, B\. M\., Ullman, T\. D\., Tenenbaum, J\. B\. & Gershman, S\. J\. \(2017\)\. Building machines that learn and think like people\. Behavioral and Brain Sciences, 40, e253\. DOI: 10\.1017/S0140525X16001837\.
20. 20\.Hagendorff, T\., Dasgupta, I\., Binz, M\., Chan, S\. C\. Y\., Lampinen, A\., Wang, J\. X\., Akata, Z\. & Schulz, E\. \(2023\)\. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods\. arXiv:2303\.13988\.
21. 21\.Lampinen, A\. K\., Dasgupta, I\., Chan, S\. C\. Y\., Sheahan, H\. R\., Creswell, A\., Kumaran, D\., McClelland, J\. L\. & Hill, F\. \(2024\)\. Language models, like humans, show content effects on reasoning tasks\. PNAS Nexus 3, pgae233\. DOI: 10\.1093/pnasnexus/pgae233\.
22. 22\.Mahowald, K\., Ivanova, A\. A\., Blank, I\. A\., Kanwisher, N\., Tenenbaum, J\. B\. & Fedorenko, E\. \(2024\)\. Dissociating language and thought in large language models\. Trends in Cognitive Sciences, 28\(6\), 517–540\. DOI: 10\.1016/j\.tics\.2024\.01\.011\.
23. 23\.Yildirim, I\. & Paul, L\. A\. \(2024\)\. From task structures to world models: What do LLMs know? Trends Cogn\. Sci\. 28, 404–415\. DOI: 10\.1016/j\.tics\.2024\.02\.008\.
24. 24\.Webb, T\., Holyoak, K\. J\. & Lu, H\. \(2023\)\. Emergent analogical reasoning in large language models\. Nat\. Hum\. Behav\. 7, 1526–1541\. DOI: 10\.1038/s41562\-023\-01659\-w\.
25. 25\.Block, N\. \(2011\)\. Perceptual consciousness overflows cognitive access\. Trends in Cognitive Sciences, 15\(12\), 567–575\. DOI: 10\.1016/j\.tics\.2011\.11\.001\.
26. 26\.Cohen, M\. A\., Dennett, D\. C\. & Kanwisher, N\. \(2016\)\. What is the bandwidth of perceptual experience? Trends Cogn\. Sci\. 20, 324–335\. DOI: 10\.1016/j\.tics\.2016\.03\.006\.
27. 27\.van Amsterdam, W\. A\. C\., van Geloven, N\., Krijthe, J\. H\., Ranganath, R\. & Cinà, G\. \(2025\)\. When accurate prediction models yield harmful self\-fulfilling prophecies\. Patterns, 6\(4\), 101229\. DOI: 10\.1016/j\.patter\.2025\.101229\.
28. 28\.Park, P\. S\., Goldstein, S\., O’Gara, A\., Chen, M\. & Hendrycks, D\. \(2024\)\. AI deception: A survey of examples, risks, and potential solutions\. Patterns 5, 100988\. DOI: 10\.1016/j\.patter\.2024\.100988\.
29. 29\.Sanchez, M\., Alford, K\., Krishna, V\., Huynh, T\. M\., Nguyen, C\. D\. T\., Lungren, M\. P\., Truong, S\. Q\. H\. & Rajpurkar, P\. \(2023\)\. AI\-clinician collaboration via disagreement prediction: A decision pipeline and retrospective analysis of real\-world radiologist\-AI interactions\. Cell Reports Medicine, 4\(10\), 101207\. DOI: 10\.1016/j\.xcrm\.2023\.101207\.
30. 30\.Rahmanzadehgervi, P\., Bolton, L\., Taesiri, M\. R\. & Nguyen, A\. T\. \(2024\)\. Vision language models are blind\. arXiv:2407\.06581 \(ACCV 2024\)\.
31. 31\.Tong, S\., Liu, Z\., Zhai, Y\., Ma, Y\., LeCun, Y\. & Xie, S\. \(2024\)\. Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs\. Proc\. IEEE/CVF CVPR 2024\. DOI: 10\.1109/CVPR52733\.2024\.00914\. arXiv:2401\.06209\.
32. 32\.Li, Y\., Du, Y\., Zhou, K\., Wang, J\., Zhao, W\. X\. & Wen, J\.\-R\. \(2023\)\. Evaluating object hallucination in large vision\-language models\. Proc\. EMNLP 2023\. arXiv:2305\.10355\.
33. 33\.Fu, H\. Y\., Shrivastava, A\., Moore, J\., West, P\., Tan, C\. & Holtzman, A\. \(2025\)\. AbsenceBench: Language models can’t tell what’s missing\. arXiv:2506\.11440\.
34. 34\.Geirhos, R\., Jacobsen, J\.\-H\., Michaelis, C\., Zemel, R\., Brendel, W\., Bethge, M\. & Wichmann, F\. A\. \(2020\)\. Shortcut learning in deep neural networks\. Nat\. Mach\. Intell\. 2, 665–673\. DOI: 10\.1038/s42256\-020\-00257\-z\.
35. 35\.Amodei, D\., Olah, C\., Steinhardt, J\., Christiano, P\., Schulman, J\. & Mané, D\. \(2016\)\. Concrete problems in AI safety\. arXiv:1606\.06565\.
36. 36\.Booth, S\., Knox, W\. B\., Shah, J\., Niekum, S\., Stone, P\. & Allievi, A\. \(2023\)\. The perils of trial\-and\-error reward design: Misdesign through overfitting and invalid task specifications\. Proc\. AAAI Conf\. Artif\. Intell\. 37, 5920–5929\. DOI: 10\.1609/aaai\.v37i5\.25733\.
37. 37\.Oakden\-Rayner, L\., Dunnmon, J\., Carneiro, G\. & Ré, C\. \(2020\)\. Hidden stratification causes clinically meaningful failures in medical imaging deep learning\. Proc\. ACM Conf\. Health Inference Learn\. \(CHIL\) 151–159\. DOI: 10\.1145/3368555\.3384468\.
38. 38\.Wu, E\., Wu, K\., Daneshjou, R\., Ouyang, D\., Ho, D\. E\. & Zou, J\. \(2021\)\. How medical AI devices are evaluated: Limitations and recommendations from an analysis of FDA approvals\. Nat\. Med\. 27, 582–584\. DOI: 10\.1038/s41591\-021\-01312\-x\.
39. 39\.Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, L\. & Polosukhin, I\. \(2017\)\. Attention is all you need\. NeurIPS 2017\. arXiv:1706\.03762\.
40. 40\.Evans, J\. St\. B\. T\. & Stanovich, K\. E\. \(2013\)\. Dual\-process theories of higher cognition: Advancing the debate\. Perspect\. Psychol\. Sci\. 8, 223–241\. DOI: 10\.1177/1745691612460685\.
41. 41\.Itti, L\. & Koch, C\. \(1998\)\. A model of saliency\-based visual attention for rapid scene analysis\. IEEE Trans\. Pattern Anal\. Mach\. Intell\. 20, 1254–1259\. DOI: 10\.1109/34\.730558\.
42. 42\.Bainbridge, L\. \(1983\)\. Ironies of automation\. Automatica, 19\(6\), 775–779\. DOI: 10\.1016/0005\-1098\(83\)90046\-8\.
43. 43\.Parasuraman, R\. & Riley, V\. \(1997\)\. Humans and automation: Use, misuse, disuse, abuse\. Human Factors, 39\(2\), 230–253\. DOI: 10\.1518/001872097778543886\.
44. 44\.Parasuraman, R\. & Manzey, D\. H\. \(2010\)\. Complacency and bias in human use of automation: An attentional integration\. Hum\. Factors 52, 381–410\. DOI: 10\.1177/0018720810376055\.
45. 45\.Dzindolet, M\. T\., Peterson, S\. A\., Pomranky, R\. A\., Pierce, L\. G\. & Beck, H\. P\. \(2003\)\. The role of trust in automation reliance\. Int\. J\. Hum\.\-Comput\. Stud\. 58, 697–718\. DOI: 10\.1016/S1071\-5819\(03\)00038\-7\.
46. 46\.Goddard, K\., Roudsari, A\. & Wyatt, J\. C\. \(2012\)\. Automation bias: A systematic review of frequency, effect mediators, and mitigators\. J\. Am\. Med\. Inform\. Assoc\. 19, 121–127\. DOI: 10\.1136/amiajnl\-2011\-000089\.
47. 47\.Kahneman, D\. \(2011\)\. Thinking, Fast and Slow\. Farrar, Straus and Giroux\.
48. 48\.Yu, C\., Engelmann, S\., Cao, R\., Ali, D\. & Papakyriakopoulos, O\. \(2026\)\. How should AI safety benchmarks benchmark safety? arXiv:2601\.23112\. DOI: 10\.48550/arXiv\.2601\.23112\.
49. 49\.Torres, I\., & Evals\-Consensus\.AI Consortium\. \(2026\)\. A call to join a collective effort on AI evaluation\. Patterns, 7\(3\), 101512\. DOI: 10\.1016/j\.patter\.2026\.101512\.
50. 50\.Kundel, H\. L\., Nodine, C\. F\. & Carmody, D\. \(1978\)\. Visual scanning, pattern recognition and decision\-making in pulmonary nodule detection\. Invest\. Radiol\. 13, 175–181\. DOI: 10\.1097/00004424\-197805000\-00001\.
51. 51\.Krupinski, E\. A\. \(2010\)\. Current perspectives in medical image perception\. Atten\. Percept\. Psychophys\. 72, 1205–1217\. DOI: 10\.3758/APP\.72\.5\.1205\.
52. 52\.de Cassai, A\., Negro, S\., Geraldini, F\., Boscolo, A\., Sella, N\., Munari, M\. & Navalesi, P\. \(2021\)\. Inattentional blindness in anesthesiology: A gorilla is worth one thousand words\. PLoS ONE 16, e0257508\. DOI: 10\.1371/journal\.pone\.0257508\.
53. 53\.Bogdoll, D\., Nitsche, M\. & Zöllner, J\. M\. \(2022\)\. Anomaly detection in autonomous driving: A survey\. Proc\. IEEE/CVF CVPR Workshops 2022\. DOI: 10\.1109/CVPRW56347\.2022\.00495\. arXiv:2204\.07974\.
54. 54\.Chan, R\., Lis, K\., Uhlemeyer, S\., Blum, H\., Honari, S\., Siegwart, R\., Fua, P\., Salzmann, M\. & Rottmann, M\. \(2021\)\. SegmentMeIfYouCan: A benchmark for anomaly segmentation\. Advances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track\. arXiv:2104\.14812\.
The complete annotated bibliography of 116 works, each carrying a finding and a positioning sentence, is provided as Supplementary File S1, which also records Crossref verification status\.Similar Articles
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
This paper challenges the 'Attention-Confidence Assumption' by demonstrating that attention map sharpness is a poor predictor of correctness in Vision-Language Models. Instead, it shows that reliability is better indicated by hidden-state geometry and self-consistency, with significant findings on architectural differences between late-fusion and early-fusion models.
The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
This paper identifies a structural failure in multi-agent AI pipelines where memory-layer attacks can be misattributed as model misalignment, formalizing Semantic Norm Drift (SND) and proposing Counterfactual Composition Testing and Memory-Persistent Information-Flow Control as defenses.
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems
This literature review identifies and analyzes the problem of silent failures in physical AI systems, where black-box models may execute harmful actions without detection. It proposes a taxonomy of runtime guardrail functions and outlines evaluation requirements for safe autonomous systems.
AI agents fail in ways nobody writes about. Here's what I've actually seen.
The article highlights practical system-level failures in AI agent workflows, such as context bleed and hallucinated details, arguing that these are often infrastructure issues rather than model defects.
Large Vision-Language Models Get Lost in Attention
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.