Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

arXiv cs.CL Papers

Summary

This paper investigates seemingly contradictory findings on whether large vision-language models (LVLMs) can coordinate efficient referring expressions. The authors show that models can achieve efficiency when explicitly prompted, but fail to infer the need for efficiency from implicit prompts, revealing key differences between human and AI communication.

arXiv:2606.17372v1 Announce Type: new Abstract: Two recent studies (Jones et al. (2026); Zeng et al. (2026)) reach apparently contradictory conclusions about whether LVLMs can coordinate on efficient referring expressions. We control for task differences between the studies while directly comparing their prompting styles. We replicate the finding that models can coordinate efficient referring expressions when explicitly prompted to do so, suggesting that other task differences are not responsible for divergent results. However, we also find that the same models fail to infer the need for communicative efficiency from a more implicit prompt, highlighting critical differences between how humans and AI systems communicate.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:40 AM

# Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication
Source: [https://arxiv.org/html/2606.17372](https://arxiv.org/html/2606.17372)
Peter Zeng1,4Amie Paige2Weiling Li2 Susan E\. Brennan2Owen Rambow3,4Cameron R\. Jones2 1Department of Computer Science2Department of Psychology 3Department of Linguistics4Institute for Advanced Computational Science Stony Brook University Correspondence:[pezeng@cs\.stonybrook\.edu](https://arxiv.org/html/2606.17372v1/[email protected])

###### Abstract

Two recent studies \(Joneset al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib49)\); Zenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)\) reach apparently contradictory conclusions about whether large vision\-language models \(LVLMs\) can coordinate on efficient referring expressions\. We control for task differences between the studies while directly comparing their prompting styles\. We replicate the finding that models can coordinate efficient referring expressions whenexplicitlyprompted to do so, suggesting that other task differences are not responsible for divergent results\. However, we also find that the same models fail to infer the need for communicative efficiency from a moreimplicitprompt, highlighting critical differences between how humans and AI systems communicate\.

Implicit vs\. Explicit Prompting Strategies for LVLMs in Referential Communication

Peter Zeng1,4Amie Paige2Weiling Li2Susan E\. Brennan2Owen Rambow3,4Cameron R\. Jones21Department of Computer Science2Department of Psychology3Department of Linguistics4Institute for Advanced Computational ScienceStony Brook UniversityCorrespondence:[pezeng@cs\.stonybrook\.edu](https://arxiv.org/html/2606.17372v1/[email protected])

## 1Introduction

For AI agents to collaborate successfully with human partners, they need to be able to refer to objects in ways that their partners will understand, and in turn, resolve their partners’ referring expressions\. Toward this end, a spate of recent studies\(Hua and Artzi,[2024](https://arxiv.org/html/2606.17372#bib.bib35); Huaet al\.,[2025](https://arxiv.org/html/2606.17372#bib.bib42); Tanet al\.,[2025](https://arxiv.org/html/2606.17372#bib.bib56)\)has examined how large vision\-language models \(LVLMs\) interact with humans during referential communication tasks in which a director refers to target objects that a matcher must identify from a larger set\.

Human partners do this flexibly, by proposing and ratifying or amending expressions as they develop a shared perspective on a given referent\. Once two human partners come to believe that they have the same referent in mind, they tend to entrain on the same expression when referring to that object again \(typically, in a concise form\), signaling that they’ve reached aconceptual pact, or flexible, temporary perspective on the objectBrennan and Clark \([1996](https://arxiv.org/html/2606.17372#bib.bib8)\)\. In this way, the common ground that accrues during dialogue allows partners toentrainon referring expressions, making communication not only accurate, but efficient\(Clark and Wilkes\-Gibbs,[1986](https://arxiv.org/html/2606.17372#bib.bib15)\)\.

Recent work is divided on whether AI agents can perform similarly to humans, and critically, to what extent entrainment emerges from prompting techniques or characteristics of the task\. Specifically, two recent studies come to opposite conclusions about whether AI agents show human\-like behavior in referential communication\.Zenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)employed a matching task in which two partners \(either two humans, two AI agents, or a mixed pair\) worked together to recreate a set order of baskets across 4 rounds\. Unlike in human\-human pairs, AI directors persisted in needlessly lengthy descriptions and the accuracy of each round actually decreased over time, suggesting they were not using common ground to increase communicative efficiency\. Conversely, in a similar taskJoneset al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib49)\)found that AI\-AI pairs shortened their referring expressions over time\. Moreover, AI\-AI pairs’ accuracy increased across rounds: outperforming human\-human pairs and suggesting that models could adapt referring expressions to reduce their length while improving or maintaining accuracy\.

In this short paper, we investigate why these studies produce apparently contradictory results\. First, we compare prompting styles\.Joneset al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib49)\)directly instructed the model about specific, surface\-level properties of turns, such as to try to use only 1\-2 words in later rounds \(which we call an “explicit" prompt\)\.Zenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)’s prompt was less direct \(an “implicit” prompt\) and included the pragmatic principlebe concise but informative; this would allow any linguistic adaptation to emerge dynamically \(see Appendix[A](https://arxiv.org/html/2606.17372#A1)for the full prompts\)\. Second, we test whether newer model versions account for differences in performance\. Third, we control for other differences\. InJoneset al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib49)\), partners switched director/matcher roles after each of 5 rounds to match one of 10 tangrams \(abstract geometric objects\), while inZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\), each partner maintained the same role throughout 4 rounds of matching pictures of 12 out of 18 baskets \(a more difficult task, and less amenable to figurative descriptions\)\.

#### Contributions

This paper reconciles recent conflicting findings on whether referring by LVLMs is truly human\-like\. We show that the divergence between prior studies stems from prompting style rather than model version or task: LVLMs shorten repeated referring expressions when prompted forcefully to use few words, but not when prompted more implicitly to be concise but informative \(see examples in Figure[1](https://arxiv.org/html/2606.17372#S1.F1)\)\.

![Refer to caption](https://arxiv.org/html/2606.17372v1/figures/Spirit_diagram_1.png)Figure 1:Example trajectories of referring expressions across repeated rounds in human\-human dialogue, left \(adapted fromZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)\) and right, AI\-AI dialogue from our experiments underimplicitandexplicitprompting conditions \(similar toZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)andJoneset al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib49)\), respectively\) with GPT\-5\.2 and GPT\-5\.5\.

## 2Related Work

Extensive prior work in psycholinguistics has shown that humans become increasingly accurate and efficient in multi\-turn referential communication by developing common ground and reusing partner\-specific referring expressions over repeated interactionsClark and Wilkes\-Gibbs \([1986](https://arxiv.org/html/2606.17372#bib.bib15)\); Brennan and Clark \([1996](https://arxiv.org/html/2606.17372#bib.bib8)\); Hawkinset al\.\([2020](https://arxiv.org/html/2606.17372#bib.bib41)\)\.

Recent work has investigated whether LVLMs exhibit similar collaborative behavior in referential communication tasks\.Hua and Artzi \([2024](https://arxiv.org/html/2606.17372#bib.bib35)\)evaluated five state\-of\-the\-art LVLMs as directors \(speaker\) or matchers \(listeners\) in a six\-round referential communication task and found that LVLMs did not spontaneously shorten their referring expressions or adapt to partner behavior across rounds\. Subsequent work showed that post\-training methods can induce more human\-like compression and consistency, leading to shorter descriptions and improved accuracy across rounds\(Huaet al\.,[2025](https://arxiv.org/html/2606.17372#bib.bib42)\)\. However, these behaviors were induced through optimization procedures rather than emerging naturally through interaction with a partner\.

Other studies have examined whether LVLMs can maintain common ground and update conversational state in human\-AI interaction\.Wanget al\.\([2025](https://arxiv.org/html/2606.17372#bib.bib55)\)found substantial performance gaps between LVLM and human overhearers in a multi\-round referential communication task\. Even with access to the full dialogue history and unlimited memory, LVLMs did not consistently improve across rounds or show reliable benefits from entering the interaction earlier\.Tanet al\.\([2025](https://arxiv.org/html/2606.17372#bib.bib56)\)tested open\-weight LVLMs on a tangram\-matching corpus; errors produced by models correlated poorly with human error patterns trial\-by\-trial\. This suggests that models may rely on different underlying mechanisms for solving the task\.Poelitzet al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib54)\)deployed GPT\-4\.1 in a human\-AI puzzle task where the model served either as director \(helper\) or matcher \(worker\)\. They found limited grounding behavior, including failures to provide clarifications or repairs after requests to do so, and failures to update assumptions after corrections\. They also found weak and decreasing lexical entrainment over rounds\. Human partners were more likely to adopt AI\-proposed referring expressions than vice versa \(an asymmetry in collaboration\)\.

Together, these findings suggest that although LVLMs can sometimes achieve high task accuracy, they still fail to consistently exhibit the collaborative and adaptive behavior observed in human conversational partners\.

## 3Experiment Methodology

The present experiment used the open\-source pipeline ofZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)to implement a multi\-round, multi\-turn collaborative object\-matching task\. We chose baskets as the non\-lexicalized objects to match as inZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\), with 5 rounds of matching as inJoneset al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib49)\)\. We chose GPT\-5\.2 and GPT\-5\.5 as the models\(OpenAI,[2026](https://arxiv.org/html/2606.17372#bib.bib57)\); the former, used byZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\), served as a way to test our implementation and corrections to their codebase, and the latter was the latest model at the time of this writing\. To focus on discrepancies from the two studies in question, our experiment used only AI–AI pairs, with director/matcher roles played by the same model\.

### 3\.1CorrectingZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)’s Pipeline

In their original prompting framework, the multimodal visual context \(the composite image of the current round’s target arrangement\) was injected as a static image at the very beginning of the context window, directly following the system instructions\.

We refactored the prompt construction pipeline to enforce strict chronological alignment and improve visual state tracking, with the prompts shown in Appendix[B](https://arxiv.org/html/2606.17372#A2)\. Past round chat histories and their corresponding visual feedback images are kept paired chronologically at the beginning of the context\. The image grid for the current round is then dynamically injected using explicit round boundary markers, ensuring it serves as the freshest visual frame directly preceding the LVLM’s next conversational action\. In addition, we unified the matcher’s visual input by rendering their active 12\-slot sequence state and candidate pool within a single composite image, supplemented by structured text reminders highlighting vacant positions\. This ensures that all spatial and state tracking operations are grounded in a single, temporally aligned visual representation\.

### 3\.2Prompt Implementation

#### Recreation of Implicit Prompt

The implicit prompting style replicates the pragmatically informed prompt design fromZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)\. It instructs the LVLM using conversational principles based on Gricean maxims and collaborative grounding norms \(e\.g\., verifying correctness, rephrasing on confusion, and requesting re\-descriptions of empty slots\)\. Crucially, this prompt does not instruct the model to compress phrasing or reuse words across rounds\. Any lexical entrainment or shortened referring expressions \(RE\) would arise spontaneously, emerging from the pair minimizing their collaborative effort\.

#### Recreation of Explicit Prompt

The explicit prompting style is adapted fromJoneset al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib49)\)’s “humanlike” prompting strategy to fit the current task\. In their original tangram\-matching task, they explicitly instructed the LVLM to reduce description lengths over rounds to simulate human\-style entrainment\. Some examples of this heavy\-handedness include instructions such as “Your descriptions should be AS SHORT AS POSSIBLE\.", and “SERIOUSLY–in later rounds just 1\-2 words"\. We adapt this strategy to the current, more complex basket\-matching task, in which the director is explicitly instructed to track the history of its descriptions and systematically shorten referring expressions over rounds\. This encourages the model to drop redundant descriptors by explicit instruction rather than to do so spontaneously\.

## 4Results

![Refer to caption](https://arxiv.org/html/2606.17372v1/figures/metrics.jpg)Figure 2:Trends over five rounds foraccuracy\(%\),numbers of words,number of turns,number of words referring expressions, andproportion of lexical overlap with prior roundsby prompt–model condition\. Dotted lines show implicit and solid lines show explicit prompting conditions; GPT\-5\.2 is in blue and GPT\-5\.5, orange\.We evaluated 40 complete AI–AI runs, crossing two prompt strategies \(Implicit vs\. Explicit\) with two model versions \(GPT\-5\.2 vs\. GPT\-5\.5\)\. Each game consisted of five repeated rounds over the same set of baskets, yielding 200 round\-level observations\. We report task accuracy, total dialogue length, number of dialogue turns, referring\-expression \(RE\) length, and lexical overlap with previous rounds, shown in Figure[2](https://arxiv.org/html/2606.17372#S4.F2)\.

Across all experiments, all Prompt x Model conditions achieved high task accuracy, but differed sharply in communicative efficiency and lexical convergence\. Under the implicit prompt, which encouraged cooperative communication without explicit instruction to shorten expressions, both models remained verbose: GPT\-5\.2 averaged 1250\.7 words per round and GPT\-5\.5 averaged 710\.4, with only modest reductions across rounds\. In contrast, the explicit prompt strongly compressed wording, by 62\.8% for GPT\-5\.2 and 75\.6% for GPT\-5\.5 relative to their implicit counterparts\. The clearest pattern to suggest reaching conceptual pacts appeared for explicit GPT\-5\.5: referring\-expression length fell from 58\.8 words in Round 1 to 32\.7 in Round 5, lexical overlap reached 1\.00, and accuracy remained high at 97\.5%\. GPT\-5\.2 also shortened under the explicit prompt, but its Round 5 accuracy dropped to 92\.5%, suggesting an accuracy–brevity tradeoff\. These findings are in line with\(Joneset al\.,[2026](https://arxiv.org/html/2606.17372#bib.bib49)\): LVLMs can produce short, stable referring expressions when explicitly told to do so, suggesting that prompt design \(rather than other task differences\) accounted for the difference in results between these studies\. However, they replicateZenget al\.\([2026](https://arxiv.org/html/2606.17372#bib.bib53)\)’s finding that without such explicit instruction, LVLMs fail to infer these cooperative communicative practices from interactional needs alone \(see Figure[1](https://arxiv.org/html/2606.17372#S1.F1)\)\.

### 4\.1Analysis of Transcripts

Transcript inspection shows that explicit prompting changes the form of the interaction, not just turn length\. Under the explicit prompt, GPT\-5\.5 introduces compact but discriminative labels in Round 1 and then prunes them into stable 2\- or 3\-word descriptions, often telegraphic or abbreviated \(e\.g\., "rect picnic", "tall cylinder", "bunny red eye"\)\. The matcher echoes these descriptions, with the guise of conceptual pact formation\. Under the implicit prompt, GPT\-5\.5 also reuses lexical material across rounds, but retains full descriptive captions and confirmation routines\. The few lower\-accuracy explicit sessions further show that compression is successful only when the retained label remains contrastively sufficient; stable labels such as "dark round basket" may still be too underspecified in a visually crowded set\.

## 5Conclusion

This study helps explain why recent work has reached different conclusions about whether LVLMs form human\-like conceptual pacts in referential communication\. With an implicit, pragmatically informed prompt, models are accurate but needlessly verbose: they reuse visual descriptions across rounds, yet do not spontaneously treat the accumulating history as a license to say less\. With an explicit prompt to shorten and reuse expressions, GPT\-5\.5 produced the human\-like surface pattern of lexical entrainment: descriptions became shorter, more stable, and increasingly pact\-like while accuracy remained high\. However, this behavior does not emerge from anything like the common ground established by humans\.

This matters because entrainment is not simply compression\. In human dialogue, a shortened referring expression is evidence that partners have coordinated on a perspective they can now rely on as shared\. Current LVLMs can be guided to reproduce this outward form, but the prompt\-dependence of the effect warrants caution about attributing the effect to the same underlying process\.

## Limitations

This study was conducted only in English, with only one type of object \(not conventionally lexicalized\), and with only two LVLMs for the full factorial design \(GPT\-5\.2/5\.5\)\. Future work should compare non\-proprietary models and downstream work such as fine\-tuning to improve accuracy in this task and efficiency of referring expressions\.

This study is also limited because it compares the collaboration between pairs of the same models \(e\.g\., GPT\-5\.5 with GPT 5\.5\)\. In doing so, two instances of the same model may align more readily to a proposed referring expression as compared to two different models or two different humans\. Psycholinguistic research proposes that referring is a collaborative process, wherein partners may begin with different perspectives, but must work together to converge \(or not\) on a referring expression that works well enough in the momentClark and Wilkes\-Gibbs \([1986](https://arxiv.org/html/2606.17372#bib.bib15)\)\. In fact, early shortened expressions may be harmful for collaborators who do not see a particular object with the same perspective\. This is to say that simply producing the behavior with LVLMs is not sufficient for the behavior to be helpful\. Additional work should evaluate these prompting strategies in situations where initial perspectives are perhaps not so readily aligned and where models must instead produce the behavior pragmatically, using evidence of understanding from their partner\.

The current study evaluated the conditions under which human\-like entrainment is produced with LVLMs in relation to accuracy in a referential communication task\.

## References

- Conceptual pacts and lexical choice in conversation\.\.Journal of experimental psychology: Learning, memory, and cognition22\(6\),pp\. 1482\.Cited by:[§1](https://arxiv.org/html/2606.17372#S1.p2.1),[§2](https://arxiv.org/html/2606.17372#S2.p1.1)\.
- H\. H\. Clark and D\. Wilkes\-Gibbs \(1986\)Referring as a collaborative process\.Cognition22\(1\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2606.17372#S1.p2.1),[§2](https://arxiv.org/html/2606.17372#S2.p1.1),[Limitations](https://arxiv.org/html/2606.17372#Sx1.p2.1)\.
- R\. D\. Hawkins, M\. C\. Frank, and N\. D\. Goodman \(2020\)Characterizing the dynamics of learning in repeated reference games\.Cognitive science44\(6\),pp\. e12845\.Cited by:[§2](https://arxiv.org/html/2606.17372#S2.p1.1)\.
- Y\. Hua and Y\. Artzi \(2024\)Talk less, interact better: evaluating in\-context conversational adaptation in multimodal LLMs\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=lVOw78nYXS)Cited by:[§1](https://arxiv.org/html/2606.17372#S1.p1.1),[§2](https://arxiv.org/html/2606.17372#S2.p2.1)\.
- Y\. Hua, E\. Wang, and Y\. Artzi \(2025\)Post\-training for efficient communication via convention formation\.arXiv preprint arXiv:2508\.06482\.Cited by:[§1](https://arxiv.org/html/2606.17372#S1.p1.1),[§2](https://arxiv.org/html/2606.17372#S2.p2.1)\.
- C\. R\. Jones, A\. Lombardi, K\. Mahowald, and B\. K\. Bergen \(2026\)LLMs and people both learn to form conventions–just not with each other\.arXiv preprint arXiv:2602\.08208\.Cited by:[Figure 1](https://arxiv.org/html/2606.17372#S1.F1),[§1](https://arxiv.org/html/2606.17372#S1.p3.1),[§1](https://arxiv.org/html/2606.17372#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.17372#S3.SS2.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.17372#S3.p1.1),[§4](https://arxiv.org/html/2606.17372#S4.p2.1)\.
- OpenAI \(2026\)GPT\-5\.5 System Card\.External Links:[Link](https://openai.com/index/gpt-5-5-system-card/)Cited by:[§3](https://arxiv.org/html/2606.17372#S3.p1.1)\.
- C\. Poelitz, F\. Doshi\-Velez, and S\. Lindley \(2026\)A benchmark to assess common ground in human\-ai collaboration\.arXiv preprint arXiv:2602\.21337\.Cited by:[§2](https://arxiv.org/html/2606.17372#S2.p3.1)\.
- A\. W\. M\. Tan, B\. Prystawski, V\. Boyce, and M\. C\. Frank \(2025\)Context informs pragmatic interpretation in vision\-language models\.arXiv preprint arXiv:2511\.03908\.Cited by:[§1](https://arxiv.org/html/2606.17372#S1.p1.1),[§2](https://arxiv.org/html/2606.17372#S2.p3.1)\.
- Z\. Wang, W\. Li, P\. Kaliosis, O\. Rambow, and S\. E\. Brennan \(2025\)LVLMs are bad at overhearing human referential communication\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 16769–16793\.Cited by:[§2](https://arxiv.org/html/2606.17372#S2.p3.1)\.
- P\. Zeng, W\. Li, A\. Paige, Z\. Wang, P\. Kaliosis, D\. Samaras, G\. Zelinsky, S\. Brennan, and O\. Rambow \(2026\)LVLMs and humans ground differently in referential communication\.arXiv preprint arXiv:2601\.19792\.Cited by:[Figure 1](https://arxiv.org/html/2606.17372#S1.F1),[§1](https://arxiv.org/html/2606.17372#S1.p3.1),[§1](https://arxiv.org/html/2606.17372#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.17372#S3.SS1),[§3\.2](https://arxiv.org/html/2606.17372#S3.SS2.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.17372#S3.p1.1),[§4](https://arxiv.org/html/2606.17372#S4.p2.1)\.

## Appendix ASystem Prompts

This section contains the full system prompts used for both the Director and Matcher roles in the Explicit and Implicit prompting conditions\. Throughout this appendix, beige prompt boxes correspond to Director prompts, and blue prompt boxes correspond to Matcher prompts\.

### A\.1Explicit Prompts

The Explicit Director system prompt and output constraints are shown in Figure[3](https://arxiv.org/html/2606.17372#A1.F3)and Figure[4](https://arxiv.org/html/2606.17372#A1.F4), respectively\. The Explicit Matcher system prompt and output constraints are shown in Figure[5](https://arxiv.org/html/2606.17372#A1.F5)and Figure[6](https://arxiv.org/html/2606.17372#A1.F6), respectively\.

``Figure 3:Explicit prompting condition: Director system prompt\. The highlighted sections are instances of heavy\-handed prompting specifically instructing the model to shorten expressions\.``Figure 4:Explicit prompting condition: Director output constraints\.``Figure 5:Explicit prompting condition: Matcher system prompt\.``Figure 6:Explicit prompting condition: Matcher output constraints\.
### A\.2Implicit Prompts

The Implicit Director system prompt and output constraints are shown in Figure[7](https://arxiv.org/html/2606.17372#A1.F7)and Figure[8](https://arxiv.org/html/2606.17372#A1.F8), respectively\. The Implicit Matcher system prompt and output constraints are shown in Figure[9](https://arxiv.org/html/2606.17372#A1.F9)and Figure[10](https://arxiv.org/html/2606.17372#A1.F10), respectively\.

``Figure 7:Implicit prompting condition: Director system prompt\.``Figure 8:Implicit prompting condition: Director output constraints\.``Figure 9:Implicit prompting condition: Matcher system prompt\.``Figure 10:Implicit prompting condition: Matcher output constraints\.

## Appendix BVisual Context Correction

This section contains the visual context injected into the prompts used for both the Director and Matcher roles in the Implicit and Explicit prompting conditions\. As above, beige prompt boxes correspond to Director prompts, and blue prompt boxes correspond to Matcher prompts\.

### B\.1Current Round Active Grid Prompts

The current\-round active\-grid visual context prompt injection for the Director is shown in Figure[11](https://arxiv.org/html/2606.17372#A2.F11), and the corresponding current\-round visual context prompt injection for the Matcher is shown in Figure[12](https://arxiv.org/html/2606.17372#A2.F12)\.

### B\.2Historical Round Feedback Prompts

The historical\-round feedback visual context prompt injection for the Director is shown in Figure[13](https://arxiv.org/html/2606.17372#A2.F13), and the corresponding historical\-round feedback prompt injection for the Matcher is shown in Figure[14](https://arxiv.org/html/2606.17372#A2.F14)\.

``Figure 11:Current\-round active\-grid visual context prompt injection for the Director\.``Figure 12:Current\-round active\-grid visual context prompt injection for the Matcher\.``Figure 13:Historical\-round feedback visual context prompt injection for the Director\.``Figure 14:Historical\-round feedback visual context prompt injection for the Matcher\.

Similar Articles

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

arXiv cs.CL

This paper investigates prompt-induced hallucinations in vision-language models through mechanistic analysis, identifying specific attention heads responsible for the models' tendency to favor textual prompts over visual evidence. The authors demonstrate that ablating these PIH-heads reduces hallucinations by at least 40% without additional training, revealing model-specific mechanisms underlying this failure mode.

LLMs Can Better Capture Human Judgments--With the Right Prompts

arXiv cs.CL

This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.

Are you speaking my languages? On spoken language adherence in multimodal LLMs

arXiv cs.CL

This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.