PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

arXiv cs.AI 06/12/26, 04:00 AM Papers
listwise-ranking multimodal parse-collapse parameterized-representation hypernetwork lora generative-ranking
Summary
PRISMR proposes a framework using hypernetworks and LoRA to internalize list structure, overcoming parse collapse in multimodal listwise ranking. It introduces a large-scale benchmark and shows reduced parse collapse and improved ranking performance across domains and backbones.
arXiv:2606.12942v1 Announce Type: new Abstract: Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.
Original Article
View Cached Full Text
Cached at: 06/12/26, 08:54 AM
# PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization
Source: [https://arxiv.org/html/2606.12942](https://arxiv.org/html/2606.12942)
Hao Jiang1Xin Li1Annan Wang1Zhi Yang2 Haoxiang Zhang2Yichi Zhang3Weisi Lin1 1Nanyang Technological University2Peking University3Independent Researcher \{jianghao907, lixin\.1997\.lixin, haoxiang\.z\.f, yichizhang0926\}@gmail\.com annan001@e\.ntu\.edu\.sg zhiyang25@stu\.pku\.edu\.cn WSLin@ntu\.edu\.sg

###### Abstract

Generative listwise ranking with Large Multimodal Models \(LMMs\) aims to capture global list context in a single forward pass, but its effectiveness degrades in long\-context multimodal scenarios\. We identify a recurring failure mode,*parse collapse*, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early\. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient\. We proposePRISMR\(ParameterizedRepresentationInternalization forSemanticMultimodalRanking\), a framework that replaces transient in\-context list processing with parametric structural conditioning\. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item\-specific LoRA weights, which are synthesized into an instance\-specific adapter for a LMM\. This paradigm enables more robust internalization of list structure while preserving the base model\. We further introduce a large\-scale multimodal review\-ranking benchmark for evaluation\. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction\-tuned backbones\.

## 1Introduction

Large Multimodal Models \(LMMs\)\[[1](https://arxiv.org/html/2606.12942#bib.bib1),[30](https://arxiv.org/html/2606.12942#bib.bib30),[33](https://arxiv.org/html/2606.12942#bib.bib33)\]have demonstrated remarkable capabilities in reasoning across intertwined text and visual contexts\. As these models are increasingly deployed in Learning\-to\-Rank \(LTR\) scenarios, existing ranking paradigms face a persistent trade\-off in long\-context settings\[[12](https://arxiv.org/html/2606.12942#bib.bib12)\]\. Traditionally, pointwise scoring is computationally efficient but fundamentally fails to account for list\-level interactions, leading to miscalibrated top\-kkrankings\[[15](https://arxiv.org/html/2606.12942#bib.bib15),[7](https://arxiv.org/html/2606.12942#bib.bib7)\]\. Pairwise ranking captures relative preferences but entails an intractable𝒪\(N2\)\\mathcal\{O\}\(N^\{2\}\)computational complexity during inference\. While generative listwise ranking\[[8](https://arxiv.org/html/2606.12942#bib.bib8),[16](https://arxiv.org/html/2606.12942#bib.bib16),[3](https://arxiv.org/html/2606.12942#bib.bib3),[32](https://arxiv.org/html/2606.12942#bib.bib32),[24](https://arxiv.org/html/2606.12942#bib.bib24),[25](https://arxiv.org/html/2606.12942#bib.bib25),[17](https://arxiv.org/html/2606.12942#bib.bib17)\]theoretically resolves these inefficiencies by evaluating the entire candidate list to leverage global context, it becomes highly unstable and computationally expensive as candidate lists grow\.

This instability manifests as a severe generative fragility when deploying LMMs for multimodal listwise ranking\. As seen in Fig\.[1](https://arxiv.org/html/2606.12942#S1.F1), When processing long, multimodal contexts—where extensive text interleaves with dense visual features—LMMs suffer from exacerbated attention dilution and the “lost in the middle” phenomenon\[[14](https://arxiv.org/html/2606.12942#bib.bib14)\]\. Consequently, LMMs frequently experience a catastrophicparse collapse\. The autoregressive generation process fails to adhere to strict structural constraints, resulting in severe hallucinations, missing candidates, and unparseable conversational outputs rather than the requested ranked lists\. Existing adaptation approaches fall short of addressing this bottleneck\. Standard Supervised Fine\-Tuning \(SFT\) and Direct Preference Optimization \(DPO\) attempt to enforce output formats by permanently altering global weights, but they struggle to dynamically adapt to highly variable, instance\-level multimodal contexts\. Conversely, traditional Context Distillation \(CD\) methods require computationally prohibitive per\-prompt gradient updates, rendering them unsuitable for real\-time, large\-scale ranking systems\.

![Refer to caption](https://arxiv.org/html/2606.12942v1/x1.png)Figure 1:Conventional generative listwise ranking feeds all multimodal candidates into the decoder context, causing attention dilution and parse collapse as the list grows\.To overcome the inherent fragility of listwise generation and the inefficiency of current adaptation methods, we proposePRISMR\(ParameterizedRepresentationInternalization forSemanticMultimodalRanking\)\. Shifting the paradigm from traditional in\-context prompt processing to parametric structural conditioning, PRISMR leverages a lightweight hypernetwork to instantly encode both the complex system instructions and the rich multimodal candidate contexts \(review content, item titles, and images\) into a Low\-Rank Adaptation \(LoRA\) module via a single, feed\-forward pass\. By projecting the cross\-modal interactions and ranking\-task constraints into a low\-rank weight delta prior to decoding, PRISMR substantially reduces the in\-context burden that drives parse collapse\. The generated LoRA acts as a strong instance\-specific structural prior during decoding, empirically pushing per\-slot parse rate above99\.9%99\.9\\,\\%in our experiments while retaining the base model’s generalized world knowledge and zero\-shot capabilities\.

Extensive experiments demonstrate that PRISMR establishes a new state\-of\-the\-art for multimodal listwise ranking\. Our main contributions are summarized as follows:

- •A length\-adaptive PRISMR framework for multimodal listwise ranking\.We propose PRISMR, a parametric internalization framework that maps a long multimodal listwise instance into a feed\-forward LoRA update, thereby moving candidate information from fragile prompt tokens into structured parameter space\. A single trained hypernetwork supports two zero\-cost test\-time synthesis modes:α\\alpha\-mode \(∑iBiAi\\sum\_\{i\}B\_\{i\}A\_\{i\}\), which provides higher in\-distribution capacity, andβ\\beta\-mode \(1N∑iBiAi\\tfrac\{1\}\{N\}\\sum\_\{i\}B\_\{i\}A\_\{i\}\), which improves robustness under length extrapolation\.
- •A multimodal listwise review\-ranking benchmark\.We construct a domain\-specific multimodal review\-ranking benchmark from Amazon Reviews 2023\[[9](https://arxiv.org/html/2606.12942#bib.bib9)\], where each example contains multiple reviews of the same product with titles, textual content, and user\-uploaded images\. The benchmark provides listwise supervision over review quality and will be publicly released to support research on long\-context multimodal ranking\.
- •A systematic analysis of parse collapse and ranking quality\.We identify parse collapse as a dominant failure mode of generative LMM listwise ranking, where models silently omit candidates or fail to produce valid rankings\. We characterize its dependence on list length and image density, and show that PRISMR improves format reliability\.

## 2Related Work

### 2\.1Multimodal Listwise Ranking and Generative Fragility

Generative Large Language Models \(LLMs\) have reshaped Information Retrieval \(IR\) and Learning\-to\-Rank \(LTR\)\. While pointwise methods scale linearly but ignore list\-level dynamics, generative listwise approaches like RankGPT\[[29](https://arxiv.org/html/2606.12942#bib.bib29)\]evaluate entire candidate lists to leverage global context\. To directly optimize listwise metrics, recent works propose differentiable surrogate losses \(e\.g\., diffNDCG\[[22](https://arxiv.org/html/2606.12942#bib.bib22)\]\) and Permutative Preference Alignment \(PPA\)\[[34](https://arxiv.org/html/2606.12942#bib.bib34)\]\. However, extending listwise paradigms to Large Multimodal Models \(LMMs\) introduces severe generative fragility\. Processing long sequences of interleaved text and dense visual features exacerbates attention dilution and the “lost in the middle” phenomenon\[[14](https://arxiv.org/html/2606.12942#bib.bib14)\], triggering a “parse collapse” where models fail to respect output formats\. While standard Supervised Fine\-Tuning \(SFT\), Direct Preference Optimization\[[23](https://arxiv.org/html/2606.12942#bib.bib23)\], and reinforcement learning \(e\.g\., GRPO\[[26](https://arxiv.org/html/2606.12942#bib.bib26)\]\) mitigate this on static distributions, they struggle to adapt to highly variable, instance\-level multimodal contexts dynamically without catastrophic forgetting or reasoning overhead\. Logits\-only listwise rerankers like FIRST\[[24](https://arxiv.org/html/2606.12942#bib.bib24)\]and RankZephyr\[[21](https://arxiv.org/html/2606.12942#bib.bib21)\]sidestep generation failures but inherit the same long\-context attention pressure asNNscales\.

### 2\.2Prompt Internalisation and Context Distillation

To alleviate the computational bottleneck and attention dilution of long prompts, prior work explores Context Distillation \(CD\) and prompt compression\. Compression techniques like LLMLingua\[[13](https://arxiv.org/html/2606.12942#bib.bib13)\]and LLMLingua\-2\[[19](https://arxiv.org/html/2606.12942#bib.bib19)\]drop tokens via information entropy, but inherently lose information and do not bypass context\-window limits; we empirically confirm in Section[4\.2](https://arxiv.org/html/2606.12942#S4.SS2)that token\-budget compression does*not*alleviate parse collapse\. Gisting\[[18](https://arxiv.org/html/2606.12942#bib.bib18)\]learns special “gist” tokens that compress an instruction prompt into a small constant number of soft tokens, while Snell\-CD\[[28](https://arxiv.org/html/2606.12942#bib.bib28)\]amortises a long context into model parameters by token\-level distillation from a context\-augmented teacher\. Generative Prompt Internalisation \(GenPI\)\[[27](https://arxiv.org/html/2606.12942#bib.bib27)\]jointly distils both the teacher’s outputs and the prompt content\. All three target a*single*long context, whereas multimodal listwise ranking presentsNNshort multimodal items and requires the parametric encoding to be aware of relative ordering across theNNitems\.

### 2\.3Hypernetwork\-based PEFT

Our work builds on hypernetwork\-based PEFT\[[10](https://arxiv.org/html/2606.12942#bib.bib10)\]: a small network produces task\-specific weights conditioned on auxiliary information\. HyperTuning\[[20](https://arxiv.org/html/2606.12942#bib.bib20)\]predicts soft prompts for a frozen language model conditioned on a few\-shot description; HINT\[[11](https://arxiv.org/html/2606.12942#bib.bib11)\]predicts adapter parameters from natural\-language instructions for zero/few\-shot task generalisation; Text\-to\-LoRA\[[4](https://arxiv.org/html/2606.12942#bib.bib4)\]predicts LoRA weights from a task description for zero\-shot adaptation; Doc\-to\-LoRA\[[5](https://arxiv.org/html/2606.12942#bib.bib5)\]uses a Perceiver\-based architecture to internalise a single long document into a LoRA adapter in one feed\-forward pass\. PRISMR is the listwise multimodal counterpart of this line of work: a single global hypernetwork encodes each ofNNmultimodal candidates into a per\-item LoRA, and we explicitly study how to combine theNNadapters into one composite incrementΔW\\Delta W\. We derive closed\-form rank\-concatenation and mean\-pooling operators, revealing a capacity and length\-robustness trade\-off unique to listwise hypernetwork PEFT\.

## 3Method

We formalize multimodal listwise ranking and presentPRISMR, a framework that replaces long in\-context list conditioning with instance\-specific parametric conditioning\. Instead of feeding all multimodal candidates into the decoder prompt, PRISMR uses a shared hypernetwork to encode each candidate into a low\-rank adapter\. The resulting adapters are then synthesized into a single weight increment mounted on a frozen LMM for one\-shot listwise decoding\. Figure[2](https://arxiv.org/html/2606.12942#S3.F2)illustrates the overall architecture\.

![Refer to caption](https://arxiv.org/html/2606.12942v1/x2.png)Figure 2:Overview of PRISMR\. A shared hypernetworkℋϕ\\mathcal\{H\}\_\{\\phi\}maps each multimodal candidatedid\_\{i\}to an item\-specific LoRA adapter\(Ai,Bi\)\(A\_\{i\},B\_\{i\}\)\. TheNNadapters are combined by one of two zero\-cost synthesis modes of the same checkpoint—rank\-dimension concatenation \(α\\alpha\-mode\) or mean pooling \(β\\beta\-mode\)—yielding a composite incrementΔW\\Delta Wmounted on the frozen LMM for listwise decoding\. PRISMR usesα\\alpha\-mode forN≤50N\\\!\\leq\\\!50andβ\\beta\-mode forN\>50N\\\!\>\\\!50by default\.### 3\.1Multimodal listwise ranking and parse collapse

Let𝒟=\{d1,…,dN\}\\mathcal\{D\}=\\\{d\_\{1\},\\ldots,d\_\{N\}\\\}denote a list of multimodal candidates, where each candidatedi=\(ti,vi\)d\_\{i\}=\(t\_\{i\},v\_\{i\}\)consists of texttit\_\{i\}and visual inputsviv\_\{i\}\. Given an instructionII, a frozen LMMfθf\_\{\\theta\}generates a target sequencey=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\)autoregressively:

Pθ\(y∣c\)=∏t=1TPθ\(yt∣y<t,c\),P\_\{\\theta\}\(y\\mid c\)=\\prod\_\{t=1\}^\{T\}P\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},c\),\(1\)whereccis the conditioning context\.

In generative listwise scoring, the target output is a structured sequencey⋆=\(⟨1:s1⟩,…,⟨N:sN⟩\)y^\{\\star\}=\(\\langle 1:s\_\{1\}\\rangle,\\ldots,\\langle N:s\_\{N\}\\rangle\), wheresi∈\[1,10\]s\_\{i\}\\in\[1,10\]is the teacher score for candidateii\. The generated scores induce a rankingπ\\pivia a deterministic monotone mapping, and evaluation is performed using NDCG@K against the teacher orderingπ⋆=argsorti⁡\(−si\)\\pi^\{\\star\}=\\operatorname\{argsort\}\_\{i\}\(\-s\_\{i\}\)\.

A standard in\-context ranker conditions on the full multimodal list:

cICL=\[I;t1,v1;…;tN,vN\],\|cICL\|=LI\+NLd,c\_\{\\mathrm\{ICL\}\}=\[I;t\_\{1\},v\_\{1\};\\ldots;t\_\{N\},v\_\{N\}\],\\qquad\|c\_\{\\mathrm\{ICL\}\}\|=L\_\{I\}\+NL\_\{d\},\(2\)whereLdL\_\{d\}denotes the average candidate length\. AsNNgrows, the decoder must attend over increasingly long multimodal contexts while also maintaining a strict output schema\. Empirically, this leads to*parse collapse*: the model produces fluent but incomplete outputs, silently omitting candidates or terminating early\. PRISMR addresses this failure mode by moving candidate information from the decoder context into instance\-specific parameters\.

### 3\.2Parametric internalization with a global hypernetwork

PRISMR keeps the base LMM frozen and introduces a shared hypernetworkℋϕ\\mathcal\{H\}\_\{\\phi\}\. For each candidatedid\_\{i\}, the hypernetwork predicts low\-rank LoRA factors for selected projections in the LMM:

ℋϕ\(di\)=\{\(Ai\(ℓ\),Bi\(ℓ\)\)\}ℓ=1L,Ai\(ℓ\)∈ℝr×din,Bi\(ℓ\)∈ℝdout×r\.\\mathcal\{H\}\_\{\\phi\}\(d\_\{i\}\)=\\left\\\{\(A\_\{i\}^\{\(\\ell\)\},B\_\{i\}^\{\(\\ell\)\}\)\\right\\\}\_\{\\ell=1\}^\{L\},\\qquad A\_\{i\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{r\\times d\_\{\\mathrm\{in\}\}\},\\quad B\_\{i\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\}\.\(3\)The induced item\-specific increment at layerℓ\\ellis

ΔWi\(ℓ\)=Bi\(ℓ\)Ai\(ℓ\)\.\\Delta W\_\{i\}^\{\(\\ell\)\}=B\_\{i\}^\{\(\\ell\)\}A\_\{i\}^\{\(\\ell\)\}\.\(4\)The same hypernetwork is shared across all candidates and all examples\. Since candidates are encoded independently, their adapters can be generated in parallel\.

### 3\.3Adapter synthesis

TheNNitem\-specific adapters are combined into a single incrementΔW\\Delta Wbefore decoding\. PRISMR exposes two zero\-cost test\-time synthesis modes of the*same*trained hypernetwork\. The first mode, denotedα\\alpha\-mode, concatenates the LoRA factors along the rank dimension, withAα=\[A1;…;AN\]A\_\{\\alpha\}=\[A\_\{1\};\\ldots;A\_\{N\}\]andBα=\[B1,…,BN\]B\_\{\\alpha\}=\[B\_\{1\},\\ldots,B\_\{N\}\], yielding

ΔWα=BαAα=∑i=1NBiAi\.\\Delta W\_\{\\alpha\}=B\_\{\\alpha\}A\_\{\\alpha\}=\\sum\_\{i=1\}^\{N\}B\_\{i\}A\_\{i\}\.\(5\)This preserves item\-specific subspaces and allows the effective rank to scale with the list size, i\.e\.,rank\(ΔWα\)≤Nr\\mathrm\{rank\}\(\\Delta W\_\{\\alpha\}\)\\leq Nr\.

The second mode, denotedβ\\beta\-mode, averages item\-level updates:

ΔWβ=1N∑i=1NBiAi\.\\Delta W\_\{\\beta\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}B\_\{i\}A\_\{i\}\.\(6\)In contrast toα\\alpha\-mode, mean pooling keeps the update scale and effective rank bounded asNNgrows, withrank\(ΔWβ\)≤r\\mathrm\{rank\}\(\\Delta W\_\{\\beta\}\)\\leq r\.

The two modes therefore induce a capacity–robustness trade\-off within a single PRISMR checkpoint\.α\\alpha\-mode offers higher in\-regime capacity because its effective rank grows withNN, but for much longer lists its update norm can also grow with the number of candidates:

‖ΔWα‖2≤∑i=1N‖BiAi‖2\.\\\|\\Delta W\_\{\\alpha\}\\\|\_\{2\}\\leq\\sum\_\{i=1\}^\{N\}\\\|B\_\{i\}A\_\{i\}\\\|\_\{2\}\.\(7\)This may shift the adapted model outside the range observed during training\.β\\beta\-mode instead normalizes the aggregate update, making it more stable under length extrapolation\. Crucially, both modes are obtained from a*single*trained hypernetwork at zero additional training cost; PRISMR therefore adopts a simple length\-adaptive rule by default:useα\\alpha\-mode forN≤50N\\\!\\leq\\\!50\(in\-regime\) and switch toβ\\beta\-mode forN\>50N\\\!\>\\\!50\(extrapolation\)\. Unless stated otherwise, “PRISMR” in the experiments below refers to this single checkpoint with theN=50N\\\!=\\\!50mode\-switch threshold\.

### 3\.4PRISMR Loss

Training examples\(𝒟,y⋆\)\(\\mathcal\{D\},y^\{\\star\}\)are obtained from a frontier teacher model, which emits one structured score line per candidate\. We discard trajectories with missing candidates, invalid indices, or malformed score lines\. During training, the base LMM remains frozen, and only the hypernetwork parametersϕ\\phiare updated\.

We first optimize token\-level distillation with the negative log\-likelihood of the teacher sequence:

ℒNLL\(ϕ\)=−∑t=1Tlog⁡Pθ\+ΔW\(ϕ\)\(yt⋆∣y<t⋆,ctrig\)\.\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(\\phi\)=\-\\sum\_\{t=1\}^\{T\}\\log P\_\{\\theta\+\\Delta W\(\\phi\)\}\\left\(y\_\{t\}^\{\\star\}\\mid y\_\{<t\}^\{\\star\},c\_\{\\mathrm\{trig\}\}\\right\)\.\(8\)This objective encourages the adapted model to reproduce the teacher’s structured scoring output from the short trigger prompt\.

To add a ranking\-aware signal, we derive differentiable item scores from the same forward pass\. Let𝒯i\\mathcal\{T\}\_\{i\}denote the token positions of theii\-th score line iny⋆y^\{\\star\}\. We define the model confidence for itemiias

s^i=1\|𝒯i\|∑t∈𝒯ilog⁡Pθ\+ΔW\(ϕ\)\(yt⋆∣y<t⋆,ctrig\)\.\\hat\{s\}\_\{i\}=\\frac\{1\}\{\|\\mathcal\{T\}\_\{i\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{i\}\}\\log P\_\{\\theta\+\\Delta W\(\\phi\)\}\\left\(y\_\{t\}^\{\\star\}\\mid y\_\{<t\}^\{\\star\},c\_\{\\mathrm\{trig\}\}\\right\)\.\(9\)This score remains differentiable with respect toϕ\\phiand serves as a proxy for how well the model supports the teacher\-provided score line\.

We then apply a LambdaRank\-style NDCG loss over item pairs\. Letgi=2si−1g\_\{i\}=2^\{s\_\{i\}\}\-1be the gain,rir\_\{i\}the teacher rank,δi=1/log2⁡\(ri\+2\)\\delta\_\{i\}=1/\\log\_\{2\}\(r\_\{i\}\+2\)the discount, andIDCG=∑igiδi\\mathrm\{IDCG\}=\\sum\_\{i\}g\_\{i\}\\delta\_\{i\}\. The NDCG change from swapping itemsiiandjjis

\|ΔNDCGij\|=\|gi−gj\|\|δi−δj\|IDCG\.\|\\Delta\\mathrm\{NDCG\}\_\{ij\}\|=\\frac\{\|g\_\{i\}\-g\_\{j\}\|\\,\|\\delta\_\{i\}\-\\delta\_\{j\}\|\}\{\\mathrm\{IDCG\}\}\.\(10\)The ranking loss is

ℒrank\(ϕ\)=1Z∑i,j:si\>sj\|ΔNDCGij\|softplus\(−\(s^i−s^j\)\),\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\(\\phi\)=\\frac\{1\}\{Z\}\\sum\_\{\\begin\{subarray\}\{c\}i,j:\\\\ s\_\{i\}\>s\_\{j\}\\end\{subarray\}\}\|\\Delta\\mathrm\{NDCG\}\_\{ij\}\|\\,\\mathrm\{softplus\}\\left\(\-\(\\hat\{s\}\_\{i\}\-\\hat\{s\}\_\{j\}\)\\right\),\(11\)whereZ=\|\{\(i,j\):si\>sj\}\|Z=\|\\\{\(i,j\):s\_\{i\}\>s\_\{j\}\\\}\|\. This term emphasizes pairs whose mis\-ordering would have a larger impact on NDCG\.

The final training objective is

ℒtotal\(ϕ\)=ℒNLL\(ϕ\)\+λℒrank\(ϕ\),\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\phi\)=\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(\\phi\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\(\\phi\),\(12\)whereλ\\lambdabalances token\-level imitation and listwise ordering\. Sinceθ\\thetais frozen, gradients flow only through the synthesized adapterΔW\(ϕ\)\\Delta W\(\\phi\)into the hypernetworkℋϕ\\mathcal\{H\}\_\{\\phi\}\.

## 4Experiments

To comprehensively evaluate the effectiveness of PRISMR, we conduct extensive experiments on a large\-scale multimodal ranking benchmark\. We aim to answer four primary research questions:RQ1:Can PRISMR eradicate generative parse collapse in listwise ranking and improve ranking quality, and how does it compare with prompt\-engineering and format\-enforcement baselines such as RankGPT, LLMLingua\-2, and constrained decoding?RQ2:How well does PRISMR generalize beyond the training\-time list length, and what are its practical inference\-efficiency benefits?RQ3:Does prompt internalization remain effective in the pointwise setting, including when applied on top of a fine\-tuned backbone?RQ4:Does PRISMR transfer across domains without additional training, and what does this reveal about the source of parse collapse?

### 4\.1Experimental Setup

Dataset\.We construct a domain\-specific multimodal review\-ranking benchmark from Amazon Reviews 2023\[[9](https://arxiv.org/html/2606.12942#bib.bib9)\]\. As summarized in Table[1](https://arxiv.org/html/2606.12942#S4.T1), each example contains 10–100 user reviews associated with the same product, where each review may include a title, textual content, and user\-uploaded images\. The task is to produce a listwise ranking of the reviews according to multiple quality dimensions, including multimodal content quality, semantic relevance, image quality, visual appeal, and expected helpfulness to users\. To validate that the induced benchmark rankings are aligned with human preferences, we conduct a pairwise preference evaluation\. Because full listwise review ranking is inherently subjective and can exhibit low inter\-annotator agreement, we sample 3,434 review pairs from the benchmark and collect pairwise judgments from both human annotators and an LLM judge\. Specifically, we use Gemini\-2\.5\-Pro as the LLM judge, which achieves 99\.3% agreement with the benchmark pairwise ordering\. In parallel, two human annotators are asked to choose the preferred review in each pair according to the same quality criteria\. The benchmark ordering achieves 90\.6% agreement with the aggregated human preferences\. Notably, the agreement between the two annotators is also around 90% \(88\.7%\), reflecting the inherent subjectivity of review ranking\. These results suggest that our benchmark is well aligned with human preferences\.

Table 1:Statistics of the constructed multimodal review benchmark\.Baselines and Metrics\.We use Qwen3\-VL\-8B\[[2](https://arxiv.org/html/2606.12942#bib.bib2)\]as the shared backbone\. We compare PRISMR \(default:α\\alpha\-mode forN≤50N\\\!\\leq\\\!50,β\\beta\-mode forN\>50N\\\!\>\\\!50\) with Base \(zero\-shot score\-by\-score prompting\), RankGPT\[[29](https://arxiv.org/html/2606.12942#bib.bib29)\]\(direct permutation generation\), LLMLingua\[[13](https://arxiv.org/html/2606.12942#bib.bib13),[19](https://arxiv.org/html/2606.12942#bib.bib19)\], Constrained\-decoding Base \(iterative decoding with one enforced line per candidate\), and SFT Base\. To analyse the contribution of each synthesis mode in isolation, we additionally report PRISMR \(β\\beta\-mode, mean pooling\) at allNNas an ablation of the same checkpoint\. We report NDCG@K forK∈\{1,3,5,10\}K\\in\\\{1,3,5,10\\\}\.

Training hyperparameters\.We use Qwen3\-VL\-8B as the frozen base model and train only the global hypernetworkℋϕ\\mathcal\{H\}\_\{\\phi\}with 5\.8×\\times108parameters\.ℋϕ\\mathcal\{H\}\_\{\\phi\}is a 6\-layer Perceiver\-style cross\-attention encoder with 128 latent tokens of width 1024, 8 heads, GeLU MLPs with expansion ratio 4, RMSNorm, and two zero\-initialized heads that output LoRA factors for each targeted projection\. We use rankr=2r=2adapters on the\{q,k,v,o\}\\\{q,k,v,o\\\}projections of all 36 transformer blocks\. We train for 3 epochs using AdamW, learning rate2×10−52\\times 10^\{\-5\},β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, no weight decay, andλ=0\.5\\lambda=0\.5in Eq\.[12](https://arxiv.org/html/2606.12942#S3.E12)\. The batch size is 1 per GPU with gradient accumulation 8\. We use bfloat16 autocast, flash\-attention 2, and gradient checkpointing on the frozen base\. The data split is obtained by hashing product\_id with validation ratio 0\.1 and seed 42\. Multi\-seed results for seeds\{7,17\}\\\{7,17\\\}are given in Appendix[A](https://arxiv.org/html/2606.12942#A1)\. Training takes about 5 hours on one NVIDIA B200 or 2\.5 hours on two B200 GPUs with DDP\.

#### Parse rate\.

For a list of sizeNN, slotiiis counted as parsed if a fixed regex extracts a valid score for indexi∈\[1,N\]i\\in\[1,N\]from the model output\. We ignore duplicate indices after the first match and clip scores to\[1,10\]\[1,10\]\. We then define

ParseRate=1PN∑p=1P∑i=1N𝟙\[\(p,i\)is parsed\]\.\\mathrm\{ParseRate\}=\\frac\{1\}\{PN\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{N\}\\mathbb\{1\}\\\!\\left\[\(p,i\)\\ \\text\{is parsed\}\\right\]\.For NDCG computation, unparsed slots are assigned a default score of 5\.0, which favors failed methods by providing a neutral fallback rather than the worst rank\. A detailed failure\-mode breakdown is given in Appendix[B](https://arxiv.org/html/2606.12942#A2)\.

### 4\.2Eradicating Parse Collapse in Listwise Ranking \(RQ1\)

Table[2](https://arxiv.org/html/2606.12942#S4.T2)evaluates listwise ranking on multimodalBaby\_Productswith one image per review andN∈\{10,20,50,100\}N\\in\\\{10,20,50,100\\\}\. Qwen3\-VL exhibits severe parse collapse: parse rate drops from 11\.3% atN=10N\{=\}10to 2\.0% atN=50N\{=\}50and 1\.0% atN=100N\{=\}100, with failures dominated by silent omissions rather than malformed outputs \(Appendix[B](https://arxiv.org/html/2606.12942#A2)\)\. Prompt engineering only partially addresses this problem\. RankGPT maintains high parse rate but its NDCG@10 still degrades with list length, while LLMLingua provides little benefit\. Constrained decoding guarantees 100% parse rate, yet remains substantially below PRISMR in NDCG@10, especially at largerNN, showing that output\-format enforcement alone fixes parsing but not long\-context ranking quality\.

PRISMR, run in its default configuration \(α\\alpha\-mode forN≤50N\\\!\\leq\\\!50,β\\beta\-mode forN\>50N\\\!\>\\\!50\), achieves near\-perfect parse rate and the best NDCG@10 atN=10,20,50,100N\{=\}10,20,50,100\. Its margin over the strongest prompt\-based baseline widens asNNgrows, indicating that the gain is not merely better format compliance but stronger listwise ranking under long multimodal context\. To isolate the effect of the synthesis mode itself, we also report each mode in isolation across allNN:β\\beta\-mode alone already recovers most of the gain and maintains perfect parse rate, suggesting that the main benefit comes from hypernetwork\-based parametric internalization, whileα\\alpha\-mode adds a smaller but consistent in\-regime improvement\.

Table 2:Listwise ranking on Qwen3\-VL\-8B \(Baby\_Products, 1 image/review, 30 validation products per cell\)\. PRISMR outperforms prompt\-based baselines at everyNN\. The bottom block reports the two synthesis modes of the same trained PRISMR checkpoint\.∗\\astFine\-tuning baselines use the same teacher labels and listwise prompts as PRISMR\. †\\daggerβ\\beta\-mode: mean\-pool synthesis of the same checkpoint \(same hypernetwork asα\\alpha\-mode\)\. ♭\\flatDefault PRISMR rule: useα\\alpha\-mode forN≤50N\\\!\\leq\\\!50\(shaded; in\-regime\), andβ\\beta\-mode forN\>50N\\\!\>\\\!50\(extrapolation\)\. The two rows therefore correspond to one method with a length\-adaptive synthesis switch\. ‡\\ddaggerN=100N\{=\}100is evaluated only for extrapolation\.

### 4\.3Length Generalization and Inference Efficiency \(RQ2\)

We examine how PRISMR’s two synthesis modes behave as the candidate list grows beyond the training\-time regime, and use the resulting curves to justify theN=50N\\\!=\\\!50mode\-switch threshold in PRISMR’s default rule\. Figure[3](https://arxiv.org/html/2606.12942#S4.F3)\(left\) plots NDCG@10 as a function of list lengthNN; the underlying hypernetwork is trained only up toN=50N\{=\}50\. In the in\-range regime \(N≤50N\\\!\\leq\\\!50\),α\\alpha\-mode consistently achieves the strongest ranking quality and remains clearly above the trained non\-hypernetwork baselines, including Standard SFT and RankGPT\-SFT\. Beyond the training length, the two modes begin to diverge:α\\alpha\-mode degrades sharply once the effective rank\-dependent adapter width grows far beyond what was seen during training, whereasβ\\beta\-mode remains substantially more stable and overtakesα\\alpha\-mode at largerNN\. The crossover sits near the training\-time list length and motivates PRISMR’s default length\-adaptive synthesis rule \(α\\alpha\-mode forN≤50N\\\!\\leq\\\!50;β\\beta\-mode forN\>50N\\\!\>\\\!50\), so that a single trained PRISMR checkpoint stays above the trained baselines across all evaluated list lengths\.

Figure[3](https://arxiv.org/html/2606.12942#S4.F3)\(right\) shows that this improvement in ranking quality is accompanied by favorable inference cost\. Measured by end\-to\-end wall\-clock latency per product on a single NVIDIA B200 with batch size 1, PRISMR is consistently faster than constrained\-decoding Base, and the gap widens asNNincreases\. The reason is structural\. Constrained decoding still performs generation over the full long context and therefore repeatedly incurs the cost of long\-context attention, whereas PRISMR amortizes review conditioning into a one\-time hypernetwork pass and then ranks using a compact parameterized representation\. As the candidate list becomes longer, avoiding repeated long\-context decoding becomes increasingly beneficial\.

### 4\.4Pointwise Internalization \(RQ3\)

Beyond listwise ranking, we evaluate PRISMR in the pointwise setting, where the same long prompt must be re\-encoded for each candidate, creating substantial redundancy\. As shown in Table[3](https://arxiv.org/html/2606.12942#S4.T3), prompt internalization improves ranking quality across both text\-only and multimodal inputs\. On the unfine\-tuned backbone, PRISMR raises NDCG@1 from 0\.7878 to 0\.9589 in the text\-only setting and from 0\.8302 to 0\.9585 in the multimodal setting\. The gains persist after finetuning, with SFT \+ PRISMR reaching 0\.9848 NDCG@1 versus 0\.9566 for SFT Base\. These results show that parametric internalization also transfers to pointwise ranking and remains effective for multimodal inputs\.

![Refer to caption](https://arxiv.org/html/2606.12942v1/x3.png)Figure 3:Left:NDCG@10 vs\. list lengthNN\. Both PRISMR modes are produced by the same checkpoint trained atN=50N\{=\}50\(dotted line\)\.α\\alpha\-mode performs best in range \(N≤50N\\\!\\leq\\\!50\), whileβ\\beta\-mode is more robust for extrapolation \(N\>50N\\\!\>\\\!50\), motivating the default switch atN=50N\{=\}50\. The resulting PRISMR envelope outperforms the trained baselines across allNN\.Right:per\-product latency on a single B200 \(batch size 1, mean over 5 products\)\. PRISMR is consistently faster than constrained\-decoding Base, with speed\-up increasing from1\.1×1\.1\\timesatN=10N\{=\}10to1\.7×1\.7\\timesatN=50N\{=\}50\.Table 3:Pointwise ranking performance under different prompt\-internalization strategies and modalities\. Here,sysdenotes internalizing only the system prompt into the LoRA, whilealldenotes internalizing the full prompt\.
### 4\.5Cross\-Domain Transfer \(RQ4\)

We next test whether PRISMR’s gains are domain\-specific or transfer beyond the training category\. To this end, we evaluate the checkpoint trained onBaby\_Productsdirectly on the disjointAmazon\_Fashioncategory, without any additional training\. As shown in Table[4](https://arxiv.org/html/2606.12942#S4.T4), PRISMR \(α\\alpha\-mode\) transfers cleanly across domains: it retains essentially perfect parse rate at all list lengths and achieves strong NDCG@10 on Fashion \(0\.982, 0\.976, and 0\.962 atN=10,20,50N\{=\}10,20,50\), remaining close to the in\-domain results on Baby\_Products\. By contrast, the Base score\-by\-score model exhibits the same parse\-collapse pattern on the new domain, with parse rate dropping to 14\.0%, 5\.0%, and 2\.0% asNNincreases\. This shows that parse collapse is primarily a property of the backbone\-and\-decoding protocol rather than of any particular domain\. AlthoughAmazon\_Fashionhas a somewhat different score distribution and appears slightly easier for both methods, the absolute gap between PRISMR and the Base model remains large, exceeding 0\.4 NDCG@10 at everyNN\.

Table 4:Zero\-shot cross\-domain transfer fromBaby\_ProductstoAmazon\_Fashion\. The model is trained onBaby\_Productsand evaluated onAmazon\_Fashionwithout additional training\.

## 5Limitations

PRISMR has several limitations\. First, its format compliance is empirical rather than formally guaranteed\. Although it achieves a≥99\.9%\\geq 99\.9\\%parse rate in the trained regime, theα\\alpha\-mode synthesis can degrade far beyond the training list length; PRISMR therefore switches to itsβ\\beta\-mode atN\>50N\\\!\>\\\!50for out\-of\-regime deployment, which is part of the default rule rather than a separate method\. Second, review ranking is subjective, and no single ordering can perfectly reflect all user preferences, especially when candidate reviews are close in quality\.

## 6Conclusion

We identify parse collapse as a key failure mode of generative listwise ranking: for long multimodal inputs, the base vision\-language model often terminates early and leaves much of the list unparsed\. Prompt compression and alternative prompting formats do not resolve this issue, suggesting that the bottleneck lies in the fragile long\-context generative output protocol\. We propose PRISMR, a hypernetwork\-based conditioning method that encodes each candidate review as a per\-item LoRA and composes them into an instance\-specific ranking module\. PRISMR nearly eliminates parse collapse, improves over prompting and constrained\-decoding baselines, remains effective for pointwise ranking, and transfers across domains\. We further observe a synthesis\-mode trade\-off:α\\alpha\-mode is stronger in range, whereasβ\\beta\-mode is more robust under length extrapolation; thus PRISMR usesα\\alphaforN≤50N\\\!\\leq\\\!50andβ\\betaforN\>50N\\\!\>\\\!50\. Overall, PRISMR shows that moving per\-candidate information from transient prompt tokens into structured parameter space offers a practical alternative to increasingly complex prompting or decoding heuristics\.

## References

- Achiam et al\. \[2023\]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Bai et al\. \[2025\]Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al\.Qwen3\-vl technical report\.*arXiv preprint arXiv:2511\.21631*, 2025\.
- Cai et al\. \[2025\]Shihao Cai, Chongming Gao, Yang Zhang, Wentao Shi, Jizhi Zhang, Keqin Bao, Qifan Wang, and Fuli Feng\.K\-order ranking preference optimization for large language models\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 4844–4859, 2025\.
- Charakorn et al\. \[2025\]Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, and Robert Tjarko Lange\.Text\-to\-lora: Instant transformer adaption\.*arXiv preprint arXiv:2506\.06105*, 2025\.
- Charakorn et al\. \[2026\]Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, and Robert Tjarko Lange\.Doc\-to\-lora: Learning to instantly internalize contexts\.*arXiv preprint arXiv:2602\.15902*, 2026\.
- Dong et al\. \[2025\]Kuicai Dong, Yujing Chang, Derrick Goh Xin Deik, Dexun Li, Ruiming Tang, and Yong Liu\.Mmdocir: Benchmarking multimodal retrieval for long documents\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 30959–30993, 2025\.
- Gera et al\. \[2025\]Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar\-Haim, Lilach Eden, and Asaf Yehudai\.Justrank: Benchmarking llm judges for system ranking\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 682–712, 2025\.
- Gupta et al\. \[2025\]Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, and Felix Yu\.Scalable in\-context ranking with generative models\.*arXiv preprint arXiv:2510\.05396*, 2025\.
- Hou et al\. \[2024\]Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley\.Bridging language and items for retrieval and recommendation\.*arXiv preprint arXiv:2403\.03952*, 2024\.
- Hu et al\. \[2022\]Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al\.Lora: Low\-rank adaptation of large language models\.*Iclr*, 1\(2\):3, 2022\.
- Ivison et al\. \[2023\]Hamish Ivison, Akshita Bhagia, Yizhong Wang, Hannaneh Hajishirzi, and Matthew E Peters\.Hint: Hypernetwork instruction tuning for efficient zero\- and few\-shot generalisation\.*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2023\.
- Jiang et al\. \[2026\]Hao Jiang, Zhi Yang, Annan Wang, Yichi Zhang, and Weisi Lin\.Rlpo: Residual listwise preference optimization for long\-context review ranking\.*arXiv preprint arXiv:2601\.07449*, 2026\.
- Jiang et al\. \[2023\]Huiqiang Jiang, Qianhui Wu, Chin\-Yew Lin, Yuqing Yang, and Lili Qiu\.Llmlingua: Compressing prompts for accelerated inference of large language models\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, pages 13358–13376, 2023\.
- Liu et al\. \[2024\]Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts\.*Transactions of the association for computational linguistics*, 12:157–173, 2024\.
- Liu et al\. \[2025a\]Qi Liu, Haozhe Duan, Yiqun Chen, Quanfeng Lu, Weiwei Sun, and Jiaxin Mao\.Llm4ranking: An easy\-to\-use framework of utilizing large language models for document reranking\.*arXiv preprint arXiv:2504\.07439*, 2025a\.
- Liu et al\. \[2025b\]Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et al\.Lipo: Listwise preference optimization through learning\-to\-rank\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 2404–2420, 2025b\.
- Liu et al\. \[2025c\]Wenhan Liu, Xinyu Ma, Yutao Zhu, Lixin Su, Shuaiqiang Wang, Dawei Yin, and Zhicheng Dou\.Coranking: Collaborative ranking with small and large ranking agents\.*arXiv preprint arXiv:2503\.23427*, 2025c\.
- Mu et al\. \[2023\]Jesse Mu, Xiang Lisa Li, and Noah D Goodman\.Learning to compress prompts with gist tokens\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Pan et al\. \[2024\]Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin\-Yew Lin, et al\.Llmlingua\-2: Data distillation for efficient and faithful task\-agnostic prompt compression\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 963–981, 2024\.
- Phang et al\. \[2023\]Jason Phang, Yi Mao, Pengcheng He, and Weizhu Chen\.Hypertuning: Toward adapting large language models without back\-propagation\.*Proceedings of the 40th International Conference on Machine Learning \(ICML\)*, 2023\.
- Pradeep et al\. \[2023\]Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin\.Rankzephyr: Effective and robust zero\-shot listwise reranking is a breeze\!*arXiv preprint arXiv:2312\.02724*, 2023\.
- Qiu et al\. \[2022\]Zi\-Hao Qiu, Quanqi Hu, Yongjian Zhong, Lijun Zhang, and Tianbao Yang\.Large\-scale stochastic optimization of ndcg surrogates for deep learning with provable convergence\.*arXiv preprint arXiv:2202\.12183*, 2022\.
- Rafailov et al\. \[2023\]Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\.Direct preference optimization: Your language model is secretly a reward model\.*Advances in neural information processing systems*, 36:53728–53741, 2023\.
- Reddy et al\. \[2024\]Revanth Gangi Reddy, JaeHyeok Doo, Yifei Xu, Md Arafat Sultan, Deevya Swain, Avirup Sil, and Heng Ji\.First: Faster improved listwise reranking with single token decoding\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 8642–8652, 2024\.
- Ren et al\. \[2025\]Ruiyang Ren, Yuhao Wang, Kun Zhou, Wayne Xin Zhao, Wenjie Wang, Jing Liu, Ji\-Rong Wen, and Tat\-Seng Chua\.Self\-calibrated listwise reranking with large language models\.In*Proceedings of the ACM on Web Conference 2025*, pages 3692–3701, 2025\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Shin et al\. \[2025\]Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, and Minjoon Seo\.Generative prompt internalization\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 7338–7363, 2025\.
- Snell et al\. \[2022\]Charlie Snell, Dan Klein, and Ruiqi Zhong\.Learning by distilling context\.*arXiv preprint arXiv:2209\.15189*, 2022\.
- Sun et al\. \[2023\]Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren\.Is chatgpt good at search? investigating large language models as re\-ranking agents\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, pages 14918–14937, 2023\.
- Team et al\. \[2023\]Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean\-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al\.Gemini: a family of highly capable multimodal models\.*arXiv preprint arXiv:2312\.11805*, 2023\.
- Vendrow et al\. \[2024\]Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E Jones, Oisin Mac Aodha, Sara Beery, and Grant Van Horn\.Inquire: A natural world text\-to\-image retrieval benchmark\.*Advances in Neural Information Processing Systems*, 37:126500–126514, 2024\.
- Wu et al\. \[2025\]Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A Rossi, Prithviraj Ammanabrolu, and Julian McAuley\.In\-context ranking preference optimization\.*arXiv preprint arXiv:2504\.15477*, 2025\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Zhao et al\. \[2025\]Yang Zhao, Yixin Wang, and Mingzhang Yin\.Permutative preference alignment from listwise ranking of human judgments\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 310–334, 2025\.

## Appendix ATraining Details

Hardware\.Listwise PRISMR is trained on a single node with2×2\\timesNVIDIA B200 GPUs, each with 180 GiB of memory\. All evaluation\-only experiments fit on a single B200 GPU\.

Optimizer and precision\.We optimize the hypernetwork parameters with AdamW implemented intorch\.optim\.AdamW, using learning rate2×10−52\\times 10^\{\-5\},β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, and no weight decay\. Training uses mixed precision withtorch\.autocastinbfloat16\. The base model is frozen throughout training\. We enable gradient checkpointing on the frozen base model to reduce activation memory and keep both the base\-model and hypernetwork activations resident during backpropagation\.

Schedule and batching\.Unless otherwise stated, we train for three epochs\. The per\-GPU batch size is 1, with gradient accumulation over 8 steps, yielding an effective batch size of 16 across the two GPUs\. For the Baby\_Products domain atN=50N=50, this corresponds to approximately1,6331\{,\}633training products per epoch and three full passes over the training split\.

Hypernetwork architecture\.The global hypernetworkℋϕ\\mathcal\{H\}\_\{\\phi\}is a Perceiver\-style cross\-attention encoder followed by layer\-specific linear LoRA heads\. It produces candidate\-conditioned LoRA increments for every targeted projection in all 36 transformer layers of Qwen3\-VL\-8B\. For each candidate, the input context consists of the system prompt, product header, review text, and image patches\. This context is tokenized by the base model tokenizer and vision encoder, and is consumed by a 6\-layer cross\-attention stack with 128 learnable latent tokens of widthdlat=1024d\_\{\\mathrm\{lat\}\}=1024\. Each layer uses 8 attention heads, GeLU MLPs with expansion factor 4, RMSNorm, and residual connections\.

For each targeted projectionℓ\\ell, two linear heads map the latent representation to LoRA factors

Ai\(ℓ\)∈ℝr×din,Bi\(ℓ\)∈ℝdout×r\.A\_\{i\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{r\\times d\_\{\\mathrm\{in\}\}\},\\qquad B\_\{i\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\}\.TheBBheads are zero\-initialized, so that each candidate\-specific update

ΔWi\(ℓ\)=Bi\(ℓ\)Ai\(ℓ\)\\Delta W\_\{i\}^\{\(\\ell\)\}=B\_\{i\}^\{\(\\ell\)\}A\_\{i\}^\{\(\\ell\)\}is initialized to zero\. We target the\{q,k,v,o\}\\\{q,k,v,o\\\}projections in every transformer block, resulting in4×36=1444\\times 36=144targeted projections\. The per\-candidate LoRA rank isr=2r=2\. The hypernetwork contains approximately5\.8×1085\.8\\times 10^\{8\}trainable parameters, while the frozen base model contains approximately1\.69×10101\.69\\times 10^\{10\}parameters\. Under theα\\alpha\-mode synthesis rule, the composite increment

ΔWα=∑iBiAi\\Delta W\_\{\\alpha\}=\\sum\_\{i\}B\_\{i\}A\_\{i\}has effective rank at mostNrNr, as stated in Equation[5](https://arxiv.org/html/2606.12942#S3.E5)\.

Vision settings\.We use at most one image per review, withmax\_images\_per\_review = 1\. Images are resized subject toimage\_max\_pixels =672×672672\\times 672= 451 584\. FlashAttention\-2 is enabled for the base model\.

Loss and supervision\.Training uses the token\-level negative log\-likelihood on teacher\-curated score lines, defined in Equation[8](https://arxiv.org/html/2606.12942#S3.E8), together with the LambdaRank\-NDCG auxiliary loss in Equation[11](https://arxiv.org/html/2606.12942#S3.E11)\. The auxiliary weight isλ=0\.5\\lambda=0\.5\. The differentiable per\-item score readouts^i\\hat\{s\}\_\{i\}in Equation[9](https://arxiv.org/html/2606.12942#S3.E9)partitions answer\-span tokens by a cumulative sum over newline token ids\. This allows the implementation to aggregate token\-level quantities with a singlescatter\_addoperation\. The pairwise LambdaRank term uses the closed\-form\|ΔNDCGij\|\|\\Delta\\mathrm\{NDCG\}\_\{ij\}\|computed from teacher\-induced ranks and discounts\.

Teacher model and filtering\.The teacher model is GPT\-5\.4\-2026\-04\-15, using a pinned snapshot\. We filter teacher traces before training\. A trace is discarded if the response: \(i\) omits any of theNNrequested score lines; \(ii\) emits an out\-of\-range or repeated candidate index; or \(iii\) violates the required score\-line regular expression\. On Baby\_Products, this filtering procedure retains79\.4%79\.4\\%of teacher trajectories atN=50N=50and94\.1%94\.1\\%atN=10N=10\. The retained trajectories constitute𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}\.

Code, models, and licences\.We will release the training pipeline, evaluation scripts, configuration files, and checkpoint hashes upon publication\. The Amazon Reviews 2023 corpus is released under the Apache\-2\.0 licence; INQUIRE\-Rerank under CC BY\-NC 4\.0; MMDocIR under MIT; and Qwen3\-VL\-8B under Apache\-2\.0\.

Data splits\.We use a single\-domain split withval\_ratio = 0\.1anddata\_seed = 42, yielding 1,633 training products and 182 validation products\. Each cell in Table[2](https://arxiv.org/html/2606.12942#S4.T2)is computed over 30 validation products sampled under the same seed\. The same validation products are used across methods, making comparisons paired\.

Multi\-seed validation\.To estimate seed\-induced variance for the headline PRISMR results, we additionally train PRISMR in bothα\\alpha\-mode andβ\\beta\-mode atN=50N=50using two extra seeds,\{7,17\}\\\{7,17\\\}, with otherwise identical hyperparameters\. For each seed, we evaluate on a paired 30\-product validation draw generated under the same seed\. Table[5](https://arxiv.org/html/2606.12942#A1.T5)reports the mean and standard deviation of NDCG@10 across these runs\. The parse rate is100%100\\%for every evaluated cell\. Across both PRISMR modes and all evaluated list lengths, the standard deviation does not exceed0\.00640\.0064NDCG@10, which is substantially smaller than the absolute gap between PRISMR and the strongest in\-context baseline\. For example, in Table[2](https://arxiv.org/html/2606.12942#S4.T2), PRISMRα\\alpha\-mode improves over RankGPT by0\.180\.18,0\.290\.29, and0\.370\.37NDCG@10 atN=10N=10,N=20N=20, andN=50N=50, respectively\.

Table 5:Multi\-seed NDCG@10 of PRISMR on Qwen3\-VL\-8B \+ Baby\_Products \(mean±\\pmstd across two extra seeds\{7,17\}\\\{7,17\\\}, identical 30\-val\-product paired draw, parse rate100%100\\%throughout\)\.
## Appendix BFailure\-mode decomposition

We classify every*unparsed slot*produced by the score\-by\-score base into one of five mutually\-overlapping categories:missing\_count\(the model never emitted a line for that index\),hallucinated\_id\(an out\-of\-range index appeared in the output\),format\_error\(the output contained no parseable “i: <score\>” substring at all\),length\_overflow\(the decoder exhaustedmax\_new\_tokenswithout finishing the list\), andrepeated\_id\(an in\-range index appeared more than once\)\.

Empirically, the score\-by\-score Base model exhibits severe parse collapse atN=20N=20across all image\-density settings\. As shown in Figure[4](https://arxiv.org/html/2606.12942#A2.F4), the parse rate remains low even when no images are included:14\.4%14\.4\\%for text\-only inputs, corresponding to151/1050151/1050successfully parsed examples\. Adding images further reduces the parse rate to13\.6%13\.6\\%with one image per review \(143/1050143/1050\) and12\.1%12\.1\\%with two images per review \(127/1050127/1050\)\.

![Refer to caption](https://arxiv.org/html/2606.12942v1/x4.png)Figure 4:Parse rate of the score\-by\-score Base model on Qwen3\-VL\-8B \+ Baby\_Products atN=20N=20under different image\-density settings\. Parse rate is low in all cases and decreases as more images are included, indicating that multimodal context exacerbates listwise parse collapse\.
## Appendix CGeneralisation to Standard Multimodal IR Benchmarks

To test whether PRISMR’s structural fix transfers beyond the Amazon Reviews family, we run the same Qwen3\-VL\-8B \+ PRISMR checkpoint \(trained onBaby\_Productsreview\-quality scoring\) zero\-shot on two off\-the\-shelf multimodal IR benchmarks:INQUIRE\-Rerank\[[31](https://arxiv.org/html/2606.12942#bib.bib31)\], an image\-retrieval re\-ranking benchmark constructed from iNaturalist 2024 with natural\-language ecology queries \(e\.g\. “a mongoose standing upright alert”\), andMMDocIR\[[6](https://arxiv.org/html/2606.12942#bib.bib6)\], a multimodal document re\-ranking benchmark over PDF page screenshots paired with VLM\-extracted page text\. For each of 30 randomly\-drawn queries we form a listwise prompt withN=10N\{=\}10candidate items: for INQUIRE, 5 positives \+ 5 negatives sampled from the CLIP top\-100 \(binary relevance\); for MMDocIR, the question’s gold pages plus random non\-gold pages from the same document\.

Table 6:Generalisation to two off\-the\-shelf multimodal IR benchmarks atN=10N\{=\}10\(30 queries each\)\. The top block is zero\-shot \(Baby→\\tobenchmark\); the bottom block fine\-tunes the same PRISMR/PRISMR \(β\\beta\-mode\) checkpoint for two epochs on the benchmark’s training split\. Parse rate transfers cleanly in both regimes\. In\-list ranking quality does*not*transfer zero\-shot, but recovers substantially after a brief in\-domain fine\-tune; even so, fine\-tuned PRISMR is statistically tied with the \(training\-free\) constrained\-decoding baseline on these short \(N=10N\{=\}10\) image\-only / document\-page rerank tasks\.The result is consistent across both benchmarks and splits into two halves\. Parse\-rate fix transfers\. PRISMR keeps a100%100\\,\\%parse rate on tasks whose candidates \(single nature photos for INQUIRE, document\-page screenshots for MMDocIR\) look nothing like its training distribution; the score\-by\-score Base collapses identically here as on Baby\. In\-list ranking\-quality does not transfer zero\-shot, but recovers under fine\-tune\. Zero\-shot PRISMR’s NDCG@10 of0\.7050\.705on INQUIRE matches the random baseline \(0\.7070\.707\); inspecting outputs we observe position\-dependent scores \(2\.8,2\.9,3\.1,3\.0,…2\.8,2\.9,3\.1,3\.0,\\ldots\) rather than content\-dependent ones, indicating that the hypernet—trained to extract text\-quality features from review bodies—has no analogous signal to extract from a brief species caption or a document page screenshot\. After two epochs of fine\-tuning on the benchmark’s own train split, PRISMR and PRISMR \(β\\beta\-mode\) both recover substantial ranking signal \(0\.8040\.804/0\.8310\.831on INQUIRE,0\.8440\.844/0\.8340\.834on MMDocIR\), but neither significantly surpasses constrained decoding \(0\.8410\.841/0\.8570\.857\)\.

We interpret this as a clean separation of the two effects PRISMR bundles:*the structural\-format fix is task\-general*and transfers without retraining;*the relevance\-quality fix requires in\-domain teacher signal and recovers under brief fine\-tuning, but its margin over training\-free constrained decoding is small at the short list lengths these benchmarks naturally provide*\. We expect PRISMR’s quality margin to widen at largerNNwhere parse\-collapse and the cost ofNNsequential constrained\-decoding forwards both grow super\-linearly; we reportN=5,20N\{=\}5,20results in Appendix[D](https://arxiv.org/html/2606.12942#A4)\.

## Appendix DIR\-benchmark results atN=5N\{=\}5andN=20N\{=\}20

To probe how PRISMR’s quality margin over constrained decoding scales with list length on the IR benchmarks, we evaluate the same fine\-tuned PRISMR and PRISMR \(β\\beta\-mode\) checkpoints \(from Section[C](https://arxiv.org/html/2606.12942#A3)\) atN∈\{5,20\}N\\in\\\{5,20\\\}\. Numbers below are NDCG@10 on the held\-out test splits with 30 random queries; parse rate is100%100\\,\\%for all PRISMR/PRISMR \(β\\beta\-mode\)/Constrained rows\.

Table 7:IR\-benchmark NDCG@10 acrossN∈\{5,10,20\}N\\in\\\{5,10,20\\\}\(parse rate100%100\\,\\%throughout\)\.
## Appendix EStandard SFT\-with\-teacher\-data baseline

To isolate PRISMR’s architectural contribution from the contribution of its teacher\-distilled training data, we train a vanilla rank\-16 LoRA on Qwen3\-VL\-8B with the*same*teacher labels \(mllm\_score\) and the*same*listwise score\-by\-score prompts used in PRISMR training\. The base model is frozen; the LoRA is attached to the language\-decoderq\_proj,k\_proj,v\_proj, ando\_projmatrices \(15\.3 M trainable parameters out of 8\.78 B\)\. Training uses AdamW atlr=1×10−4\\text\{lr\}=1\\times 10^\{\-4\}, the standard NLL on the completion \(i\.e\. on the “11:s1\\s\_\{1\}\\backslashn22:s2\\s\_\{2\}\\backslashn…” tokens\) with the prompt tokens masked out, batch size 1, gradient accumulation 8, for one epoch on a 300\-product training subset on4×4\\timesNVIDIA B200 GPUs \(matching the Vanilla\-D2L compute budget\)\. The resulting LoRA\-only state dict is 61 MB\.

The SFT row of Table[2](https://arxiv.org/html/2606.12942#S4.T2)reports the eval atN∈\{10,20,50\}N\\in\\\{10,20,50\\\}on the same 30\-product held\-out val set used elsewhere\. The model achieves67%/39%/38%67\\,\\%/39\\,\\%/38\\,\\%parse rate and0\.847/0\.651/0\.5870\.847/0\.651/0\.587NDCG@10\. Inspection of failure cases shows the same silent\-omission pattern as the un\-fine\-tuned base, just with a smaller magnitude\. The interpretation in Section[4\.2](https://arxiv.org/html/2606.12942#S4.SS2)is that PRISMR’s gain over standard SFT —\+0\.13/\+0\.31/\+0\.36\+0\.13/\+0\.31/\+0\.36NDCG@10, growing withNN— is attributable to the hypernetwork\-internalisation mechanism itself, not to access to teacher data: when the base model has to consume the entire candidate list in\-context and emitNNscores autoregressively, parse collapse persists even after fine\-tuning on the target distribution\.

## Appendix FAblation:β\\beta\-mode synthesis

The Vanilla\-D2L row in Table[2](https://arxiv.org/html/2606.12942#S4.T2)replaces the rank\-dim concatenation step of PRISMR with chunk\-wise mean pooling\. Concretely, the per\-chunk LoRA matrices\{\(Ai,Bi\)\}i=1N\\\{\(A\_\{i\},B\_\{i\}\)\\\}\_\{i=1\}^\{N\}produced by the same hypernetwork are aggregated by

Avanilla=1N∑i=1NAi,Bvanilla=1N∑i=1NBi,A\_\{\\text\{vanilla\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}A\_\{i\},\\qquad B\_\{\\text\{vanilla\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}B\_\{i\},
so the resulting adapter has a fixed effective rank ofrrinstead of PRISMR’s\(N\+1\)r\(N\+1\)\\,r\. The implementation is a∼\\sim20\-LOC change inlora\_merger\.combine\_lorathat we expose as\-\-combine\_mode mean; everything else \(hypernetwork architecture, base model, image stack, training data, optimizer, and loss\) is identical to PRISMR\. To keep the wall\-clock budget tractable for this ablation we trained PRISMR \(β\\beta\-mode\) for one epoch on a 300\-product subsample of the training set \(versus three epochs on all 1 633 products for PRISMR\)\. PRISMR \(β\\beta\-mode\) therefore consumes roughly1/121/12of PRISMR’s per\-step compute and1/41/4of its training time\. The interpretation in Section[4\.2](https://arxiv.org/html/2606.12942#S4.SS2)is that even with this reduced budget PRISMR \(β\\beta\-mode\) recovers most of PRISMR’s ranking improvement, indicating that the rank\-dim concatenation contributes a real but secondary effect on top of the core hypernetwork\-internalisation step\.

## Appendix GNDCG@KKbreakdown

Table[8](https://arxiv.org/html/2606.12942#A7.T8)expands Table[2](https://arxiv.org/html/2606.12942#S4.T2)into the full NDCG@KKfamily atK∈\{1,3,5,10\}K\\in\\\{1,3,5,10\\\}\. Two patterns are worth noting\. First, the PRISMR−\-Constrained gap is largest at smallKK:\+0\.181\+0\.181at NDCG@1 vs\.\+0\.064\+0\.064at NDCG@10 forN=10N\{=\}10\. Top\-1 ranking is where ranking quality matters most, and that is where PRISMR’s structural advantage is most pronounced\. Second, the score\-by\-score Base’s NDCG@1 is dramatically lower than its NDCG@10—this is an artifact of default\-score imputation: when 88 % of slots are unparsed and assigned the median score, large\-KKNDCG values are pulled toward a random\-permutation baseline \(≈0\.77\\approx 0\.77atN=10N\{=\}10\) while top\-1 reflects the genuine zero\-signal regime\.

Table 8:NDCG@KKbreakdown across listwise methods \(Baby\_Products, img=1, 30 val products / cell\)\. PRISMR’s advantage over the format\-fixed baselines is largest atK=1K\{=\}1\.
## Appendix HBaseline Implementation Details

RankGPT\.We implement RankGPT using the same Qwen3\-VL\-8B backbone as PRISMR\. The model is given a system prompt that asks it to output a single permutation line in the format

π1\>π2\>⋯\>πN\.\\pi\_\{1\}\>\\pi\_\{2\}\>\\cdots\>\\pi\_\{N\}\.Decoding is greedy, withmax\_new\_tokens =6N\+326N\+32\. We parse the generated text by extracting integers in first\-appearance order, removing duplicates, and clipping indices to the valid range\[1,N\]\[1,N\]\. If any candidates are missing after parsing, we append the missing indices in ascending numeric order, so that every run produces a complete permutation of lengthNN\. This parser is intentionally permissive and therefore gives the baseline credit whenever a recoverable ranking can be inferred from the output\.

LLMLingua\-2The compressed review text replaces the original review text in the same score\-by\-score prompting template used by the base model\. Images are passed to the model unchanged\. We verified by manual inspection that the compressor actively rewrites the review text in almost all cases; fallback to the identity input is rare\.

PRISMR\.For PRISMR, we evaluate the trained checkpoint described in Appendix[A](https://arxiv.org/html/2606.12942#A1)\. Unless otherwise stated, decoding is greedy and uses the same evaluation split as the baselines\. No baseline\-specific post\-processing is applied beyond the score\-line parser used consistently for score\-based methods\.

Reproducibility\.We will release the evaluation scripts, launchers, configuration files, and exact command lines used for RankGPT, LLMLingua\-2, and PRISMR upon publication\.
PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

Similar Articles

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

Parameter-Efficient Fine-Tuning with Learnable Rank

Reinforcing Multimodal Reasoning Against Visual Degradation

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

Submit Feedback

Similar Articles

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
Parameter-Efficient Fine-Tuning with Learnable Rank
Reinforcing Multimodal Reasoning Against Visual Degradation
Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking