ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

arXiv cs.CL Papers

Summary

ConvMemory v2 is a recall-preserving reranker that reorders the top-10 candidates from ConvMemory v1 using a fine-tuned cross-encoder, improving MRR on the LoCoMo benchmark while preserving recall.

arXiv:2606.10842v1 Announce Type: new Abstract: We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder (22,713,601 parameters, measured from the released checkpoint) applied to the ten (query, memory) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence. On the LoCoMo conversational memory benchmark (5 seeds, n = 4955 test rows), v2 raises FULL MRR from v1's 0.5824 to 0.6560 (paired bootstrap +0.0734, 95% CI [+0.0645, +0.0827]) and H@1 from 0.4440 to 0.5474. v2 closes most but not all of the gap to a much more expensive full-pool cross-encoder reference (mxbai-rerank-large-v1 over the top-500, MRR 0.6688): on FULL MRR v2 sits 0.013 below mxbai_top500, but on two raw-dense-hard slices (where v1's protected top-10 has higher recall than mxbai's own top-10) v2 exceeds mxbai_top500. A four-arm load-bearing ablation shows candidate-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval. v2 is best understood as a standard recall-preserving cascade pattern with LoCoMo-specific fine-tuning, an explicit anti-shortcut inference contract, and disciplined load-bearing analysis; its advantage over mxbai is slice-specific rather than a general dominance claim. This report extends the v1 technical report (arXiv:2605.28062).
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:12 AM

# A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval
Source: [https://arxiv.org/html/2606.10842](https://arxiv.org/html/2606.10842)
Taiheng Pan School of Computing and Information Systems University of Melbourne github\.com/pth2002

###### Abstract

We describe ConvMemory v2, an opt\-in token\-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1’s protected top\-10 candidate set\. v2 is a fine\-tunedms\-marco\-MiniLM\-L\-6\-v2cross\-encoder \(22,713,601 parameters, measured from the released checkpoint\) applied to the ten \(query, memory\) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence\. On the LoCoMo conversational memory benchmark \(5 seeds,n=4955n=4955test rows\), v2 raises FULL MRR from v1’s 0\.5824 to 0\.6560 \(paired bootstrap\+0\.0734\+0\.0734, 95% CI\[\+0\.0645,\+0\.0827\]\[\+0\.0645,\+0\.0827\]\) and H@1 from 0\.4440 to 0\.5474\. v2 closes most but not all of the gap to a much more expensive full\-pool cross\-encoder reference \(mxbai\-rerank\-large\-v1 over the top\-500, MRR 0\.6688\): on FULL MRR v2 sits 0\.013 below mxbai\_top500, but on two raw\-dense\-hard slices \(where v1’s protected top\-10 has higher recall than mxbai’s own top\-10\) v2 exceeds mxbai\_top500\. A four\-arm load\-bearing ablation shows candidate\-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval\. v2 is best understood as a standard recall\-preserving cascade pattern with LoCoMo\-specific fine\-tuning, an explicit anti\-shortcut inference contract, and disciplined load\-bearing analysis; its advantage over mxbai is slice\-specific rather than a general dominance claim\. This report extends the v1 technical report\[[1](https://arxiv.org/html/2606.10842#bib.bib1)\]\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2606.10842#S1)
2. [2Relationship to the v1 Paper](https://arxiv.org/html/2606.10842#S2)
3. [3Related Work](https://arxiv.org/html/2606.10842#S3)
4. [4ConvMemory v2 Architecture](https://arxiv.org/html/2606.10842#S4)1. [4\.1Recall\-preserving cascade](https://arxiv.org/html/2606.10842#S4.SS1) 2. [4\.2Scoring model and parameter count](https://arxiv.org/html/2606.10842#S4.SS2) 3. [4\.3Inference prompt format](https://arxiv.org/html/2606.10842#S4.SS3) 4. [4\.4Inference procedure](https://arxiv.org/html/2606.10842#S4.SS4)
5. [5Training](https://arxiv.org/html/2606.10842#S5)1. [5\.1Objective](https://arxiv.org/html/2606.10842#S5.SS1) 2. [5\.2Hyperparameters and split](https://arxiv.org/html/2606.10842#S5.SS2) 3. [5\.3Training source and released\-checkpoint provenance](https://arxiv.org/html/2606.10842#S5.SS3)
6. [6Experimental Protocol](https://arxiv.org/html/2606.10842#S6)
7. [7Inference Isolation and Shortcut Controls](https://arxiv.org/html/2606.10842#S7)
8. [8Main Results](https://arxiv.org/html/2606.10842#S8)1. [8\.1Headline \(FULL\) and slice results](https://arxiv.org/html/2606.10842#S8.SS1) 2. [8\.2Hard slices: where v2 exceeds the full\-pool reference](https://arxiv.org/html/2606.10842#S8.SS2) 3. [8\.3The top\-10 reordering ceiling](https://arxiv.org/html/2606.10842#S8.SS3)
9. [9Load\-Bearing Ablation](https://arxiv.org/html/2606.10842#S9)1. [9\.1The two v2 numbers \(0\.6560 vs 0\.6677\)](https://arxiv.org/html/2606.10842#S9.SS1)
10. [10Cost](https://arxiv.org/html/2606.10842#S10)
11. [11Limitations](https://arxiv.org/html/2606.10842#S11)
12. [12Reproducibility](https://arxiv.org/html/2606.10842#S12)
13. [13Discussion and Future Work](https://arxiv.org/html/2606.10842#S13)
14. [AFull Latency Protocol](https://arxiv.org/html/2606.10842#A1)
15. [BReleased Checkpoint Provenance](https://arxiv.org/html/2606.10842#A2)
16. [CSource\-of\-Truth Checklist](https://arxiv.org/html/2606.10842#A3)
17. [References](https://arxiv.org/html/2606.10842#bib)

## 1Introduction

The v1 ConvMemory technical report\[[1](https://arxiv.org/html/2606.10842#bib.bib1)\]studied how cheaply a small learned reranker could approximate cross\-encoder quality on conversational long\-term memory retrieval\. Its default reranker deliberately avoids running a per\-query, per\-candidate transformer forward over the candidate pool \(v1 §3\.4\); it organizes a high\-recall top\-500 pool cheaply\. A natural follow\-up question is: once v1 has already narrowed the pool to a small, high\-recall prefix, can a small, bounded amount of token\-level cross\-encoder computation, spent only on that prefix, improve ordering without sacrificing recall and without becoming as expensive as a full\-pool cross\-encoder?

This report answers that question with ConvMemory v2, a recall\-preserving top\-10 evidence reranker\. v2 takes the exact top\-10 set that v1 returns, scores those ten \(query, memory\) text pairs with a fine\-tunedms\-marco\-MiniLM\-L\-6\-v2cross\-encoder, reorders the ten candidates by that score, and appends v1’s unchanged tail\. Because the top\-10*set*is preserved, Recall@10 and Hit@10 are unchanged from v1 by construction\.

We make three contributions\.

1. 1\.ConvMemory v2: an opt\-in, recall\-preserving top\-10 evidence reranker distributed as a released checkpoint on the Hugging Face Hub\. Its value is a new point on the cost–quality frontier: on LoCoMo it recovers a large fraction of the MRR and H@1 headroom above v1 at roughly1\.7×1\.7\\timesv1’s latency, while remaining about68×68\\timescheaper than a full\-pool cross\-encoder in our measurement\. On FULL MRR, v2 remains below mxbai\-rerank\-large\-v1; its advantage over the full\-pool reference is slice\-specific \(§[8](https://arxiv.org/html/2606.10842#S8)\)\.
2. 2\.A mechanism\-level ablation: a four\-arm, 5\-seed token\-evidence ablation showing that candidate\-specific memory text is what drives v2’s gain\. Removing the memory text, shuffling it within a question, or replacing it with text from other questions does not merely erase the gain — it drives MRR below raw dense retrieval, the cleanest available fingerprint that token\-on\-memory\-text alignment is load\-bearing\.
3. 3\.An explicit anti\-shortcut inference contract: the released inference API rejects a fixed set of gold\-defining and teacher\-derived fields, and train/test conversations are disjoint\. This is backed by machine\-checkable tests, not just prose\.

v2 builds on a recall\-preserving cascade, an established information\-retrieval pattern; its contribution lies in the LoCoMo\-specific fine\-tuning, the recall\-preserving design, the load\-bearing analysis, and the anti\-shortcut inference contract\. All results reported here are on LoCoMo at the retrieval stage\.

## 2Relationship to the v1 Paper

Because v2 composes with v1 rather than replacing it, and because the v1 report made a specific structural cost claim and a specific negative attribution result, we make the relationship explicit before describing v2\. This section expands the “Relationship to the v1 paper” note in the public v2 documentation\.111[https://github\.com/pth2002/ConvMemory/blob/main/docs/EVIDENCE\_RERANKER\.md](https://github.com/pth2002/ConvMemory/blob/main/docs/EVIDENCE_RERANKER.md)

#### v1 §3\.4 \(the “no per\-pair transformer forward” structural claim\)\.

v1’s core cost argument is that its default path does not run a transformer forward over each query–candidate pair in the pool\. v1 alone still honours this:retrieve\(query, memories\)uses the pure v1 path unless v2 is explicitly requested\. v2 relaxes this claim only on the protected top\-10: it scores ten query–memory pairs per query \(one per protected candidate\), not 500\. v1 and v1\+v2 are therefore two points on the same cost–quality frontier, complementary points rather than substitutes\.

#### v1 §3\.3 \(teacher choice discipline\)\.

v1 deliberately did not use the strongest available cross\-encoder \(mxbai\-rerank\-large\-v1\) as its distillation teacher, to avoid conflating distillation gains with teacher choice\. v2’s headline arm preserves this discipline: it is trained with a gold\-only listwise objective \(teacher weight0\.00\.0; §[5](https://arxiv.org/html/2606.10842#S5)\)\. A cross\-encoder\-teacher variant exists but is not the headline and is not load\-bearing \(§[8](https://arxiv.org/html/2606.10842#S8)\)\.

#### v1 §5 \(temporal window not load\-bearing\)\.

v1 published a negative result: its learned temporal window is statistically significant on aggregate but not temporally specific\. v2’s supported mechanism is distinct from the temporal window and is tested separately \(§[9](https://arxiv.org/html/2606.10842#S9)\): candidate\-specific memory text is load\-bearing\. The v1 negative result stands unchanged\.

#### v1 §7 \(from\-scratch stream rerankers fail without a teacher signal\)\.

v1 §7 showed that small from\-scratch stream rerankers trained with retrieval\-only supervision fail on real LoCoMo\. v2 occupies the complementary region v1 §7 left open: it uses a pretrained cross\-encoder backbone and supervised \(gold listwise\) fine\-tuning rather than a from\-scratch architecture\. v2 is evidence that the missing ingredient in v1 §7 was the supervised / pretrained signal source, not the cascade idea\.

#### v1 §10 \(CCGE\-LA on top did not match mxbai\)\.

v1 reported that even with the CCGE\-LA editor, absolute LoCoMo MRR remained well below mxbai\. v2 narrows this: v1 alone still loses to mxbai\_top500 \(MRR 0\.5824 vs 0\.6688\); v2 closes about 85% of the FULL\-MRR gap between v1 and mxbai\_top500 \(0\.6560 vs 0\.6688\), while mxbai\_top500 stays ahead on FULL\. On two raw\-dense\-hard slices, however, v2*exceeds*mxbai\_top500 \(§[8](https://arxiv.org/html/2606.10842#S8)\)\. v2 and CCGE\-LA are two independent opt\-in extensions; v2 neither contains nor replaces CCGE\-LA\.

#### v1 §11 \(three future\-work directions\)\.

v1 listed multi\-backbone checkpoint distribution, an end\-to\-end agent benchmark, and broader CCGE\-LA training as future work\. v2 addresses a separate direction; all three v1 future\-work items remain open\.

Table[1](https://arxiv.org/html/2606.10842#S2.T1)summarizes these relationships\.

Table 1:How ConvMemory v2 relates to specific results in the v1 report\. v2 composes with v1 rather than replacing it, and does not revive or revisit v1’s negative attribution result\.

## 3Related Work

We keep related work focused on the components that directly determine v2’s design, evaluation setup, and claim boundaries\.

ConvMemory v2 is built on thecross\-encoder/ms\-marco\-MiniLM\-L\-6\-v2cross\-encoder222[https://huggingface\.co/cross\-encoder/ms\-marco\-MiniLM\-L\-6\-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)as its scoring backbone, the same family v1 used as its distillation teacher\. As a high\-cost full\-pool reference point we use mxbai\-rerank\-large\-v1333[https://huggingface\.co/mixedbread\-ai/mxbai\-rerank\-large\-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1)applied over the dense top\-500\. The base ConvMemory v1 reranker uses an MPNet sentence encoder\[[3](https://arxiv.org/html/2606.10842#bib.bib3)\]and is described in the v1 report\[[1](https://arxiv.org/html/2606.10842#bib.bib1)\]\. All evaluation is on the LoCoMo conversational memory benchmark\[[2](https://arxiv.org/html/2606.10842#bib.bib2)\]\. Cross\-encoder reranking over a candidate prefix, and recall\-preserving cascades that reorder a protected set without changing recall, are standard IR patterns that v2 composes for conversational memory retrieval\.

## 4ConvMemory v2 Architecture

### 4\.1Recall\-preserving cascade

v2 is the last stage of a cascade whose earlier stages are unchanged from v1:

```
query + memories
    --> dense MPNet candidate generation
    --> ConvMemory v1 rerank over the top-500 pool
    --> preserve the EXACT v1 top-10 set            <-- recall frozen here
    --> v2 cross-encoder reorders only those 10
    --> append v1’s unchanged tail (ranks 11..N)
```

Figure[1](https://arxiv.org/html/2606.10842#S4.F1)shows the same cascade in full detail, including the per\-pair scoring of the protected top\-10 and how the reordered set is recombined with the unchanged tail\.

![Refer to caption](https://arxiv.org/html/2606.10842v1/x1.png)Figure 1:The recall\-preserving cascade in full\. ConvMemory v1 reranks the dense top\-500; v2 reorders only the protected v1 top\-10 \(one pair scoring per protected candidate\) and appends v1’s unchanged tail\. Because the top\-10 set is preserved, Recall@10 and Hit@10 equal v1’s by construction\.The deployment rule is: take v1’s top\-10 set, score the ten \(query, memory\) text pairs with the v2 cross\-encoder, sort those ten by v2 score, and leave everything at rank 11 and beyond exactly as v1 ordered it\. Because the top\-10*set*is never changed, Recall@10 and Hit@10 are exactly equal to v1’s — this is a constructive \(by\-design\) property, not a statistical near\-equality\. The released checkpoint and its provenance are described in §[5](https://arxiv.org/html/2606.10842#S5); the isolation contract that constrains what v2 may read at inference is described in §[7](https://arxiv.org/html/2606.10842#S7)\.

### 4\.2Scoring model and parameter count

The v2 module is a fine\-tunedms\-marco\-MiniLM\-L\-6\-v2cross\-encoder\. Its parameter count, measured from the released checkpoint, is exactly 22,713,601 parameters \(we write “approximately 22\.7M” where a round figure is convenient, but the measured value is 22,713,601\)\. For comparison, the lightweight v1 reranker is roughly 3\.6M parameters and performs zero cross\-encoder pair scorings per query; v2 adds a second, heavier stage that runs ten cross\-encoder pair scorings per query on the protected prefix\. Every cost statement in this report distinguishes these two regimes \(§[10](https://arxiv.org/html/2606.10842#S10)\)\.

### 4\.3Inference prompt format

Each candidate is scored as a \(query\-side, memory\-side\) text pair using a time\-annotated format\. With candidate positions\{pi\}\\\{p\_\{i\}\\\}andmax\_pos=maxi⁡pi\\texttt\{max\\\_pos\}=\\max\_\{i\}p\_\{i\}, the query side is rendered as

> QUERY\_TIME: \{max\_pos:\.0f\}\. \{query\}

and each candidate side as

> MEMORY\_TIME: \{pos:\.0f\}\. \{text\}

The position metadata is optional; when present it is the only temporal signal v2 sees, and the load\-bearing ablation \(§[9](https://arxiv.org/html/2606.10842#S9)\) shows it is not what drives the gain — candidate\-specific memory text is\.

### 4\.4Inference procedure

Algorithm[1](https://arxiv.org/html/2606.10842#alg1)states the full v2 inference path\. The key invariant is on the last line: the returned top\-10*set*equals v1’s, so Recall@10 and Hit@10 are preserved by construction\.

Algorithm 1ConvMemory v2 inference \(recall\-preserving top\-10 reorder\)1:query

qq; memory store

MM; base ConvMemory v1; attached evidence reranker

RR
2:ranked memory list

3:

C←C\\leftarrowdense vector search over

MM, take dense top\-500

4:

L←L\\leftarrowConvMemory v1 rerank of

CC⊳\\trianglerightthe v1 path; no per\-pair cross\-encoder scoring

5:

P←first10​\(L\)P\\leftarrow\\mathrm\{first\}\_\{10\}\(L\)⊳\\trianglerightprotected top\-10 set \(inclusive\)

6:

T←T\\leftarrowremainder of

LLafter the first 10 items⊳\\trianglerightunchanged tail

7:

pi←position​\(mi\)p\_\{i\}\\leftarrow\\mathrm\{position\}\(m\_\{i\}\)for each

mi∈Pm\_\{i\}\\in P
8:

𝑚𝑎𝑥​\_​𝑝𝑜𝑠←maxmi∈P⁡pi\\mathit\{max\\\_pos\}\\leftarrow\\max\_\{m\_\{i\}\\in P\}p\_\{i\}
9:foreach

mi∈Pm\_\{i\}\\in Pdo

10:

qi←q\_\{i\}\\leftarrow"QUERY\_TIME: \{max\_pos\}\. "

\+q\+\\,q
11:

di←d\_\{i\}\\leftarrow"MEMORY\_TIME: \{pip\_\{i\}\}\. "

\+text​\(mi\)\+\\,\\mathrm\{text\}\(m\_\{i\}\)
12:

si←R\.score​\(qi,di\)s\_\{i\}\\leftarrow R\.\\mathrm\{score\}\(q\_\{i\},d\_\{i\}\)⊳\\trianglerightone pair scoring

13:endfor

14:

P′←P^\{\\prime\}\\leftarrowsort

PPby

sis\_\{i\}descending

15:return

P′∥TP^\{\\prime\}\\,\\\|\\,T

Invariant:set​\(P′\)=set​\(P\)\\mathrm\{set\}\(P^\{\\prime\}\)=\\mathrm\{set\}\(P\), so Recall@10 and Hit@10 equal v1’s by construction\. The reranker performs exactly\|P\|=10\|P\|=10pair scorings per query, never\|C\|=500\|C\|=500\.

## 5Training

### 5\.1Objective

v2 is trained with a listwise gold\-only objective over the protected top\-10\. For a question with candidate logitsziz\_\{i\}and a gold indicatorgig\_\{i\}\(normalized to sum to one over the gold memories\), the loss is the listwise cross\-entropy

ℒgold=−∑igi​log⁡\(softmax​\(z\)i\),\\mathcal\{L\}\_\{\\text\{gold\}\}=\-\\sum\_\{i\}g\_\{i\}\\,\\log\\\!\\big\(\\mathrm\{softmax\}\(z\)\_\{i\}\\big\),
with the softmax taken over the ten protected candidates\. An optional cross\-encoder teacher term \(a KL\-style soft\-label loss against an mxbai soft teacher\) is supported, giving the general objectiveℒ=wgold​ℒgold\+wteacher​ℒteacher\\mathcal\{L\}=w\_\{\\text\{gold\}\}\\,\\mathcal\{L\}\_\{\\text\{gold\}\}\+w\_\{\\text\{teacher\}\}\\,\\mathcal\{L\}\_\{\\text\{teacher\}\}\.The headline arm is gold\-only:wgold=1\.0w\_\{\\text\{gold\}\}=1\.0,wteacher=0\.0w\_\{\\text\{teacher\}\}=0\.0\. A separate cross\-encoder\-teacher arm \(wteacher=0\.25w\_\{\\text\{teacher\}\}=0\.25\) was run for comparison and reaches FULL MRR 0\.6546, within noise of the gold\-only 0\.6560 \(§[8](https://arxiv.org/html/2606.10842#S8)\); the teacher term is therefore not load\-bearing and is not part of the headline\.

#### Questions whose gold is outside the protected top\-10\.

v1’s top\-10 recall is 0\.7798, so for roughly 22% of training questions no gold memory is present in the protected top\-10\. These questions still pass through the forward pass, but their gold target is replaced by a uniform placeholder and ahas\_goldmask zeroes their contribution to the listwise loss; since the headline arm also uses teacher weight0\.00\.0, such questions contribute nothing to the training signal\. This is consistent with the recall\-preserving contract: v2 cannot recover a gold item that v1 failed to place in the protected set, so these questions carry no usable ordering signal\. Evaluation, in contrast, is over*all*test questions, including those v2 cannot recover, so the headline MRR already prices in the recall ceiling\.

### 5\.2Hyperparameters and split

Training uses AdamW with learning rate2×10−52\\times 10^\{\-5\}, weight decay0\.010\.01, batch size88, one epoch, and a linear warmup capped at 100 steps\. The training data is the LoCoMo dev split withdev\_ratio=0\.5=0\.5, where the split is taken by conversation id \(question\_id\.split\("::",1\)\[0\]\) so that dev and test conversations are disjoint \(§[7](https://arxiv.org/html/2606.10842#S7)\)\. All headline numbers are 5\-seed \(seeds 7, 11, 23, 31, 47\)\. The cross\-encoder backbone iscross\-encoder/ms\-marco\-MiniLM\-L\-6\-v2withmax\_length=256=256andtop\_k=10=10\.

### 5\.3Training source and released\-checkpoint provenance

The exact internal reproduction of the v363 5\-seed method\-level headline numbers uses the internal experiment scriptexperiments/v361\_top10\_evidence\_reranker\.py; this is the script to cite for the headline numbers\. A separate public, general\-purpose training entry point is provided asexamples/train\_evidence\_reranker\.py; it is the user\-facing recipe for training an evidence reranker on user records and is not the exact LoCoMo\-locked harness\.

The released checkpoint requires a careful provenance statement\. The original v361 5\-seed run did not save its per\-seed checkpoints\. The checkpoint published on the Hugging Face Hub \(Purdy0228/ConvMemory\-v2\-Evidence\-Reranker\) is aseed\-7 representative checkpoint, exported from the same v361 gold\-only recipe after v0\.5\.0 packaging \(provenance:results/v365\_v05\_evidence\_reranker\_checkpoint/seed\_7/MANIFEST\.json, which records a teacher weight of0\.00\.0, a training target of gold\-only listwise retrieval cross\-entropy, and a source experiment ofv361\_top10\_evidence\_reranker\.py\)\. Consequently the headline FULL MRR of 0\.6560 is amethod\-level 5\-seed estimate, not a separately measured score of this exact single checkpoint\. We keep this distinction explicit throughout \(§[8](https://arxiv.org/html/2606.10842#S8), §[11](https://arxiv.org/html/2606.10842#S11)\)\.

## 6Experimental Protocol

#### Dataset and task\.

All experiments are on the LoCoMo conversational memory retrieval setting\[[2](https://arxiv.org/html/2606.10842#bib.bib2)\]: given a query against an accumulated multi\-session conversation, retrieve the gold memory \(or memories\) supporting the answer\. We evaluate at the retrieval stage only, without a downstream answer generator\.

#### Candidate pool\.

The candidate pool is the dense MPNet top\-500, the same pool used for the v1, v2, and mxbai comparisons where applicable\. v2 only ever sees, and only ever reorders, the protected v1 top\-10 drawn from this pool\.

#### v1 anchor\.

The v1 baseline is the paper\-compatible ConvMemory candidate\-local path reproduced in v359, matching the v1 report’s v0\.40 Table\-3 style\. Its 5\-seed FULL metrics are R@10 0\.7798, MRR 0\.5824, H@1 0\.4440, which match the v1 report and anchor every v2−\-v1 delta\.

#### Seeds and split\.

All headline numbers are 5\-seed \(seeds 7, 11, 23, 31, 47\)\. The train/test split is produced bychoose\_spliton the conversation \(sample\) id, wheresample\_id=question\_id\.split\("::",1\)\[0\]anddev\_ratio=0\.5=0\.5\. The split is taken over distinct sample ids and then materialized to examples, so dev \(training\) and test conversations are disjoint within each seed\. Each seed thus induces its own conversation\-level split: the seed is passed intochoose\_split, which shuffles the conversation ids withrandom\.Random\(seed\)before partitioning, so seeds vary the dev/test partition as well as model initialization\. The FULL row countn=4955n=4955is the*total*number of test questions pooled across the five seed\-specific test sets \(per\-seed sizes 938, 1135, 937, 990, 955\)\. Metrics are computed per \(method, seed\) and then averaged over seeds; paired bootstrap intervals resample paired per\-question differences over the pooled rows, with the resampling unit keyed byseed::split::question\_id\. Because test rows are pooled across seed\-specific splits, the bootstrap interval should be read as a paired method\-level uncertainty estimate over evaluated seed–question instances, not as a confidence interval over a single fixed held\-out test set\.

#### Metrics\.

We report Recall@10, MRR, and H@1\. Because v2 preserves the v1 top\-10 set, its Recall@10 \(and Hit@10\) equal v1’s exactly — this is a by\-design equality, not an empirical near\-tie, so v2−\-v1 deltas on R@10 are an exact zero\. The metric that moves is ordering quality inside the protected set \(MRR, H@1\)\.

#### Significance\.

Confidence intervals are from a question\-level paired bootstrap over the pooled per\-seed test rows \(n=4955n=4955on FULL, pooled across five seeds\), resampling questions with replacement and recomputing the paired metric difference\. We report 95% intervals; a delta is called significant when its interval excludes zero\.

#### Hard slices\.

Beyond FULL we report three diagnostic slices:

- •T\_SUP\_auto: supersession\-style questions where a later memory supersedes an earlier one\. This slice is assigned*automatically*by heuristic, not human\-audited; we treat it as an automatic slice, and its labels are heuristic rather than human\-audited\.
- •RAW\_TOP1\_WRONG\_GOLD\_IN\_POOL: questions where the raw dense top\-1 is wrong but the gold memory is nonetheless somewhere in the pool — i\.e\. reranking has something to recover\.
- •RAW\_RESCUABLE\_STALE\_TOP1: questions where the raw dense top\-1 is a stale memory but a rescuable correct memory is present\.

On both raw\-dense\-hard slices, v1’s protected top\-10 has high recall \(R@10≈0\.93\\approx 0\.93\), which is what gives the recall\-preserving cascade room to beat a full\-pool cross\-encoder on those slices \(§[8](https://arxiv.org/html/2606.10842#S8)\)\.

## 7Inference Isolation and Shortcut Controls

v2’s inference contract is deliberately narrow; we refer to it as the anti\-shortcut contract, following the v1 report\. The only inputs the scorer accepts are: query text; candidate memory id and memory text; optional candidate position/time metadata; and the protected v1 top\-10 candidate set\. Gold labels and teacher \(mxbai / cross\-encoder\) scores are used only as training or evaluation targets and never as inference features\.

The public API enforces this with a fixedFORBIDDEN\_FIELDSset\. If any candidate passed at inference contains any of the following keys, the API raisesValueError:

> gold,gold\_ids,is\_current,is\_latest,is\_stale,stale,answer,answer\_text,ce\_score,mxbai\_score,teacher\_score,gpt\_label,entity\_id,slot\_id\.

Two properties are checked mechanically rather than asserted in prose\. The testtest\_evidence\_reranker\_rejects\_forbidden\_fieldsiterates over every forbidden field and confirms the API rejects it\. The testtest\_default\_behavior\_unchangedconfirms that the default \(v1\) path is byte\-identical whether or not the v2 module is present, so enabling v2 is a true opt\-in\. A third test confirms recall preservation: the returned top\-10 set is identical with and without v2\.

Train/test isolation is by conversation: the split function partitions onsample\_id\(question\_id\)=question\_id\.split\("::",1\)\[0\], so that no conversation appears in both the dev \(training\) and test partitions for a given seed\. The final v361 headline is trained without GPT\-labeled data; earlier GPT experiments fall outside the v2 claim\.

Table[2](https://arxiv.org/html/2606.10842#S7.T2)collects the isolation contract as an audit checklist\.

Table 2:Anti\-shortcut / isolation audit checklist\. Each property is backed by a named mechanism or test rather than by prose assertion\.
## 8Main Results

### 8\.1Headline \(FULL\) and slice results

Table[3](https://arxiv.org/html/2606.10842#S8.T3)reports the v363 verifier\-packet headline numbers \(5\-seed method\-level;n=4955n=4955pooled test rows on FULL\)\. v2 is thev361\_top10\_gold\_listwisearm\. R@10 is constant across the v1\-derived rows by construction \(the protected top\-10 set is preserved\)\.

Table 3:LoCoMo FULL results, 5\-seed method\-level \(n=4955n=4955\)\. v2 preserves v1’s Recall@10 by construction and improves MRR and H@1\. mxbai\_ce\_top500 is a high\-cost full\-pool cross\-encoder reference \(not an upper bound\); mxbai\_top10\_guard is mxbai restricted to the same protected top\-10 \(set\-matched control, i\.e\. same ten candidates and same ten pair scorings, but not the same wall\-clock cost\); oracle\_top10\_guard is the true ceiling for any reordering of v1’s top\-10\.Paired bootstrap of v2 minus v1 on the full set \(n=4955n=4955\):

- •MRR delta\+0\.0734\+0\.0734, 95% CI\[\+0\.0645,\+0\.0827\]\[\+0\.0645,\+0\.0827\]\.
- •H@1 delta\+0\.1033\+0\.1033, 95% CI\[\+0\.0906,\+0\.1171\]\[\+0\.0906,\+0\.1171\]\.
- •R@10 delta=0=0\(constructive zero; the top\-10 set is unchanged\)\.

#### Naming discipline for the mxbai reference\.

We treat mxbai\_ce\_top500 as a strong but expensive full\-pool cross\-encoder reference rather than an upper bound or ceiling\. The true ceiling for any method that may only reorder v1’s protected top\-10 isoracle\_top10\_guard\(FULL MRR 0\.8415\), which leaves substantial headroom above v2\.

### 8\.2Hard slices: where v2 exceeds the full\-pool reference

Table[4](https://arxiv.org/html/2606.10842#S8.T4)reports two raw\-dense\-hard slices alongside the supersession slice \(5\-seed means\)\. On both hard slices, v2*exceeds*mxbai\_top500\.

Table 4:Slice MRR \(5\-seed mean\)\. v2−\-v1 deltas are paired\-bootstrap with 95% CI\. On RAW\_TOP1\_WRONG and RAW\_RESCUABLE\_STALE, v2 exceeds mxbai\_top500; on T\_SUP\_auto and on FULL it does not\. Slice names are abbreviated: RAW\_TOP1\_WRONG is RAW\_TOP1\_WRONG\_GOLD\_IN\_POOL \(raw dense top\-1 wrong but gold in pool\); RAW\_RESCUABLE\_STALE is RAW\_RESCUABLE\_STALE\_TOP1 \(raw dense top\-1 stale but a rescuable correct memory present\)\.#### Why v2 beats mxbai\_top500 on these slices\.

The mechanism is a recall asymmetry in the protected pool; it reflects pool composition rather than v2’s scorer being stronger than mxbai in general\. On these two raw\-dense\-difficult slices, v1’s protected top\-10 has R@10≈0\.93\\approx 0\.93, higher than mxbai’s own top\-10 recall when mxbai reranks the full top\-500 \(R@10≈0\.85\\approx 0\.85on these slices\); Table[5](https://arxiv.org/html/2606.10842#S8.T5)reports the exact per\-slice values\. So “v1’s high\-recall protected pool\+\+v2’s precise reordering” can outscore “mxbai reranking a top\-500 whose own top\-10 already lost more gold\.” This is a genuine selling point of the recall\-preserving design, but it is slice\-specific: on FULL MRR and on T\_SUP\_auto, mxbai\_top500 is still ahead\.

Table 5:The recall asymmetry behind Table[4](https://arxiv.org/html/2606.10842#S8.T4)\. R@10\(v1 protected\) is the recall of v1’s protected top\-10 on the slice; R@10\(mxbai top\-10\) is the recall of mxbai\_top500’s own top\-10 after reranking the full pool\. On the two raw\-dense\-hard slices v1’s protected set retains more gold; on T\_SUP\_auto the asymmetry is reversed, consistent with v2 trailing mxbai\_top500 there\.
#### Scope of these results\.

These results are LoCoMo\-only and retrieval\-stage\. They are evidence for slice\-specific gains under the recall\-preserving cascade, not for v2 surpassing mxbai in general \(it does not on FULL or T\_SUP\), for generalization beyond conversational memory, or for downstream answer\-quality gains in an end\-to\-end agent/QA pipeline \(not evaluated here\)\. v2’s gains are upper\-bounded by v1’s top\-10 recall: any gold memory v1 fails to place in the protected top\-10 is unrecoverable by v2\.

### 8\.3The top\-10 reordering ceiling

To put v2’s FULL MRR in context, Table[6](https://arxiv.org/html/2606.10842#S8.T6)places it between the two set\-matched references that operate on the same protected top\-10\.

Table 6:FULL MRR on the protected top\-10\.mxbai\_top10\_guardis mxbai restricted to the same ten candidates \(set\-matched: same candidates and same number of pair scorings, not the same cost\);oracle\_top10\_guardis the true ceiling for any reordering of v1’s top\-10 \(it places the gold first whenever the gold is in the set\)\. v2 recovers a large fraction of the v1→\\tomxbai\_top10\_guard headroom but leaves substantial room below the oracle\.Two points follow\. First, on the set\-matched comparison \(same ten candidates\), v2 \(0\.6560\) is close to mxbai\_top10\_guard \(0\.6623\), and v2’s MiniLM\-L\-6 backbone is6\.7×6\.7\\timescheaper to run on that set \(§[10](https://arxiv.org/html/2606.10842#S10)\)\. Second, the true ceiling for reordering v1’s top\-10 is oracle\_top10\_guard at 0\.8415, far above all learned arms; the headroom that remains is bounded by ordering quality, while the headroom that is permanently lost is bounded by v1’s top\-10 recall\.

## 9Load\-Bearing Ablation

To identify what actually drives v2’s MRR gain, the v364 audit retrains the v2\-style full\-text arm in the same harness as four perturbation arms, all 5\-seed, all preserving the exact v1 top\-10 set \(so all have R@10=0\.7798=0\.7798on FULL\)\. Table[7](https://arxiv.org/html/2606.10842#S9.T7)reports FULL MRR\.

The four perturbation arms are designed to remove or corrupt exactly one source of signal each, while keeping the protected top\-10 set \(and hence Recall@10\) fixed:

- •no\_memory\_text: the candidate memory text is removed entirely, leaving only the query side and any positional prefix\.
- •random\_other\_query\_text\(the name is historical; it is the candidate*memory*text, not the query, that is replaced\): each candidate’s text is replaced with text drawn from*other*questions — the cross\-question stress test, which breaks query–candidate alignment while keeping superficially plausible text\.
- •shuffled\_memory\_text: the candidate texts are permuted*within the same question*— the same\-topic\-but\-wrong\-candidate stress test, which keeps the texts on\-topic but attaches them to the wrong candidate\.
- •scalar\_only: only rank/score/time metadata is kept; the memory text is withheld, isolating whatever a non\-textual shortcut could achieve\.

Table 7:Token\-evidence load\-bearing ablation, FULL MRR, 5\-seed\. All arms preserve v1’s top\-10 set\. The within\-harness full\-text baseline is 0\.6677 \(see §[9\.1](https://arxiv.org/html/2606.10842#S9.SS1)on why this differs slightly from the 0\.6560 headline\)\. The three text perturbations all fall below raw dense \(0\.3254\)\.Figure[2](https://arxiv.org/html/2606.10842#S9.F2)visualizes the same ablation; the three text\-perturbation arms all fall below the raw\-dense baseline, whilescalar\_onlystays near v1 level\.

![Refer to caption](https://arxiv.org/html/2606.10842v1/figures/figure3.png)Figure 2:Token\-evidence load\-bearing ablation \(FULL MRR, 5\-seed\)\. All arms preserve v1’s top\-10 set\. The three text\-perturbation arms \(no\_memory\_text, shuffled\_memory\_text, random\_other\_query\_text\) all fall below the raw\_dense baseline \(0\.3254, dashed line\): corrupting candidate\-specific memory text does not merely erase the gain, it scores below the unreranked baseline\. scalar\_only stays near v1 level\.Paired bootstrap, full text minus each ablation on FULL MRR \(all 95% CIs exclude zero\):

- •full−\-no\_memory\_text:\+0\.3712\+0\.3712\[\+0\.3599,\+0\.3829\]\[\+0\.3599,\+0\.3829\]\.
- •full−\-random\_other\_query\_text:\+0\.4173\+0\.4173\[\+0\.4067,\+0\.4284\]\[\+0\.4067,\+0\.4284\]\.
- •full−\-shuffled\_memory\_text:\+0\.3948\+0\.3948\[\+0\.3834,\+0\.4060\]\[\+0\.3834,\+0\.4060\]\.
- •full−\-scalar\_only:\+0\.0881\+0\.0881\[\+0\.0801,\+0\.0969\]\[\+0\.0801,\+0\.0969\]\.

#### Key finding\.

The three text perturbations \(0\.25060\.2506,0\.27310\.2731,0\.29660\.2966\) all fall*below*raw dense retrieval \(0\.32540\.3254\)\. This is the cleanest fingerprint that token\-on\-memory\-text alignment is load\-bearing: when the candidate\-specific text is removed, shuffled within the question, or swapped for text from other questions, the model does not gracefully fall back to a sensible default — it is confidently wrong, scoring below the unreranked dense order\. By contrast, the scalar\-only arm \(rank/time/score features without memory text\) stays at roughly v1 level \(0\.5792 vs v1’s 0\.5824\), showing that scalar shortcuts alone cannot reproduce the gain\. The mechanism is candidate\-specific text interaction, not rank/time/score shortcuts\. The two text stress tests probe complementary failure modes and both collapse\.random\_other\_query\_text\(0\.2506\) confirms the model is not exploiting generic, query\-independent text statistics: when candidate texts come from unrelated questions, scoring is worse than dense order\.shuffled\_memory\_text\(0\.2731\) is the harder, same\-question control: the texts are exactly the candidate texts for that question, merely reattached to the wrong candidates, and the model still collapses — so it is the*alignment*between a specific candidate’s text and the query, not mere topical presence of plausible text, that carries the signal\. That all three text\-damaged arms land below raw dense \(0\.3254\) rather than merely below full text is the strong form of the result: a model that had only learned a mild text prior would degrade toward the dense baseline, not below it\. Scoring below dense means the corrupted text is actively misleading the reranker, which strongly supports the interpretation that the uncorrupted candidate\-specific text was load\-bearing\.

### 9\.1The two v2 numbers \(0\.6560 vs 0\.6677\)

There are two method\-level v2 MRR estimates in this report, and we keep them distinct\. The canonical headline is the v361 5\-seed method\-level run: FULL MRR0\.5824→0\.65600\.5824\\to 0\.6560\. The released HF checkpoint \(Purdy0228/ConvMemory\-v2\-Evidence\-Reranker\) is a seed\-7 representative checkpoint exported from the same v361 gold\-only recipe after v0\.5\.0 packaging; the0\.65600\.6560figure is the method\-level 5\-seed estimate,nota separately measured score of this exact single checkpoint\. The v364 ablation\-harness baseline is0\.66770\.6677\(full\_text retrained inside the v364 ablation script as a fresh 5\-seed run, for apples\-to\-apples comparison with the four perturbation arms\)\. The\+0\.012\+0\.012gap between0\.65600\.6560and0\.66770\.6677is about1\.3​σ1\.3\\sigmaof fresh\-training seed variance \(MRR std≈0\.009\\approx 0\.009\); both method\-level estimates are significantly above v1, with bootstrap CIs that do not cross zero\.

## 10Cost

Figure[3](https://arxiv.org/html/2606.10842#S10.F3)plots the cost–quality trade\-off across the four reranking paths; the precise latency figures follow in Table[8](https://arxiv.org/html/2606.10842#S10.T8)\.

![Refer to caption](https://arxiv.org/html/2606.10842v1/figures/figure2.png)Figure 3:MRR–latency trade\-off on LoCoMo \(x\-axis: log scale\)\. v1\+v2 recovers most of the MRR headroom above v1 at∼\\sim1\.7×\\timesv1’s latency, and stays within 0\.013 FULL MRR of the high\-cost full\-pool reference mxbai\_top500 while running about 68×\\timescheaper\. mxbai\_top500 remains the most accurate but by far the slowest, and remains ahead of v2 on FULL MRR\.Table[8](https://arxiv.org/html/2606.10842#S10.T8)reports the v362 latency probe \(200 timed queries after warmup, RTX 4080 SUPER\)\. The full protocol is in Appendix[A](https://arxiv.org/html/2606.10842#A1)\.

Table 8:Latency \(RTX 4080 SUPER, 200 timed queries after warmup\)\. “pair scorings” is the number of cross\-encoder query–memory pair evaluations per query\. v1 runs none; the v2 evidence stage runs ten \(one per protected candidate\); a full\-pool cross\-encoder runs 500\.#### What the latency figures include and exclude\.

Each measured path includes the model scoring it names: ConvMemory v1 scoring over the top\-500 for the v1 rows; the v2 cross\-encoder scoring over the protected top\-10 for the v2 rows; and mxbai cross\-encoder scoring over the stated candidate count for the mxbai rows\. The figures exclude vector\-database I/O, network latency, downstream LLM answer generation, and prompt construction; they reflect reranking\-stage compute on the stated candidates\. Embedding precomputation is excluded except where it is intrinsically part of a measured path\. These are single\-configuration indicative measurements, not benchmark\-grade latency claims; the full protocol and caveats are in Appendix[A](https://arxiv.org/html/2606.10842#A1)\.

#### Cost framing\.

v1 alone preserves the v1 §3\.4 property \(zero per\-pair cross\-encoder scoring over the pool\)\. v1\+v2 adds a bounded precision stage: ten cross\-encoder pair scorings per query on the protected top\-10, where the v2 module itself is 22,713,601 parameters\. This adds a new point on the cost–quality frontier at roughly1\.7×1\.7\\timesv1’s latency and about68×68\\timescheaper than a full\-pool mxbai cross\-encoder in this measurement; v1 remains the cheaper default path alongside it\. Comparing the two cross\-encoders on the identical protected top\-10 \(set\-matched\), v2 \(11\.867 ms/q\) is6\.72×6\.72\\timescheaper than mxbai \(79\.770 ms/q\), because v2’s MiniLM\-L\-6 backbone is far smaller than mxbai\-large\.

## 11Limitations

#### Recall is capped by v1\.

v2 only reorders v1’s protected top\-10, so it inherits v1’s top\-10 recall as a hard ceiling\. Any gold memory that v1 fails to place in the top\-10 is permanently unrecoverable by v2\. The relevant ceiling isoracle\_top10\_guard\(FULL MRR 0\.8415\), which is the best any reorderer of v1’s top\-10 could achieve; v2’s 0\.6560 leaves real headroom, and that headroom is split between better ordering \(recoverable\) and missing recall \(not\)\.

#### LoCoMo\-specific fine\-tuning, no cross\-domain evidence\.

The headline is a LoCoMo conversational\-memory result\. Transfer to other domains, other conversation distributions, or document retrieval is untested here and outside the scope of this report’s claims\. A cross\-domain user should retrain or at least re\-validate their own evidence reranker rather than assume the LoCoMo checkpoint transfers\.

#### Single\-hardware latency\.

All cost numbers come from one consumer GPU \(RTX 4080 SUPER\) in a single run of 200 timed queries\. Data\-center GPUs, smaller GPUs, CPU inference, and kernel\-fused servers would shift both the absolute numbers and the ratios; practitioners should measure on their target hardware\.

#### Retrieval\-stage only\.

Every metric here is retrieval\-stage MRR / H@1 / Recall@10\. The evaluation stops at retrieval; whether v2’s ordering gains translate into downstream answer\-quality gains in an end\-to\-end agent or QA pipeline is left to future work\.

#### Below mxbai on FULL\.

On FULL MRR, v2 sits 0\.013 below mxbai\_top500 \(0\.6560 vs 0\.6688\), and on the T\_SUP\_auto slice 0\.6469 vs 0\.6572\. v2’s advantage over mxbai\_top500 holds only on the two raw\-dense\-hard slices of Table[4](https://arxiv.org/html/2606.10842#S8.T4)and is a slice\-specific recall\-asymmetry effect rather than general dominance\.

#### Released checkpoint vs\. headline number\.

The published checkpoint is a single seed\-7 representative of the gold\-only recipe; the headline figures are 5\-seed method\-level estimates \(§[9\.1](https://arxiv.org/html/2606.10842#S9.SS1), Appendix[B](https://arxiv.org/html/2606.10842#A2)\)\. The two are not interchangeable, and downloading the single checkpoint should not be expected to reproduce the 5\-seed aggregate exactly\. A cross\-encoder\-teacher variant exists \(wteacher=0\.25w\_\{\\text\{teacher\}\}=0\.25, FULL MRR 0\.6546\) but is not the headline; the headline is gold\-only listwise\.

#### Not a replacement for other extensions\.

v2 is an opt\-in stage, not a replacement for v1 \(the cheaper default\) nor for the separate CCGE\-LA conflict editor or the Memory\-MLA recall expander; these are independent components, each composing with v2 as a separate opt\-in stage\.

## 12Reproducibility

The package is installable viapip install convmemory==0\.5\.0\. The released v2 checkpoint is on the Hugging Face Hub atPurdy0228/ConvMemory\-v2\-Evidence\-Reranker\(a seed\-7 representative checkpoint\)\. The source is on GitHub atgithub\.com/pth2002/ConvMemoryat tagv0\.5\.0\(commit48b80b4\)\. A minimal usage example:

```
from convmemory import ConvMemory
m = ConvMemory.from_pretrained("Purdy0228/ConvMemory-LoCoMo-MPNet")
m.load_evidence_reranker("Purdy0228/ConvMemory-v2-Evidence-Reranker")
ranked = m.retrieve(query=q, memories=ms, evidence_reranker="v2", top_k=10)
```

The exact internal reproduction of the v363 5\-seed method\-level headline numbers usesexperiments/v361\_top10\_evidence\_reranker\.py\. The public, general\-purpose training recipe \(a user\-facing entry point, not the LoCoMo\-locked harness\) isexamples/train\_evidence\_reranker\.py\.

## 13Discussion and Future Work

The three future\-work directions named in the v1 report — multi\-backbone checkpoint distribution, an end\-to\-end agent benchmark integrating ConvMemory into a full pipeline, and broader \(multi\-task, multi\-seed\) CCGE\-LA training — all remain open\. v2 addresses a separate, unanticipated direction\. Natural next steps specific to v2 include: validating the recall\-preserving cascade on non\-LoCoMo conversational\-memory data; measuring whether v2’s retrieval\-stage MRR/H@1 gains translate into downstream answer quality in an end\-to\-end agent; and studying how far the protected\-prefix width \(here, ten\) can be widened before the cost advantage over a full\-pool cross\-encoder erodes\.

## Appendix AFull Latency Protocol

This appendix documents the latency measurement behind Table[8](https://arxiv.org/html/2606.10842#S10.T8)in §[10](https://arxiv.org/html/2606.10842#S10)\. As in the v1 report, the intent is not benchmark\-grade latency publication but a reproducible cost\-frontier comparison and an explicit statement of what each figure includes and excludes\.

#### Setup\.

Measurements were taken on a single NVIDIA GeForce RTX 4080 SUPER\. The probe times 200 queries after a warmup pass, oncuda\. The v1 candidate pool is the dense MPNet top\-500; v2 reranks only the protected v1 top\-10\.

#### What each row measures\.

- •v1 top500\(16\.779 ms/q\): the v1 reranking path over the top\-500 pool, producing the top\-10 set v2 will reorder\.
- •v2 \(top10 only\)\(11\.867 ms/q\): ten cross\-encoder pair scorings over the protected top\-10, excluding the upstream v1 cost\.
- •v1 \+ v2\(28\.646 ms/q\): the deployed opt\-in path; roughly the sum of the previous two, i\.e\.1\.71×1\.71\\timesv1 alone\.
- •mxbai top10\(79\.770 ms/q\): mxbai\-large restricted to the same protected ten candidates, the set\-matched cross\-encoder control\.
- •mxbai top500\(1960\.248 ms/q\): mxbai\-large over the full top\-500 pool, the high\-cost full\-pool reference\.

#### Ratios\.

mxbai\_top500 / \(v1\+v2\)=68\.43×=68\.43\\times; mxbai\_top500 / v1=116\.83×=116\.83\\times; mxbai\_top10 / v2\_top10=6\.72×=6\.72\\times; \(v1\+v2\) / v1=1\.71×=1\.71\\times\.

#### Caveats\.

All figures are from a single run on a single consumer GPU \(RTX 4080 SUPER\) with 200 timed queries after warmup; data\-center GPUs, smaller GPUs, CPU inference, or kernel\-fused inference servers would shift these numbers and should be measured on the target hardware\. The cross\-encoder baselines are off\-the\-shelf checkpoints at default sequence length\. These are indicative cost\-frontier measurements, not benchmark\-grade latency claims\.

## Appendix BReleased Checkpoint Provenance

The relationship between the published checkpoint and the headline numbers warrants a precise statement, because the two are easy to conflate\.

- •The original v361 5\-seed run did not save its per\-seed checkpoints\. There is therefore no archived single checkpoint that “is” the 5\-seed headline\.
- •The checkpoint published on the Hugging Face Hub \(Purdy0228/ConvMemory\-v2\-Evidence\-Reranker\) is aseed\-7 representativecheckpoint, exported after v0\.5\.0 packaging using the same v361 gold\-only recipe\.
- •Its manifest records the recipe: teacher weight0\.00\.0\(gold\-only\),top\_k=10=10,max\_length=256=256, and cross\-encoder backbonecross\-encoder/ms\-marco\-MiniLM\-L\-6\-v2, with source experimentv361\_top10\_evidence\_reranker\.pyand training target “gold\-only listwise retrieval cross\-entropy\.”
- •Consequently, the released checkpointimplements the method; the headline metrics \(FULL MRR 0\.6560, etc\.\) aremethod\-level 5\-seed estimates, not the measured score of this one checkpoint\.

The 0\.6560 figure is therefore a method\-level estimate, not the individual score of the released checkpoint\. A user evaluating the single downloaded checkpoint should expect a single\-seed result consistent with, but not identical to, the 5\-seed method\-level estimate\.

## Appendix CSource\-of\-Truth Checklist

Every quantitative claim in this report traces to one of the files in Table[9](https://arxiv.org/html/2606.10842#A3.T9)\. Where exact bibliographic metadata for external model cards could not be verified, those models are cited as footnoted URLs rather than as fabricated references\.

Table 9:Mapping from claim type to its source\-of\-truth file\. The public package and test files \(convmemory/evidence\_reranker\.py,tests/test\_evidence\_reranker\.py\) are in the GitHub repository at tagv0\.5\.0; theresults/verifier packets and the internal experiment script are local source\-of\-truth artifacts retained by the author and available on request, not part of the public tag\.
## References

- \[1\]Taiheng Pan\.ConvMemory: A lightweight learned memory reranker, a negative attribution result, and a research\-preview conflict editor\.arXiv preprint arXiv:2605\.28062, 2026\.[https://arxiv\.org/abs/2605\.28062](https://arxiv.org/abs/2605.28062)\.
- \[2\]Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\.Evaluating very long\-term conversational memory of LLM agents\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.[https://arxiv\.org/abs/2402\.17753](https://arxiv.org/abs/2402.17753)\.
- \[3\]Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie\-Yan Liu\.MPNet: Masked and permuted pre\-training for language understanding\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.[https://arxiv\.org/abs/2004\.09297](https://arxiv.org/abs/2004.09297)\.

Similar Articles

Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

arXiv cs.LG

This paper proposes a training-free, CPU-only retrieval method that fuses BM25 lexical scores with late-interaction dense scores for conversational memory retrieval, achieving up to +17.2 points improvement on LoCoMo Hit@1 over late interaction alone across six encoders. The study provides controlled ablations on pooling operators, reranker effects, and benchmark robustness, framing the gain as a division of labor between dense and lexical signals.