Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

arXiv cs.CL Papers

Summary

This paper introduces a behavioral evaluation framework for calibrating claims about deployment-time memory in LLM test-time training, proposing an evidence ladder and explicit baselines to bridge proxy metrics and behavioral evidence.

arXiv:2607.00368v1 Announce Type: new Abstract: Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity, future-token loss, long-context performance, or reward. These metrics are well matched to claims about stream adaptation, domain adaptation, context compression, and reward-backed test-time improvement. They are weaker evidence, however, for a capability that TTT results are increasingly used to motivate: deployed assistant memory, personalization, or sparse post-deployment learning, which instead requires behavioral evidence such as later recall, paraphrase robustness, retention, locality, conflict handling, and use in downstream actions after the original support context is removed. We introduce a behavioral evaluation framework that calibrates TTT memory claims to the evidence that supports them. It has two components: a claim-calibrated evidence ladder that separates stream/domain adaptation, bridge internalization, and deployment-time behavioral learning; and an evaluation protocol with matched explicit-memory baselines and mutually exclusive failure categories. We validate the framework by auditing recent TTT and memory-adjacent work and by instantiating it as a controlled diagnostic in which, in a sparse nonce-fact setting, one-step LoRA updates lower support and answer loss across three Qwen3 model scales while generated free-form recall stays at zero, exposing a measurable gap between proxy improvement and deployment behavior. The framework gives authors and evaluators a concrete standard for aligning TTT memory claims with the evidence actually reported.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:36 AM

# Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
Source: [https://arxiv.org/html/2607.00368](https://arxiv.org/html/2607.00368)
Xiangchen Song1Zhenhao Chen2Lingjing Kong1Shaoan Xie1 Xinshuai Dong1Guangyi Chen1,2Kun Zhang1,2 1Carnegie Mellon University2MBZUAI xiangchs@cs\.cmu\.edu, kunz1@cmu\.edu

###### Abstract

Large language model test\-time training \(TTT\) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target\-domain data, or verifiable task attempts, and then judged by perplexity, future\-token loss, long\-context performance, or reward\. These metrics are well matched to claims about stream adaptation, domain adaptation, context compression, and reward\-backed test\-time improvement\. They are weaker evidence, however, for a capability that TTT results are increasingly used to motivate: deployed assistant memory, personalization, or sparse post\-deployment learning, which instead requires behavioral evidence such as later recall, paraphrase robustness, retention, locality, conflict handling, and use in downstream actions after the original support context is removed\. We introduce a behavioral evaluation framework that calibrates TTT memory claims to the evidence that supports them\. It has two components: a claim\-calibrated evidence ladder that separates stream/domain adaptation, bridge internalization, and deployment\-time behavioral learning; and an evaluation protocol with matched explicit\-memory baselines and mutually exclusive failure categories\. We validate the framework by auditing recent TTT and memory\-adjacent work and by instantiating it as a controlled diagnostic in which, in a sparse nonce\-fact setting, one\-step LoRA updates lower support and answer loss across three Qwen3 model scales while generated free\-form recall stays at zero, exposing a measurable gap between proxy improvement and deployment behavior\. The framework gives authors and evaluators a concrete standard for aligning TTT memory claims with the evidence actually reported\.

## 1Introduction

Test\-time training \(TTT\) challenges the conventional “train, then deploy” boundary by allowing model states or parameters to change during inference\. In large language models, recent work has made this idea technically concrete: models may update from retrieved neighbors\[[11](https://arxiv.org/html/2607.00368#bib.bib7)\], learn online hidden\-state updates through fast weights\[[25](https://arxiv.org/html/2607.00368#bib.bib19)\], perform large\-chunk updates for throughput and state capacity\[[35](https://arxiv.org/html/2607.00368#bib.bib25)\], or align online updates with next\-token prediction\[[9](https://arxiv.org/html/2607.00368#bib.bib5)\]\. Related work studies targeted context\-specific updates\[[4](https://arxiv.org/html/2607.00368#bib.bib28)\], meta\-learned long\-context learning\[[28](https://arxiv.org/html/2607.00368#bib.bib20)\], parameter\-efficient context memories\[[7](https://arxiv.org/html/2607.00368#bib.bib4)\], locally supported parametric memories\[[17](https://arxiv.org/html/2607.00368#bib.bib8)\], input\-perplexity minimization\[[13](https://arxiv.org/html/2607.00368#bib.bib10)\], label\-free uncertainty signals\[[32](https://arxiv.org/html/2607.00368#bib.bib22)\], unlabeled reinforcement learning\[[40](https://arxiv.org/html/2607.00368#bib.bib38)\], and self\-directed update data\[[41](https://arxiv.org/html/2607.00368#bib.bib39),[1](https://arxiv.org/html/2607.00368#bib.bib3)\]\.

Across these settings, the evaluation recipe is relatively stable\. A model is updated at test time on recently observed tokens, retrieved examples, task attempts, or generated data, and performance is then reported through lower perplexity, better future\-token prediction, improved long\-context accuracy, or higher reward\. These results show real progress on in\-sequence adaptation and compact use of recent context, often addressing practical limits of transformer\-based systems such as finite context windows or static parameters\. They also explain why TTT is attractive for LLM systems: online updates may allow a model to adapt to new evidence rather than relying only on fixed parameters or the current prompt\.

The central observation of this paper is that this evidence is not always calibrated to the claims it is used to support\. In real\-world assistant settings, the relevant question is often not whether an update improves prediction on a nearby continuation\. It is whether a deployed model can hear a sparse user preference, acquire a project\-specific fact, revise a stale belief, or learn a procedure, and then use that information later after the original support context is removed\.

We formalize this as a behavioral evaluation framework for LLM test\-time training: claims about memory, personalization, or sparse post\-deployment learning are calibrated against behavioral evidence beyond perplexity—later recall, paraphrase robustness, retention, locality, and conflict handling after the original support context is removed—rather than against local likelihood alone\.

The fact that perplexity is an imperfect proxy for downstream behavior is not new; similar lessons appear in model editing, retrieval, and memory evaluation\. The distinctive issue for LLM TTT is that inference\-time parameter or state updates make local likelihood gains especially easy to reinterpret as evidence of learning, memory, or personalization\. We call this failure mode*evidence migration*: evidence that directly supports one evaluation regime is carried into a stronger deployment narrative without the behavioral tests required for that narrative\.

The key distinction is between*in\-sequence adaptation*and*deployment\-time behavioral learning*\. In in\-sequence adaptation, a model observes a token stream, updates its state or parameters on a support chunk, and is evaluated on future tokens from the same stream\. In deployment\-time behavioral learning, a deployed model receives sparse but high\-value interactions—facts, preferences, corrections, or procedures—and must later use them in a personalized and stable way\. Local loss reductions can strongly support the first setting while leaving the second only partially tested\. Figure[1](https://arxiv.org/html/2607.00368#S1.F1)illustrates this split with a sparse user\-stated preference that should influence a later response\.

![Refer to caption](https://arxiv.org/html/2607.00368v1/x1.png)Figure 1:Two evaluation paradigms\. Top: deployment\-time behavioral learning from a sparse user interaction, illustrated with a user\-stated dietary preference and evaluated by whether that preference changes later user\-facing behavior\. Bottom: perplexity\-based evidence, where the model updates on a past support token chunk and achieves lower loss on a future chunk\.This distinction matters because downstream systems are beginning to draw on recent TTT work for memory\-like capabilities\. Many long\-context, label\-free adaptation, and reward\-backed TTT methods primarily report perplexity, likelihood, uncertainty, throughput, long\-context accuracy, or task reward\[[25](https://arxiv.org/html/2607.00368#bib.bib19),[35](https://arxiv.org/html/2607.00368#bib.bib25),[9](https://arxiv.org/html/2607.00368#bib.bib5),[4](https://arxiv.org/html/2607.00368#bib.bib28),[28](https://arxiv.org/html/2607.00368#bib.bib20),[13](https://arxiv.org/html/2607.00368#bib.bib10),[32](https://arxiv.org/html/2607.00368#bib.bib22),[40](https://arxiv.org/html/2607.00368#bib.bib38)\]\. A closer frontier studies parameter\-efficient context memories, self\-adaptation, and context internalization\[[7](https://arxiv.org/html/2607.00368#bib.bib4),[30](https://arxiv.org/html/2607.00368#bib.bib36),[17](https://arxiv.org/html/2607.00368#bib.bib8),[37](https://arxiv.org/html/2607.00368#bib.bib41)\]\. These are important steps toward memory\-like systems, but deployment\-time memory adds a further requirement: sparse user information should remain usable after delay and should be evaluated against retrieval or long\-context baselines\.

Many existing TTT papers are properly grounded in proxy evaluation and do not themselves claim to solve deployment\-time memory\. The problem arises when such results are later cited, summarized, or motivated as evidence for stronger claims about memory, personalization, or self\-updating assistants\. Table[1](https://arxiv.org/html/2607.00368#S1.T1)summarizes common evidence migration patterns and the behavioral tests needed to support the stronger deployment\-time narratives\.

Table 1:Evidence migration patterns\. Evidence that is well grounded for one claim can be used to motivate stronger deployment\-time narratives that require additional behavioral tests\.We use TTT as an umbrella term for inference\-time methods that update model state, especially parameters or fast weights, and treat LLM*test\-time learning*\(TTL\) as a neighboring adaptation formulation\. We use*memory*,*continual learning*, and*deployment\-time behavioral learning*for the stronger claim that new information encountered after deployment can be acquired and later used\. When we discuss*future\-token gain*, we mean improvement on nearby held\-out continuations after a test\-time update\.

Concretely, this paper makes four contributions:

1. 1\.We identify and name*evidence migration*: the failure mode in which TTT results grounded in perplexity, reward, or same\-stream adaptation are used to support stronger claims about memory, personalization, or post\-deployment learning\.
2. 2\.We introduce a claim\-calibrated evidence ladder, ranging from proxy evidence for stream adaptation, to evidence for context internalization, to behavioral evidence for deployment\-time memory, and pair it with an evaluation protocol and matched explicit\-memory baselines\.
3. 3\.We audit recent TTT and memory\-adjacent work through this ladder, distinguishing what current evaluations directly support from what additional tests would be needed for stronger deployment\-time claims\.
4. 4\.We instantiate and stress\-test the framework with a controlled Qwen3/LoRA diagnostic, showing that proxy and answer\-likelihood gains can coexist with zero generated recall, and provide claim\-specific behavioral templates for factual, preference, correction, procedure, and agent\-memory claims\.

## 2Perplexity is an Incomplete Proxy

Perplexity remains valuable\. It is often the cleanest early signal that a test\-time update has changed the model in a nontrivial way\. The problem is not that perplexity is uninformative, but that it answers a narrower question than many deployment\-time claims require\. Lower loss can arise from mechanisms that are useful for stream adaptation but insufficient for post\-deployment learning\.

Continuous text is a dense\-evidence regime\.When support and evaluation chunks come from the same passage, they share topic, entities, lexical choices, and local discourse structure\. Lower future\-token loss may therefore reflect continuation fitting, topic priming, or memorization of nearby regularities, which is precisely what dynamic evaluation\[[15](https://arxiv.org/html/2607.00368#bib.bib11),[22](https://arxiv.org/html/2607.00368#bib.bib34)\]was designed to exploit\. This is a legitimate stream\-adaptation benefit\. However, it does not require the model to form a stable, queryable representation of the underlying fact\. Dense same\-stream evidence is therefore weaker evidence for sparse deployment\-time learning\.

Associative retrieval differs from semantic acquisition\.Long\-context TTT can also succeed by building efficient associative state for the current input\. Some long\-context TTT results are consistent with the view that TTT\-style state acts as a compressed associative memory over the current context\. For example, TTT\-E2E\[[28](https://arxiv.org/html/2607.00368#bib.bib20)\]connects TTT\-style updates to key\-value binding mechanisms and reports that full attention remains substantially stronger on needle\-in\-a\-haystack recall, partly because compressed TTT\-style state can discard details needed for exact retrieval\. This supports context compression\. It is weaker evidence for acquiring a reusable fact, rule, preference, or procedure that should survive paraphrase and delay\.

Teacher forcing is easier than behavioral extraction\.A model can assign higher probability to the gold continuation of a support fact without later producing the correct answer to a free\-form query about that fact\. This gap is especially important under open\-ended decoding: a small increase in the probability of the right answer tokens may still be insufficient for greedy generation, beam search, or robust answer selection under paraphrase\. Lower teacher\-forced loss is therefore evidence that the update moved probability mass in a useful direction, but it does not guarantee that the learned information is accessible during open\-ended generation\. A system can improve language\-model loss while still failing to retrieve the new knowledge behaviorally\.

Cumulative stream adaptation can be mistaken for sparse\-interaction learning\.In standard continuous\-text TTT protocols, updates often accumulate across a long input stream\. This is appropriate when the goal is online adaptation to that stream\. It becomes harder to interpret when the same result is used as evidence that a single sparse user interaction can be internalized and reused later\. A stream protocol can conflate immediate adaptation to the current support item with cumulative adaptation from all preceding chunks\. Deployment\-time claims therefore need reset\-vs\-stream comparisons, or equivalent controls, when the protocol permits them\.

Local LM metrics do not test generalization, locality, or conflict\.Deployment\-time learning requires more than improved prediction of nearby text\. The update should persist under paraphrase, delay, mild distribution shift, and conflicting later evidence, while leaving unrelated behavior intact\. The model editing literature provides a useful cautionary parallel: ROME\[[19](https://arxiv.org/html/2607.00368#bib.bib13)\]and MEMIT\[[20](https://arxiv.org/html/2607.00368#bib.bib14)\]made efficacy, generalization, and specificity explicit, while later evaluations found that short\-form edit tests can miss specificity failures\[[12](https://arxiv.org/html/2607.00368#bib.bib32)\], multi\-hop failures\[[39](https://arxiv.org/html/2607.00368#bib.bib37)\], long\-form failures\[[23](https://arxiv.org/html/2607.00368#bib.bib15)\], gradual and catastrophic forgetting under scale\[[10](https://arxiv.org/html/2607.00368#bib.bib6)\], broader evaluation mismatches for edited models\[[16](https://arxiv.org/html/2607.00368#bib.bib12)\], and deployment\-style failures in the wild\[[33](https://arxiv.org/html/2607.00368#bib.bib24)\]\. The same lesson carries over to TTT memory claims: a locally valid success metric can overstate the practical capability of interest when the claim concerns delayed, user\-facing behavior\.

Takeaway:Support reconstruction, future\-token loss, answer NLL, and, when used, candidate ranking should remain in TTT reports because they are informative mechanism\-level signals\. For deployment\-time TTT, however, they should be labeled as*proxy metrics*and kept separate from headline behavioral evidence\.

## 3Deployment\-Time Memory as a Behavioral Evaluation Target

This limitation does not invalidate existing TTT evaluations for stream adaptation, long\-context compression, domain adaptation, or reward\-backed test\-time improvement\. The issue is claim calibration: these evaluations do not by themselves establish the stronger deployment\-time capabilities implied by memory, personalization, or self\-updating assistant narratives\.

The regime split matters because memory\-style deployment is no longer an abstract desideratum\. It is already the target of direct behavioral evaluation\. LoCoMo\[[18](https://arxiv.org/html/2607.00368#bib.bib33)\]evaluates very long\-term conversational memory; LongMemEval\[[31](https://arxiv.org/html/2607.00368#bib.bib21)\]benchmarks long\-term interactive memory in chat assistants; MemoryAgentBench\[[14](https://arxiv.org/html/2607.00368#bib.bib9)\]evaluates memory through incremental multi\-turn interactions; MemoryBench\[[2](https://arxiv.org/html/2607.00368#bib.bib1)\]targets memory and continual learning in LLM systems; MemoryCD\[[36](https://arxiv.org/html/2607.00368#bib.bib26)\]studies lifelong cross\-domain personalization; and Mem2ActBench\[[24](https://arxiv.org/html/2607.00368#bib.bib17)\]evaluates long\-term memory utilization in task\-oriented agents\. These benchmarks ask whether information from one or a few interactions can be reused after delay, paraphrase, abstention pressure, or conflict\. Improving nearby continuation loss after a dense support chunk answers a different question\.

Deployment\-time learning differs from pretraining and continuous\-text adaptation in three ways\. It is*low redundancy*: a fact, preference, correction, or procedure may be stated only once\. It is*weak and heterogeneous*: user evidence often arrives as short, informal utterances rather than clean supervised examples\. It is also*delayed and behavioral*: the model must later answer, revise, abstain, route, or act according to the learned information without damaging unrelated behavior\. These properties make behavioral probes necessary: the central question is whether the update changes later behavior in the way required by the claimed deployment use case, not merely whether it improves local likelihood\.

The distinction is especially clear in settings with stronger supervision\. TTT\-Discover\[[34](https://arxiv.org/html/2607.00368#bib.bib40)\]shows that test\-time training can improve search when the environment supplies verifiable rewards and many attempts can be scored\. Personalized assistants face a harder frontier: semantic internalization under weak evidence, as explored by SEAL\[[41](https://arxiv.org/html/2607.00368#bib.bib39)\]and Absorber LLM\[[37](https://arxiv.org/html/2607.00368#bib.bib41)\], and stressed by direct memory benchmarks such as MemoryBench\[[2](https://arxiv.org/html/2607.00368#bib.bib1)\]\. Reward\-rich discovery and dense stream adaptation are promising, but sparse deployment\-time memory imposes a separate behavioral burden\.

Matched explicit\-memory baselines are therefore central\. If the deployed task is to remember a user fact, preference, correction, or procedure, then retrieval, long\-context prompting, and memory systems that store and reuse the same evidence are natural comparators\. MemoryBank\[[38](https://arxiv.org/html/2607.00368#bib.bib27)\]stores long\-term user memory, LongMem\[[29](https://arxiv.org/html/2607.00368#bib.bib23)\]augments language models with long\-term memory, MemGPT\[[21](https://arxiv.org/html/2607.00368#bib.bib16)\]manages memory through an operating\-system\-like architecture, Mem0\[[8](https://arxiv.org/html/2607.00368#bib.bib31)\]targets scalable production\-ready long\-term memory for agents, Dynamic Cheatsheet\[[27](https://arxiv.org/html/2607.00368#bib.bib35)\]maintains an adaptive test\-time memory, and MEMORYLLM\[[30](https://arxiv.org/html/2607.00368#bib.bib36)\]studies self\-updatable language models\. These systems retain deployment\-time information outside ordinary base\-model weights and retrieve, page, curate, or pool it when needed\. They need not be framed as competitors that TTT must always beat\. Rather, they define the behavioral and systems\-level tradeoff that a parametric update must justify\.

Takeaway:Deployment\-time memory claims require behavioral tests against matched explicit\-memory baselines\. These comparisons reveal whether parametric TTT adds value under privacy, latency, compression, or context\-pressure constraints\.

## 4Audit

Claim levels\.We audit recent TTT and memory\-adjacent work using three claim levels\.*S\-level*evidence denotes stream or domain adaptation: the model is updated on a recent stream, target\-domain data, retrieved examples, or task attempts, and evaluated by nearby loss, same\-stream performance, domain accuracy, long\-context performance, or reward\.*B\-level*evidence denotes bridge mechanisms such as internalization, parametric memory, context absorption, or self\-adaptation: the evaluation suggests that information can be stored, compressed, or transformed inside model state, but does not yet establish sparse delayed user\-facing behavior\.*D\-level*evidence denotes deployment\-time behavioral learning: sparse post\-deployment information changes later behavior under recall, paraphrase, delay, locality, conflict, or action\-use tests after the original support context is unavailable\. Figure[2](https://arxiv.org/html/2607.00368#S4.F2)summarizes the ladder\.

Operationally, we assign S when the primary evidence is same\-stream, domain, or task\-adaptation performance; B when the evaluation tests memory\-like internalization, parametric storage, context absorption, or self\-adaptation without fully testing sparse delayed user behavior; and D only when the evaluation directly probes later behavior from sparse post\-deployment information after context removal\. B\-level is intentionally heterogeneous: it covers mechanisms that bridge proxy adaptation and deployment memory, but these mechanisms do not by themselves establish D\-level memory\.

![Refer to caption](https://arxiv.org/html/2607.00368v1/x2.png)Figure 2:Three claim levels used to calibrate TTT claims\. The level records the strongest evidence directly supported by the reported evaluation, not the broadest motivation or downstream narrative\.Audit protocol\.We screened more than 40 targeted\-search candidates available through April 2026 using bibliography seeding and targeted queries around LLM TTT/TTL, long\-context TTT, context memory, parametric memory, personalization, continual learning, and assistant memory\. A record was included if its title, abstract, introduction, or motivation explicitly invoked test\-time learning or training, memory, persistence, context absorption, self\-improvement, personalization, continual learning, or assistant memory\. We coded 24 papers and treated the remaining records as background or exclusions, including retrieval\-only or external\-memory baselines, model\-editing and safety background, non\-LLM TTT, and architecture\-adjacent context memory\.

We therefore code claims by calibrating them to evidence rather than by inferring intent, assigning each paper the strongest claim level directly supported by its reported evaluations instead of the level suggested by its broadest motivating statement\. Assistant\-memory benchmarks are labeled as D targets rather than D\-level TTT evidence because they define the behavioral task family but do not themselves show that parametric TTT achieves it\. This audit is not a prevalence estimate or systematic review\. Appendix[C](https://arxiv.org/html/2607.00368#A3)gives the search fields, candidate accounting, background and exclusion summary, and boundary\-case sensitivity worksheet\.

The audit asks what each paper family’s evidence directly supports, and what additional tests would be needed before using it in a deployment\-memory claim\. Table[2](https://arxiv.org/html/2607.00368#S4.T2)gives representative examples\.

Table 2:Claim\-calibration exemplars\. The level column records the strongest claim level directly supported by the reported evaluation, not the broadest motivation or downstream narrative\. Full paper\-level coding and boundary cases are in Appendix[C](https://arxiv.org/html/2607.00368#A3)\.Audit pattern\.The main pattern is not that TTT lacks progress; it is that most reported evidence remains below D\-level deployment behavior\. Dynamic evaluation, TTT\-NN, TTT layers, LaCT, In\-Place TTT, TTT\-E2E, TLM, SyTTA, and TTRL primarily provide S\-level evidence: adaptation to recent text, target\-domain data, or verifiable tasks\. PERK, Locas, SEAL, TT\-SI, TTT\-Discover, MEMORYLLM, and Absorber LLM move toward B\-level evidence by studying parametric memories, self\-adaptation, context internalization, or reward\-backed discovery\. Reasonable reclassifications of S/B boundary cases may change family counts, but they do not change the decision rule\. A D\-level deployment claim requires behavioral tests showing that a sparse user fact, preference, correction, or procedure changes later behavior after paraphrase, delay, conflict, and removal of the original context\.

Takeaway:S\- and B\-level evidence can motivate deployment\-memory hypotheses, but D\-level language requires behavioral tests for sparse information after context removal, paraphrase, delay, locality, and conflict\.

## 5A Controlled Diagnostic Experiment

We instantiate the framework with a controlled diagnostic experiment that isolates a common inference behind evidence migration: whether improved support loss or answer likelihood is sufficient evidence for deployment recall\. A test\-time update is applied to a sparse factual support sentence, the support context is then removed, and the model is later queried for the injected fact\. Such a cleanly controlled case, in which proxy metrics improve while later behavior does not, demonstrates why proxy evidence and deployment recall should be reported separately\.

Diagnostic setup\.We use Qwen3 models at three scales,Qwen/Qwen3\-1\.7B,Qwen/Qwen3\-4B, andQwen/Qwen3\-8B, with LoRA adaptation and a minimal online update on each support text\. The factual setting introduces nonce access\-code facts and then tests direct, paraphrased, and delayed recall after the support context is removed\. We defineΔ​NLL=NLLafter−NLLbefore\\Delta\\mathrm\{NLL\}=\\mathrm\{NLL\}\_\{\\mathrm\{after\}\}\-\\mathrm\{NLL\}\_\{\\mathrm\{before\}\}, so negative values indicate improved teacher\-forced loss after the update\. Full hyperparameters, prompt counts, scoring rules, explicit\-memory baselines, conflict\-overwrite tests, stress updates, and robustness checks are in Appendices[B](https://arxiv.org/html/2607.00368#A2)and[D](https://arxiv.org/html/2607.00368#A4)\.

Evidence levels\.The diagnostic keeps proxy, bridge, and target behavioral evidence separate\.*Proxy*evidence consists of support reconstruction and localΔ\\DeltaNLL\.*Bridge*evidence consists of answerΔ\\DeltaNLL\. This metric tests whether probability mass moves toward the target answer, but not whether the model will produce that answer in open\-ended use\.*Target behavior*in the main factual probe is generated success under direct, paraphrased, and delayed prompting after the support context is removed\. Appendix[D](https://arxiv.org/html/2607.00368#A4)adds locality, conflict\-overwrite, retrieval, and replacement\-memory checks as robustness and comparison evidence\. This separation suggests lower loss is informative, but delayed free\-form recall must be measured behaviorally\.

Table 3:Loss improves, but generated free\-form recall does not appear under the fixed greedy decoding protocol\. Values are percentages exceptΔ\\DeltaNLL\. LowerΔ\\DeltaNLL indicates improved teacher\-forced loss after the update\. Generated success is measured over 48 factual probes per model using normalized first\-answer\-line containment of the target answer\.Results\.Across all three Qwen3 sizes, one\-step LoRA updates improve support and answer loss, while generated free\-form recall remains at 0\.0% for direct, paraphrased, and delayed queries under the fixed greedy decoding and normalized\-containment scoring protocol\. Table[3](https://arxiv.org/html/2607.00368#S5.T3)shows this proxy/behavior split directly, supportΔ\\DeltaNLL improves substantially, and answerΔ\\DeltaNLL also improves, yet these gains do not translate into successful greedy recovery of the injected facts\.

The zero generated\-recall result is not the only signal we report\. The same table shows that teacher\-forced answer likelihood improves, yet greedy generation still omits the target fact\. Appendix[D](https://arxiv.org/html/2607.00368#A4)gives representative cases where the likelihood signal does not translate into generated recall\.

Explicit\-memory controls\.Matched explicit\-memory controls serve as evidence\-usability and deployment\-comparison checks\. The appendix specifies exact\-context, BM25\-style retrieval, and replacement\-memory protocols, but the factual aggregate control reported here is the easy BM25 condition\. In the 1\.7B seed\-replication checks, BM25 hit and memory answering remain at48/4848/48\. Harder retrieval checks in Appendix[D](https://arxiv.org/html/2607.00368#A4)have Hit@10/480/48under paraphrased\-support decoys and0/240/24under stale/current conflicts\. Thus, the point is not that retrieval trivially solves deployment memory, but that parametric TTT should be compared to explicit memory under matched evidence, query, scoring, and budget constraints\.

Stress update and conflict checks\.A stronger update can produce generated recall, but it exposes the missing locality dimension\. In the Qwen3\-1\.7B factual setting, the appendix stress check raises direct and delayed recall to 72\.9% \(35/4835/48\) and paraphrased recall to 54\.2% \(26/4826/48\), while locality falls from 97\.9% \(141/144141/144\) to 9\.7% \(14/14414/144\) \(Table[A16](https://arxiv.org/html/2607.00368#A4.T16)\)\. Conflict\-overwrite checks show a different failure mode: the one\-step LoRA update returns neither stale nor corrected code in all72/7272/72conflicts, while replacement memory is corrected\-only in68/7268/72cases and both\-corrected\-and\-stale in4/724/72\(Table[A15](https://arxiv.org/html/2607.00368#A4.T15)\)\. These checks show why D\-level evidence should jointly report recall, paraphrase robustness, delay, locality or interference, conflict behavior, and matched baselines\.

Robustness scope\.Appendix[D](https://arxiv.org/html/2607.00368#A4)reports additional checks, the 1\.7B seed replications keep direct/paraphrased/delayed recall at0/480/48, preference/correction and procedure mini\-tasks also improve proxy loss while leaving greedy behavior at0/240/24, and the continuous\-text check is reported only as proxy\-regime context\. We therefore use these checks to qualify the diagnostic, not to claim that the tested one\-step LoRA update achieves deployment memory\.

Takeaway:The diagnostic separates proxy improvement from deployment recall\. Proxy and bridge gains can coexist with zero behavioral recall, and stronger recall can expose severe locality cost\. D\-level claims, therefore, require generated recall under paraphrase and delay, locality, or conflict reporting, and matched baselines\.

## 6A Claim\-Calibrated Evaluation Protocol

For TTT results that invoke post\-deployment learning, memory, personalization, or self\-updating assistants, the framework maps claim language to evidence level\. Table[4](https://arxiv.org/html/2607.00368#S6.T4)gives the decision rule for assigning supported wording from the reported evidence\. Stream\-adaptation claims can headline stream loss, future\-token loss, long\-context accuracy, throughput, or reward\. Deployment\-memory claims should headline later behavior under a deployment\-like update episode\.

Table 4:Claim\-calibrated decision protocol\. Evidence can be positive and useful while still supporting only S\- or B\-level wording\. D\-level language requires behavioral improvement under the claimed use case, disclosed scoring, locality or failure reporting, and matched baseline comparison\.Match probes to claims\.D\-level evaluation should be claim\-specific rather than maximalist\. Factual memory requires no\-context recall under paraphrase and delay; preference personalization requires later choices or response changes under plausible alternatives; correction claims require mutually exclusive stale/current scoring; procedural claims require later action or task transfer; and agent\-memory claims require learned information to affect routing, tool calls, or arguments\. Appendix[B](https://arxiv.org/html/2607.00368#A2)gives operational templates and evidence tiers\. Such evaluations can be small: the support episode should introduce a sparse item absent from the evaluation prompt, and the later probe should test the failure modes relevant to the claim\. When possible, reset\-vs\-stream controls should separate one\-shot write\-in from accumulated stream adaptation\.

Keep proxy metrics separate\.Support reconstruction, future\-token loss, answer NLL, and, when reported, candidate ranking should remain in the report because they explain what the update is doing\. But they should be presented as mechanism\-level or intermediate signals, not as substitutes for the behavioral target\. In particular, answer likelihood and generated answer success should be reported separately: the diagnostic above shows that a model can improve the likelihood of the target answer while still producing the same wrong free\-form response\.

Use matched baselines\.A no\-update model tests whether the update helps at all\. Exact\-context or long\-context prompting tests whether the support evidence is sufficient when explicit\. Retrieval or memory baselines test whether the same evidence can be stored outside the weights and reused under the same query and scoring rule\. A*matched*baseline should use the same support information, query set, scoring rule, and disclosed budgets, including update tokens, trainable parameters, optimization steps, latency, memory footprint, context length, and retrieval index size when applicable\.

Parametric TTT need not beat every explicit\-memory baseline on every axis to be useful\. Its value may appear under privacy, compression, latency, amortization, offline\-operation, or context\-pressure constraints\. But those constraints should be stated and evaluated directly; otherwise, a memory or personalization claim risks conflating mechanism\-level adaptation with deployment\-time usefulness\.

Takeaway:Claim language should track evidence level\. Proxy and bridge metrics explain mechanisms, but memory, personalization, and post\-deployment learning claims require behavioral evidence under the claimed use case, with failure categories and matched baselines visible\.

## 7Discussion

The framework is deliberately scoped, and several natural objections and boundary cases arise about how far it should be applied\. We address them here as design considerations that clarify what the framework does and does not require\.

Scope: evaluation against stated objectives\.Many TTT papers study stream adaptation, long\-context scaling, or task reward under the objective they optimize, without claiming to implement deployment\-time memory\. Those papers should be evaluated on their stated objectives\. The interpretive issue arises when such results are later used to motivate memory, personalization, or self\-updating assistants\. At that point, S\- and B\-level evidence needs additional behavioral probes before it supports D\-level language\.

Perplexity as behavior in dense streams\.One response is that perplexity is already behavioral for language models, because next\-token prediction is the model’s core behavior\. This is reasonable in dense\-stream settings, where better prediction may be the desired outcome\. Personalized assistants require a different kind of behavior: recalling sparse facts, applying corrections, resolving conflicts, or abstaining when information is missing\. In that regime, lower loss is useful evidence about the update mechanism, but deployed memory still has to be shown in later responses\.

Relation to external\-memory substrates\.Assistant memory may often be better implemented explicitly rather than parametrically\. Systems such as MemoryBank and MemGPT store, retrieve, page, or curate information outside the base\-model weights, and they are aimed directly at long\-term interaction and multi\-session use\[[38](https://arxiv.org/html/2607.00368#bib.bib27),[21](https://arxiv.org/html/2607.00368#bib.bib16)\]\. The framework is compatible with this view\. If explicit memory is the practical baseline, then parametric TTT should be evaluated against it under matched evidence, query, scoring, and budget constraints\.

Context compression as a legitimate target\.Methods need not target sparse user facts to be memory\-relevant\. PERK, Locas, MEMORYLLM, and related systems study adapters, parametric memories, or persistent memory pools that encode context for later use\[[7](https://arxiv.org/html/2607.00368#bib.bib4),[17](https://arxiv.org/html/2607.00368#bib.bib8),[30](https://arxiv.org/html/2607.00368#bib.bib36)\]\. We treat this as a legitimate B\-level objective\. The calibration point is narrower: compression or internalization results should be described as such unless they also show deployment behavior under paraphrase, delay, conflict, and locality\.

Procedural rather than episodic memory\.Test\-time learning may be most useful for reusable strategies, code snippets, search heuristics, or reward\-backed reasoning rather than user\-fact recall\. Dynamic Cheatsheet and TTRL are examples of this direction: they use accumulated solutions or reward\-style signals to improve future problem solving\[[27](https://arxiv.org/html/2607.00368#bib.bib35),[40](https://arxiv.org/html/2607.00368#bib.bib38)\]\. The same evidence standard applies, but the behavioral target changes\. A procedural\-memory claim should be tested through later task performance, transfer, and failure modes for the learned procedure, rather than through factual recall alone\.

Stronger update mechanisms\.Stronger TTT systems may pass behavioral tests that simple update mechanisms fail\. Better objectives, routing, replay, constrained updates, verification, or hybrid parametric–external memory could produce useful deployment\-time memory\. The framework accommodates this possibility, and specifies what evidence would be needed to establish it: behavioral success under the claimed use case, together with update cost, locality, conflict, and matched baselines\.

## 8Conclusion

TTT has made real progress on long\-context modeling, stream adaptation, domain adaptation, and reward\-backed test\-time improvement\. Our framework provides a concrete way to keep claims and evidence aligned: under the proposed protocol, deployment\-memory wording is supported only when the claimed deployment behavior is directly tested\. Lower perplexity, future\-token loss, answer likelihood, candidate ranking when explicitly reported, and reward under a stated task are valuable evidence for the objectives they measure, but they should not be treated as sufficient evidence for deployment\-time memory\.

Showing that a model has acquired a user fact, preference, correction, or procedure for later use requires tests that resemble the claimed use case: direct, paraphrased, and delayed behavior; locality and conflict reporting; disclosed update budgets; mutually exclusive failure categories; and matched retrieval or long\-context baselines\. The aim is to keep claim and evidence at the same level\. If future TTT systems can provide memory, the decisive evidence will appear where memory matters: in later behavior, under realistic constraints, with the interference cost visible\.

## References

- \[1\]\(2025\)Self\-improving llm agents at test\-time\.External Links:2510\.07841,[Link](https://arxiv.org/abs/2510.07841)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.9.8.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.4.3.1.1.1)\.
- \[2\]Q\. Ai, Y\. Tang, C\. Wang, J\. Long, W\. Su, and Y\. LIU\(2026\)MemoryBench: a benchmark for memory and continual learning in LLM systems\.External Links:[Link](https://openreview.net/forum?id=wU4Tjlzg3h)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.3.1.p1.1.5.4.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p2.1),[§3](https://arxiv.org/html/2607.00368#S3.p4.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.5.4.1.1.1)\.
- \[3\]S\. Antonelli, M\. S\. Akhondzadeh, and A\. Bojchevski\(2026\)Test\-time training undermines existing safety guardrails\.InICLR 2026 Workshop on Trustworthy AI,External Links:[Link](https://openreview.net/forum?id=0fttB7cUsu)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.15.14.1.1.1)\.
- \[4\]R\. Bansal, A\. Zhang, R\. Tiwari, L\. Madaan, S\. S\. Duvvuri, F\. Devvrit, D\. Brandfonbrener, D\. Alvarez\-Melis, P\. Bhargava, M\. Kale, and S\. Jelassi\(2026\)Let’s \(not\) just put things in context: test\-time training for long\-context LLMs\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=H0bcEdPCoc)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.8.7.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1)\.
- \[5\]A\. Behrouz, Z\. Li, P\. Kacham, M\. Daliri, Y\. Deng, P\. Zhong, M\. Razaviyayn, and V\. Mirrokni\(2025\)ATLAS: learning to optimally memorize the context at test time\.arXiv preprint arXiv:2505\.23735\.External Links:[Link](https://arxiv.org/abs/2505.23735)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.18.17.1.1.1)\.
- \[6\]A\. Behrouz, P\. Zhong, and V\. Mirrokni\(2025\)Titans: learning to memorize at test time\.arXiv preprint arXiv:2501\.00663\.External Links:[Link](https://arxiv.org/abs/2501.00663)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.17.16.1.1.1)\.
- \[7\]Z\. Chen, A\. Romanou, G\. Weiss, and A\. Bosselut\(2026\)PERK: long\-context reasoning as parameter\-efficient test\-time learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=qxDTe8fIyA)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.2.1.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.3.2.1.1.1),[§7](https://arxiv.org/html/2607.00368#S7.p5.1)\.
- \[8\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.5.4.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p5.1)\.
- \[9\]G\. Feng, S\. Luo, K\. Hua, G\. Zhang, W\. Huang, D\. He, and T\. Cai\(2026\)In\-place test\-time training\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dTWfCLSoyl)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.7.6.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.2.1.1.1.1)\.
- \[10\]A\. Gupta, A\. Rao, and G\. Anumanchipalli\(2024\)Model editing at scale leads to gradual and catastrophic forgetting\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 15202–15232\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.12.11.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[11\]M\. Hardt and Y\. Sun\(2024\)Test\-time training on nearest neighbors for large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CNL2bku4ra)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.4.3.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1)\.
- \[12\]J\. Hoelscher\-Obermaier, J\. Persson, E\. Kran, I\. Konstas, and F\. Barez\(2023\-07\)Detecting edit failures in large language models: an improved specificity benchmark\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11548–11559\.External Links:[Link](https://aclanthology.org/2023.findings-acl.733/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.733)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.9.8.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[13\]J\. Hu, Z\. Zhang, G\. Chen, X\. Wen, C\. Shuai, W\. Luo, B\. Xiao, Y\. Li, and M\. Tan\(2025\)Test\-time learning for large language models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=iCYbIaGKSR)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.5.4.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1)\.
- \[14\]Y\. Hu, Y\. Wang, and J\. McAuley\(2026\)Evaluating memory in LLM agents via incremental multi\-turn interactions\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=DT7JyQC3MR)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.3.1.p1.1.4.3.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p2.1)\.
- \[15\]B\. Krause, E\. Kahembwe, I\. Murray, and S\. Renals\(2018\)Dynamic evaluation of neural sequence models\.InInternational Conference on Machine Learning,pp\. 2766–2775\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.2.1.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p2.1)\.
- \[16\]Q\. Li, X\. Liu, Z\. Tang, P\. Dong, Z\. Li, X\. Pan, and X\. Chu\(2024\)Should we really edit language models? on the evaluation of edited language models\.Advances in Neural Information Processing Systems37,pp\. 30850–30885\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.13.12.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[17\]S\. Lu, Z\. Liang, D\. Ma, Y\. Wang, H\. Mi, and D\. Yu\(2026\)Locas: your models are principled initializers of locally\-supported parametric memories\.arXiv preprint arXiv:2602\.05085\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.3.2.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.3.2.1.1.1),[§7](https://arxiv.org/html/2607.00368#S7.p5.1)\.
- \[18\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.3.1.p1.1.2.1.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p2.1)\.
- \[19\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\(2022\)Locating and editing factual associations in gpt\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.7.6.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[20\]K\. Meng, A\. S\. Sharma, A\. J\. Andonian, Y\. Belinkov, and D\. Bau\(2023\)Mass\-editing memory in a transformer\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=MkbcAHIYgyS)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.8.7.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[21\]C\. Packer, V\. Fang, S\. Patil, K\. Lin, S\. Wooders, and J\. Gonzalez\(2023\)MemGPT: towards llms as operating systems\.\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.4.3.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p5.1),[§7](https://arxiv.org/html/2607.00368#S7.p4.1)\.
- \[22\]A\. Rannen\-Triki, J\. Bornschein, R\. Pascanu, M\. Hutter, A\. György, A\. Galashov, Y\. W\. Teh, and M\. K\. Titsias\(2024\)Revisiting dynamic evaluation: online adaptation for large language models\.arXiv preprint arXiv:2403\.01518\.External Links:[Link](https://arxiv.org/abs/2403.01518)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.3.2.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p2.1)\.
- \[23\]D\. Rosati, R\. Gonzales, J\. Chen, X\. Yu, Y\. Kayani, F\. Rudzicz, and H\. Sajjad\(2024\)Long\-form evaluation of model editing\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3749–3780\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.11.10.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[24\]Y\. Shen, K\. Li, W\. Zhou, and S\. Hu\(2026\)Mem2ActBench: a benchmark for evaluating long\-term memory utilization in task\-oriented autonomous agents\.External Links:2601\.19935,[Link](https://arxiv.org/abs/2601.19935)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.3.1.p1.1.7.6.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p2.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.5.4.1.1.1)\.
- \[25\]Y\. Sun, X\. Li, K\. Dalal, J\. Xu, A\. Vikram, G\. Zhang, Y\. Dubois, X\. Chen, X\. Wang, S\. Koyejo, T\. Hashimoto, and C\. Guestrin\(2025\)Learning to \(learn at test time\): RNNs with expressive hidden states\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=wXfuOj9C7L)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.5.4.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.2.1.1.1.1)\.
- \[26\]Y\. Sun, X\. Wang, Z\. Liu, J\. Miller, A\. Efros, and M\. Hardt\(2020\)Test\-time training with self\-supervision for generalization under distribution shifts\.InInternational conference on machine learning,pp\. 9229–9248\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.16.15.1.1.1)\.
- \[27\]M\. Suzgun, M\. Yuksekgonul, F\. Bianchi, D\. Jurafsky, and J\. Zou\(2026\-03\)Dynamic cheatsheet: test\-time learning with adaptive memory\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 7080–7106\.External Links:[Link](https://aclanthology.org/2026.eacl-long.333/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.333),ISBN 979\-8\-89176\-380\-7Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.6.5.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p5.1),[§7](https://arxiv.org/html/2607.00368#S7.p6.1)\.
- \[28\]A\. Tandon, K\. Dalal, X\. Li, D\. Koceja, M\. Rød, S\. Buchanan, X\. Wang, J\. Leskovec, S\. Koyejo, T\. Hashimoto, C\. Guestrin, J\. McCaleb, Y\. Choi, and Y\. Sun\(2025\)End\-to\-end test\-time training for long context\.External Links:2512\.23675,[Link](https://arxiv.org/abs/2512.23675)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.9.8.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[§2](https://arxiv.org/html/2607.00368#S2.p3.1)\.
- \[29\]W\. Wang, L\. Dong, H\. Cheng, X\. Liu, X\. Yan, J\. Gao, and F\. Wei\(2023\)Augmenting language models with long\-term memory\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=BryMFPQ4L6)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.3.2.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p5.1)\.
- \[30\]Y\. Wang, Y\. Gao, X\. Chen, H\. Jiang, S\. Li, J\. Yang, Q\. Yin, Z\. Li, X\. Li, B\. Yin, J\. Shang, and J\. McAuley\(2024\)MEMORYLLM: towards self\-updatable large language models\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=p0lKWzdikQ)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.4.3.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[§3](https://arxiv.org/html/2607.00368#S3.p5.1),[§7](https://arxiv.org/html/2607.00368#S7.p5.1)\.
- \[31\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu\(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=pZiyCaVuti)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.3.1.p1.1.3.2.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p2.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.5.4.1.1.1)\.
- \[32\]Y\. Xu, H\. Yao, Z\. Guo, W\. Guo, P\. Li, A\. Liu, X\. Hu, and H\. Xiong\(2026\)You only need 4 extra tokens: synergistic test\-time adaptation for LLMs\.External Links:[Link](https://openreview.net/forum?id=FZYtfAlndh)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.6.5.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1)\.
- \[33\]W\. Yang, F\. Sun, J\. Tan, X\. Ma, Q\. Cao, D\. Yin, H\. Shen, and X\. Cheng\(2025\-07\)The mirage of model editing: revisiting evaluation in the wild\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 15336–15354\.External Links:[Link](https://aclanthology.org/2025.acl-long.745/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.745),ISBN 979\-8\-89176\-251\-0Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.14.13.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[34\]M\. Yuksekgonul, D\. Koceja, X\. Li, F\. Bianchi, J\. McCaleb, X\. Wang, J\. Kautz, Y\. Choi, J\. Zou, C\. Guestrin, and Y\. Sun\(2026\)Learning to discover at test time\.External Links:2601\.16175,[Link](https://arxiv.org/abs/2601.16175)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.10.9.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p4.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.4.3.1.1.1)\.
- \[35\]T\. Zhang, S\. Bi, Y\. Hong, K\. Zhang, F\. Luan, S\. Yang, K\. Sunkavalli, W\. T\. Freeman, and H\. Tan\(2026\)Test\-time training done right\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Tb9qAxT3xv)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.1.1.p1.1.6.5.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.2.1.1.1.1)\.
- \[36\]W\. Zhang, X\. Wei, W\. Huang, Z\. Hui, C\. Wang, M\. Gong, and P\. S\. Yu\(2026\)MemoryCD: benchmarking long\-context user memory of llm agents for lifelong cross\-domain personalization\.arXiv preprint arXiv:2603\.25973\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.3.1.p1.1.6.5.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p2.1)\.
- \[37\]Z\. Zhang, S\. Zhang, C\. Wu, Z\. Wei, and M\. Sun\(2026\)Absorber llm: harnessing causal synchronization for test\-time training\.arXiv preprint arXiv:2604\.20915\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.11.10.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[§3](https://arxiv.org/html/2607.00368#S3.p4.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.3.2.1.1.1)\.
- \[38\]W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang\(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.2.1.1.1.1),[§3](https://arxiv.org/html/2607.00368#S3.p5.1),[§7](https://arxiv.org/html/2607.00368#S7.p4.1)\.
- \[39\]Z\. Zhong, Z\. Wu, C\. D\. Manning, C\. Potts, and D\. Chen\(2023\)MQuAKE: assessing knowledge editing in language models via multi\-hop questions\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://openreview.net/forum?id=0hTPJBnncc)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px3.1.1.p1.1.10.9.1.1.1),[§2](https://arxiv.org/html/2607.00368#S2.p6.1)\.
- \[40\]Y\. Zuo, K\. Zhang, L\. Sheng, S\. Qu, G\. Cui, X\. Zhu, H\. Li, Y\. Zhang, X\. Long, E\. Hua, B\. Qi, Y\. Sun, Z\. Ma, L\. Yuan, N\. Ding, and B\. Zhou\(2025\)TTRL: test\-time reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=VuVhgEiu20)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.7.6.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§1](https://arxiv.org/html/2607.00368#S1.p7.1),[§7](https://arxiv.org/html/2607.00368#S7.p6.1)\.
- \[41\]A\. Zweiger, J\. Pari, H\. Guo, Y\. Kim, and P\. Agrawal\(2025\)Self\-adapting language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=JsNUE84Hxi)Cited by:[Appendix C](https://arxiv.org/html/2607.00368#A3.SS0.SSS0.Px4.2.1.p1.1.8.7.1.1.1),[§1](https://arxiv.org/html/2607.00368#S1.p1.1),[§3](https://arxiv.org/html/2607.00368#S3.p4.1),[Table 2](https://arxiv.org/html/2607.00368#S4.T2.1.4.3.1.1.1)\.

## Appendix AEvidence migration patterns

This appendix makes the evidence\-migration concern concrete without treating the cited papers as overclaiming\. The migration risk is interpretive: a result that is well scoped as S\- or B\-level evidence can become misleading when reused as support for a D\-level deployment\-memory narrative\.

Table A1:Concrete evidence\-migration patterns\.

## Appendix BDetailed evaluation agenda and limitations

### Purpose of this appendix\.

This appendix expands the evaluation protocol in Table[4](https://arxiv.org/html/2607.00368#S6.T4)\. The goal is not to require every TTT paper to run every possible memory test\. Rather, the evaluation should match the claim\. A paper claiming stream adaptation can headline stream loss, future\-token loss, long\-context accuracy, throughput, or reward\. A paper claiming deployment\-time memory, personalization, correction, or agent memory should additionally report later behavior under a deployment\-like update episode after the original support context is unavailable\.

### Diagnostic role of the pilot\.

The diagnostic in Section[5](https://arxiv.org/html/2607.00368#S5)partially follows this agenda\. It specifies exact\-context controls, reports BM25\-style retrieval and replacement\-memory checks, separates answer\-likelihood and generated\-answer reporting, discloses update\-time and token\-budget details, and adds a same\-family Qwen3 scale axis, a 1\.7B three\-seed replication, a stress update, mutually exclusive conflict scoring, and small preference/correction and procedure/action probes\. Full benchmark construction and representative method comparisons are left for future work\.

Diagnostic role:The LoRA/Qwen pilot isolates one evidential question: whether support\-loss and answer\-likelihood gains are sufficient for generated deployment behavior\. The result shows that these proxy and bridge metrics can improve without free\-form recall, motivating separate reporting of mechanism\-level and behavioral evidence\.

### Minimal D\-level evidence by claim type\.

D\-level evidence is claim\-specific\. The required behavioral bundle changes with the claim, but in every case the learned information should affect later behavior after the original support context is removed\. Table[A2](https://arxiv.org/html/2607.00368#A2.T2)gives minimum evidence shapes for common deployment\-memory claims\.

Table A2:Minimal D\-level evidence by deployment claim type\. The required bundle changes with the claim, but in every case the later behavior must match the claimed use case after the original support context is removed\.

### Operational behavioral templates\.

The same claim\-specific idea can be made more operational by specifying the support episode, the no\-context query, paraphrase or delay condition, locality or conflict check, and matched baseline\. Table[A3](https://arxiv.org/html/2607.00368#A2.T3)gives templates for common deployment\-memory settings\. These are intended as minimal behavioral shapes, not as a universal benchmark\.

Table A3:Claim\-specific behavioral templates\.

### Direct and paraphrased retrieval\.

A deployment\-time update should be tested on direct questions and paraphrases that require extracting the newly introduced fact, preference, correction, or procedure after the support context is removed\. Teacher\-forced answer likelihood and free generation should be reported separately, because the gap between them is itself informative\. A system can move probability mass toward the correct answer under teacher forcing while still failing to produce the answer under open\-ended generation\.

### Retention, locality, and conflict\.

Updated knowledge should be queried after unrelated intervening turns or tasks\. Evaluations should also include locality probes and conflict\-resolution tests, because a useful update should not indiscriminately distort unrelated behavior or preserve stale information after a correction\. If the claim concerns procedures or agents, the query should require the updated information to select a rule, route, tool, or action rather than merely repeat a string\.

### Matched explicit\-memory baselines\.

If the intended use case is a personalized assistant or agent, prompt accumulation, long\-context inference, retrieval\-augmented memory, and other non\-parametric memory systems are natural baselines\. These baselines should use matched evidence and report context cost, latency, update budget, and failure categories, because a parametric update is most compelling when it improves behavior under constraints where explicit memory is costly or unavailable\. Stream adaptation should also be separated from one\-shot write\-in through reset\-vs\-stream controls, since cumulative gains across a stream provide different evidence from learning from a single support item\.

Table A4:Matched explicit\-memory baseline protocol\.
In this paper, the easy BM25\-style factual condition is a usability control: it verifies that the sparse support facts are answerable when explicit memory retrieves the intended sentence\. The harder paraphrase and stale/current retrieval checks are failure\-mode probes, showing why stronger explicit\-memory baselines should include recency handling, semantic retrieval, reranking, or conflict resolution rather than naive lexical matching alone\.

### Evidence tiers for deployment\-memory language\.

The strength of the claim should track the strength of the behavioral evidence\. Table[A5](https://arxiv.org/html/2607.00368#A2.T5)gives three rough tiers\. The tiers are not meant to define a benchmark leaderboard; they are a wording guide for authors and evaluators\.

Table A5:Evidence tiers for deployment\-memory language\.

### Wording examples\.

The decision protocol is meant to be operational\. Table[A6](https://arxiv.org/html/2607.00368#A2.T6)gives examples of wording that stays within the evidence level and wording that calls for D\-level behavioral evidence\.

Table A6:Claim wording examples\.

### Claim discipline\.

Papers should title and frame their contributions at the level their evidence supports\. Continuous\-stream adaptation, target\-domain TTL, formal test\-time discovery, context compression, and deployment\-time learning are all legitimate targets\. Evidence from one regime should not inherit the behavioral expectations of another\. In particular, proxy and bridge metrics should remain in the report, but they should be described as mechanism\-level or intermediate evidence unless the later behavior required by the claim is directly tested\.

### Additional limitations\.

This diagnostic pass adds a same\-family Qwen3 size axis, hard\-negative retrieval controls, a 1\.7B three\-seed replication, conflict scoring, and small preference/correction and procedure/action probes\. It leaves several directions open: cross\-family model comparisons, a second parametric update mechanism, representative TTT/TTL systems, richer procedure\-following tasks, and multi\-seed runs for every model size\. The confidence intervals are across examples unless explicitly marked as seed replication\. Reward\-rich settings such as math, code, kernels, and scientific optimization remain an important frontier\. These limitations reinforce the central recommendation: specify the evaluation target, disclose the scoring rule, report failure categories, and avoid compressing all evidence into one loss number\.

## Appendix CPaper\-level audit sheet

This appendix expands the claim\-calibrated audit summarized in the main text\. The level column records the strongest claim directly supported by the reported evidence as summarized in the paper, while the last column records the shortest behavioral test that would be needed before using the result as D\-level deployment\-time learning evidence\.

### Audit protocol\.

We screened papers available through April 2026 from the paper’s bibliography plus targeted searches using terms such as*LLM test\-time training*,*test\-time learning*,*long\-context TTT*,*context memory*,*self\-adapting language models*,*parametric memory*,*personalization*,*continual learning*, and*assistant memory*\. A paper enters the audit when its title, abstract, introduction, or motivation explicitly invokes test\-time learning/training, memory, persistence, context absorption, self\-improvement, personalization, continual learning, or assistant memory\. We screened 41 candidate records: 24 are coded below, 5 retrieval\-only or external\-memory systems are retained as background baselines, 9 model\-editing/safety/background papers are cited for context but excluded from claim\-level coding, and 3 non\-LLM or architecture\-adjacent papers are excluded from the audit table\. This is single\-author coding with sensitivity recoding, intended as a claim\-calibration audit rather than a systematic review, prevalence estimate, or inter\-coder reliability study\. To make disagreements inspectable, we provide paper\-level source phrases, boundary\-case coding rationales, and sensitivity effects rather than asking readers to trust aggregate counts\.

S\-level coding means the reported evidence directly supports stream/domain/task adaptation; B\-level coding means it supports a bridge mechanism such as internalization, compression, parametric memory, or self\-adaptation; D\-level coding means it directly tests sparse deployment information after paraphrase, delay, conflict, locality, or action use\. Borderline cases are coded to the strongest level directly supported by the reported evaluation, not by the broadest motivation sentence\. Eight papers are treated as boundary cases for sensitivity: TTT layers, LaCT, Not Just Context, TTT\-E2E, PERK, Locas, MEMORYLLM, and Absorber LLM\. Reclassifying these S/B cases changes the counts without changing the paper’s decision rule, because none supplies the full D\-level behavioral bundle for sparse user deployment claims\.

Table A7:Audit field summary\.
Table A8:Candidate accounting\.

### Boundary\-case coding worksheet\.

We provide the following worksheet for inspecting the eight boundary cases used in the audit sensitivity discussion\.

Table A9:Boundary\-case coding worksheet\.

### Background and exclusion records\.

The following records were screened because they are useful comparators, cautionary evaluation background, or boundary cases, but they are not coded as direct LLM TTT evidence in Tables[A11](https://arxiv.org/html/2607.00368#A3.T11)–[A13](https://arxiv.org/html/2607.00368#A3.T13)\.

Table A10:Background and exclusion records\.

### Worked examples\.

*TTT\-NN:*S pass, D no\. It updates on nearest\-neighbor text and reports local language\-model improvement, so it supports adaptation to nearby retrieved evidence; deployment\-memory use would require delayed no\-context recall, locality, and matched memory baselines\.*PERK/Locas:*B pass, D incomplete\. They explicitly move toward parameter\-efficient or locally supported memories, so their evidence is closer to internalization; D\-level assistant\-memory claims still need sparse user facts, preferences, and corrections under paraphrase, delay, conflict, and matched external\-memory baselines\.*LongMemEval/MemoryBench:*D target behavior yes\. They evaluate multi\-session or service\-time memory behavior, so they instantiate the target capability; a TTT paper borrowing that motivation should report comparable behavior under a stated update budget\.*This pilot:*proxy and bridge evidence yes, D behavior no for the tested one\-step LoRA setup; matched retrieval and replacement\-memory checks are reported, so the result supports the proxy\-insufficiency point: proxy and bridge improvements should not be treated as deployment\-memory evidence without matching behavioral success\.

Table A11:Paper\-level audit for stream and long\-context TTT evidence\.
Table A12:Paper\-level audit for bridge evidence toward memory\-like claims\.
Table A13:Paper\-level audit for D\-level assistant\-memory targets\.

## Appendix DDiagnostic robustness checks

### Pilot experiment details\.

Unless otherwise noted, diagnostic runs use H100 GPUs, seed 13, LoRA rank 8, alpha 16, dropout 0\.0, learning rate5×10−45\\times 10^\{\-4\}, and a 24\-token generation budget\. The factual probes use 48 prompts per prompt type, the overwrite probes use 24 conflicts, and the locality probes use 144 unrelated prompts\. We score generated answers as correct when the normalized first answer line contains the normalized target answer\. We defineΔ​NLL=NLLafter−NLLbefore\\Delta\\mathrm\{NLL\}=\\mathrm\{NLL\}\_\{\\mathrm\{after\}\}\-\\mathrm\{NLL\}\_\{\\mathrm\{before\}\}, so negative deltas indicate improved teacher\-forced loss after the update\.

The matched external\-memory baselines receive the same support evidence as the LoRA update\. The*exact\-context*baseline prepends the support sentence directly to the evaluation prompt\. The*BM25\-style retrieval*baseline builds a fixed memory bank whose retrieval unit is the support sentence\. For each true fact, we add three hard negatives: the same object with a different prefix, the same prefix with a different object, and an answer\-format distractor\. We report both top\-1 retrieval hit, counted as correct when the retrieved fact id matches the target id, and end\-to\-end answer success after prompting withRetrieved memory: <support\>followed by the original query\. For conflicts, the*replacement\-memory*baseline discards the stale support, keeps only the correction sentence, and then answers the current\-code query\.

The supporting checks are deliberately narrow\. On 1\.7B only, we run a continuous\-text proxy\-regime check by splitting 32 passages into*support*,*future\_1*, and*future\_2*, updating on the support chunk, and comparing one\-step, stream, and reset evaluation\. We rerun the 1\.7B factual/conflict core probes with two additional seeds\. We also add small 24\-item 1\.7B preference/correction and procedure/action mini\-tasks to test whether the proxy/behavior pattern is limited to nonce access\-code strings\.

Table A14:Robustness and stress checks for the 1\.7B diagnostic setting\.

### Conflict\-overwrite scoring\.

Correction claims require mutually exclusive categories\. A response counts as successful only when it contains the corrected answer and excludes the stale answer\. Responses containing both corrected and stale information are failures, even if a non\-exclusive scorer would mark the corrected answer as present\.

Table A15:Mutually exclusive conflict\-overwrite categories\. Counts are out of 24 conflicts per model\. Corrected\-only is the strict D\-level success criterion; both corrected and stale is counted as failure\.
The one\-step LoRA update produces neither code in all conflict cases, indicating overwrite non\-recall rather than successful correction\. This is a failure of behavioral access, not merely a failure of conflict arbitration\. Replacement memory is usually corrected\-only, although the 1\.7B model sometimes includes both corrected and stale codes in the same response\.

### Stress update tradeoff\.

A stronger update can produce generated recall, but recall alone is not sufficient for deployment\-memory claims\. Locality probes ask unrelated questions whose answers should remain unchanged after the update\.

Table A16:Stress update tradeoff in the Qwen3\-1\.7B factual setting\.
The 16\-step stress point raises direct and delayed recall to 72\.9% and paraphrased recall to 54\.2%, but locality falls from 97\.9% to 9\.7%\. We treat this as evidence that stronger updates can move behavior, not as evidence that the update has achieved deployment memory\. A D\-level claim would need to report both the behavioral gain and the interference cost\.

### Retrieval hardness check\.

The original BM25\-style factual condition is useful because it shows that the sparse evidence is answerable when explicit memory retrieves the intended sentence\. To avoid making lexical retrieval look easier than it is, we add a no\-generation lexical\-retrieval stress check: one condition paraphrases the true memory snippets and adds same\-subject backup\-token distractors; the other puts stale and corrected snippets for the same subject in memory and asks for the current code\. Table[A17](https://arxiv.org/html/2607.00368#A4.T17)reports retrieval hit rates and oracle top\-1 answerability, with model generation left out of this check\.

Table A17:Harder BM25 retrieval checks over the frozen fact set\.

### Qualitative diagnostic examples\.

The following examples make the failure mode inspectable\. They are illustrative rather than a prevalence table, and show cases where answer likelihood improves while free\-form behavior still does not change; the rank column is included only as case\-level diagnostic context\.

![Refer to caption](https://arxiv.org/html/2607.00368v1/figures/ttt_failure_cases.png)Figure A1:Qualitative diagnostic failure cases\. Left: a one\-step update lowers target\-answer NLL without changing free\-form generation\. Right: a stronger stress update can elicit the target answer, but with severe locality interference on unrelated prompts\.Table A18:Additional qualitative examples from the 1\.7B factual diagnostic\.

## Appendix EScope and limitations

This paper presents a behavioral evaluation framework with a calibration audit and a controlled diagnostic experiment\. The diagnostic shows that proxy improvement can be insufficient for generated deployment behavior; it does not map the full boundary of parametric deployment\-time learning\. The audit is structured and reproducible, but its purpose is claim calibration rather than prevalence estimation\. These limitations reinforce the recommendation: specify the evaluation target, disclose the scoring rule, and report failure categories instead of compressing all evidence into one loss number\.

## Appendix FDeployment and Societal Risks

Deployment\-time memory is not only a capability claim but also a governance claim\. A system that stores user facts, preferences, corrections, or procedures through parametric updates may make information harder to inspect, delete, scope to a user, or audit than an explicit memory store\. Prematurely validating such systems with proxy metrics could encourage deployment before consent, retention, deletion, cross\-session isolation, and cross\-user leakage are tested\. Our behavioral standard therefore has an ethical dimension: D\-level evidence should include not only recall and personalization success, but also locality, conflict handling, deletion or forgetting when relevant, and clear disclosure of where information is stored\. Explicit\-memory baselines are important partly because they often provide clearer audit and deletion surfaces\. Parametric TTT methods that claim deployment memory should state what privacy, latency, compression, or offline\-operation constraints justify storing information in model state\.

Table A19:Governance\-oriented behavioral tests for deployment\-memory claims over user data\.

Similar Articles

@polynoamial: https://x.com/polynoamial/status/2064210146558136827

X AI KOLs Following

This article argues that LLM benchmark performance is increasingly a function of test-time compute, and that current evaluation methods fail to capture capability improvements when controlling for inference budget. It advocates for plotting performance vs. tokens, cost, or time, and discusses implications for safety evaluations.

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.