One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

arXiv cs.LG 06/10/26, 04:00 AM Papers
world-models interpretability open-source machine-learning arxiv ai-research
Summary
This paper introduces WorldModelLens, an open-source substrate for interpretability of world models, using a capability-typed adapter interface that generalizes across diverse architectures like PlaNet, Dreamer, IRIS, and I-JEPA. The framework provides a unified hook-and-cache layer for activation analysis and adds only ~12% overhead when inactive.
arXiv:2606.09936v1 Announce Type: new Abstract: World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint-embedding predictive architectures such as I-JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re-implemented from scratch for each architecture because existing hook-and-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open-source interpretability substrate organized around a capability-typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement-learning and self-supervised world models are first-class without either imitating the other. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:19 AM
# One Lens, Many Worlds A Capability-Typed Interface for World-Model Interpretability
Source: [https://arxiv.org/html/2606.09936](https://arxiv.org/html/2606.09936)
Bhavith Chandra Challagundla1, Sanskar Pandey2, Param Thakkar3, Rishikesh Mallagundla4, Yugandhar Reddy Gogireddy5, Wenhao Lu6, Hindol Roy Choudhury2, Shravani Challagundla7, Mohamed Deraz Nasr8, and Spursh Deshpande2 1New York University2Independent Researcher3Veermata Jijabai Technological Institute 4Mercity5University of Southern California6Independent Researcher, MIT 7Independent Researcher, GITAM8Georgia Institute of Technology

###### Abstract

World models are now built on substantially different computational substrates\. Latent recurrent state\-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token\-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint\-embedding predictive architectures such as I\-JEPA predict in a learned latent space with no pixel decoder\. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re\-implemented from scratch for each architecture because existing hook\-and\-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts\. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface\. We present WorldModelLens, an open\-source interpretability substrate organized around a capability\-typed adapter: every model implements four required methods \(encode, transition, initial state, sample\) and declares a set of optional heads \(decode, reward, continue, actor, critic\) through an explicit capability descriptor, so that reinforcement\-learning and self\-supervised world models are first\-class without either imitating the other\. A single hook and cache layer exposes time\-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once\. We demonstrate the full analysis suite end\-to\-end on I\-JEPA, provide adapters spanning latent\-recurrent, transformer\-token, and joint\-embedding families as evidence that the interface generalizes by construction, and show that the hook layer adds approximately 12% per\-step overhead when inactive, which makes always\-on interpretability telemetry practical inside training and control loops\.

## 1Introduction

World models have become a central tool for learning predictive representations of environments\[[1](https://arxiv.org/html/2606.09936#bib.bib1),[2](https://arxiv.org/html/2606.09936#bib.bib2)\], and the architectures used to build them have diverged sharply\. The Dreamer family and PlaNet learn a recurrent state\-space model whose latent combines a deterministic recurrent component with a continuous or discrete\-categorical stochastic component\[[3](https://arxiv.org/html/2606.09936#bib.bib3),[4](https://arxiv.org/html/2606.09936#bib.bib4),[5](https://arxiv.org/html/2606.09936#bib.bib5)\]\. IRIS and Decision Transformer quantize observations into a learned codebook and model dynamics autoregressively with a transformer\[[6](https://arxiv.org/html/2606.09936#bib.bib6),[7](https://arxiv.org/html/2606.09936#bib.bib7)\]\. Joint\-embedding predictive architectures such as I\-JEPA and TD\-MPC2 predict directly in a learned embedding space and may have no decoder at all\[[8](https://arxiv.org/html/2606.09936#bib.bib8),[9](https://arxiv.org/html/2606.09936#bib.bib9)\]\. These models differ not only in their internal computation but in their interface to the environment: some consume actions and emit rewards and values, while others are purely self\-supervised and have neither\.

Interpretability methods, by contrast, are uniform in spirit\. Linear and nonlinear probes read an internal representation and predict a property of interest\[[19](https://arxiv.org/html/2606.09936#bib.bib19),[20](https://arxiv.org/html/2606.09936#bib.bib20)\]; activation patching overwrites an internal activation and measures the downstream effect\[[15](https://arxiv.org/html/2606.09936#bib.bib15),[14](https://arxiv.org/html/2606.09936#bib.bib14)\]; sparse autoencoders decompose activations into interpretable features\[[17](https://arxiv.org/html/2606.09936#bib.bib17),[18](https://arxiv.org/html/2606.09936#bib.bib18)\]\. Each reduces to the same three primitives: read an activation, optionally overwrite it, and observe an effect\. This regularity is exactly what mechanistic\-interpretability tooling for language models exploits\. TransformerLens\[[12](https://arxiv.org/html/2606.09936#bib.bib12)\]popularized a hook\-and\-cache abstraction in which named hook points expose every internal activation for reading or editing, and a body of work has built circuit\-level analyses on top of it\[[13](https://arxiv.org/html/2606.09936#bib.bib13),[14](https://arxiv.org/html/2606.09936#bib.bib14)\]\.

That tooling, however, is shaped around the transformer language model\. It assumes a stack of attention blocks operating on a sequence of tokens, with no representation for an action input, no environment step, no multi\-step rollout, no imagined future, and no mixed continuous and discrete latent\. World models violate every one of these assumptions\. The practical effect is that interpretability code is rewritten for each new world model, results are reported in model\-specific terms that do not transfer, and questions that depend on comparing internal structure across architectures cannot be expressed in shared code\. General\-purpose interpretability libraries such as Captum\[[21](https://arxiv.org/html/2606.09936#bib.bib21)\], NNsight\[[22](https://arxiv.org/html/2606.09936#bib.bib22)\], and Penzai\[[23](https://arxiv.org/html/2606.09936#bib.bib23)\]provide attribution and access primitives but do not model world\-model semantics such as trajectories, dynamics rollouts, or intervention replay\.

We argue that this fragmentation is a property of the tooling rather than of the models\. A world model, irrespective of family, can be described by an encoder that maps an observation to a latent, a transition that advances latent state and optionally consumes an action, and a set of optional read\-out heads that a given model may or may not possess\. We make this structure explicit through a*capability\-typed adapter*, in which the required core must be implemented and optional capabilities are declared by a descriptor, so that a reinforcement\-learning agent and a self\-supervised predictor are both first\-class without either having to fabricate the other’s heads\. A single hook and cache layer mounted over this interface then exposes time\-indexed activations, imagination rollouts, and intervention replay, so that every analysis is written once against the interface rather than against any single architecture\.

We present WorldModelLens, an open\-source realization of this design\. Our contributions are:

- •A capability\-typed world\-model interface \(Section[3](https://arxiv.org/html/2606.09936#S3)\) that expresses reinforcement\-learning and self\-supervised world models uniformly through four required methods and five optional, descriptor\-gated heads, and that we instantiate across the Dreamer, PlaNet, TD\-MPC2, IRIS, Decision Transformer, and I\-JEPA families\.
- •A backend\-agnostic hook, cache, rollout, and intervention\-replay layer over that interface, with time\-indexed activation caching and device offloading for long rollouts, on top of which probing, activation patching, sparse autoencoders, and surprise analysis are each implemented exactly once \(Section[4](https://arxiv.org/html/2606.09936#S4)\)\.
- •An end\-to\-end demonstration on I\-JEPA \(Section[5](https://arxiv.org/html/2606.09936#S5)\) in which the unmodified analysis suite recovers layer\-resolved structure in the predictor, together with a measurement of the overhead the substrate imposes\.
- •An open implementation whose hook layer adds approximately 12% per\-step overhead when inactive, making always\-on interpretability telemetry practical inside training and control loops\.

Scaling the empirical analysis to additional families, including Dreamer at full scale, V\-JEPA\[[10](https://arxiv.org/html/2606.09936#bib.bib10)\], and Cosmos\[[11](https://arxiv.org/html/2606.09936#bib.bib11)\], is in progress and forms our roadmap \(Section[7](https://arxiv.org/html/2606.09936#S7)\)\. The present paper establishes the interface, the substrate, and a deep single\-family demonstration\.

## 2Background

We formalize a world model as a tuple

ℳ=\(𝒪,𝒮,𝒵,𝒜,ι,ℰ,τ,ℋ\),\\mathcal\{M\}=\\big\(\\mathcal\{O\},\\,\\mathcal\{S\},\\,\\mathcal\{Z\},\\,\\mathcal\{A\},\\ \\iota,\\,\\mathcal\{E\},\\,\\tau,\\ \\mathcal\{H\}\\big\),\(1\)where𝒪\\mathcal\{O\}is the observation space,𝒮\\mathcal\{S\}the deterministic latent state space,𝒵\\mathcal\{Z\}the stochastic latent space, and𝒜\\mathcal\{A\}an optional action space\. The required core is three maps and a sampler\. An initial\-state mapι:\{∗\}→𝒮\\iota:\\\{\\ast\\\}\\to\\mathcal\{S\}returnss0s\_\{0\}\. A probabilistic encoder

ℰ:𝒪×𝒮→Δ\(𝒵\),zt∼qθ\(z∣ot,st−1\)\\mathcal\{E\}:\\mathcal\{O\}\\times\\mathcal\{S\}\\to\\Delta\(\\mathcal\{Z\}\),\\qquad z\_\{t\}\\sim q\_\{\\theta\}\(z\\mid o\_\{t\},s\_\{t\-1\}\)\(2\)maps an observation and the previous state to a distribution over latent codes in the variational tradition\[[58](https://arxiv.org/html/2606.09936#bib.bib58)\], from whichsample\_zdrawsztz\_\{t\}using a Gumbel\-softmax relaxation\[[31](https://arxiv.org/html/2606.09936#bib.bib31),[32](https://arxiv.org/html/2606.09936#bib.bib32)\]when𝒵\\mathcal\{Z\}is categorical and the identity when it is continuous\. A transition

τ:𝒮×𝒵×\(𝒜∪\{∅\}\)→𝒮,st\+1=τ\(st,zt,at\),\\tau:\\mathcal\{S\}\\times\\mathcal\{Z\}\\times\(\\mathcal\{A\}\\cup\\\{\\varnothing\\\}\)\\to\\mathcal\{S\},\\qquad s\_\{t\+1\}=\\tau\(s\_\{t\},z\_\{t\},a\_\{t\}\),\(3\)advances the deterministic state, withat=∅a\_\{t\}=\\varnothingfor action\-free models\. The unit visible to downstream analysis is the joint latentht=\(st,zt\)h\_\{t\}=\(s\_\{t\},z\_\{t\}\)\. The optional heads form a setℋ⊆\{gdec,grew,gcont,π,V\}\\mathcal\{H\}\\subseteq\\\{g\_\{\\mathrm\{dec\}\},g\_\{\\mathrm\{rew\}\},g\_\{\\mathrm\{cont\}\},\\pi,V\\\}with signatures

gdec\\displaystyle g\_\{\\mathrm\{dec\}\}:𝒮×𝒵→𝒪,\\displaystyle:\\mathcal\{S\}\\\!\\times\\\!\\mathcal\{Z\}\\to\\mathcal\{O\},grew\\displaystyle g\_\{\\mathrm\{rew\}\}:𝒮×𝒵→ℝ,\\displaystyle:\\mathcal\{S\}\\\!\\times\\\!\\mathcal\{Z\}\\to\\mathbb\{R\},gcont\\displaystyle g\_\{\\mathrm\{cont\}\}:𝒮×𝒵→\[0,1\],\\displaystyle:\\mathcal\{S\}\\\!\\times\\\!\\mathcal\{Z\}\\to\[0,1\],\(4\)π\\displaystyle\\pi:𝒮×𝒵→Δ\(𝒜\),\\displaystyle:\\mathcal\{S\}\\\!\\times\\\!\\mathcal\{Z\}\\to\\Delta\(\\mathcal\{A\}\),V\\displaystyle V:𝒮×𝒵→ℝ,\\displaystyle:\\mathcal\{S\}\\\!\\times\\\!\\mathcal\{Z\}\\to\\mathbb\{R\},for decoding, reward, continuation, policy, and value\. Reinforcement\-learning world models\[[59](https://arxiv.org/html/2606.09936#bib.bib59)\]instantiate most ofℋ\\mathcal\{H\}; self\-supervised video and joint\-embedding models instantiate few or none\. A rollout overo1:To\_\{1:T\}is the recursionzt∼ℰ\(ot,st−1\)z\_\{t\}\\sim\\mathcal\{E\}\(o\_\{t\},s\_\{t\-1\}\),st\+1=τ\(st,zt,at\)s\_\{t\+1\}=\\tau\(s\_\{t\},z\_\{t\},a\_\{t\}\), and*imagination*is the same recursion withℰ\\mathcal\{E\}replaced by the learned priorpθ\(z∣st\)p\_\{\\theta\}\(z\\mid s\_\{t\}\)\. The per\-step*surprise*is the divergence between the posterior and prior latents,

surpriset=DKL\(qθ\(z∣ot,st−1\)∥pθ\(z∣st\)\)\.\\mathrm\{surprise\}\_\{t\}\\;=\\;D\_\{\\mathrm\{KL\}\}\\\!\\big\(q\_\{\\theta\}\(z\\mid o\_\{t\},s\_\{t\-1\}\)\\,\\big\\\|\\,p\_\{\\theta\}\(z\\mid s\_\{t\}\)\\big\)\.\(5\)
The families we target differ chiefly in the form of the latent and the transition, not in this signature\. Dreamer and PlaNet use a recurrent state\-space model whose latent concatenates a deterministic recurrent part with a stochastic part that is continuous in V1 and discrete\-categorical from V2 onward\[[3](https://arxiv.org/html/2606.09936#bib.bib3),[4](https://arxiv.org/html/2606.09936#bib.bib4)\]\. IRIS and Decision Transformer use a transformer\[[56](https://arxiv.org/html/2606.09936#bib.bib56)\]over a discrete codebook, so the latent is a sequence of tokens\[[6](https://arxiv.org/html/2606.09936#bib.bib6),[7](https://arxiv.org/html/2606.09936#bib.bib7)\]\. I\-JEPA and TD\-MPC2 predict in a continuous embedding space and, in the case of I\-JEPA, have no decoder\[[8](https://arxiv.org/html/2606.09936#bib.bib8),[9](https://arxiv.org/html/2606.09936#bib.bib9)\]\. A transformer\-only hook library has no place for the action input, no representation for an imagination rollout, and no abstraction that covers a recurrent latent, a token sequence, and a joint embedding at once\. The interface above does, and it is this interface that WorldModelLens makes concrete\.

## 3The WorldModelLens Abstraction

WorldModelLens is organized as three layers, shown in Figure[1](https://arxiv.org/html/2606.09936#S3.F1), that separate what a model*is*from how it is*instrumented*from what is*measured*\. At the bottom, a backend adapter exposes a single world model through the required\-and\-optional interface of Section[2](https://arxiv.org/html/2606.09936#S2), translating that model’s particular internals into the common signature\. In the middle, theHookedWorldModelwrapper mounts named hook points over the adapter’s outputs and routes every activation it produces through a single cache manager, so that each activation can be read, recorded, or overwritten by name\. At the top, a library of analyses operates entirely through the wrapper, reading and editing activations by name and never referring to a particular architecture\.

The value of this separation is that responsibility is partitioned cleanly across the boundaries\. A forward pass enters at the adapter, which callsencodeon each observation andtransitionto advance the latent state, emitting a named activation at every hook point\. The wrapper records or modifies those activations and assembles them into a typed trajectory\. The analysis layer then consumes the trajectory, or installs interventions that the wrapper replays through the adapter\. Because the adapter is the only layer that knows how a given model computes, replacing DreamerV3 with I\-JEPA changes only the bottom layer, while every analysis above it continues to run without modification\. This is the concrete mechanism behind the portability claim of Section[1](https://arxiv.org/html/2606.09936#S1): an analysis is written against the interface once, and inherits every present and future backend for free\. The remainder of this section describes each layer in turn\.

ProbingPatchingSAE / Surpriserun\_with\_cacherun\_with\_hooksimagine / replayDreamer / PlaNetIRIS / DTI\-JEPA / TD\-MPC2Analysis layer \(written once\)Hook & cache layer: HookedWorldModelAdapter layer: BaseModelAdapter\+\+Capabilitiesactivations read and edited by namerequired\+\+optional model interfaceFigure 1:The three\-layer design\. Adapters expose any world model through one typed interface; the hooked\-model wrapper mounts named hook points and a single cache; analyses read and edit activations without referring to any architecture\.### 3\.1Capability\-Typed Adapters

A backend is registered by subclassingBaseModelAdapter, atorch\.nn\.Modulewhose required surface is four methods:encode\(obs, h\_prev\), which returns a posterior latent and its prior;transition\(h, z, action\), which advances the deterministic state and accepts an optional action;initial\_state\(batch\_size, device\); andsample\_z\(logits, temperature\), which draws a latent from the encoder output and applies a Gumbel\-softmax relaxation\[[31](https://arxiv.org/html/2606.09936#bib.bib31),[32](https://arxiv.org/html/2606.09936#bib.bib32)\]for categorical latents and the identity for continuous ones\. Five further methods are optional and correspond to read\-out heads:decode,predict\_reward,predict\_continue,actor\_forward, andcritic\_forward\. Each optional method raisesNotImplementedErrorin the base class, so a head is available only when an adapter overrides it\. Formally, the declared capabilities form a vectorc∈\{0,1\}7c\\in\\\{0,1\\\}^\{7\}over\(dec,rew,cont,act,crit,usesA,rl\)\(\\textsc\{dec\},\\textsc\{rew\},\\textsc\{cont\},\\textsc\{act\},\\textsc\{crit\},\\textsc\{usesA\},\\textsc\{rl\}\), normalized on construction to satisfy the implications

cact⇒cusesA,crew⇒crl,c\_\{\\textsc\{act\}\}\\Rightarrow c\_\{\\textsc\{usesA\}\},\\qquad c\_\{\\textsc\{rew\}\}\\Rightarrow c\_\{\\textsc\{rl\}\},\(6\)and the instantiated head set isℋ\(c\)=\{gj:cj=1\}\\mathcal\{H\}\(c\)=\\\{g\_\{j\}:c\_\{j\}=1\\\}\. An analysis dispatches onℋ\(c\)\\mathcal\{H\}\(c\)rather than on the concrete model class, which is what lets a single implementation serve reinforcement\-learning and self\-supervised models alike\.

Which optional heads exist is declared rather than discovered\. AWorldModelCapabilitiesdescriptor carries seven boolean fields,has\_decoder,has\_reward\_head,has\_continue\_head,has\_actor,has\_critic,uses\_actions, andis\_rl\_trained, and normalizes them on construction: declaring an actor implies the model is action\-conditioned, and declaring a reward head implies it was trained against a reinforcement\-learning objective\. Two predicates,requires\_actionsandis\_rl\_model, summarize the descriptor for callers\. Analyses consult these fields and skip what a model does not provide instead of failing on a missing method\. Figure[2](https://arxiv.org/html/2606.09936#S3.F2)shows three families populating the same interface differently: an I\-JEPA encoder declares no reward, value, or action capability, a planning model such as PlaNet declares a decoder and a reward head but no actor, and a DreamerV3 agent declares the full set\. Adapters are resolved through a centralBackendRegistryindexed by aWorldModelFamilyenumeration and by capability, so a backend is selected by family or by the features an analysis requires rather than by a hard\-coded class reference\.

RequiredcoreOptionalheadsDreamerV3IRISI\-JEPAencode, transition,init, sampleencode, transition,init, sampleencode, transition,init, sampledecodereward, continueactor, criticdecode \(tokens\)rewardactor, criticdecodereward, continueactor, criticFigure 2:The same interface, populated differently\. Solid teal boxes are heads a family exposes; dashed gray boxes are capabilities it declares absent\. The required core \(blue\) is identical across all families, so analyses target it directly and consult the descriptor for everything optional\.
### 3\.2Hooks, Caching, and Time

The methods of an adapter are wrapped byHookedWorldModel, which mounts named hook points and routes every activation through a singleHookCacheManager\. A hook point is identified by a component name and a time index, and a hook function receives the tensor at that point together with aHookContextthat carries the current timestep, the component name, and the trajectory observed so far\. Hooks run at apreorpoststage relative to the component and can be restricted to a single timestep or a half\-open range of timesteps, which gives temporal control without altering the model\. A higher\-level hook grammar parses string specifications such asz\[0:10\]for a range of latent dimensions,t=5\.zfor a single step, andtransition\.prefor a stage, and compiles them down to the same primitives\. For models exposed as ordinarynn\.Modulegraphs, aHookedRootModuleregisters forward hooks on leaf modules and assigns standardized names such asencoder\.layer\_i\.hook\_outputandattn\.hook\_\{query,key,value,pattern\}, which makes the attention internals of transformer\-token and joint\-embedding models addressable by the same machinery\.

Two entry points cover the analysis surface\.run\_with\_cacheperforms a full forward pass and writes every requested activation into a time\-indexedActivationCacheaddressable by the pair \(name,tt\); the cache stores tensors, lazily evaluated callables, andtorch\.distributionsobjects, the last of which lets it compute per\-step surprise as the Kullback\-Leibler divergence between the posterior and prior latent distributions\[[4](https://arxiv.org/html/2606.09936#bib.bib4)\]\.run\_with\_hooksinstalls temporary functions at named points for the duration of one pass, which is the primitive underlying activation patching\[[15](https://arxiv.org/html/2606.09936#bib.bib15),[14](https://arxiv.org/html/2606.09936#bib.bib14)\], ablation, and intervention\. Writing𝒞\\mathcal\{C\}for the cache as a partial map\(n,t\)↦𝒞\[n,t\]\(n,t\)\\mapsto\\mathcal\{C\}\[n,t\]from a hook name and timestep to a tensor,run\_with\_hooksreplaces the activation at a chosen site by a functionϕn,t\\phi\_\{n,t\}before it propagates,hn,t←ϕn,t\(hn,t\)h\_\{n,t\}\\leftarrow\\phi\_\{n,t\}\(h\_\{n,t\}\), which subsumes ablation \(ϕ≡0\\phi\\equiv 0on a coordinate subset\), additive noising \(ϕ\(h\)=h\+ϵ\\phi\(h\)=h\+\\epsilon,ϵ∼𝒩\(0,σ2I\)\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)\), and restoration \(ϕ≡𝒞clean\[n,t\]\\phi\\equiv\\mathcal\{C\}^\{\\mathrm\{clean\}\}\[n,t\]from a cached clean run\)\. Long rollouts can exceed device memory, so the cache exposes an offloading policy \(retain on device, move to host, spill to disk, or switch automatically past a timestep threshold\) with optional half\-precision quantization, while a lazy trajectory store backs episodes with shared tensors persisted in Zarr or HDF5\. Figure[3](https://arxiv.org/html/2606.09936#S3.F3)contrasts the two entry points\.

s0s\_\{0\}s1s\_\{1\}s2s\_\{2\}s3s\_\{3\}\(n,0\)\(n,0\)\(n,1\)\(n,1\)\(n,2\)\(n,2\)\(n,3\)\(n,3\)cacherolloutffoverwrite att=2t\{=\}2Figure 3:Two access patterns over a rollout\.run\_with\_cache\(teal\) records every activation under a \(name,tt\) key for later analysis\.run\_with\_hooks\(amber\) installs a functionffthat overwrites a chosen activation in place, which is the basis of patching, ablation, and intervention replay\.
### 3\.3Trajectories, Imagination, and Intervention Replay

Above the cache sits a typed data model\. AWorldStaterecords one step as a required latent state with optional action, reward, value, and termination fields, and aWorldTrajectoryis a sequence of such states tagged as real, imagined, or planned\. A parallelLatentStateandLatentTrajectoryexpose the recurrent and stochastic components of latent\-dynamics models directly, including the per\-step Kullback\-Leibler term used as a surprise signal\. Two operations distinguish world\-model interpretability from its language\-model counterpart\. Imagination, exposed asimagine, rolls the transition forward from a chosen state without further observations and samples actions from the actor when one is present, which produces counterfactual futures that are themselves cached and analyzed\. Intervention replay re\-executes a recorded trajectory while installing hooks that overwrite chosen activations, so the causal effect of an internal edit on the model’s own predicted future is measured directly rather than inferred\. A three\-level hierarchy ties these together at the granularity of a dimension, a state, or a trajectory, so an analysis can locate a behavior over a span of steps, examine the active dimensions of a single state, and trace a result down to one latent coordinate\. Together these convert the static read\-and\-patch primitives of language\-model interpretability\[[12](https://arxiv.org/html/2606.09936#bib.bib12)\]into the rollout\-and\-replay primitives that world\-model analysis requires\. These three layers, the adapter that fixes the interface, the hooked wrapper that fixes how it is observed, and the typed trajectories that fix what is recorded, are the substrate against which every analysis in the next section is written\.

## 4Analyses, Expressed Once

Because every adapter exposes the same interface, an analysis written against that interface runs on every backend without modification\. Algorithm[1](https://arxiv.org/html/2606.09936#alg1)states activation patching in terms ofrun\_with\_cacheandrun\_with\_hooksalone\. A clean run is cached, a corrupted run is patched at one site with the cached value, and the reported effect is the recovery rate\(mpatched−mcorrupted\)/\(mclean−mcorrupted\)\(m\_\{\\text\{patched\}\}\-m\_\{\\text\{corrupted\}\}\)/\(m\_\{\\text\{clean\}\}\-m\_\{\\text\{corrupted\}\}\), clamped to the unit interval\. The same two primitives support causal tracing that adds Gaussian noise to all components and restores one at a time\[[15](https://arxiv.org/html/2606.09936#bib.bib15),[64](https://arxiv.org/html/2606.09936#bib.bib64)\], and a greedy circuit search that ablates attention heads in order of attributed effect and keeps the minimal set that preserves a behavior\[[45](https://arxiv.org/html/2606.09936#bib.bib45),[16](https://arxiv.org/html/2606.09936#bib.bib16)\]\.

The remaining analyses follow the same read\-or\-edit pattern over named activations and never reference an architecture\. Probing fits linear, ridge, and logistic estimators with stratified cross\-validation and reports permutation\-test significance\[[19](https://arxiv.org/html/2606.09936#bib.bib19),[20](https://arxiv.org/html/2606.09936#bib.bib20),[54](https://arxiv.org/html/2606.09936#bib.bib54)\]\. Semantic probes project latents onto frozen DINO\[[40](https://arxiv.org/html/2606.09936#bib.bib40),[60](https://arxiv.org/html/2606.09936#bib.bib60)\]and CLIP\[[41](https://arxiv.org/html/2606.09936#bib.bib41)\]features, and a learned linear map supports natural\-language concept queries against the latent space\[[63](https://arxiv.org/html/2606.09936#bib.bib63)\]\. Sparse autoencoders are provided in ReLU, top\-kk\[[43](https://arxiv.org/html/2606.09936#bib.bib43)\], and gated\[[44](https://arxiv.org/html/2606.09936#bib.bib44)\]forms, trained with a reconstruction\-plus\-sparsity objective in the tradition of dictionary learning\[[42](https://arxiv.org/html/2606.09936#bib.bib42),[17](https://arxiv.org/html/2606.09936#bib.bib17),[18](https://arxiv.org/html/2606.09936#bib.bib18),[65](https://arxiv.org/html/2606.09936#bib.bib65)\], and the evaluator reports the expectedL0L\_\{0\}, reconstruction fidelity, and dead\-feature fraction\. Disentanglement is measured with the mutual\-information gap, the DCI scores, and SAP\[[27](https://arxiv.org/html/2606.09936#bib.bib27),[28](https://arxiv.org/html/2606.09936#bib.bib28),[29](https://arxiv.org/html/2606.09936#bib.bib29)\]; representational similarity uses centered kernel alignment\[[26](https://arxiv.org/html/2606.09936#bib.bib26)\]; faithfulness is scored by the area over the perturbation curve\[[52](https://arxiv.org/html/2606.09936#bib.bib52)\]; attribution combines integrated gradients\[[24](https://arxiv.org/html/2606.09936#bib.bib24)\]with SmoothGrad\[[49](https://arxiv.org/html/2606.09936#bib.bib49)\]; and uncertainty is decomposed into epistemic and aleatoric parts, with a Mahalanobis score for out\-of\-distribution latents\[[55](https://arxiv.org/html/2606.09936#bib.bib55)\]\. Each is implemented once and inherits every backend\.

For precision, the principal estimands are the sparse\-autoencoder objective, integrated\-gradients attribution, and centered kernel alignment,

ℒSAE\(x\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{SAE\}\}\(x\)=∥x−Wdf∥22\+λ∥f∥1,f=σk\(Wex\+b\),\\displaystyle=\\lVert x\-W\_\{d\}f\\rVert\_\{2\}^\{2\}\+\\lambda\\lVert f\\rVert\_\{1\},\\quad f=\\sigma\_\{k\}\(W\_\{e\}x\+b\),\(7\)IGi\(x\)\\displaystyle\\mathrm\{IG\}\_\{i\}\(x\)=\(xi−xi′\)∫01∂F\(x′\+α\(x−x′\)\)∂xi𝑑α,\\displaystyle=\(x\_\{i\}\-x\_\{i\}^\{\\prime\}\)\\int\_\{0\}^\{1\}\\frac\{\\partial F\\\!\\big\(x^\{\\prime\}\+\\alpha\(x\-x^\{\\prime\}\)\\big\)\}\{\\partial x\_\{i\}\}\\,d\\alpha,\(8\)CKA\(X,Y\)\\displaystyle\\mathrm\{CKA\}\(X,Y\)=HSIC\(K,L\)HSIC\(K,K\)HSIC\(L,L\),\\displaystyle=\\frac\{\\mathrm\{HSIC\}\(K,L\)\}\{\\sqrt\{\\mathrm\{HSIC\}\(K,K\)\\,\\mathrm\{HSIC\}\(L,L\)\}\},\(9\)whereσk\\sigma\_\{k\}is the top\-kkactivation,We,WdW\_\{e\},W\_\{d\}the encoder and decoder weights,x′x^\{\\prime\}a baseline embedding, andK,LK,Lthe Gram matrices of the two representations; the integral in \([8](https://arxiv.org/html/2606.09936#S4.E8)\) is approximated by a5050\-step Riemann sum\. Faithfulness is the area over the perturbation curve,AOPC=1K\+1∑k=0K\(m0−mk\)\\mathrm\{AOPC\}=\\tfrac\{1\}\{K\+1\}\\sum\_\{k=0\}^\{K\}\\\!\\big\(m\_\{0\}\-m\_\{k\}\\big\), wheremkm\_\{k\}is the prediction after ablating thekkmost important coordinates, and an out\-of\-distribution latent is scored by the Mahalanobis distancedM\(z\)=\(z−μ\)⊤Σ−1\(z−μ\)d\_\{M\}\(z\)=\\sqrt\{\(z\-\\mu\)^\{\\top\}\\Sigma^\{\-1\}\(z\-\\mu\)\}to the fitted latent Gaussian𝒩\(μ,Σ\)\\mathcal\{N\}\(\\mu,\\Sigma\)\.

Algorithm 1Activation patching, written once against the interface1:Input:hooked model

MM, clean obs

oo, corrupted obs

o′o^\{\\prime\}, hook name

nn, step

tt
2:

\_,𝒞←M\.run\_with\_cache\(o\)\\\_,\\ \\mathcal\{C\}\\leftarrow M\.\\textsc\{run\\\_with\\\_cache\}\(o\)⊳\\trianglerightrecord clean activations

3:

v←𝒞\[n,t\]v\\leftarrow\\mathcal\{C\}\[n,t\]⊳\\trianglerightclean activation at\(n,t\)\(n,t\)

4:

f←λx:vf\\leftarrow\\lambda x:\\ v⊳\\trianglerightpatch hook returns the clean value

5:

y′←M\.run\_with\_hooks\(o′,hooks=\{\(n,t\):f\}\)y^\{\\prime\}\\leftarrow M\.\\textsc\{run\\\_with\\\_hooks\}\(o^\{\\prime\},\\ \\text\{hooks\}=\\\{\(n,t\):f\\\}\)
6:returneffect of restoring

\(n,t\)\(n,t\)on the model output

y′y^\{\\prime\}

## 5Evaluation

We evaluate three claims: that the interface covers world\-model families through capability typing \(Section[5\.1](https://arxiv.org/html/2606.09936#S5.SS1)\); that analyses written once recover meaningful structure on a real model \(Section[5\.2](https://arxiv.org/html/2606.09936#S5.SS2)\); and that the hook layer is cheap enough for always\-on use \(Section[5\.3](https://arxiv.org/html/2606.09936#S5.SS3)\)\.

### 5\.1Coverage Through Capability Typing

Table[1](https://arxiv.org/html/2606.09936#S5.T1)lists the families currently implemented and the optional heads each declares through its capability descriptor\. The required core is identical across every family, and the families differ only in which optional heads they expose, which is exactly the variation the capability typing is designed to absorb\. A reinforcement\-learning model such as DreamerV3 populates the full set of heads, a planning model such as PlaNet exposes a decoder and a reward head but no actor, and a self\-supervised model such as I\-JEPA exposes none\. Integrating a new backend therefore reduces to implementing the four required methods and declaring capabilities, after which the entire analysis library, including probing, activation patching, sparse autoencoders, and surprise analysis, applies without modification\. No analysis in the library contains architecture\-specific code\.

Table 1:Implemented backends and the capabilities each declares\. The required core \(encode, transition, initial state, sample\) is identical across every family; the families differ only in which optional heads they expose\. The full analysis suite applies to all of them without modification\.
### 5\.2Case Study: Layer\-Resolved Structure in I\-JEPA

To show that the substrate recovers nontrivial structure, we apply the unmodified analysis suite to the predictor of an I\-JEPA model, itself a vision transformer\[[57](https://arxiv.org/html/2606.09936#bib.bib57)\], and ask, at each layer, whether the context patches a head attends to are the patches that causally determine its prediction\. Context\-patch importance is computed two ways through the same hook interface\. Integrated gradients\[[24](https://arxiv.org/html/2606.09936#bib.bib24)\]integrate the prediction gradient along a path from a mean\-embedding baseline to the true patch embeddings over fifty steps, scoring each context patch by its contribution to the predictor reconstruction loss measured against the exponential\-moving\-average target encoder\. Attention importance reads the predictor self\-attention weights directly from the cachedattn\.hook\_patternactivation, taking the row of the masked target query over the context keys, per head and averaged over heads\. An evaluator compares the two rankings with the top\-kkJaccard overlapJk=\|𝒯kattn∩𝒯kIG\|/\|𝒯kattn∪𝒯kIG\|J\_\{k\}=\\lvert\\mathcal\{T\}\_\{k\}^\{\\mathrm\{attn\}\}\\cap\\mathcal\{T\}\_\{k\}^\{\\mathrm\{IG\}\}\\rvert\\big/\\lvert\\mathcal\{T\}\_\{k\}^\{\\mathrm\{attn\}\}\\cup\\mathcal\{T\}\_\{k\}^\{\\mathrm\{IG\}\}\\rvertover the top\-kkpatch sets and the Spearman rank correlationρ\\rhobetween the two importance vectors, reporting95%95\\%confidence intervals over the evaluation set\. Figure[4](https://arxiv.org/html/2606.09936#S5.F4)reports the Spearman correlation across predictor layers\. Agreement is weak throughout and is statistically indistinguishable from zero in three of the four layers, with only one layer showing modest positive alignment\. The most\-attended context patches are therefore frequently not the most causally relevant ones, which is consistent with the broader caution that attention weight is an unreliable explanation\[[25](https://arxiv.org/html/2606.09936#bib.bib25)\]\. The contribution here is methodological\. The comparison runs entirely through the cache and hook interface with no model\-specific code, and the same procedure transfers to any adapter that exposes attention\. We develop the mechanistic account of the effect in companion work\.

0123−0\.2\-0\.2−0\.1\-0\.100\.10\.10\.20\.2Predictor layerSpearman \(ρ\\rho\)Figure 4:Attribution and attention agree only weakly inside the I\-JEPA predictor\. Spearman rank correlation between integrated\-gradients patch attribution and attention weight, per layer \(mean over the evaluation set, error bars are one standard error\)\. Layer 2 shows modest positive alignment, while layers 0, 1, and 3 are indistinguishable from zero or negative, indicating that high\-attention patches are not reliably the causally important ones\.
### 5\.3Overhead of the Hook Layer

For interpretability telemetry to run inside a training or control loop, the hook layer must be cheap when inactive\. We measure per\-step latency of the I\-JEPA adapter in three conditions: the bare adapter with no instrumentation; the adapter wrapped byHookedWorldModelwith hook points mounted but no hooks installed; and a fullrun\_with\_cachethat records every activation\. Table[2](https://arxiv.org/html/2606.09936#S5.T2)reports the results\. Mounting the hook layer adds approximately 12% to per\-step latency, which is acceptable for always\-on use, whereas exhaustive caching is roughly seven times slower and is intended for offline analysis rather than continuous telemetry\.*Reported numbers are to be finalized with the evaluation hardware and the number of steps averaged, with mean and standard deviation\.*

Table 2:Per\-step latency of the hook layer on the I\-JEPA adapter\. Mounting hook points is cheap; exhaustive caching is reserved for offline analysis\. Hardware and averaging details to be finalized\.

## 6Related Work

Mechanistic interpretability tooling\.TransformerLens\[[12](https://arxiv.org/html/2606.09936#bib.bib12)\]established the hook\-and\-cache abstraction for language models and underpins a large body of circuit\-level analysis\[[13](https://arxiv.org/html/2606.09936#bib.bib13),[14](https://arxiv.org/html/2606.09936#bib.bib14),[45](https://arxiv.org/html/2606.09936#bib.bib45)\]\. The methods that run on such tooling, including activation and path patching\[[15](https://arxiv.org/html/2606.09936#bib.bib15),[16](https://arxiv.org/html/2606.09936#bib.bib16)\], attribution patching\[[46](https://arxiv.org/html/2606.09936#bib.bib46)\], causal scrubbing\[[47](https://arxiv.org/html/2606.09936#bib.bib47)\], and the logit lens\[[48](https://arxiv.org/html/2606.09936#bib.bib48)\], share the read\-and\-edit interface we adopt\. NNsight\[[22](https://arxiv.org/html/2606.09936#bib.bib22)\]exposes model internals for remote and local intervention, Captum\[[21](https://arxiv.org/html/2606.09936#bib.bib21)\]provides attribution primitives, and Penzai\[[23](https://arxiv.org/html/2606.09936#bib.bib23)\]offers structural visualization and editing\. These target generic networks or transformer language models and do not represent world\-model semantics\. WorldModelLens centers the abstraction on the world\-model interface itself, including actions, dynamics rollouts, imagination, and intervention replay, and treats reinforcement\-learning and self\-supervised models uniformly through capability typing\.

World models\.The families we support span recurrent latent dynamics\[[1](https://arxiv.org/html/2606.09936#bib.bib1),[2](https://arxiv.org/html/2606.09936#bib.bib2),[3](https://arxiv.org/html/2606.09936#bib.bib3),[4](https://arxiv.org/html/2606.09936#bib.bib4),[5](https://arxiv.org/html/2606.09936#bib.bib5)\], search\-based latent models\[[33](https://arxiv.org/html/2606.09936#bib.bib33),[9](https://arxiv.org/html/2606.09936#bib.bib9)\], transformer\-token and sequence models\[[6](https://arxiv.org/html/2606.09936#bib.bib6),[7](https://arxiv.org/html/2606.09936#bib.bib7),[34](https://arxiv.org/html/2606.09936#bib.bib34),[35](https://arxiv.org/html/2606.09936#bib.bib35)\], and joint\-embedding predictive architectures grounded in self\-supervised representation learning\[[61](https://arxiv.org/html/2606.09936#bib.bib61),[8](https://arxiv.org/html/2606.09936#bib.bib8),[10](https://arxiv.org/html/2606.09936#bib.bib10),[36](https://arxiv.org/html/2606.09936#bib.bib36),[37](https://arxiv.org/html/2606.09936#bib.bib37),[38](https://arxiv.org/html/2606.09936#bib.bib38),[39](https://arxiv.org/html/2606.09936#bib.bib39)\]\. World foundation models extend the same ideas to video at scale\[[11](https://arxiv.org/html/2606.09936#bib.bib11)\]\. These are the targets of our analysis rather than competing tools, and no shared interpretability substrate previously spanned them\.

Interpretability methods\.The analyses we make portable are drawn from work on probing\[[19](https://arxiv.org/html/2606.09936#bib.bib19),[20](https://arxiv.org/html/2606.09936#bib.bib20),[54](https://arxiv.org/html/2606.09936#bib.bib54),[62](https://arxiv.org/html/2606.09936#bib.bib62)\], activation patching and causal localization\[[15](https://arxiv.org/html/2606.09936#bib.bib15),[14](https://arxiv.org/html/2606.09936#bib.bib14),[16](https://arxiv.org/html/2606.09936#bib.bib16),[66](https://arxiv.org/html/2606.09936#bib.bib66)\], sparse autoencoders and dictionary learning\[[42](https://arxiv.org/html/2606.09936#bib.bib42),[17](https://arxiv.org/html/2606.09936#bib.bib17),[18](https://arxiv.org/html/2606.09936#bib.bib18),[43](https://arxiv.org/html/2606.09936#bib.bib43),[44](https://arxiv.org/html/2606.09936#bib.bib44)\], gradient and perturbation attribution\[[24](https://arxiv.org/html/2606.09936#bib.bib24),[49](https://arxiv.org/html/2606.09936#bib.bib49),[50](https://arxiv.org/html/2606.09936#bib.bib50),[51](https://arxiv.org/html/2606.09936#bib.bib51),[52](https://arxiv.org/html/2606.09936#bib.bib52)\], representational similarity\[[26](https://arxiv.org/html/2606.09936#bib.bib26),[53](https://arxiv.org/html/2606.09936#bib.bib53)\], disentanglement\[[27](https://arxiv.org/html/2606.09936#bib.bib27),[28](https://arxiv.org/html/2606.09936#bib.bib28),[29](https://arxiv.org/html/2606.09936#bib.bib29),[30](https://arxiv.org/html/2606.09936#bib.bib30)\], and out\-of\-distribution detection\[[55](https://arxiv.org/html/2606.09936#bib.bib55)\]\. Our contribution is not these methods but a substrate that lets a single implementation of each apply across world\-model architectures\.

## 7Limitations and Roadmap

WorldModelLens is alpha software, and several limitations bound the present claims\. Our deep empirical demonstration is on a single family, I\-JEPA; the case for cross\-architecture portability rests on the shared interface and the implemented adapter coverage rather than on completed cross\-family studies, which are in progress for Dreamer at full scale, V\-JEPA\[[10](https://arxiv.org/html/2606.09936#bib.bib10)\], and Cosmos\[[11](https://arxiv.org/html/2606.09936#bib.bib11)\]\. Some adapters provide faithful re\-implementations of an architecture rather than loading the original published checkpoints, and we mark in the documentation which adapters load released weights\. The overhead measurements in Section[5\.3](https://arxiv.org/html/2606.09936#S5.SS3)are to be finalized with full hardware details and variance across runs\. We see closing these gaps, in particular reporting the same analysis across multiple loaded checkpoints, as the primary path from this introduction to a full cross\-architecture study\.

## 8Conclusion

We argued that the fragmentation of world\-model interpretability is a property of the tooling rather than of the models, and that a small capability\-typed interface captures the structure shared across latent\-recurrent, transformer\-token, and joint\-embedding world models\. WorldModelLens realizes this interface and mounts a single hook, cache, rollout, and intervention\-replay layer over it, so that probing, patching, sparse autoencoders, and surprise analysis are written once and apply to every backend\. We demonstrated the substrate end\-to\-end on I\-JEPA, reported adapter coverage across six families, and showed that the hook layer is cheap enough for always\-on telemetry\. We hope the substrate lowers the cost of interpretability research on world models and makes findings comparable across the architectures that the field is rapidly producing\.

## References

- \[1\]D\. Ha and J\. Schmidhuber\. Recurrent World Models Facilitate Policy Evolution\. In*NeurIPS*, 2018\.
- \[2\]D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson\. Learning Latent Dynamics for Planning from Pixels\. In*ICML*, 2019\.
- \[3\]D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi\. Dream to Control: Learning Behaviors by Latent Imagination\. In*ICLR*, 2020\.
- \[4\]D\. Hafner, T\. Lillicrap, M\. Norouzi, and J\. Ba\. Mastering Atari with Discrete World Models\. In*ICLR*, 2021\.
- \[5\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap\. Mastering Diverse Domains through World Models\.*arXiv:2301\.04104*, 2023\.
- \[6\]V\. Micheli, E\. Alonso, and F\. Fleuret\. Transformers are Sample\-Efficient World Models\. In*ICLR*, 2023\.
- \[7\]L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch\. Decision Transformer: Reinforcement Learning via Sequence Modeling\. In*NeurIPS*, 2021\.
- \[8\]M\. Assran, Q\. Duval, I\. Misra, P\. Bojanowski, P\. Vincent, M\. Rabbat, Y\. LeCun, and N\. Ballas\. Self\-Supervised Learning from Images with a Joint\-Embedding Predictive Architecture\. In*CVPR*, 2023\.
- \[9\]N\. Hansen, H\. Su, and X\. Wang\. TD\-MPC2: Scalable, Robust World Models for Continuous Control\. In*ICLR*, 2024\.
- \[10\]A\. Bardes, Q\. Garrido, J\. Ponce, X\. Chen, M\. Rabbat, Y\. LeCun, M\. Assran, and N\. Ballas\. V\-JEPA: Latent Video Prediction for Visual Representation Learning\.*arXiv:2404\.08471*, 2024\.
- \[11\]NVIDIA\. Cosmos World Foundation Model Platform for Physical AI\.*arXiv:2501\.03575*, 2025\.
- \[12\]N\. Nanda and J\. Bloom\. TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models\.[https://github\.com/TransformerLensOrg/TransformerLens](https://github.com/TransformerLensOrg/TransformerLens), 2022\.
- \[13\]N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, et al\. A Mathematical Framework for Transformer Circuits\.*Transformer Circuits Thread*, 2021\.
- \[14\]K\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt\. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT\-2 Small\. In*ICLR*, 2023\.
- \[15\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\. Locating and Editing Factual Associations in GPT\. In*NeurIPS*, 2022\.
- \[16\]N\. Goldowsky\-Dill, C\. MacLeod, L\. Sato, and A\. Arora\. Localizing Model Behavior with Path Patching\.*arXiv:2304\.05969*, 2023\.
- \[17\]H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey\. Sparse Autoencoders Find Highly Interpretable Features in Language Models\. In*ICLR*, 2024\.
- \[18\]T\. Bricken, A\. Templeton, J\. Batson, et al\. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning\.*Transformer Circuits Thread*, 2023\.
- \[19\]G\. Alain and Y\. Bengio\. Understanding Intermediate Layers Using Linear Classifier Probes\.*arXiv:1610\.01644*, 2016\.
- \[20\]Y\. Belinkov\. Probing Classifiers: Promises, Shortcomings, and Advances\.*Computational Linguistics*, 48\(1\), 2022\.
- \[21\]N\. Kokhlikyan, V\. Miglani, M\. Martin, et al\. Captum: A Unified and Generic Model Interpretability Library for PyTorch\.*arXiv:2009\.07896*, 2020\.
- \[22\]J\. Fiotto\-Kaufman, A\. R\. Loftus, E\. Todd, et al\. NNsight and NDIF: Democratizing Access to Foundation Model Internals\.*arXiv:2407\.14561*, 2024\.
- \[23\]D\. D\. Johnson\. Penzai and Treescope: Tools for Visualizing and Manipulating Neural Networks\.[https://github\.com/google\-deepmind/penzai](https://github.com/google-deepmind/penzai), 2024\.
- \[24\]M\. Sundararajan, A\. Taly, and Q\. Yan\. Axiomatic Attribution for Deep Networks\. In*ICML*, 2017\.
- \[25\]S\. Jain and B\. C\. Wallace\. Attention is not Explanation\. In*NAACL*, 2019\.
- \[26\]S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton\. Similarity of Neural Network Representations Revisited\. In*ICML*, 2019\.
- \[27\]R\. T\. Q\. Chen, X\. Li, R\. Grosse, and D\. Duvenaud\. Isolating Sources of Disentanglement in Variational Autoencoders\. In*NeurIPS*, 2018\.
- \[28\]C\. Eastwood and C\. K\. I\. Williams\. A Framework for the Quantitative Evaluation of Disentangled Representations\. In*ICLR*, 2018\.
- \[29\]A\. Kumar, P\. Sattigeri, and A\. Balakrishnan\. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations\. In*ICLR*, 2018\.
- \[30\]I\. Higgins, L\. Matthey, A\. Pal, et al\. beta\-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework\. In*ICLR*, 2017\.
- \[31\]E\. Jang, S\. Gu, and B\. Poole\. Categorical Reparameterization with Gumbel\-Softmax\. In*ICLR*, 2017\.
- \[32\]C\. J\. Maddison, A\. Mnih, and Y\. W\. Teh\. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables\. In*ICLR*, 2017\.
- \[33\]J\. Schrittwieser, I\. Antonoglou, T\. Hubert, et al\. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model\.*Nature*, 588, 2020\.
- \[34\]M\. Janner, Q\. Li, and S\. Levine\. Offline Reinforcement Learning as One Big Sequence Modeling Problem\. In*NeurIPS*, 2021\.
- \[35\]J\. Bruce, M\. Dennis, A\. Edwards, et al\. Genie: Generative Interactive Environments\. In*ICML*, 2024\.
- \[36\]Y\. LeCun\. A Path Towards Autonomous Machine Intelligence\.*OpenReview*, 2022\.
- \[37\]K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick\. Masked Autoencoders Are Scalable Vision Learners\. In*CVPR*, 2022\.
- \[38\]J\.\-B\. Grill, F\. Strub, F\. Altché, et al\. Bootstrap Your Own Latent: A New Approach to Self\-Supervised Learning\. In*NeurIPS*, 2020\.
- \[39\]T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton\. A Simple Framework for Contrastive Learning of Visual Representations\. In*ICML*, 2020\.
- \[40\]M\. Caron, H\. Touvron, I\. Misra, et al\. Emerging Properties in Self\-Supervised Vision Transformers\. In*ICCV*, 2021\.
- \[41\]A\. Radford, J\. W\. Kim, C\. Hallacy, et al\. Learning Transferable Visual Models from Natural Language Supervision\. In*ICML*, 2021\.
- \[42\]B\. A\. Olshausen and D\. J\. Field\. Emergence of Simple\-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images\.*Nature*, 381, 1996\.
- \[43\]L\. Gao, T\. Dupré la Tour, H\. Tillman, et al\. Scaling and Evaluating Sparse Autoencoders\.*arXiv:2406\.04093*, 2024\.
- \[44\]S\. Rajamanoharan, A\. Conmy, L\. Smith, et al\. Improving Dictionary Learning with Gated Sparse Autoencoders\.*arXiv:2404\.16014*, 2024\.
- \[45\]A\. Conmy, A\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso\. Towards Automated Circuit Discovery for Mechanistic Interpretability\. In*NeurIPS*, 2023\.
- \[46\]N\. Nanda\. Attribution Patching: Activation Patching at Industrial Scale\.[https://neelnanda\.io/attribution\-patching](https://neelnanda.io/attribution-patching), 2023\.
- \[47\]L\. Chan, A\. Garriga\-Alonso, N\. Goldowsky\-Dill, et al\. Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses\.*Alignment Forum*, 2022\.
- \[48\]nostalgebraist\. Interpreting GPT: The Logit Lens\.*LessWrong*, 2020\.
- \[49\]D\. Smilkov, N\. Thorat, B\. Kim, F\. Viégas, and M\. Wattenberg\. SmoothGrad: Removing Noise by Adding Noise\.*arXiv:1706\.03825*, 2017\.
- \[50\]R\. R\. Selvaraju, M\. Cogswell, A\. Das, R\. Vedantam, D\. Parikh, and D\. Batra\. Grad\-CAM: Visual Explanations from Deep Networks via Gradient\-Based Localization\. In*ICCV*, 2017\.
- \[51\]S\. M\. Lundberg and S\.\-I\. Lee\. A Unified Approach to Interpreting Model Predictions\. In*NeurIPS*, 2017\.
- \[52\]W\. Samek, A\. Binder, G\. Montavon, S\. Lapuschkin, and K\.\-R\. Müller\. Evaluating the Visualization of What a Deep Neural Network Has Learned\.*IEEE TNNLS*, 28\(11\), 2017\.
- \[53\]M\. Raghu, J\. Gilmer, J\. Yosinski, and J\. Sohl\-Dickstein\. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability\. In*NeurIPS*, 2017\.
- \[54\]J\. Hewitt and P\. Liang\. Designing and Interpreting Probes with Control Tasks\. In*EMNLP*, 2019\.
- \[55\]K\. Lee, K\. Lee, H\. Lee, and J\. Shin\. A Simple Unified Framework for Detecting Out\-of\-Distribution Samples and Adversarial Attacks\. In*NeurIPS*, 2018\.
- \[56\]A\. Vaswani, N\. Shazeer, N\. Parmar, et al\. Attention Is All You Need\. In*NeurIPS*, 2017\.
- \[57\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, et al\. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale\. In*ICLR*, 2021\.
- \[58\]D\. P\. Kingma and M\. Welling\. Auto\-Encoding Variational Bayes\. In*ICLR*, 2014\.
- \[59\]R\. S\. Sutton and A\. G\. Barto\.*Reinforcement Learning: An Introduction*\. MIT Press, 2nd edition, 2018\.
- \[60\]M\. Oquab, T\. Darcet, T\. Moutakanni, et al\. DINOv2: Learning Robust Visual Features without Supervision\.*Transactions on Machine Learning Research*, 2024\.
- \[61\]Y\. Bengio, A\. Courville, and P\. Vincent\. Representation Learning: A Review and New Perspectives\.*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35\(8\), 2013\.
- \[62\]K\. Li, A\. K\. Hopkins, D\. Bau, F\. Viégas, H\. Pfister, and M\. Wattenberg\. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task\. In*ICLR*, 2023\.
- \[63\]B\. Kim, M\. Wattenberg, J\. Gilmer, C\. Cai, J\. Wexler, F\. Viégas, and R\. Sayres\. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors \(TCAV\)\. In*ICML*, 2018\.
- \[64\]A\. Geiger, H\. Lu, T\. Icard, and C\. Potts\. Causal Abstractions of Neural Networks\. In*NeurIPS*, 2021\.
- \[65\]A\. Templeton, T\. Conerly, J\. Marcus, et al\. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet\.*Transformer Circuits Thread*, 2024\.
- \[66\]J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber\. Investigating Gender Bias in Language Models Using Causal Mediation Analysis\. In*NeurIPS*, 2020\.
One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

Similar Articles

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

Bridging the Agent-World Gap: Text World Models for LLM-based Agents

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes

Submit Feedback

Similar Articles

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications
Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform
Bridging the Agent-World Gap: Text World Models for LLM-based Agents
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes