SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
Summary
SimPersona learns discrete buyer personas from raw clickstreams using a VQ-VAE and maps them to persona tokens for LLM-based web agents, achieving high conversion-rate alignment across many live storefronts.
View Cached Full Text
Cached at: 05/15/26, 06:22 AM
# Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
Source: [https://arxiv.org/html/2605.14205](https://arxiv.org/html/2605.14205)
Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ted Chaiwachirasak, Han Li, Lingyun Wang Shopify Bellevue, Washington, USA
###### Abstract
Large language model \(LLM\)\-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations\. Existing personalization methods rely on hand\-crafted prompt\-based personas that are brittle, difficult to scale, context\-inefficient, and unable to faithfully represent population\-level behavior\. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM\-based web agents as compact persona tokens\. Given raw clickstreams, a behavior\-aware vector\-quantized variational autoencoder \(VQ\-VAE\) induces a discrete buyer\-type space that captures the statistical structure of real buyer behavior and merchant\-specific buyer population distributions\. To provide behavior\-specific guidance to LLM\-based web agents,SimPersonamaps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine\-tunes the agent with these tokens on real browsing traces\. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store\-specific prompt engineering\. For population\-level simulation,SimPersonasamples buyer types from each merchant’s empirical distribution over the learned VQ\-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant\-specific buyer population distributions\. Evaluated on8\.378\.37M buyers across4242held\-out live storefronts,SimPersonaachieves78%78\\%conversion\-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with8×8\\timesmore parameters on goal\-oriented shopping tasks\. We further release an open\-source data pipeline that converts raw e\-commerce event logs into buyer representations and agent\-training traces\.
## 1Introduction
Figure 1:SimPersonaframework overview\.Top\-left: behavioral features and product embeddings are extracted from raw clickstreams\.Top\-right: a behavior\-aware VQ\-VAE maps each buyer to one ofKKpersona tokens\.Bottom\-right: two\-stage SFT grounds the tokens in the LLM; first token warm\-up \(backbone frozen\), then full fine\-tuning\.Bottom\-left: evaluation on unseen storefronts across behavioral alignment, conversion alignment, and instruction following\.Simulating realistic human shopping behavior has direct commercial impacts where even marginal improvements in buyer modeling translate into measurable gains in recommendationNi and others \[[15](https://arxiv.org/html/2605.14205#bib.bib7)\], Shiet al\.\[[21](https://arxiv.org/html/2605.14205#bib.bib8)\], storefront evaluationLuet al\.\[[13](https://arxiv.org/html/2605.14205#bib.bib34)\], or synthetic A/B testingWanget al\.\[[24](https://arxiv.org/html/2605.14205#bib.bib33)\]\. LLM\-based web agents have recently made such simulations feasible at scale, navigating live storefronts and executing complete shopping flows from search to checkoutDenget al\.\[[5](https://arxiv.org/html/2605.14205#bib.bib15)\], Zhouet al\.\[[34](https://arxiv.org/html/2605.14205#bib.bib17)\], Guret al\.\[[8](https://arxiv.org/html/2605.14205#bib.bib16)\]\. Yet, while these agents master the*mechanics*of web interaction, they learn a single population\-level policy that produces an “average buyer”: they know*how*to shop, but not*who*they are shopping as\. That is, steering an agent to reflect a specific buyer segment, the*persona problem*, remains the core challenge, and reconstructing the*distribution*of buyer types that collectively defines a store’s traffic remains openWanget al\.\[[26](https://arxiv.org/html/2605.14205#bib.bib18)\], Luet al\.\[[12](https://arxiv.org/html/2605.14205#bib.bib19)\], Gebreegziabheret al\.\[[7](https://arxiv.org/html/2605.14205#bib.bib22)\]\.
The standard approach to the persona problem is to tell the agent who to be via prompts where memory modulesParket al\.\[[16](https://arxiv.org/html/2605.14205#bib.bib1)\], Park and others \[[17](https://arxiv.org/html/2605.14205#bib.bib2)\], purchase\-history summariesShiet al\.\[[20](https://arxiv.org/html/2605.14205#bib.bib23),[21](https://arxiv.org/html/2605.14205#bib.bib8)\], or role\-playing profilesShao and others \[[19](https://arxiv.org/html/2605.14205#bib.bib24)\], Chuanget al\.\[[3](https://arxiv.org/html/2605.14205#bib.bib25)\]are leveraged for steering\. However, prompt\-based conditioning is brittle: behavior shifts with wordingLutz and others \[[14](https://arxiv.org/html/2605.14205#bib.bib26)\], demographic role\-playing fails to match real responsesChuanget al\.\[[3](https://arxiv.org/html/2605.14205#bib.bib25)\], and the best prompt\-based shopping agents reproduce only11\.9%11\.9\\%of real buyer actionsLuet al\.\[[12](https://arxiv.org/html/2605.14205#bib.bib19)\]\. Text\-based personas are also expensive \(requiring auxiliary LLM calls and consuming context at every step\) and limited in expressiveness, since hand\-crafted templates cannot capture the full diversity of real behavior\. Crucially, they do not solve the*distribution*problem, because personas are crafted individually rather than learned from the full population, they carry no notion of how many buyers of each type exist, yet realistic simulation demands not just individual personas but the correct*mix*of buyer types that defines a store’s traffic\. Recent RL\-based agentsWanget al\.\[[27](https://arxiv.org/html/2605.14205#bib.bib20)\], Zhanget al\.\[[32](https://arxiv.org/html/2605.14205#bib.bib21)\]face the same limitation, as their personas also derive from LLM\-generated text profiles\.
The user\-modeling literature takes a data\-driven approach, learning structured buyer representations directly from raw clickstreamsYang and others \[[30](https://arxiv.org/html/2605.14205#bib.bib11)\], Ni and others \[[15](https://arxiv.org/html/2605.14205#bib.bib7)\], Zheng and others \[[33](https://arxiv.org/html/2605.14205#bib.bib9)\], Hatt and Feuerriegel \[[9](https://arxiv.org/html/2605.14205#bib.bib10)\], Wanget al\.\[[25](https://arxiv.org/html/2605.14205#bib.bib12)\]\. These methods reveal that shopping behavior has rich latent structure that textual profiles cannot express, and in principle can capture population\-level variation\. While providing high expressivity, they are designed for offline user modeling; their representations are used for tasks such as prediction, recommendation, and clustering, not for driving agent behavior in closed\-loop environments\. User modeling solves persona*discovery*but not persona*grounding*: it captures who a buyer is without teaching an agent how to act on the provided information\.
We proposeSimPersona\([Figure˜1](https://arxiv.org/html/2605.14205#S1.F1)\), a framework that jointly addresses persona discovery, persona grounding, and population distribution learning by representing buyer behavior as*discrete persona tokens*: compact enough to occupy a single context position, yet expressive enough to capture real behavioral structure\. A behavior\-aware vector\-quantized variational autoencoder \(VQ\-VAE\)van den Oordet al\.\[[23](https://arxiv.org/html/2605.14205#bib.bib35)\]maps each buyer’s de\-identified historical clickstream to one ofKKpersona tokens, each encoding a coarse distinct behavioral profile\. Because these tokens are learned end\-to\-end from real traffic, the distribution over token assignments naturally mirrors the true population mix of buyer types, enabling faithful reconstruction of store\-level traffic patterns, something text\-based methods fundamentally cannot provide\. The learned tokens are added directly to the LLM vocabulary, making them compatible with token\-based agents and assignable to new buyers with a single encoder forward pass\. Because newly introduced persona tokens must be aligned with the pretrained model’s existing semantic and action spaces, we introduce a two\-stage persona\-grounding procedure that decouples learning*what*each token means from learning*how*to act on it\. This prevents shortcut learning from surface cues in the prompt and produces persona embeddings that transfer across unseen stores\. Training uses only a small corpus disjoint from evaluation to encourage the model to learn reusable behavioral patterns rather than memorizing sessions\. At inference, persona assignment scales to millions of buyers in seconds without retraining or per\-store calibration\. In summary, our contributions are:
1. 1\.SimPersona, a framework that learns discrete persona tokens from raw clickstreams via a behavior\-aware VQ\-VAE, enabling scalable persona assignment, faithful reconstruction of buyer population distributions, and closed\-loop agent simulation\.
2. 2\.A two\-stage persona\-grounding framework that decouples*what*each token means from*how*to act on it, producing embeddings that generalize across storefronts without adaptation\.
3. 3\.An open\-source data pipeline for converting raw clickstreams into buyer representations and agent\-training traces, providing infrastructure for buyer behavior simulation\.
4. 4\.Evaluation on8\.378\.37M unseen buyers showing78%78\\%conversion alignment, statistically significant behavioral separation, and stronger instruction following than8×8\{\\times\}larger baseline\.
The remainder of the paper is organized as follows\.[Section˜2](https://arxiv.org/html/2605.14205#S2)details the proposed framework,[Section˜3](https://arxiv.org/html/2605.14205#S3)evaluates it on8\.378\.37M buyers, and[Section˜4](https://arxiv.org/html/2605.14205#S4)discusses limitations and future directions\.
## 2Proposed method
As detailed below, we build persona\-conditioned shopping agents from raw clickstream data by developing a data pipeline for buyer representations and agent traces \([Section˜2\.1](https://arxiv.org/html/2605.14205#S2.SS1)\), a behavior\-aware VQ\-VAE that learns discrete personas and maps them to trainable tokens \([Section˜2\.2](https://arxiv.org/html/2605.14205#S2.SS2)\), and a multi\-stage fine\-tuning procedure that grounds these tokens in agent behavior \([Section˜2\.3](https://arxiv.org/html/2605.14205#S2.SS3)\)\.
### 2\.1Data pipeline
Existing buyer\-simulation methods typically begin from curated inputs including benchmark tasks, explicit shopping preferences, or simulated user profiles rather than the raw clickstream logs available in production e\-commerce systemsLuet al\.\[[13](https://arxiv.org/html/2605.14205#bib.bib34)\], Wanget al\.\[[26](https://arxiv.org/html/2605.14205#bib.bib18),[24](https://arxiv.org/html/2605.14205#bib.bib33)\]\. In real platforms, however, buyer behavior is observed through fragmented low\-level events \(page views, searches, cart mutations, and checkout actions\) and these logs are optimized for analytics rather than for buyer modeling or agent training\. As a result, they contain rich behavioral signal, but not in a form usable by representation\-learning methods or LLM agents\. Converting raw clickstreams into structured buyer representations and grounded multi\-turn action traces is therefore a necessary but largely underexplored step\.
We address this gap with a modular pipeline that performs a single enrichment pass over raw clickstreams and produces two complementary data types,[Figure˜2](https://arxiv.org/html/2605.14205#A1.F2)\. It first produces compact aggregated buyer\-level representations that combine behavioral statistics with semantic product context; enabling downstream tasks such as segmentation, clustering, and persona discovery\. Then, it generates executable agent traces based on live storefront interactions\. The traces are formatted as multi\-turn sequences of DOM observations, structured actions, memory, and step\-level reasoning for supervised fine tuning \(SFT\)\. These components are detailed below in the context of persona learning \(the pipeline itself is task\-agnostic and open\-source111URL is anonymized for review, but source code is provided with submission\.\)
#### 2\.1\.1Buyer behavioral representation
We begin with raw platform logs containing page views, cart actions, search activities and other fine\-grained interaction records\. These records capture interactions but provide limited context about*what*the buyer was interested in or how their actions relate across a session\. Hence, we enrich the logs by augmenting them with the product catalog, collection directory, and search\-query records; producing aggregated buyer–session data that combine behavioral signals with semantic context \(e\.g\., product titles, collection names, and search queries\), see[Figure˜3](https://arxiv.org/html/2605.14205#A1.F3)\.
For persona discovery, we compress each buyer–session history into a single vector that captures both*how*the buyer shops and*what*they shop for\. To represent shopping style, we extract 16 scalar features organized into five groups:*exposure & volume*\(total sessions, active days, session counts across product views, add\-to\-cart, checkout, search, and collection browsing\),*engagement*\(total session duration and total product views\),*funnel*\(add\-to\-cart, checkout, and browse\-only rates\),*intent strength*, and*dollar values*\(average cart and order value\); details provided in Appendix[A](https://arxiv.org/html/2605.14205#A1)\. Product preference is captured by averaging the 768\-dimensional product embeddings over each buyer’s viewed, carted, and purchased items, then reducing each to 128 dimensions via PCA \(retaining∼\\sim85% variance\) so that the three semantic channels do not overwhelm the behavioral scalars\. Three binary masks are included to show which embedding channels are observed\. The resulting 403\-dimensional vector provides a compact buyer representation for each buyer–shop pair \(see[Figure˜4](https://arxiv.org/html/2605.14205#A1.F4)\)\.
Many downstream components require a shared notion of where a buyer lies in the conversion funnel\. Instead of using task\-specific heuristics, we define a single*funnel stratification*from aggregated event counts\. Each buyer is assigned, via a priority cascade extracted from their raw behavioral signals, to one of five strata:Apurchasers,Bcheckout abandoners,Ccart builders,Dwindow shoppers, orEbouncers\. We reuse this label in the pipeline as it guides intent derivation for simulation, supports stratified sampling during training, and provides diagnostics for representation quality \([Section˜3\.1](https://arxiv.org/html/2605.14205#S3.SS1)\)\.
#### 2\.1\.2From clickstreams to agent traces\.
Buyer representations capture behavioral variation but fail to inform an agent about how those behaviors sequentially unroll in a storefront\. To address this issue, we replay real buyer sessions on live storefronts\. From the enriched session tables generated in the previous step, we reconstruct each buyer’s timestamped event sequence and use an LLM \(GPT\-5\) to rewrite it as a natural\-language navigation goal that preserves the structure of the original session\. A separate agent \(Gemini 3 Flash\) then executes this goal on the live stores, interacting with real pages and recording its trajectory in an SFT\-ready format\. Each generated training example consists of a shared*system prompt*that defines the agent’s role, output schema, and interaction rules, paired with a*user prompt*containing the session’s inferred intent, the buyer profile field \(populated with the learned persona token\), a cumulative progress log that serves as the agent’s memory for reasoning and error recovery, and the current page’s DOM snapshot\. The corresponding*assistant turn*contains a structured JSON action and a reasoning trace \([Figure˜5](https://arxiv.org/html/2605.14205#A1.F5)\)\. Each example thus teaches the model to interpret page state, track progress, recover from errors, and most importantly, navigate according to the learned buyer persona\. An independent LLM judge \(GPT\-5\) filters the corpus to retain only successful trajectories\.
### 2\.2Persona discovery via behavior\-aware VQ\-VAE
Our goal is to use raw clickstreams to learn latent buyer behavior that is usable by a language model\. To this end we build a VQ\-VAE where the encoder compresses high\-dimensional behavioral features and the quantization layer maps them to a finite codebook whose entries map one\-to\-one to new trainable tokens in an LLM’s vocabulary\. Unlike distance\-based clusterings such askk\-means, our model is trained end\-to\-end with a behavior\-aware contrastive objective and semantic auxiliary heads, encouraging codebook entries to reflect meaningful behavioral similarity rather than mere geometric proximity\. This discretization also enables*distribution learning*, because the codebook is learned jointly over the full buyer population, the frequency with which buyers are assigned to each token provides a natural estimate of the latent distribution of buyer types\. In Appendix[B](https://arxiv.org/html/2605.14205#A2), we show that this learned distribution recovers aggregate population\-level buyer behavior\.
Encoder maps the input𝐱∈ℝdin\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\}through Linear→\\toReLU→\\toDropout blocks to a latent vector𝐳e∈ℝD\\mathbf\{z\}\_\{e\}\\in\\mathbb\{R\}^\{D\}, which is quantized to its nearest codebook entry𝐞k∗\\mathbf\{e\}\_\{k^\{\*\}\}from a learned codebook\{𝐞1,…,𝐞K\}⊂ℝD\\\{\\mathbf\{e\}\_\{1\},\\dots,\\mathbf\{e\}\_\{K\}\\\}\\subset\\mathbb\{R\}^\{D\}:
k∗=argmink∈\{1,…,K\}‖𝐳e−𝐞k‖22,𝐳q=𝐞k∗,k^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{k\\in\\\{1,\\dots,K\\\}\}\\\|\\mathbf\{z\}\_\{e\}\-\\mathbf\{e\}\_\{k\}\\\|\_\{2\}^\{2\},\\qquad\\mathbf\{z\}\_\{q\}=\\mathbf\{e\}\_\{k^\{\*\}\},\(1\)with gradients propagated via the straight\-through estimatorvan den Oordet al\.\[[23](https://arxiv.org/html/2605.14205#bib.bib35)\]\. The decoder mirrors the encoder and reconstructs the input from𝐳q\\mathbf\{z\}\_\{q\}\. Codebook entries are initialized withkk\-means\+\+Arthur and Vassilvitskii \[[1](https://arxiv.org/html/2605.14205#bib.bib38)\]over encoder outputs and updated via exponential moving averages \(EMA\)\. To prevent codebook collapseRazaviet al\.\[[18](https://arxiv.org/html/2605.14205#bib.bib39)\], we track an EMA of code usage and mark entries as dead when their count falls below a fractionα\\alphaof the mean\. After an initial warmup, dead entries are reinitialized from randomly sampled active encoder outputs\. Details regarding hyperparameters are provided in Appendix[C](https://arxiv.org/html/2605.14205#A3)\(code is available in data pipeline repository\)\.
Reconstruction loss does not guarantee semantically meaningful personas since similar features may correspond to buyers with different behaviors\. We therefore augment the standard VQ\-VAE objective with contrastive and auxiliary supervision to enforce semantic clustering:
ℒ=λrℒrecon\+βℒcommit\+λcℒcontrastive\+λa\(ℒengage\+ℒexplore\+ℒpurchase\)\.\\mathcal\{L\}=\\lambda\_\{r\}\\,\\mathcal\{L\}\_\{\\text\{recon\}\}\+\\beta\\,\\mathcal\{L\}\_\{\\text\{commit\}\}\+\\lambda\_\{c\}\\,\\mathcal\{L\}\_\{\\text\{contrastive\}\}\+\\lambda\_\{a\}\\bigl\(\\mathcal\{L\}\_\{\\text\{engage\}\}\+\\mathcal\{L\}\_\{\\text\{explore\}\}\+\\mathcal\{L\}\_\{\\text\{purchase\}\}\\bigr\)\.\(2\)Our reconstruction term \(ℒrecon\\mathcal\{L\}\_\{\\text\{recon\}\}\) is group\-aware\. Scalar behavioral features are reconstructed with MSE, while product\-preference embeddings are reconstructed with cosine distance to preserve semantic direction rather than magnitude\. Also, to prevent the high\-dimensional embedding channels from dominating the loss, we compute reconstruction at the level of semantic groups rather than raw dimensions, and we mask embedding terms when the corresponding interaction type is absent \(a buyer who never added to cart incurs no loss on the carted\-product embedding\)\. The commitment lossℒcommit=‖𝐳e−sg\(𝐳q\)‖22\\mathcal\{L\}\_\{\\text\{commit\}\}=\\\|\\mathbf\{z\}\_\{e\}\-\\mathrm\{sg\}\(\\mathbf\{z\}\_\{q\}\)\\\|\_\{2\}^\{2\}keeps encoder outputs close to their assigned codes\.
Reconstruction loss shapes what the codebook preserves, but provides no direct signal about*which buyers should share a code*\. We propose a three\-stage contrastive objective via InfoNCEvan den Oordet al\.\[[22](https://arxiv.org/html/2605.14205#bib.bib40)\]that constructs positive pairs through progressively finer gates\.*Stage 1 \(funnel gate\)*: each buyer receives an ordinal signature from their highest interaction level \(purchase\>\>cart\>\>view\>\>none\) where only same\-signature buyers can form pairs\.*Stage 2 \(product filter\)*: within the same\-signature pool, the top\-MMpeers are retained by cosine similarity in product\-embedding space, ensuring paired buyers engage with similar products\.*Stage 3 \(behavioral filter\)*: from these candidate peers, the top\-FFneighbors \(M\>FM\>F\) are selected using Euclidean distance over exploration and engagement features \(product views, search sessions, collection views, session duration\), enforcing similar browsing style on top of similar funnel depth and product interest\. All non\-self samples remain in the InfoNCE denominator, so the model is simultaneously encouraged to pull behaviorally aligned buyers together and push apart incompatible ones\. See Appendix[D](https://arxiv.org/html/2605.14205#A4)for formulation\.
To further encourage the codebook to capture interpretable shopping behavior, we attach three auxiliary classification heads to𝐳q\\mathbf\{z\}\_\{q\}\. Each head is a cross\-entropy loss to predict a coarse three\-level label \(low, medium, or high\) along a behavioral axis:*engagement depth*,*exploration breadth*, and*purchase intensity*\. These auxiliary tasks provide supervised pressure for the codebook to preserve behavioral distinctions\. The labels are derived from buyer\-level aggregate scores, with binning tailored to each target’s distribution\. Engagement is measured by total session duration and exploration by the number of search, collection\-view, and product\-view sessions; both are approximately log\-normal, so we log\-transform them and split into percentile\-based terciles\. Purchase intensity is defined as8×checkout\_sessions\+3×atc\_sessions8\\times\\texttt\{checkout\\\_sessions\}\+3\\times\\texttt\{atc\\\_sessions\}\(see Appendix[E](https://arxiv.org/html/2605.14205#A5)\), which produces a discrete, heavily zero\-inflated distribution with natural gaps between non\-purchasers, light buyers, and heavy buyers\. We place bin boundaries at these gaps to obtain three groups that align with distinct purchasing behaviors\. To prevent the model from ignoring rare but important groups such as heavy purchasers, we apply inverse\-frequency class weightswi=N/\(B⋅ni\)w\_\{i\}=N/\(B\\cdot n\_\{i\}\), whereNNis the training\-set size,B=3B\{=\}3, andnin\_\{i\}is the count in binii\. After training, these heads also provide interpretable labels for each codebook entry, making it straightforward to inspect what each persona code represents \(see Appendix[D](https://arxiv.org/html/2605.14205#A4)\)\.
### 2\.3Multi\-stage supervised fine\-tuning
The VQ\-VAE produces discrete persona indices, but these indices have no inherent semantics for the language model\. We therefore extend the Qwen3\-14B\-BaseYanget al\.\[[29](https://arxiv.org/html/2605.14205#bib.bib46)\]tokenizer withKKspecial tokens \(<\|persona\_0\|\>, …,<\|persona\_\{K\-1\}\|\>\), one for each codebook entry\. This creates a central learning problem as the model must learn to*condition its actions on the persona token*rather than on surface cues in the prompt\. End\-to\-end fine\-tuning fails at this stage: randomly initialized embeddings inject noise early in training, while the model quickly discovers that simple lexical cues in the inferred intent such as “buy” versus “browse” provide an easier shortcut, allowing it to ignore the persona token\. To address this, we propose a two\-stage training framework that explicitly separates learning*what each persona token means*from learning*how to act on it*\. We ablate this choice in Appendix[F](https://arxiv.org/html/2605.14205#A6)and find that two stage training is essential for robustness\.
In Stage 1 \(persona grounding\), we freeze the pretrained backbone and train only theKKnew embeddings\. Each training example is a browsing trace from the data pipeline \([Sections˜2\.1\.2](https://arxiv.org/html/2605.14205#S2.SS1.SSS2)and[9\(a\)](https://arxiv.org/html/2605.14205#A7.F9.sf1)\) in which the buyer profile field contains the persona token, and the goal is an intent for session\. Crucially, all goals in this stage are*intent\-neutral*: short product\-interest statements such as “You are interested in skirts,” with the category inferred from the buyer’s interaction history\. Because the goal never reveals whether the buyer will browse, cart, or purchase, the only signal distinguishing a high\-conversion token from a low\-conversion token is the statistical pattern of actions across training traces\. This signal forces the embedding to absorb the behavioral meaning of its persona\. At the end of Stage 1 each persona embedding has converged to a distinct, stable region of the representation space without perturbing the pretrained weights\.
In Stage 2 \(action\-oriented fine\-tuning\), we unfreeze the backbone and continue training on richer, goal\-directed traces\. Goals now carry explicit intent derived from the buyer’s funnel stratum: sessions from strata A–C receive prompts such as “You are here to buy skirts,” while stratum\-D sessions receive “You are here to browse skirts\.” Since the intent is now present in both the goal*and*the persona token, the model must learn to fuse two complementary signals: the token encodes the buyer’s general behavioral profile while the goal specifies the intent of*this*session \(see[Figure˜9\(b\)](https://arxiv.org/html/2605.14205#A7.F9.sf2)\)\. Neither signal is redundant, e\.g\., a high\-purchase token paired with a browse goal should still produce exploratory behavior and so the model cannot collapse to either cue alone\. Both stages draw from a small pool of shops fully disjoint from evaluation\. Because Stage 1 anchors the embeddings via aggregate behavioral co\-occurrence against a frozen backbone, the learned tokens transfer to unseen storefronts without adaptation; we confirm this in[Section˜3\.3](https://arxiv.org/html/2605.14205#S3.SS3)\(architectural details in Appendix[G](https://arxiv.org/html/2605.14205#A7)\)\.
## 3Discussion and results
In this section, we evaluate our method along four axes: persona clustering quality \([Section˜3\.1](https://arxiv.org/html/2605.14205#S3.SS1)\), conversion alignment with real buyers \([Section˜3\.2](https://arxiv.org/html/2605.14205#S3.SS2)\), fine\-grained behavioral fidelity \([Section˜3\.3](https://arxiv.org/html/2605.14205#S3.SS3)\), and instruction\-following performance against GPT\-OSS\-120B \([Table˜4](https://arxiv.org/html/2605.14205#S3.T4)\)\.
### 3\.1Does our clustering learn behaviorally meaningful personas?
We train the VQ\-VAE on a balanced subset of3939shops with sufficient support across strata A–E \([Section˜2\.1](https://arxiv.org/html/2605.14205#S2.SS1)\)\. For each shop, we sample up to1,5001\{,\}500buyer\-shop pairs, capped at300300per stratum, to avoid over\-representing either large shops or dominant buyer types\. After removing stratum E \(bouncers\), which contains limited behavioral signal, the resulting dataset contains44,55944\{,\}559buyer\-shop pairs, split85/1585/15into training and validation sets\. The model takes as input1616behavioral scalars together with three product embeddings explained in Appendix[A](https://arxiv.org/html/2605.14205#A1)and[Section˜2\.1](https://arxiv.org/html/2605.14205#S2.SS1)\.
We evaluate the learned VQ\-VAE codebook by comparing it against MiniBatchkk\-means trained on the same buyer representations withK=256K\{=\}256clusters\. We assess four complementary aspects of cluster quality\.*Stratum purity*measures whether each cluster respects the funnel strata from[Section˜2\.1](https://arxiv.org/html/2605.14205#S2.SS1); we flag clusters with*incompatible mixing*, where high\-funnel buyers \(strata A–C\) and window shoppers \(stratum D\) co\-occur in substantial proportions\.*Auxiliary\-head coherence*measures behavioral consistency along the three supervised axes from[Section˜2\.2](https://arxiv.org/html/2605.14205#S2.SS2)by reporting the percentage of buyers assigned to codes that span non\-adjacent bins \(e\.g\., low and high\)\.*Pairwise cosine similarity*quantifies how alike buyers within the same cluster are in the original feature space\. Finally, the*Calinski–Harabasz index*Caliński and Harabasz \[[2](https://arxiv.org/html/2605.14205#bib.bib45)\]summarizes the ratio of between\-cluster to within\-cluster dispersion; higher values indicate tighter and better separated clusters \(see[Appendix˜H](https://arxiv.org/html/2605.14205#A8)\)\.
Table 1:VQ\-VAE vs\. MiniBatchkk\-means \(K=256K\{=\}256\)\. Coherence:%\\%of buyers assigned to clusters spanning non\-adjacent bins \(↓\\downarrow\)\.*Stratum purity**Coherence \(% incoh\.↓\\downarrow\)**Separation*MethodMeanpurityIncompat\.mixingEngageExplorePurchasePWcosineCalinski–Harabaszkk\-means78\.9%666\.7%4\.6%5\.4%0\.722173\.8VQ\-VAE84\.5%00\.5%0\.0%0\.0%0\.774206\.5Based on the results summarized in[Table˜1](https://arxiv.org/html/2605.14205#S3.T1), the most striking difference is in behavioral semantics\. Becausekk\-means has no notion of what the features*mean*, it simply minimizes Euclidean distance and freely merges buyers at opposite ends of a behavioral axis whenever doing so reduces geometric cost\. This is visible in the auxiliary\-head coherence:66\.7%66\.7\\%of buyers fall in engagement\-incoherent codes underkk\-means, meaning low\-engagement and high\-engagement buyers are routinely assigned to the same cluster\. VQ\-VAE, by contrast, keeps incoherence below0\.5%0\.5\\%on all three heads\. The same pattern appears in stratum purity \(84\.5%84\.5\\%vs\.78\.9%78\.9\\%\) and incompatible mixing \(0vs\.66severely mixed codes\): without semantic supervision,kk\-means conflates fundamentally different buyer types\. On the combined coherence metrics \(pairwise cosine similarity and CH index\), VQ\-VAE leads across the board, the modest margin is becausekk\-means achieves strong tightness on the product\-embedding dimensions, which dominate the concatenation by sheer count\. But this geometric tightness is hollow as it comes at the cost of the behavioral coherence that matters for downstream persona conditioning\.
### 3\.2Do persona\-conditioned agents match humans’ conversion rates?
Fine\-tuning uses the same3939training shops but only a small subset of approximately3,6003\{,\}600sessions\. We deliberately downsample this set to ensure that every codebook entry is represented while keeping the corpus small enough to discourage memorization\. At inference time, we evaluate on the full buyer population, spanning8\.378\.37M unique buyers across4242shops, none of which overlap with the training shops\. This setup provides a direct test of whether the learned persona tokens generalize to unseen storefronts\. Each agent is deployed on the*same live*storefront visited by its real counterpart and is conditioned on two inputs: a persona token specifying the buyer profile, and a shopping intent \(same as Stage 2 in[Section˜2\.3](https://arxiv.org/html/2605.14205#S2.SS3)\)\. The agent then navigates the storefront autonomously, and we ask whether its conversion behavior matches that of the real buyers represented by the same persona\.
Answering this question requires a metric beyond standard task\-completion benchmarksZhouet al\.\[[34](https://arxiv.org/html/2605.14205#bib.bib17)\], Denget al\.\[[5](https://arxiv.org/html/2605.14205#bib.bib15)\], Yaoet al\.\[[31](https://arxiv.org/html/2605.14205#bib.bib29)\], where success is binary\. For buyer simulation, what matters is not whether the agent*can*convert, but whether it converts at the*same rate*as the real buyers it represents\. We therefore define:
ATC alignment=1−\|ATCreal−ATCagent\|,\\displaystyle=1\-\\left\|\\text\{ATC\}\_\{\\text\{real\}\}\-\\text\{ATC\}\_\{\\text\{agent\}\}\\right\|,Purchase alignment=1−\|PURreal−PURagent\|,\\displaystyle=1\-\\left\|\\text\{PUR\}\_\{\\text\{real\}\}\-\\text\{PUR\}\_\{\\text\{agent\}\}\\right\|,Action Rate Alignment \(ARA\)=0\.5×ATC alignment\+0\.5×Purchase alignment,\\displaystyle=0\.5\\times\\text\{ATC alignment\}\+0\.5\\times\\text\{Purchase alignment\},whereATCandPURare the add\-to\-cart and checkout fractions per\(shop,token\)\(\\text\{shop\},\\text\{token\}\)pair, stratified by funnel stratum to prevent dominant non\-converting populations from washing out rare but important buyer types\. To verify that alignment is driven by the persona token itself and not by shop\-level tendencies, we compare*correct*pairings where an agent conditioned on tokenaais evaluated against buyers assigned to tokenaa, against two mismatch baselines:*all\-mismatch*, where the same agent is compared against buyers of every other tokenb≠ab\\neq aon that shop and averaged, and*random\-mismatch*, where the agent is compared against buyers of one randomly chosen incorrect token\. If persona tokens carry meaningful signal, correct pairings should consistently outperform both baselines \(more ablations and analysis are provided in Appendix[I](https://arxiv.org/html/2605.14205#A9)\)\.
Table 2:Conversion alignment by funnel stratum\. Correct: matched token\. All\-Mis: averaged over all wrong tokens\. 1\-R: one random wrong token\.ATC AlignmentPurchase AlignmentARAStratumCorrectAll\-Mis1\-RCorrectAll\-Mis1\-RCorrectAll\-Mis1\-RA\.706\.496\.522\.742\.516\.537\.724\.506\.530B\.736\.465\.429\.760\.598\.544\.748\.532\.487C\.767\.438\.442\.674\.755\.720\.721\.597\.581D\.942\.653\.644\.965\.792\.769\.954\.723\.707Stratified\.788\.513\.509\.785\.665\.643\.787\.589\.576[Table˜2](https://arxiv.org/html/2605.14205#S3.T2)breaks down alignment by funnel stratum\. Stratum D is the easiest to match, since these buyers rarely add to cart or proceed to checkout; the target behavior is largely to browse without converting, and the agent reaches95\.1%95\.1\\%ARA\. The task becomes harder deeper in the funnel\. Stratum C requires the agent to add to cart at the right rate*without*proceeding to checkout, while strata B and A demand even finer control over when to cart, when to continue through checkout, and when to complete a purchase\. These higher\-funnel groups are also more behaviorally diverse, yet the agent still achieves strong alignment\. Critically, correct pairings outperform mismatch baselines by2121–2424pp across every stratum, with the widest gap on add\-to\-cart alignment; a discretionary, persona\-specific action with high variance across buyer types\. The stratified ARA of78\.7%78\.7\\%versus57\.4%57\.4\\%under mismatch confirms that a single discrete token meaningfully steers conversion behavior to match the real buyer population, not just at the easy end of the funnel but across all levels of behavioral complexity\.
### 3\.3Do persona\-conditioned agents preserve behavioral dimensions?
In this section, we ask a finer\-grained question: do the behavioral dimensions learned by the auxiliary heads also manifest in the agent’s simulated behavior? As explained in[Section˜2\.2](https://arxiv.org/html/2605.14205#S2.SS2), each persona token encodes a Low/Medium/High bin on three behavioral axes\. If these latent distinctions are preserved end\-to\-end, tokens in higher bins should produce systematically different simulated behavior than those in lower bins\. We test this by grouping tokens according to their auxiliary\-head bin and comparing simulated behavioral scores across groups using four complementary statistics: Cohen’sddfor the standardizedLow\-vs\-Higheffect sizeCohen \[[4](https://arxiv.org/html/2605.14205#bib.bib41)\], Welch’stt\-test for pairwise mean differencesWelch \[[28](https://arxiv.org/html/2605.14205#bib.bib42)\], the Kruskal–Wallis test for any difference across the three groupsKruskal and Wallis \[[10](https://arxiv.org/html/2605.14205#bib.bib43)\], and a permutation test with10,00010\{,\}000random label reassignmentsFisher \[[6](https://arxiv.org/html/2605.14205#bib.bib44)\]\.
Purchase intensity is the clearest example of why persona conditioning matters beyond coarse task intent\. In our simulation setup, the intent specifies whether a session is broadly*buy*\- or*browse*\-oriented, but this single bit groups fundamentally different buyers across strata A, B, and C\. Without persona conditioning, the agent has no way to distinguish these behaviors; it can only produce a single “average buyer” response to the buy prompt\. The persona token resolves this ambiguity\. As shown in the funnel progression section of[Table˜3](https://arxiv.org/html/2605.14205#S3.T3), Low\-purchase tokens almost never trigger conversion actions, Medium tokens convert at moderate rates, and High tokens convert routinely, all under the*same*buy\-oriented intent\. This separation is not marginal: the mean purchase intensity score for High tokens is over100×100\{\\times\}that of Low tokens \([Table˜3](https://arxiv.org/html/2605.14205#S3.T3)\), confirmed by a very large effect size \(Cohen’sd=2\.80d=2\.80\) and highly significant statistical tests across all four measures \(p<10−14p<10^\{\-14\}\)\.
Table 3:Behavioral separation across persona\-token bins\. Mean score is the average event count per simulation\. Statistical significance is assessed atα=0\.05\\alpha=0\.05\.*Mean score per bin**Statistical tests*HeadLowMed\.HighCohen’sddtt\-testppKWppPerm\.ppPurchase0\.062\.947\.252\.805\.5×10−225\.5\{\\times\}10^\{\-22\}4\.5×10−144\.5\{\\times\}10^\{\-14\}<0\.0001\{<\}\\,0\.0001Engagement1\.552\.002\.410\.764\.1×10−34\.1\{\\times\}10^\{\-3\}1\.2×10−21\.2\{\\times\}10^\{\-2\}0\.0030\.003Exploration1\.341\.291\.390\.100\.810\.810\.780\.780\.480\.48*Purchase\-bin funnel progression*Purchase binLowMediumHighActionRateRateRateAdd\-to\-Cart1\.0%29\.4%72\.0%Checkout0\.3%13\.7%43\.3%Engagement provides a second, distinct form of behavioral transfer\. The VQ\-VAE engagement head is defined from total session duration over a buyer’s history, but simulated sessions lack wall\-clock time\. We therefore evaluate engagement using a behavioral proxy: the total number of actions the agent takes per session\. As shown in[Table˜3](https://arxiv.org/html/2605.14205#S3.T3), the mean engagement score rises monotonically from Low to High tokens, indicating that the persona token shapes not only*whether*the agent converts, but also*how actively*it participates in the session; High\-engagement tokens produce agents that browse more pages and persist longer, while Low\-engagement tokens produce agents that act quickly and leave\. All four statistical tests confirm this separation is significant among all three levels \([Table˜3](https://arxiv.org/html/2605.14205#S3.T3)\), with the permutation test \(p=0\.003p=0\.003\) providing the strongest evidence: fewer than33in1,0001\{,\}000random label shuffles produced a gradient as strong as the observed one\.
Exploration is the only auxiliary head that does not transfer reliably to agent behavior \([Table˜3](https://arxiv.org/html/2605.14205#S3.T3)\)\. Mean scores are nearly flat across bins \(∼1\.3\{\\sim\}1\.3events\), with negligible effect size \(Cohen’sd=0\.10d=0\.10\) and no significant differences\. Several factors may contribute to this weak separation\. First, the underlying human signal is itself subtle: even among real buyers, Medium and High exploration differ by only0\.120\.12events per session, making the distinction difficult to recover from a single session\. Second, the label distribution is heavily skewed, because our exploration metric aggregates product views, searches, and collection browsing over buyers who actively engaged \(excluding bouncers\), the VQ\-VAE assigns most tokens to the High bin \(72%72\\%High vs\.5%5\\%Low\), potentially leaving insufficient contrast to learn the distinction\. Third, our simulation setup introduces a floor effect; every agent receives a goal naming a product category, so even Low\-exploration personas must perform at least one search or product view to pursue their objective, eliminating near\-zero sessions and pushing all bins toward a similar baseline of∼1\.3\{\\sim\}1\.3events\.
### 3\.4Does fine\-tuning improve instruction following?
Table 4:Instruction\-following performance on7070deterministic tasks \(1010repetitions each\)\.MetricSimPersonaGPT\-OSS𝚫\\boldsymbol\{\\Delta\}Avg steps per run14\.911\.4\+3\.5Self\-reported goal reached77\.7%93\.9%−\-16\.2pp*Funnel progression \(required action reached\)*Cart tasks→\\toadded to cart91\.4%81\.7%\+9\.7ppCheckout tasks→\\toreached checkout76\.9%53\.1%\+23\.8ppPurchase tasks→\\toreached checkout59\.3%52\.2%\+7\.1ppBeyond behavioral alignment, we ask whether SimPersona also improves the agent’s ability to execute shopping tasks in live environments\. We compare SimPersona against GPT\-OSS\-120B, a strong general\-purpose baseline, on deterministic goal\-oriented tasks executed on live storefronts\. Rather than using a fixed benchmark, we derive each task from a sampled buyer’s historical session so that task difficulty reflects the underlying behavioral profile: low\-engagement buyers receive short lookup\-style tasks, while more engaged buyers receive longer flows\. We uniformly sample7070such tasks to cover persona\-bin combinations, conversion outcomes, and complexity levels ranging from33navigation actions to more than2020sequential steps \([Table˜10](https://arxiv.org/html/2605.14205#A10.T10)\)\. Each task is repeated1010times per model, and an independent LLM judge \(GPT\-5\) determines whether the objective was completed\.
[Table˜4](https://arxiv.org/html/2605.14205#S3.T4)shows thatSimPersonaoutperforms GPT\-OSS on every action metric despite having nearly8×8\{\\times\}fewer parameters, with the advantage scaling with task difficulty\. On simple cart tasks both models perform well, but the gap widens sharply on checkout tasks \(\+23\.8\+23\.8pp\), which require chaining product selection, cart management, and multi\-page form navigation in a single session\. Even on purchase tasks which are the hardest category, demanding end\-to\-end transactions within3030steps,SimPersonaprogresses further through the funnel\. It also takes more steps per simulation \(14\.914\.9vs\.11\.411\.4\), reflecting persistence through complex navigation rather than early termination\. The most revealing contrast is in*how each model fails*: GPT\-OSS declaresgoal\_reachedin93\.9%93\.9\\%of simulations yet frequently stops before completing the required action, whereasSimPersonaself\-reports less often \(77\.7%77\.7\\%\) but is far better calibrated; when it claims success, it has actually progressed further through the funnel\.
## 4Conclusion
We presentedSimPersona, a framework that bridges user modeling and agent simulation by learning discrete persona tokens from raw clickstreams via a behavior\-aware VQ\-VAE and grounding them in an LLM through two\-stage fine\-tuning\. On live storefronts unseen during training,SimPersonareproduces real conversion patterns, separates behavioral dimensions across personas, and outperforms a8×8\{\\times\}larger baseline; all with a single token per buyer and no per\-store calibration\.
Several directions remain open for future work; exploration behavior is less cleanly separated, suggesting richer signals are needed\. DOM\-based page representations omit visual cues that shape real decisions, motivating multimodal perception, and the current pipeline does not fully utilize the learned VQ\-VAE embeddings after assignment; using them to initialize LLM token embeddings could reduce or eliminate the first grounding stage\. Additionally, newly opened stores before clickstream history accumulates remain unsupported, as the framework requires buyer history\.
## References
- \[1\]D\. Arthur and S\. Vassilvitskii\(2007\)K\-means\+\+: the advantages of careful seeding\.InProceedings of the Eighteenth Annual ACM\-SIAM Symposium on Discrete Algorithms,pp\. 1027–1035\.Cited by:[Appendix C](https://arxiv.org/html/2605.14205#A3.p1.1),[§2\.2](https://arxiv.org/html/2605.14205#S2.SS2.p2.9)\.
- \[2\]T\. Caliński and J\. Harabasz\(1974\)A dendrite method for cluster analysis\.Communications in Statistics – Theory and Methods3\(1\),pp\. 1–27\.Cited by:[Appendix H](https://arxiv.org/html/2605.14205#A8.p1.6),[§3\.1](https://arxiv.org/html/2605.14205#S3.SS1.p2.2)\.
- \[3\]Y\. Chuang, K\. Nirunwiroj, Z\. Studdiford, A\. Goyal, V\. V\. Frigo, S\. Yang, D\. V\. Shah, J\. Hu, and T\. T\. Rogers\(2024\)Beyond demographics: aligning role\-playing llm\-based agents using human belief networks\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 14010–14026\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[4\]J\. Cohen\(1988\)Statistical power analysis for the behavioral sciences\.2 edition,Lawrence Erlbaum Associates\.Cited by:[§3\.3](https://arxiv.org/html/2605.14205#S3.SS3.p1.3)\.
- \[5\]X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su\(2023\)Mind2Web: towards a generalist agent for the web\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.14205#S3.SS2.p2.7)\.
- \[6\]R\. A\. Fisher\(1935\)The design of experiments\.Oliver and Boyd\.Cited by:[§3\.3](https://arxiv.org/html/2605.14205#S3.SS3.p1.3)\.
- \[7\]S\. A\. Gebreegziabher, Y\. Yang, C\. Chiang, H\. Yoo, C\. Chen, H\. J\. Do, Z\. Ashktorab, W\. Geyer, D\. Gómez\-Zará, and T\. J\. Li\(2026\)The behavioral fabric of llm\-powered gui agents: human values and interaction outcomes\.InProceedings of the 31st International Conference on Intelligent User Interfaces,pp\. 909–927\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1)\.
- \[8\]I\. Gur, H\. Furuta, A\. Huang, M\. Safdari, Y\. Matsuo, D\. Eck, and A\. Faust\(2023\)A real\-world WebAgent with planning, long context understanding, and program synthesis\.arXiv preprint arXiv:2307\.12856\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1)\.
- \[9\]T\. Hatt and S\. Feuerriegel\(2022\)Detecting user exits from online behavior: a duration\-dependent latent state model\.arXiv preprint arXiv:2208\.03937\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p3.1)\.
- \[10\]W\. H\. Kruskal and W\. A\. Wallis\(1952\)Use of ranks in one\-criterion variance analysis\.Journal of the American Statistical Association47\(260\),pp\. 583–621\.Cited by:[§3\.3](https://arxiv.org/html/2605.14205#S3.SS3.p1.3)\.
- \[11\]J\. Lin\(1991\)Divergence measures based on the Shannon entropy\.IEEE Transactions on Information Theory37\(1\),pp\. 145–151\.Cited by:[Appendix B](https://arxiv.org/html/2605.14205#A2.p4.7)\.
- \[12\]Y\. Lu, J\. Huang, Y\. Han, B\. Yao, S\. Bei, J\. Gesi, Y\. Xie, Z\. Wang, Q\. He, and D\. Wang\(2025\)Can llm agents simulate multi\-turn human behavior? evidence from real online customer behavior data\.arXiv preprint arXiv:2503\.20749\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[13\]Y\. Lu, B\. Yao, H\. Gu, J\. Huang, Z\. J\. Wang, Y\. Li, J\. Gesi, Q\. He, T\. J\. Li, and D\. Wang\(2025\)Uxagent: an llm agent\-based usability testing framework for web design\.InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems,pp\. 1–12\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14205#S2.SS1.p1.1)\.
- \[14\]S\. S\. Y\. Lutzet al\.\(2025\)The prompt makes the person\(a\): a systematic evaluation of sociodemographic persona prompting for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[15\]J\. Niet al\.\(2018\)Perceive your users in depth: learning universal user representations from multiple e\-commerce tasks\.InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§1](https://arxiv.org/html/2605.14205#S1.p3.1)\.
- \[16\]J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[17\]J\. S\. Parket al\.\(2024\)Generative agent simulations of 1,000 people\.arXiv preprint arXiv:2411\.10109\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[18\]A\. Razavi, A\. van den Oord, and O\. Vinyals\(2019\)Generating diverse high\-fidelity images with VQ\-VAE\-2\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix C](https://arxiv.org/html/2605.14205#A3.p1.5),[§2\.2](https://arxiv.org/html/2605.14205#S2.SS2.p2.9)\.
- \[19\]Y\. Shaoet al\.\(2023\)Character\-llm: a trainable agent for role\-playing\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[20\]Y\. Shi, Y\. Fei, S\. Zhang, H\. Wang, and X\. Xiao\(2025\)You are what you bought: generating customer personas for e\-commerce applications\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1810–1819\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[21\]Y\. Shi, W\. Xu, Z\. Zhang, X\. Zi, Q\. Wu, and M\. Xu\(2025\)PersonaX: a recommendation agent\-oriented user modeling framework for long behavior sequence\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 5764–5787\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.300)Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[22\]A\. van den Oord, Y\. Li, and O\. Vinyals\(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[Appendix D](https://arxiv.org/html/2605.14205#A4.p1.2),[§2\.2](https://arxiv.org/html/2605.14205#S2.SS2.p4.6)\.
- \[23\]A\. van den Oord, O\. Vinyals, and K\. Kavukcuoglu\(2017\)Neural discrete representation learning\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.14205#S2.SS2.p2.9)\.
- \[24\]D\. Wang, T\. Hsu, Y\. Lu, L\. Cui, Y\. Xie, W\. Headean, B\. Yao, A\. Veeragouni, J\. Liu, S\. Nag, and J\. Wang\(2025\)AgentA/b: automated and scalable web a/b testing with interactive llm agents\.arXiv preprint arXiv:2504\.09723\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14205#S2.SS1.p1.1)\.
- \[25\]G\. Wang, X\. Zhang, S\. Tang, H\. Zheng, and B\. Y\. Zhao\(2016\)Unsupervised clickstream clustering for user behavior analysis\.InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems,pp\. 225–236\.External Links:[Document](https://dx.doi.org/10.1145/2858036.2858107)Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p3.1)\.
- \[26\]Z\. Wang, Y\. Lu, W\. Li, A\. Amini, B\. Sun, Y\. Bart, W\. Lyu, J\. Gesi, T\. Wang, J\. Huang, Y\. Su, U\. Ehsan, M\. Alikhani, T\. J\. Li, L\. Chilton, and D\. Wang\(2025\)OPeRA: a dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation\.arXiv preprint arXiv:2506\.05606\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.05606)Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14205#S2.SS1.p1.1)\.
- \[27\]Z\. Wang, Y\. Lu, Y\. Zhang, J\. Huang, and D\. Wang\(2025\)Customer\-r1: personalized simulation of human behaviors via rl\-based llm agent in online shopping\.arXiv preprint arXiv:2510\.07230\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[28\]B\. L\. Welch\(1947\)The generalization of ‘student’s’ problem when several different population variances are involved\.Biometrika34\(1/2\),pp\. 28–35\.Cited by:[§3\.3](https://arxiv.org/html/2605.14205#S3.SS3.p1.3)\.
- \[29\]A\. Yang, B\. Yang, B\. Zhang, B\. Wang, B\. Li, B\. Liu,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§2\.3](https://arxiv.org/html/2605.14205#S2.SS3.p1.1)\.
- \[30\]D\. Yanget al\.\(2023\)TRACE: transformer\-based user representations from attributed clickstream event sequences\.InProceedings of the ACM Web Conference,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p3.1)\.
- \[31\]S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan\(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§3\.2](https://arxiv.org/html/2605.14205#S3.SS2.p2.7)\.
- \[32\]Y\. Zhang, T\. Wang, J\. Gesi, Z\. Wang, Y\. Lu, J\. Lin, S\. Zhan, V\. Gao, R\. Jiao, J\. Liu,et al\.\(2025\)Shop\-r1: rewarding llms to simulate human behavior in online shopping via reinforcement learning\.arXiv preprint arXiv:2507\.17842\.Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p2.1)\.
- \[33\]W\. Zhenget al\.\(2020\)A deep Markov model for clickstream analytics in online shopping\.InProceedings of The Web Conference 2020,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p3.1)\.
- \[34\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, Y\. Bisk, D\. Fried, U\. Alon,et al\.\(2024\)WebArena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.14205#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.14205#S3.SS2.p2.7)\.
## Appendix AData Pipeline
[Figure˜2](https://arxiv.org/html/2605.14205#A1.F2)illustrates our end\-to\-end data pipeline described in[Section˜2\.1](https://arxiv.org/html/2605.14205#S2.SS1)\. As mentioned in[Section˜2\.1\.1](https://arxiv.org/html/2605.14205#S2.SS1.SSS1), we join the de\-identified raw inputs including the event\-level information with the product catalog, collection directory, and search\-query logs to produce enriched session\-level records that combine numerical behavioral signals \(e\.g\., session counts, conversion rates, session durations\) with semantic information \(e\.g\., product titles, search queries, collection names\) shown in[Figure˜3](https://arxiv.org/html/2605.14205#A1.F3)\.
Figure 2:Data pipeline overview\. A single enrichment pass over raw clickstream logs produces two outputs: \(top\) numeric and semantic buyer\-level features for VQ\-VAE persona discovery, and \(bottom\) executable multi\-turn agent traces for SFT, generated by replaying real sessions on live storefronts\.Figure 3:Data enrichment\. Raw event\-level tables are joined with the product catalog, collection directory, and product embeddings to produce an enriched buyer–session table containing \(A\) numeric behavioral features used for VQ\-VAE persona discovery, and \(B\) semantic columns used for SFT trace generation\.From these enriched records, we extract the1616scalar behavioral features described in[Section˜2\.1\.1](https://arxiv.org/html/2605.14205#S2.SS1.SSS1)\. Among them,*Intent strength*is a composite score calculated by weighting actions according to their commercial commitment:
sintent=watc⋅natc\+wco⋅nco\_start\+wpur⋅npurchasensessions\+watc⋅natc\+wco⋅nco\_start\+wpur⋅npurchases\_\{\\text\{intent\}\}=\\frac\{w\_\{\\text\{atc\}\}\\cdot n\_\{\\text\{atc\}\}\+w\_\{\\text\{co\}\}\\cdot n\_\{\\text\{co\\\_start\}\}\+w\_\{\\text\{pur\}\}\\cdot n\_\{\\text\{purchase\}\}\}\{n\_\{\\text\{sessions\}\}\+w\_\{\\text\{atc\}\}\\cdot n\_\{\\text\{atc\}\}\+w\_\{\\text\{co\}\}\\cdot n\_\{\\text\{co\\\_start\}\}\+w\_\{\\text\{pur\}\}\\cdot n\_\{\\text\{purchase\}\}\}\(3\)
wherenatcn\_\{\\text\{atc\}\},nco\_startn\_\{\\text\{co\\\_start\}\}, andnpurchasen\_\{\\text\{purchase\}\}count sessions containing an add\-to\-cart, checkout initiation, and completed purchase, respectively\. The weightswatc=3w\_\{\\text\{atc\}\}\{=\}3,wco=5w\_\{\\text\{co\}\}\{=\}5,wpur=8w\_\{\\text\{pur\}\}\{=\}8are monotonically increasing with funnel depth, following the same coprime design principle used for the auxiliary purchase head \(Appendix[E](https://arxiv.org/html/2605.14205#A5)\):watcw\_\{\\text\{atc\}\}andwpurw\_\{\\text\{pur\}\}match the purchase composite weights, whilewco=5w\_\{\\text\{co\}\}\{=\}5is their midpoint, capturing checkout initiation as an intermediate commitment signal between add\-to\-cart and completed purchase\. The denominator ensuressintent∈\[0,1\)s\_\{\\text\{intent\}\}\\in\[0,1\): a pure window shopper scores0, while a buyer who converts every session approaches∼0\.94\{\\sim\}0\.94\.
Table 5:Feature normalization pipeline for the 403\-dimensional VQ\-VAE inputs\.Feature GroupDimsTransformRationaleExposure & Volume8log\(1\+x\)→rob\-z\\log\(1\{\+\}x\)\\to\\text\{rob\-\}zRight\-skewed countsEngagement Depth2log\(1\+x\)→rob\-z\\log\(1\{\+\}x\)\\to\\text\{rob\-\}zCumulative totals with heavy tailsFunnel Rates3n\+αN\+α\+β→logit→rob\-z\\frac\{n\+\\alpha\}\{N\+\\alpha\+\\beta\}\\to\\mathrm\{logit\}\\to\\text\{rob\-\}zSmoothing stabilizes smallNNIntent Strength1logitϵ→rob\-z\\mathrm\{logit\}\_\{\\epsilon\}\\to\\text\{rob\-\}zBounded composite scoreDollar Values2log\(1\+x\)→rob\-z\\log\(1\{\+\}x\)\\to\\text\{rob\-\}z$0 for non\-buyers, heavy tailProduct Embeddings3×1283\{\\times\}128dim\-z→z\\toPCA768→128768\{\\to\}128Heterogeneous dims; retains 85% varEmbedding Masks3binary \(as\-is\)Flags observed vs\. zero\-filled channelsTotal403Figure 4:VQ\-VAE input construction for a single buyer–shop pair\.top:product embeddings are computed by averaging the768768\-dimensional catalog vectors of all products a buyer viewed, carted, or purchased, then compressed to128128dimensions via PCA\.Bottom:the final403403\-dimensional input concatenates1616zz\-scored behavioral scalars, three128128\-d product embeddings, and a33\-bit evidence mask indicating which product channels are present\.Because the1616features span different scales and distributions, each group is normalized with a tailored transform \([Table˜5](https://arxiv.org/html/2605.14205#A1.T5)\)\. Heavy\-tailed counts and dollar values are log\-transformed and then robustzz\-scored, computed as\(x−x~\)/IQR\(x\-\\tilde\{x\}\)/\\mathrm\{IQR\}with MAD×1\.4826\\,\{\\times\}\\,1\.4826fallback whenIQR=0\\mathrm\{IQR\}\{=\}0\. Bounded rates are first Bayesian\-smoothed to replace the raw raten/Nn/Nwithp~=\(n\+α\)/\(N\+α\+β\)\\tilde\{p\}=\(n\{\+\}\\alpha\)/\(N\{\+\}\\alpha\{\+\}\\beta\)under a Laplace prior \(α=β=1\\alpha\{=\}\\beta\{=\}1\) to prevent extreme logit values from buyers with few sessions, then logit\-transformed and robustzz\-scored\. These1616scalars are concatenated with three PCA\-compressed product embeddings \(3×1283\\times 128dimensions\) and three binary evidence masks to form the403403\-dimensional VQ\-VAE input illustrated in[Figure˜4](https://arxiv.org/html/2605.14205#A1.F4)\.
[Figure˜5](https://arxiv.org/html/2605.14205#A1.F5)illustrates the trace generation pipeline described in[Section˜2\.1\.2](https://arxiv.org/html/2605.14205#S2.SS1.SSS2): a buyer’s raw clickstream is first synthesized into a natural\-language navigation goal, then an agent replays that goal on the live storefront, producing the multi\-turn interaction traces used for SFT\.
Figure 5:SFT trace generation from enriched clickstreams\.Row 1: the enriched session record contains the buyer’s event timeline with semantic labels \(product names, collection titles, search queries\) alongside metadata \(shop ID, funnel stratum, persona token\)\.Row 2: an LLM synthesizes the event sequence into a natural\-language navigation goal, which is replayed on the live storefront by an agent and verified by an LLM judge\.Row 3: the verified replay is converted into a multi\-turn SFT example; system prompt defines the task format, user turn provides the navigation goal and DOM snapshot, and the assistant turn serves as the training target\.
## Appendix BPopulation Distribution Recovery
A distinctive property of our proposed method is that it recovers not only*which*behavioral types exist in the buyer population, but*how they are distributed*within each store, a capability that is structurally absent from prompt\-based persona methods, where persona types are defined by manual specification and no mechanism exists to estimate the mixing proportions \(ps\(k\)p\_\{s\}\(k\)\) from observed traffic\. In our method however, because each buyerbbis mapped to a discrete token via a single encoder forward pass followed by nearest\-entry quantizationq\(⋅\)q\(\\cdot\), the empirical population distribution over persona tokens for storessis given by:
p^s\(k\)=1\|ℬs\|∑b∈ℬs𝟏\[q\(zb\)=k\],k=1,…,K,\\hat\{p\}\_\{s\}\(k\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{B\}\_\{s\}\|\}\\sum\_\{b\\in\\mathcal\{B\}\_\{s\}\}\\mathbf\{1\}\[\\,q\(z\_\{b\}\)=k\\,\],\\qquad k=1,\\dots,K,\(4\)whereℬs\\mathcal\{B\}\_\{s\}is the set of buyers observed at that store\. This distribution is the sampling measure used to construct simulated buyer populations: agents are assigned persona tokens in proportion top^s\(k\)\\hat\{p\}\_\{s\}\(k\)\. As a result, store\-level simulation statistics such as add\-to\-cart rate, checkout rate, and navigation depth depend not only on how each token behaves, but also on whether the learned token mixture matches the real buyer population\. If high\-intent tokens are overrepresented, simulated conversion will be inflated; if low\-intent tokens dominate incorrectly, it will be suppressed\. Recoveringp^s\\hat\{p\}\_\{s\}is therefore a necessary condition for faithful population\-level simulation\.
We test this property on the same4242held\-out storefronts covering8\.378\.37M buyers used in[Section˜3](https://arxiv.org/html/2605.14205#S3), using two complementary analyses\. The first asks whether the learned token mixture can recover each store’s funnel\-stage composition\. The second asks whether the same token mixture can reconstruct continuous store\-level behavioral features\. Together, these tests evaluate whether the codebook captures real population structure rather than merely clustering individual sessions\.
Figure 6:Stratum distribution recovery across all4242storefronts\. Solid bars show the real funnel\-stage composition and hatched bars show the distribution predicted from each store’s VQ\-VAE token distribution\. Bar heights are shown aslog\(1\+100x\)\\log\(1\+100x\), wherexxis the stratum fraction, to make low\-frequency strata visible alongside the dominant browser segment\. The predicted and real distributions remain closely matched under this visualization \(mean JS divergence=0\.054=0\.054\), confirming that the learned codebook faithfully recovers the buyer population mix\.We first evaluate recovery of funnel strata\. As explained in[Section˜2\.1\.1](https://arxiv.org/html/2605.14205#S2.SS1.SSS1), each buyer is independently assigned to one of four rule\-based strata from its historical behavior: Purchaser \(A\), Checkout Abandoner \(B\), Cart Builder \(C\), or Window Shopper \(D\)\. These labels are not used to train the VQ\-VAE, which only sees continuous behavioral features, product embeddings, and evidence masks\. This makes stratum recovery a stringent test: if the token distribution can reconstruct the stratum mix of a store, then the unsupervised codebook has recovered behaviorally meaningful population structure\.
For each tokenkk, we estimate its global stratum profileP\(stratum∣k\)P\(\\text\{stratum\}\\mid k\)from all buyers assigned to that token\. We then predict the stratum distribution of storessby mixing these token profiles according to the store’s token distribution:
P^s\(stratum\)=∑k=1Kp^s\(k\)P\(stratum∣k\)\.\\hat\{P\}\_\{s\}\(\\text\{stratum\}\)=\\sum\_\{k=1\}^\{K\}\\hat\{p\}\_\{s\}\(k\)\\,P\(\\text\{stratum\}\\mid k\)\.\(5\)Intuitively, this asks if we know which persona tokens a store’s buyers fall into, and we know what funnel stage each token typically represents, can we recover the store’s real funnel\-stage mix? We measure the agreement using the Jensen–Shannon \(JS\) divergenceLin \[[11](https://arxiv.org/html/2605.14205#bib.bib50)\], a symmetric, bounded measure of similarity between two probability distributions\. JS divergence ranges from0\(identical distributions\) to11\(completely disjoint support\); values below0\.050\.05indicate near\-perfect agreement, while values above0\.150\.15suggest meaningful distributional differences\.
[Figure˜6](https://arxiv.org/html/2605.14205#A2.F6)compares the predicted and real stratum distributions across all evaluation storefronts, from the largest store with3\.13\.1M buyers to the smallest with under1,0001\{,\}000\. The mean Jensen–Shannon divergence is0\.0540\.054\(median0\.0450\.045\), with3636of4242stores below0\.100\.10\. The recovery is equally accurate on stores with over one million buyers, where even small distributional errors would affect hundreds of thousands of sampled agents, and on stores with fewer than10,00010\{,\}000buyers, where the token distribution is estimated from a smaller sample\. Visually, the solid \(real\) and hatched \(predicted\) bars are nearly indistinguishable across stores, despite wide variation in buyer composition; from browser\-dominated stores \(\>90%\>90\\%stratum D\) to stores with30%\+30\\%\+purchasers\. This level of agreement between a hard rule\-based labeling and an unsupervised discrete codebook trained on continuous features, confirms that the VQ\-VAE codebook has learned behaviorally meaningful clusters whose population\-level statistics faithfully reflect the real buyer distribution at scale\.
We next test whether the same token distribution preserves continuous behavioral information beyond categorical strata\. For each tokenkk, we compute its behavioral centroid𝐱¯k∈ℝd\\bar\{\\mathbf\{x\}\}\_\{k\}\\in\\mathbb\{R\}^\{d\}; the mean raw feature vector of all buyers assigned to that token across all stores\. For a given storess, we reconstruct its aggregate buyer profile by weighting these centroids by the store’s token distribution:
𝐱^s=∑k=1Kp^s\(k\)𝐱¯k\.\\hat\{\\mathbf\{x\}\}\_\{s\}=\\sum\_\{k=1\}^\{K\}\\hat\{p\}\_\{s\}\(k\)\\,\\bar\{\\mathbf\{x\}\}\_\{k\}\.\(6\)We compare𝐱^s\\hat\{\\mathbf\{x\}\}\_\{s\}to the true store\-level mean𝐱s\\mathbf\{x\}\_\{s\}and report per\-featureR2R^\{2\}across all4242stores \([Figure˜7](https://arxiv.org/html/2605.14205#A2.F7)\)\.R2R^\{2\}measures the fraction of store\-to\-store variance in each feature that is explained by the token\-weighted prediction: a value of0\.960\.96means that96%96\\%of the cross\-store variation in that feature can be recovered from the token distribution alone, without access to any individual buyer data\.
Figure 7:Store\-level behavioral reconstruction from persona token distributions\. The codebook achievesR2≥0\.87R^\{2\}\\geq 0\.87on the four features most critical to shopping simulation, confirming that it preserves behaviorally relevant population structure while remaining invariant to volume metrics\.The reconstruction is strongest on the behavioral dimensions most critical to simulation fidelity: intent strength \(R2=0\.96R^\{2\}=0\.96\), add\-to\-cart rate \(0\.950\.95\), checkout rate \(0\.900\.90\), and browse\-only rate \(0\.870\.87\)\. These are precisely the features that determine whether a simulated agent browses passively, expresses purchase interest, or advances toward checkout, the core axes along which persona tokens must differentiate buyer behavior\. Their consistently highR2R^\{2\}\(≥0\.87\\geq 0\.87\) confirms that by knowing a store’s token mix we can predict its real aggregate behavioral profile: if two stores differ substantially in how purchase\-oriented their visitors are, the VQ\-VAE token distribution captures that difference\.
## Appendix CVQ\-VAE architecture and training details\.
As explained in[Section˜2\.2](https://arxiv.org/html/2605.14205#S2.SS2), codebook entries are initialized withkk\-means\+\+Arthur and Vassilvitskii \[[1](https://arxiv.org/html/2605.14205#bib.bib38)\]over encoder outputs from a full pass through the training set\. During training, entries are updated via exponential moving averages rather than gradient descent:
𝐞k←γ𝐞k\+\(1−γ\)𝐳¯k,\\mathbf\{e\}\_\{k\}\\leftarrow\\gamma\\,\\mathbf\{e\}\_\{k\}\+\(1\-\\gamma\)\\,\\bar\{\\mathbf\{z\}\}\_\{k\},\(7\)where𝐳¯k\\bar\{\\mathbf\{z\}\}\_\{k\}is the mean of encoder outputs assigned to entrykkin the current batch andγ∈\[0,1\)\\gamma\\in\[0,1\)controls the memory of past assignments\. To prevent codebook collapseRazaviet al\.\[[18](https://arxiv.org/html/2605.14205#bib.bib39)\], we maintain an EMA of per\-entry assignment countsn^k\\hat\{n\}\_\{k\}and declare an entry dead when
n^k<α⋅1K∑j=1Kn^j,\\hat\{n\}\_\{k\}\\;<\\;\\alpha\\cdot\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}\\hat\{n\}\_\{j\},\(8\)i\.e\., when its usage falls below a fractionα\\alphaof the mean\. Dead entries are reinitialized from randomly sampled encoder outputs plus small Gaussian noise\. This reset activates only after a warmup period so that usage statistics reflect a stable latent space\.
The encoder, codebook, and decoder have layer sizes\[403→256→128→96\]\[403\\\!\\to\\\!256\\\!\\to\\\!128\\\!\\to\\\!96\],K=256K\\\!=\\\!256entries inℝ96\\mathbb\{R\}^\{96\}, and a mirrored decoder, respectively\. Reconstruction is down\-weighted \(λr=0\.3\\lambda\_\{r\}\\\!=\\\!0\.3\) to prioritize behavioral partitioning over per\-feature fidelity; remaining hyperparameters areβ=0\.75\\beta\\\!=\\\!0\.75,λc=0\.15\\lambda\_\{c\}\\\!=\\\!0\.15,λa=0\.5\\lambda\_\{a\}\\\!=\\\!0\.5, EMA decayγ=0\.9\\gamma\\\!=\\\!0\.9\([Equation˜7](https://arxiv.org/html/2605.14205#A3.E7)\), and dead\-code replacement with thresholdα=0\.1\\alpha\\\!=\\\!0\.1every5050steps after a100100\-step warmup \([Equation˜8](https://arxiv.org/html/2605.14205#A3.E8)\)\. All values were selected through hyperparameter search optimizing for behavioral purity of the resulting codebook entries\. We highlight that the training data is constructed by aggregating user behavior over last44months\.
## Appendix DContrastive and auxiliary loss
The contrastive termℒcontrast\\mathcal\{L\}\_\{\\text\{contrast\}\}is an InfoNCE lossvan den Oordet al\.\[[22](https://arxiv.org/html/2605.14205#bib.bib40)\]computed over the encoder outputs𝐳e\\mathbf\{z\}\_\{e\}:
ℒcontrast=−1\|ℬ\|∑i∈ℬlogexp\(sim\(𝐳e\(i\),𝐳e\(i\+\)\)/τ\)∑j∈ℬexp\(sim\(𝐳e\(i\),𝐳e\(j\)\)/τ\),\\mathcal\{L\}\_\{\\text\{contrast\}\}=\-\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{i\\in\\mathcal\{B\}\}\\log\\frac\{\\exp\(\\text\{sim\}\(\\mathbf\{z\}\_\{e\}^\{\(i\)\},\\mathbf\{z\}\_\{e\}^\{\(i\+\)\}\)/\\tau\)\}\{\\sum\_\{j\\in\\mathcal\{B\}\}\\exp\(\\text\{sim\}\(\\mathbf\{z\}\_\{e\}^\{\(i\)\},\\mathbf\{z\}\_\{e\}^\{\(j\)\}\)/\\tau\)\},\(9\)wheresim\(⋅,⋅\)\\text\{sim\}\(\\cdot,\\cdot\)denotes cosine similarity,τ\\tauis a temperature parameter, andi\+i\+is the positive pair for anchorii\. As explained in[Section˜2\.2](https://arxiv.org/html/2605.14205#S2.SS2), we define positive pairs through33steps, and all non\-self samples in the batch remain in the InfoNCE denominator \([Equation˜9](https://arxiv.org/html/2605.14205#A4.E9)\), so the model simultaneously pulls behaviorally aligned buyers together and pushes apart incompatible ones\.
We setτ=0\.1\\tau=0\.1,M=10M=10, andF=3F=3, chosen via hyperparameter search over codebook purity optimization\. The constraintM\>FM\>Fis by design: the product filter retrieves a broad pool of peers who engage with similar merchandise, and the behavioral filter then narrows this pool to theFFwhose browsing*style*most closely matches the anchor buyer\. IfF≥MF\\geq M, the behavioral stage would pass through every product\-filtered candidate, collapsing the two\-stage selection into a single filter and losing the ability to separate*what*buyers shop for from*how*they shop\.
The auxiliary lossℒaux\\mathcal\{L\}\_\{\\text\{aux\}\}adds supervised pressure for the codebook to preserve behaviorally meaningful distinctions\. Three single\-layer linear heads operate on the quantized representation𝐳q\\mathbf\{z\}\_\{q\}, each predicting a coarse three\-way label \(Low / Medium / High\) for one behavioral axis:
ℒaux=∑h∈\{engage,explore,purchase\}ℒh,\\mathcal\{L\}\_\{\\text\{aux\}\}=\\sum\_\{h\\in\\\{\\text\{engage\},\\,\\text\{explore\},\\,\\text\{purchase\}\\\}\}\\mathcal\{L\}\_\{h\},\(10\)where eachℒh\\mathcal\{L\}\_\{h\}is a weighted cross\-entropy loss:
ℒh=−∑c=13wc\(h\)yc\(h\)logp^c\(h\)\(𝐳q\),p^\(h\)\(𝐳q\)=softmax\(MLPh\(𝐳q\)\)\.\\mathcal\{L\}\_\{h\}=\-\\sum\_\{c=1\}^\{3\}w\_\{c\}^\{\(h\)\}\\,y\_\{c\}^\{\(h\)\}\\log\\hat\{p\}\_\{c\}^\{\(h\)\}\(\\mathbf\{z\}\_\{q\}\),\\qquad\\hat\{p\}^\{\(h\)\}\(\\mathbf\{z\}\_\{q\}\)=\\mathrm\{softmax\}\\\!\\bigl\(\\mathrm\{MLP\}\_\{h\}\(\\mathbf\{z\}\_\{q\}\)\\bigr\)\.\(11\)Hereyc\(h\)∈\{0,1\}y\_\{c\}^\{\(h\)\}\\in\\\{0,1\\\}is the one\-hot ground\-truth bin for headhh, andp^c\(h\)\\hat\{p\}\_\{c\}^\{\(h\)\}is the predicted probability for bincc\. Because some bins are much smaller than others \(e\.g\., heavy purchasers are rare\), we apply inverse\-frequency class weightswc\(h\)=N/\(3⋅nc\(h\)\)w\_\{c\}^\{\(h\)\}=N/\(3\\cdot n\_\{c\}^\{\(h\)\}\), whereNNis the training\-set size andnc\(h\)n\_\{c\}^\{\(h\)\}is the count in bincc\. This ensures the loss does not ignore rare but behaviorally important groups\.
## Appendix EAuxiliary head weight selection
Table 6:Purchase\-intensity bins induced byspur=8ncheckout\+3natcs\_\{\\mathrm\{pur\}\}=8\\,n\_\{\\texttt\{checkout\}\}\+3\\,n\_\{\\texttt\{atc\}\}\. The same partition is obtained for all105105tested monotone weight pairs satisfyingwco\>watc\>0w\_\{\\texttt\{co\}\}\>w\_\{\\texttt\{atc\}\}\>0\.BinSemanticsNNCheckout RateATC RatePurchase Intensity0Browse only17,453\.000\.000\.0001Interest only16,472\.000\.892\.7062Purchase\-committed10,634\.533\.536\.776The auxiliary purchase head \([Sections˜2\.2](https://arxiv.org/html/2605.14205#S2.SS2)and[D](https://arxiv.org/html/2605.14205#A4)\) supervises the codebook to preserve coarse purchase\-intensity structure\. Each buyer receives a three\-level label \(Low / Medium / High\) derived from the composite scorespur=wconcheckout\+watcnatcs\_\{\\mathrm\{pur\}\}=w\_\{\\texttt\{co\}\}\\,n\_\{\\texttt\{checkout\}\}\+w\_\{\\texttt\{atc\}\}\\,n\_\{\\texttt\{atc\}\}, wherencheckoutn\_\{\\texttt\{checkout\}\}andnatcn\_\{\\texttt\{atc\}\}count sessions containing a completed checkout and an add\-to\-cart, respectively\. The only structural assumption is the funnel orderingwco\>watc\>0w\_\{\\texttt\{co\}\}\>w\_\{\\texttt\{atc\}\}\>0: checkout reflects stronger commitment than add\-to\-cart\. The resulting integer scores are discretized into three bins by a count\-based procedure that greedily merges adjacent values until three groups remain\.
A key question is whether the auxiliary labels depend sensitively on the particular numerical choice of\(wco,watc\)\(w\_\{\\texttt\{co\}\},w\_\{\\texttt\{atc\}\}\)\. To test this, we swept all monotone integer pairs withwco∈\[2,15\]w\_\{\\texttt\{co\}\}\\in\[2,15\]andwatc∈\[1,wco−1\]w\_\{\\texttt\{atc\}\}\\in\[1,w\_\{\\texttt\{co\}\}\-1\], for a total of105105candidate configurations\. On the44,55944\{,\}559VQ\-VAE training buyers, every one of these configurations produced exactly the same three\-way partition \(17,453/16,472/10,63417\{,\}453/16\{,\}472/10\{,\}634buyers\)\. In other words, the induced labels are not driven by any special tuning of the weights\. Instead, the buyers already separate naturally into three coarse behavioral groups: those with no commercial activity, those with add\-to\-cart activity but no checkout behavior, and those with checkout\-bearing sessions\. As long as the monotone orderingwco\>watcw\_\{\\texttt\{co\}\}\>w\_\{\\texttt\{atc\}\}is respected, the exact ratio does not change this partition\.
Since the partition is invariant,\(wco,watc\)=\(8,3\)\(w\_\{\\texttt\{co\}\},w\_\{\\texttt\{atc\}\}\)=\(8,3\)is chosen for interpretability rather than as a tuned parameter\. The values are coprime \(gcd\(8,3\)=1\\gcd\(8,3\)=1\), which maximizes the number of composite scores that uniquely decompose into their constituent events: a score of1111can only mean one checkout plus one add\-to\-cart \(8\+38\+3\), and66can only mean two add\-to\-carts \(3×23\\times 2\)\. A non\-coprime pair like\(8,2\)\(8,2\)would make scores like1616ambiguous; two checkouts \(8×28\\times 2\) or eight add\-to\-carts \(2×82\\times 8\), conflating fundamentally different funnel behaviors\. Beyond coprimality, the gap between88and33is large enough that the smallest checkout\-containing score \(88\) strictly exceeds two add\-to\-carts \(66\), cleanly separating repeated interest from genuine funnel progression in the low\-count regime that dominates real buyer data\. The resulting half\-integer bin edges\[1\.5,4\.5\]\[1\.5,4\.5\]further guarantee that no buyer’s integer\-valued score falls on a boundary, making every bin assignment unambiguous\.
## Appendix FTwo\-stage vs\. single\-stage training comparison
Table 7:Two\-stage vs\. single\-stage SFT across4242storefronts\. Goal reached = successful task completion\. Steps\-limit = step budget exhausted\. Other = malformed or unparseable agent output\.Two\-StageSingle\-StageOutcomeCount%Count%Goal reached42,08383\.541,72082\.8Steps\-limit errors6,62513\.16,33312\.6Time\-limit errors5981\.270\.0Other errors1,0942\.22,3404\.6A central design choice inSimPersonais the two\-stage SFT framework described in[Section˜2\.3](https://arxiv.org/html/2605.14205#S2.SS3)\. In Stage 1, we freeze the LLM backbone and train only the256256persona\-token embeddings, and in Stage 2, we unfreeze the full model and jointly fine\-tune all parameters\. To test whether this decomposition is necessary, we compare against a single\-stage baseline that trains all parameters jointly from initialization\. Both variants use the same architecture, data, and optimization settings; the only difference is whether persona grounding is decoupled from action learning\.
We evaluate both models on the same simulation setup used in[Section˜3\.2](https://arxiv.org/html/2605.14205#S3.SS2)\. Each simulation run has one of four terminal outcomes:*goal reached*,*steps\-limit*,*time\-limit*, or*other*\. A steps\-limit outcome means that the agent continued to produce valid actions but did not complete the assigned task within the3030\-step budget\. A time\-limit outcome means that the run exceeded the wall\-clock timeout\. The*other*category captures malformed or non\-executable outputs, including invalid JSON and structurally broken action strings\.[Table˜7](https://arxiv.org/html/2605.14205#A6.T7)reports aggregate outcome counts, while[Table˜8](https://arxiv.org/html/2605.14205#A9.T8)summarizes the distribution of per\-shop error rates\.
Figure 8:Per\-shop error\-rate comparison between two\-stage and single\-stage SFT \(sorted by two\-stage total error\)\. The two\-stage model keeps error bounded below29%29\\%on every shop, while single\-stage training produces catastrophic outliers exceeding70%70\\%\. The non\-zero time\-limit rate under two\-stage training reflects deeper storefront engagement rather than inefficiency\.The aggregate goal\-reached rates \(83\.5%83\.5\\%vs\.82\.8%82\.8\\%\) and overall error rates \(16\.5%16\.5\\%vs\.17\.2%17\.2\\%\) appear nearly identical, and a pairedtt\-test on per\-shop error rates confirms no significant difference in the mean \(p\-value=0\.72=0\.72\)\. However, focusing on the mean is misleading\. The single\-stage model’s comparable average is an artifact of compensating extremes: it achieves very low error on some storefronts while collapsing catastrophically on others\. The coefficient of variation, which measures how dispersed per\-shop error rates are relative to their average, nearly doubles from0\.460\.46\(two\-stage\) to0\.810\.81\(single\-stage\)\. A coefficient of0\.810\.81indicates that the typical deviation from the mean is almost as large as the mean itself, meaning that the aggregate error rate is a poor predictor of what any individual storefront will experience\. The per\-shop error range tells the same story: under two\-stage training, the gap between the best and worst shop is26\.326\.3pp \(2\.2%2\.2\\%–28\.5%28\.5\\%\), whereas under single\-stage training it balloons to68\.468\.4pp \(2\.9%2\.9\\%–71\.3%71\.3\\%\), a2\.6×2\.6\{\\times\}wider spread\. In practical terms, this means a merchant deploying the single\-stage model has no reliable guarantee of agent quality: while some storefronts enjoy error rates below5%5\\%, others see more than half of all simulation episodes fail\. As[Table˜8](https://arxiv.org/html/2605.14205#A9.T8)shows, our two\-stage training framework does not primarily improve the mean, it eliminates tail risk, compress the error distribution so that no storefront exceeds30%30\\%error and every shop maintains at least71\.5%71\.5\\%goal completion\.
[Figure˜8](https://arxiv.org/html/2605.14205#A6.F8)makes this distributional difference visually concrete\. The two\-stage error bars rise smoothly across shops and stay below29%29\\%, while single\-stage bars spike on several storefronts, revealing two distinct catastrophic failure modes\. The first is*structural output collapse*: on the worst shop,51\.1%51\.1\\%of episodes produce unparseable output, driving the total error to71\.3%71\.3\\%; under two\-stage training the same shop drops to24\.2%24\.2\\%error with only3\.4%3\.4\\%other errors; the entire4747pp improvement comes from eliminating malformed outputs\. The second mode is*degenerate navigation*: another shop reaches60\.3%60\.3\\%error with57\.4%57\.4\\%of episodes exhausting the step budget in repetitive loops; two\-stage training reduces this to24\.5%24\.5\\%by enabling more efficient action planning\.
These worst cases reflect a systematic pattern\. Total*other*errors drop53%53\\%\(from2,3402\{,\}340to1,0941\{,\}094\), and the per\-shop other\-error rate falls from4\.6%4\.6\\%mean \(std10\.010\.0pp, max51\.1%51\.1\\%\) to2\.2%2\.2\\%mean \(std2\.12\.1pp, max10\.9%10\.9\\%\)\. Across shops, the reduction in total error correlates strongly with the reduction in other errors \(Pearsonr=0\.69r=0\.69,p<10−6p<10^\{\-6\}\), confirming that structural output failures are the dominant source of variance eliminated by two\-stage training\. We attribute this to optimization interference: when256256randomly initialized persona\-token embeddings are updated jointly with all backbone parameters, the large, initially noisy gradients can destabilize the structured\-generation pathways responsible for producing valid JSON actions before the tokens acquire meaningful signal\. The two\-stage framework avoids this by freezing the backbone in Stage 1, letting the embeddings converge to coherent representations without corrupting action generation, so that Stage 2 fine\-tuning starts from an aligned rather than conflicting initialization\.
The remaining error\-type panels reinforce this interpretation\. Steps\-limit rates are comparable in aggregate \(13\.1%13\.1\\%vs\.12\.6%12\.6\\%\) but far more variable under single\-stage training \(std10\.810\.8pp vs\.7\.37\.3pp; one shop reaches57\.4%57\.4\\%, while no two\-stage shop exceeds25%25\\%\)\. The time\-limit panel shows a counterintuitive asymmetry: the two\-stage model incurs1\.2%1\.2\\%mean time\-limit rate, while single\-stage reports near\-zero\. The time\-limit errors require the agent to actively render pages and interact with DOM elements long enough to exhaust the wall\-clock budget\. The single\-stage model on its hardest shops fails before reaching this regime, either through unparseable output or step\-count exhaustion\. The two\-stage model engages deeply enough that it occasionally runs out of time while making genuine progress, making the presence of time\-limit errors a proxy for*behavioral depth*\.
## Appendix GTwo stage training details
\(a\)Stage 1: persona grounding \(backbone frozen, only token embeddings updated\)\.
\(b\)Stage 2: action\-oriented fine\-tuning \(all parameters updated\)\.
Figure 9:Two\-stage persona\-grounding SFT examples\. Each training example consists of a system prompt specifying the output schema, a user turn containing the persona token, session goal, progress log \(memory\), and current page DOM, and an assistant turn with a structured JSON action and reasoning trace\. Stage 1 teaches the model*what*each token means; Stage 2 teaches it*how*to act on that knowledge\.Both stages use Qwen3\-14B as the base architecture, trained on22nodes×\\times88NVIDIA H200 GPUs \(1616GPUs total\) with Fully Sharded Data Parallel \(FSDP\)\. In Stage 1, the backbone is fully frozen; only the256256persona\-token embeddings are trainable\. We train for11epoch on approximately34,00034\{,\}000multi\-turn traces generated from3,6003\{,\}600sessions across3939shops \(each session is on average1010steps\), with an effective batch size of6464\(11per device×\\times1616GPUs×\\times44gradient\-accumulation steps\), learning rate1×10−51\\times 10^\{\-5\}with cosine annealing and5050warmup steps, weight decay0\.010\.01, and maximum sequence length16,38416\{,\}384tokens\. Loss is computed on assistant turns only\. Stage 2 follows the same configuration except that all model parameters are unfrozen and trained\. The3939training shops are fully disjoint from the4242evaluation storefronts used in[Sections˜3\.2](https://arxiv.org/html/2605.14205#S3.SS2),[3\.3](https://arxiv.org/html/2605.14205#S3.SS3)and[4](https://arxiv.org/html/2605.14205#S3.T4); no buyer appearing in the training set is included in any evaluation\. Also,[Figure˜9](https://arxiv.org/html/2605.14205#A7.F9)shows a schematic example of the data used for our two stage training described in[Section˜2\.3](https://arxiv.org/html/2605.14205#S2.SS3)\.
## Appendix HCalinski–Harabasz index\.
The Calinski–Harabasz \(CH\) indexCaliński and Harabasz \[[2](https://arxiv.org/html/2605.14205#bib.bib45)\]is defined as
CH=SSbetween/\(K−1\)SSwithin/\(N−K\),\\mathrm\{CH\}=\\frac\{\\mathrm\{SS\}\_\{\\mathrm\{between\}\}\\,/\\,\(K\-1\)\}\{\\mathrm\{SS\}\_\{\\mathrm\{within\}\}\\,/\\,\(N\-K\)\},\(12\)whereKKis the number of clusters,256256in our case,NNis the total number of buyers,SSbetween\\mathrm\{SS\}\_\{\\mathrm\{between\}\}is the sum of squared distances from each cluster centroid to the global centroid weighted by cluster size, andSSwithin\\mathrm\{SS\}\_\{\\mathrm\{within\}\}is the sum of squared distances from each buyer to its assigned cluster centroid\. Higher values indicate tighter, better\-separated clusters\.
## Appendix IPersona token ablation under neutral intents
To further test the effect of persona tokens on agent behavior, we compare simulations with persona\-token conditioning against simulations without\. To ensure that the agent is not guided by intent\-specific information, both conditions use a neutral, generic intent \(“you are interested in product X”\)\. The rest of the experimental setup is identical to[Section˜3\.2](https://arxiv.org/html/2605.14205#S3.SS2)\.
Our analysis reveals that removing persona tokens triples the rate of simulation crashes:7,8627\{,\}862\(15\.6%15\.6\\%\) without tokens versus2,6502\{,\}650\(5\.3%5\.3\\%\) with tokens \([Table˜9](https://arxiv.org/html/2605.14205#A9.T9)\)\. The gap is dominated by a single error category:StagehandTargetClosedError, a browser\-level failure in which the page session terminates mid\-interaction\. This error occurs4\.6×4\.6\{\\times\}more frequently without persona tokens \(6,5796\{,\}579vs\.1,4301\{,\}430\)\. To verify that this gap reflects a difference in navigation behavior rather than a difference in output formatting, we use context\-length errors as a control: these errors are triggered when the model’s prompt or response exceeds the token limit, a failure mode that depends entirely on output length and is independent of how the agent navigates the storefront\. Context\-length errors are virtually identical across conditions \(1,1191\{,\}119vs\.1,1411\{,\}141; ratio0\.980\.98\), confirming that persona tokens do not change*what*the agents generate but rather*how*they interact with the page\. The crash reduction is not confined to a few outlier shops:3838of4242storefronts exhibit a higher crash rate without tokens, with per\-shop differences exceeding4040pp on the most affected shops \([Figure˜10](https://arxiv.org/html/2605.14205#A9.F10), left\)\.
Table 8:Distributional statistics of per\-shop error rate across4242storefronts\.StatisticTwo\-StageSingle\-StageStd of error rate7\.6pp13\.9ppMin error rate2\.2%2\.9%Max error rate28\.5%71\.3%Shops<<10% error813Shops\>\>30% error05Shops\>\>50% error02Min goal\-reached rate71\.5%28\.7%Shops<<50% goal reached02Shops<<70% goal reached05Table 9:Persona token ablation under neutral intents\. Persona tokens reduce browser crashes by66%66\\%and produce deeper storefront engagement across all behavioral metrics\.MetricWith TokenNo TokenRatioEpisode outcomes \(all simulations\)Simulation crashes2,6502\{,\}650\(5\.3%5\.3\\%\)7,8627\{,\}862\(15\.6%15\.6\\%\)0\.34×0\.34\{\\times\}StagehandTargetClosed1,4301\{,\}4306,5796\{,\}5790\.22×0\.22\{\\times\}Context length1,1411\{,\}1411,1191\{,\}1191\.02×1\.02\{\\times\}Goal reached39,66439\{,\}664\(78\.7%78\.7\\%\)37,94237\{,\}942\(75\.3%75\.3\\%\)1\.05×1\.05\{\\times\}Behavioral engagement \(all simulations\)Add\-to\-cart events16,13216\{,\}13213,69813\{,\}6981\.18×1\.18\{\\times\}Checkout reached7,8237\{,\}8236,6646\{,\}6641\.17×1\.17\{\\times\}Navigation depth \(succeeded simulations only\)Avg actions / session11\.111\.19\.19\.11\.23×1\.23\{\\times\}Avg pages visited3\.93\.93\.63\.61\.10×1\.10\{\\times\}Avg products viewed1\.391\.391\.291\.291\.08×1\.08\{\\times\}Crash rate consistencyShops w/ higher crash rate3/423/4238/4238/42—StagehandTargetClosedErroroccurs when the agent’s actions destabilize the browser session, for example, by clicking elements that no longer exist after a page transition, issuing navigation commands faster than the renderer can process, or interacting with the DOM during partial page loads\. These failures are deterministic consequences of the agent’s action sequence, not stochastic infrastructure events\. In our simulations, both conditions receive only a neutral intent \(“you are interested in product X”\) that specifies a product but provides no behavioral signal\. Without a persona token, the agent lacks any prior over trajectory structure and it must select actions conditioned solely on page observations and a generic goal\. This yields a high\-entropy action distribution that produces erratic navigation like frequent page switches, interactions with transient UI elements, and action–observation desynchronization during asynchronous page loads\. These are precisely the trajectory patterns that maximize the probability of acting on stale page state\. A persona token provides the missing behavioral signal\. Trained on real buyer navigation traces, each token encodes a coherent browsing style that constrains the agent’s action distribution toward structured, sequential trajectories\. Because storefronts are designed for human navigation, these human\-like trajectories are inherently more compatible with the browser environment, explaining why the token reducesStagehandTargetClosedcrashes by78%78\\%without affecting context\-length errors at all\.
Figure 10:Persona token ablation under neutral intents\.Left: per\-shop infrastructure crash rate \(sorted by difference\);3838of4242shops crash more frequently without persona tokens\.Right: error\-type breakdown showing that browser crashes \(StagehandTargetClosed\) are4\.6×4\.6\{\\times\}more frequent without tokens, while context\-length errors remain identical \(≈1,130\{\\approx\}1\{,\}130\), isolating navigation behavior as the mechanism\.Behavioral engagement metrics provide further evidence that persona tokens produce deeper, more purposeful navigation rather than merely preventing crashes \([Table˜9](https://arxiv.org/html/2605.14205#A9.T9)\)\. Across all simulations, persona\-conditioned agents add items to cart18%18\\%more often \(16,13216\{,\}132vs\.13,69813\{,\}698\) and reach checkout17%17\\%more frequently \(7,8237\{,\}823vs\.6,6646\{,\}664\)\. On clean \(non\-crash\) simulations where both conditions complete the session without infrastructure failure, persona\-conditioned agents still perform23%23\\%more actions per session \(11\.111\.1vs\.9\.19\.1\), visit10%10\\%more pages \(3\.93\.9vs\.3\.63\.6\), and view8%8\\%more products \(1\.391\.39vs\.1\.291\.29\)\. The persona token therefore provides a genuine behavioral prior that shapes navigation even in the absence of persona\-specific intent: the agent browses more deeply, interacts with more product pages, and progresses further through the purchase funnel, producing sessions that more closely resemble real buyer behavior\.
## Appendix JInstruction\-following task diversity
To evaluate instruction\-following quality \([Table˜4](https://arxiv.org/html/2605.14205#S3.T4)\), we construct a deterministic benchmark of navigation tasks with varying complexity\. Each task specifies a concrete sequence of actions the agent must execute on a live storefront e\.g\., searching for specific products, viewing product pages, adding items to cart, and proceeding through checkout\.[Table˜10](https://arxiv.org/html/2605.14205#A10.T10)shows representative examples ordered by complexity: the simplest tasks require 3 actions \(search, view, cart\), while the most complex involve 27 actions spanning multiple search queries, product comparisons, cart modifications, and a completed purchase\.
Table 10:Representative tasks from the deterministic benchmark, ordered by complexity\. Bins denote \(Engagement / Exploration / Purchase\);*Nav\.*counts distinct navigation actions in the instruction\.Bins \(E/X/P\)Nav\.Task description0 / 2 / 13Search for one product, view it, add to cart, end session0 / 1 / 15Search for one product, view it, add to cart, abandon checkout1 / 2 / 07Search two products, view each, add one to cart, abandon checkout2 / 2 / 110Search 3 products, view each, add one, revisit pages, abandon2 / 2 / 216View 4 products, add one to cart, view 2 more, add another, checkout2 / 2 / 227Search and view 6 products, add 3 to cart, revisit 2, purchaseSimilar Articles
Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
Introduces Persona Policies (PPol), a plug-and-play control layer that uses LLM-driven evolutionary program search to generate diverse, human-like user personas for evaluating LLM agents. Achieves 33–62% fitness gains over baseline, with human-likeness rated at 80.4%, and improves agent robustness with +17% task success.
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
This paper introduces SalesSim, a framework and benchmark for evaluating Multimodal LLMs as retail user simulators, identifying gaps in persona alignment and proposing a new reinforcement learning method called UserGRPO.
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
SimGym is a framework that simulates A/B tests on e-commerce storefronts using vision-language model agents, reducing experimental cycles from weeks to under an hour while achieving 77% directional alignment with real buyer behavior.
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models
PersonaArena is a dynamic simulation framework that uses a large corpus of social content and a multi-agent debating judge to evaluate and improve LLMs' ability to maintain coherent and authentic persona-level role-playing in realistic social scenarios.
Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
This paper from eBay presents a modular two-agent simulation framework for evaluating conversational shopping assistant architectures, enabling controlled comparisons of responder designs. Key findings include that rolling-window memory outperforms intent-extraction memory by 35% in speed, and that systematic failure analysis reduced failure rates by 62%.