Navigating User Behavior toward Personalized Multimodal Generation
Summary
This paper proposes NaviGen, a framework for personalized multimodal content generation that encodes user behavior into executable instructions using a dual identifier and a two-stage SFT+RL pipeline, improving personalization across product, game, and short-video domains.
View Cached Full Text
Cached at: 06/24/26, 07:45 AM
# Navigating User Behavior toward Personalized Multimodal Generation
Source: [https://arxiv.org/html/2606.24196](https://arxiv.org/html/2606.24196)
Hengji Zhou1∗, Yufeng Liu1∗, Ye Liu1, Yong Xu1, Lianghao Xia2†, Liqiang Nie2 1South China University of Technology 2Harbin Institute of Technology, Shenzhen hengjizhou01@gmail\.com,202330361751@mail\.scut\.edu\.cn, 202330451251@mail\.scut\.edu\.cn,yxu@scut\.edu\.cn, aka\_xia@foxmail\.com,nieliqiang@gmail\.com
###### Abstract
Modern AIGC pipelines deliver high\-fidelity images and videos but presuppose a well\-formed creation instruction, while end users rarely articulate visual details, leaving generators misaligned with user demand\. We studypersonalized content generation, which turns a user’s interaction history into an executable instruction for downstream synthesis, and identify two obstacles: behavior must be encoded in a form legible to language reasoning, and the model must acquire instruction\-writing skill absent from both pretraining and behavior data\. We proposeNaviGen, which represents each item with a dual identifier coupling a collaborative code and a textual code as a behavioral substrate and a semantic bridge in one token stream\. On this representation, a two\-stage SFT\+RL pipeline first distills preference reasoning and instruction writing from evolutionarily searched supervision, then aligns generation with user intent through hierarchical and self\-consistent rewards\. Experiments across product, game, and short\-video domains show that NaviGen improves personalized image and video generation, strengthens next\-item prediction, and yields more specific, relevant, and visually generatable instructions\. Our code is released at:[https://github\.com/iLearn\-Lab/NaviGen](https://github.com/iLearn-Lab/NaviGen)\.
Navigating User Behavior toward Personalized Multimodal Generation
Hengji Zhou1∗, Yufeng Liu1∗, Ye Liu1, Yong Xu1, Lianghao Xia2†, Liqiang Nie21South China University of Technology2Harbin Institute of Technology, Shenzhenhengjizhou01@gmail\.com,202330361751@mail\.scut\.edu\.cn,202330451251@mail\.scut\.edu\.cn,yxu@scut\.edu\.cn,aka\_xia@foxmail\.com,nieliqiang@gmail\.com
11footnotetext:∗Hengji Zhou and Yufeng Liu have equal contribution to this work\.22footnotetext:†Lianghao Xia is the corresponding author\.## 1Introduction
Multimodal content generation, such as text\-to\-image posters and short videos, is rapidly becoming a core productivity layer for media, marketing, and e\-commerce\(Xuet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib28); Linget al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib26)\)\. A modern creation pipeline refines a textual instruction with a language model and renders it into an image or video via a text\-to\-vision generator\(Yanget al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib20); Seedanceet al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib21)\), where the instruction serves as the central control signal of what is generated and how it looks\.
Recent progress along this pipeline broadly falls into three lines\.*\(i\) Generation backbones:*diffusion and autoregressive transformers for text\-to\-image and text\-to\-video deliver high\-fidelity rendering from textual prompts\(Yanget al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib20); Seedanceet al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib21)\)\.*\(ii\) Instruction enrichment:*LLM\-based prompt expansion and multi\-agent creation systems turn short user inputs into detailed, structured instructions that better exploit generator capacity\(Xuet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib28); Anet al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib27); Danget al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib14)\)\.*\(iii\) Conditional control:*reference\-, layout\-, or identity\-conditioned generation injects external signals for fine\-grained control over the output\(Zhaoet al\.,[2025b](https://arxiv.org/html/2606.24196#bib.bib25); Linget al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib26)\)\.
Figure 1:Personalized multimodal content generation\.However, all of these methods presupposea well\-formed creation instruction as input, leaving a fundamental question unanswered:*whose taste is the content actually for?*End consumers differ widely in taste and rarely articulate visual details, yet whether the content resonates with them is what ultimately matters\. Without a path from user signals to a concrete instruction, even the strongest generators produce content that is generic or misaligned with real demand\. This gap motivates a new problem we term*personalized content generation*: translating a user’s implicit behavior history into a creation instruction that steers the downstream generator toward content the user truly wants\.
Realizing this paradigm raises two central challenges, on the input and output sides of the language model respectively\.\(C1\) Representation gap between behavior and language\.A user’s behavior history must be encoded into a form the LM can reason over, yet no single representation suffices: ID\-based encodings preserve behavioral structure but remain opaque in the LM’s semantic space, while raw textual metadata is expressive but verbose, tempting the model to paraphrase history rather than infer preference\. A workable representation must carry behavioral signal and stay legible to language reasoning at once\.\(C2\) Capability gap between understanding preference and writing instructions\.Even given such a representation,*knowing what a user likes*and*writing a good creation instruction*are two distinct skills—the latter neither cultivated by language pretraining nor reflected in user behavior data, leaving the model with no natural source to acquire it\.
We propose NaviGen, whose two core designs map one\-to\-one onto these challenges\. To close the representation gap \(C1\), NaviGen encodes each item with adual\-identifierscheme: a collaborative identifier \(CID\) captures its behavioral role via residual vector quantization, while a textual identifier \(TID\) compresses its textual semantics into ordered, standardized terms\. Together they give the language model a compact behavioral substrate and a controllable language bridge in a single token stream\. To close the capability gap \(C2\), NaviGen adopts atwo\-stage SFT\+RL pipeline\. The SFT stage learns from history\-to\-instruction supervision synthesized by evolutionary search under an LLM judge, teaching the model to reason about preference evolution rather than paraphrase history\. The RL stage jointly optimizes two complementary rewards: a hierarchical CID reward for preference correctness, and a triangular instruction\-aware reward defined over the generated instruction, the predicted target semantics, and the ground\-truth target semantics, which together drive the model toward generation\-ready instructions\.
Our contributions are summarized as follows:
- •We propose NaviGen, a unified framework that turns user behavior sequences into generation\-ready creation instructions for personalized AIGC, bridging behavioral modeling and controllable content generation in a single pipeline\.
- •We introduce a dual\-identifier representation that couples a residual\-quantized CID with an ordered, length\-flexible TID, jointly providing a compact behavioral substrate and a controllable semantic bridge within one token stream\.
- •We design a two\-stage SFT\+RL pipeline that requires no human\-written instructions: evolutionary search with an LLM judge synthesizes supervision, while GRPO uses a hierarchical CID reward and a triangular instruction\-aware reward enforcing closed\-loop self\-consistency\.
- •Across product, game, and short\-video domains, NaviGen consistently improves personalized image and video generation quality, CID\-space next\-item prediction accuracy, and instruction specificity, relevance, and visual generatability\.
## 2Preliminary
Personalized Content Generation\. We consider the setting of consumer\-facing personalized AIGC, where a multimodal generative model synthesizes visual content tailored to an individual user\. Given a textual creation instructionℐ∈𝒯\\mathcal\{I\}\\in\\mathcal\{T\}, an off\-the\-shelf generatorgϕg\_\{\\phi\}produces the final output:
𝒪=gϕ\(ℐ\),\\mathcal\{O\}\\;=\\;g\_\{\\phi\}\(\\mathcal\{I\}\),\(1\)
where𝒪∈𝒴\\mathcal\{O\}\\in\\mathcal\{Y\}denotes the generated content in the target modality space \(e\.g\., image or video\), andgϕg\_\{\\phi\}remains fixed during our training\. Under this setting, the quality of personalized generation is fundamentally bottlenecked by the quality ofℐ\\mathcal\{I\}, while end consumers cannot be expected to author such instructions by hand\. This motivates the need for an automatic instruction generatorfθf\_\{\\theta\}that producesℐ\\mathcal\{I\}on the user’s behalf\.
Behavior as Implicit Preference Evidence\. To drivefθf\_\{\\theta\}toward user\-specific generation, we leverage the user’s observed interaction history as an implicit signal of visual preference\. For a given user, we denote this history as an ordered sequence
ℋu=⟨x1,x2,…,xn⟩,\\mathcal\{H\}\_\{u\}\\;=\\;\\langle x\_\{1\},x\_\{2\},\\ldots,x\_\{n\}\\rangle,\(2\)
where eachxkx\_\{k\}is an item the user has previously engaged with \(e\.g\., clicked, viewed, or purchased\), associated with its visual and semantic attributes\. We treatℋu\\mathcal\{H\}\_\{u\}as*preference evidence*: a record from which the user’s latent visual taste can be inferred and projected forward into a creative direction\.
Task Formulation: Behavior\-Conditioned Instruction Generation\. Given a user’s historyℋu\\mathcal\{H\}\_\{u\}, our goal is to learn an instruction generatorfθf\_\{\\theta\}that produces a free\-form textual instruction:
ℐ=fθ\(ℋu\),maxθPθ\(ℐ∣ℋu\)\.\\mathcal\{I\}\\;=\\;f\_\{\\theta\}\(\\mathcal\{H\}\_\{u\}\),\\quad\\max\_\{\\theta\}\\;P\_\{\\theta\}\\\!\\left\(\\mathcal\{I\}\\mid\\mathcal\{H\}\_\{u\}\\right\)\.\(3\)
We say that an instructionℐ\\mathcal\{I\}is*generation\-ready*if it satisfies two properties: \(1\)*preference alignment*:ℐ\\mathcal\{I\}faithfully captures the user\-specific visual preferences evidenced byℋu\\mathcal\{H\}\_\{u\}; and \(2\)*generation feasibility*:ℐ\\mathcal\{I\}is sufficiently concrete and visually grounded to serve as an effective conditioning signal for the downstream generatorgϕg\_\{\\phi\}\. The objective of this work is to designfθf\_\{\\theta\}such that, for arbitrary user histories, it consistently emits generation\-ready instructions, thereby bridging implicit user behavior and high\-quality personalized visual synthesis\.
## 3Method
This section presents the technical details of NaviGen, with overall architecture shown in Figure[2](https://arxiv.org/html/2606.24196#S3.F2)\.
### 3\.1Dual\-Identifier Behavior Encoding
To make user behaviorℋu\\mathcal\{H\}\_\{u\}legible to an LLM, each entryxk∈ℋux\_\{k\}\\in\\mathcal\{H\}\_\{u\}must be serialized into tokens within the model’s vocabulary\. A naive choice, directly feeding captions or metadata, is verbose and injects redundant noise that slows optimization and blurs preference signals\. We therefore encode each entry with a compact*dual identifier*, decoupling sequence\-level identity from semantic grounding so that neither role compromises the other\.
Collaborative Identifier \(CID\)\.Inspired by LLMs for collaborative filtering\(Denget al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib1)\), the CID encodes an entry’s role within user interaction sequences, distilling collaborative patterns observed across the consumer\-content interaction graph\. Its metadatamvm\_\{v\}is first mapped to a continuous embedding𝐞v=ψ\(mv\)\\mathbf\{e\}\_\{v\}=\\psi\(m\_\{v\}\)via a pretrained embedding modelψ\\psi, then quantized through a multi\-layer residual K\-means process:
sℓ\\displaystyle s\_\{\\ell\}=argmink‖𝐫ℓ−𝐜ℓk‖2,𝐫ℓ\+1=𝐫ℓ−𝐜ℓsℓ,\\displaystyle=\\operatorname\*\{arg\\,min\}\_\{k\}\\\|\\mathbf\{r\}\_\{\\ell\}\-\\mathbf\{c\}\_\{\\ell\}^\{k\}\\\|^\{2\},\\;\\mathbf\{r\}\_\{\\ell\+1\}=\\mathbf\{r\}\_\{\\ell\}\-\\mathbf\{c\}\_\{\\ell\}^\{s\_\{\\ell\}\},\(4\)
where𝐫1=𝐞v\\mathbf\{r\}\_\{1\}=\\mathbf\{e\}\_\{v\},𝐜ℓk\\mathbf\{c\}\_\{\\ell\}^\{k\}denotes thekk\-th centroid in theℓ\\ell\-th codebook, andsℓ∈\{1,…,Kcb\}s\_\{\\ell\}\\in\\\{1,\\ldots,K\_\{\\text\{cb\}\}\\\}is the discrete code assigned to levelℓ\\ell\. The resulting CID is a three\-level residual token sequence:
cid\(v\)=⟨s1\(v\),s2\(v\),s3\(v\)⟩\.\\text\{cid\}\(v\)=\\langle\\;s\_\{1\}\(v\),\\;s\_\{2\}\(v\),\\;s\_\{3\}\(v\)\\;\\rangle\.\(5\)This hierarchy enables multi\-granularity modeling and partial\-credit supervision, as matching any level yields a meaningful signal\. Each CID token is added to the vocabulary and initialized via dedicated embedding training \(Section[3\.2\.1](https://arxiv.org/html/2606.24196#S3.SS2.SSS1)\)\.
Textual Identifier \(TID\)\.Unlike existing work that treats TIDs as fixed\-length targets for next\-item prediction\(Zhanget al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib30)\), we note that semantically equivalent textual variants may correspond to different terms, making exact next\-TID prediction overly restrictive; meanwhile, entries vary in semantic complexity and thus require different numbers of terms\. We therefore construct variable\-length TIDs by imposing only an upper bound on the number of terms:
tid\(v\)=\[t1,t2,…,tm\],m≤10,\\text\{tid\}\(v\)=\[t\_\{1\},t\_\{2\},\\ldots,t\_\{m\}\],\\quad m\\leq 10,\(6\)where eachtkt\_\{k\}is a concise phrase capturing a core semantic dimension \(e\.g\., subject category, key attribute\)\. Terms are ordered by importance and produced by compressing the original caption through an LLM with controlled output constraints\. Unlike free\-form text, the TID provides a compact, deduplicated, and domain\-stable semantic signature that serves as the bridge between sequence\-level preference modeling and instruction generation\.
Figure 2:Overall architecture of the proposed NaviGen framework for personalized multimodal generation\.
### 3\.2Reasoning\-Infused Supervised Tuning
NaviGen employs two\-stage supervised fine\-tuning that progressively builds from identifier embeddings to full reasoning\-capable generation\.
#### 3\.2\.1Cold\-Start Embedding Initialization
To prevent randomly initialized CID embeddings from destabilizing pretrained weights through noisy gradients, we decouple representation acquisition from backbone adaptation: all pretrained weights are frozen, and only parameters tied to the new tokens are updated\. Let𝒟init\\mathcal\{D\}\_\{\\text\{init\}\}denote the auxiliary training set with token sequences𝐬i\\mathbf\{s\}\_\{i\}and target tokensyiy\_\{i\}\. The embedding initialization loss is:
ℒinit=−∑i∑tlogp\(yi,t∣𝐬i,<t;ℰCID,𝐖out\)\\mathcal\{L\}\_\{\\text\{init\}\}=\-\\sum\_\{i\}\\sum\_\{t\}\\log p\(y\_\{i,t\}\\mid\\mathbf\{s\}\_\{i,<t\};\\mathcal\{E\}\_\{\\text\{CID\}\},\\mathbf\{W\}\_\{\\text\{out\}\}\)\(7\)whereℰCID\\mathcal\{E\}\_\{\\text\{CID\}\}and𝐖out\\mathbf\{W\}\_\{\\text\{out\}\}are the learnable CID embeddings and output projection layer\. Three auxiliary tasks establish bidirectional CID\-TID alignment:
CID2TID\.Mapping CID to its corresponding TID, grounding behavioral codes in semantic terms\.
TID2CID\.Inverse TID\-to\-CID mapping, constructing behavioral identifiers from semantic signals\.
CID2CID\.Predicting a target CID from a history of CIDs, capturing sequential behavioral patterns\.
#### 3\.2\.2Reasoning\-Augmented Full Finetuning
The initialization stage equips the model with stable CIDs, but two capabilities essential for behavior\-conditioned instruction generation remain absent: \(i\) translating inferred preferences into generation\-ready instructions, and \(ii\) reasoning over how user interests evolve along the interaction history\. To instill both, we unfreeze all parameters and augment the existing objectives with a newCID2INStask and chain\-of\-thought supervision\.
CID2INS\.AIGC instructions are synthesized through an evolution\-inspired search that progressively refines candidate instructions toward user\-aligned visual semantics\. Starting from three founder strategies \(conservative, balanced, exploratory\), each round selects the two strongest candidates via a multi\-dimensional judge and produces two offspring through crossover and controlled mutation\. Given the candidate trajectory\{ℐ\(r\)\}r=0R\\\{\\mathcal\{I\}^\{\(r\)\}\\\}\_\{r=0\}^\{R\}, the final instruction is:
ℐ⋆=argmaxℐ∈𝒫f𝒮judge\(ℐ\),\\mathcal\{I\}^\{\\star\}=\\operatorname\*\{arg\\,max\}\_\{\\mathcal\{I\}\\in\\mathcal\{P\}\_\{\\text\{f\}\}\}\\;\\mathcal\{S\}\_\{\\text\{judge\}\}\(\\mathcal\{I\}\),\(8\)where𝒫f\\mathcal\{P\}\_\{\\text\{f\}\}is the final population and𝒮judge\\mathcal\{S\}\_\{\\text\{judge\}\}is the LLM\-based multi\-dimensional scorer\. The model is jointly supervised to predict the target TID, anchoring the generated instruction to the intended visual semantics and preventing semantic drift\.
Reasoning\.Chain\-of\-thought traces are distilled from a teacher model\. ForCID2CID, the teacher articulates how user preferences evolve along the historical identifier sequence without explicitly referencing the target item\. Formally, given an identifier sequenceh1:n=\(h1,…,hn\)h\_\{1:n\}=\(h\_\{1\},\\ldots,h\_\{n\}\)wherehih\_\{i\}denotes the CID at stepii, the teacher distills consecutive preference shifts into a reasoning chain:
𝒯\(\{hi−1→hi\}i=2n\),\\mathcal\{T\}\\big\(\\\{h\_\{i\-1\}\\rightarrow h\_\{i\}\\\}\_\{i=2\}^\{n\}\\big\),\(9\)where𝒯\(⋅\)\\mathcal\{T\}\(\\cdot\)aggregates the evolving preference trajectory across the interaction history\. ForCID2INS, the reasoning summarizes the full evolutionary trajectory—how candidate instructions converged toward the target semantics across rounds\. Formally, starting from founder strategies that form the initial population𝒫\(0\)\\mathcal\{P\}^\{\(0\)\}, each roundr∈\{1,…,R\}r\\in\\\{1,\\dots,R\\\}selects the two highest\-scoring candidates under𝒮judge\\mathcal\{S\}\_\{\\text\{judge\}\}and updates the population via elitism, crossover, and mutation:
𝒫\(r\)=\{ℐelite\(r\),ℐelite\(r\)⊕ℐmate\(r\),μ\(ℐelite\(r\)\)\},\\mathcal\{P\}^\{\(r\)\}=\\bigl\\\{\\,\\mathcal\{I\}^\{\(r\)\}\_\{\\text\{elite\}\},\\;\\mathcal\{I\}^\{\(r\)\}\_\{\\text\{elite\}\}\\oplus\\mathcal\{I\}^\{\(r\)\}\_\{\\text\{mate\}\},\\;\\mu\(\\mathcal\{I\}^\{\(r\)\}\_\{\\text\{elite\}\}\)\\,\\bigr\\\},\(10\)whereℐelite\(r\)\\mathcal\{I\}^\{\(r\)\}\_\{\\text\{elite\}\}andℐmate\(r\)\\mathcal\{I\}^\{\(r\)\}\_\{\\text\{mate\}\}are the top\-2 candidates in𝒫\(r−1\)\\mathcal\{P\}^\{\(r\-1\)\},⊕\\oplusfuses the elite’s target\-facing anchor with the mate’s visual expressiveness, andμ\(⋅\)\\mu\(\\cdot\)applies a controlled mutation that preserves the semantic core\. The resulting score trajectory across rounds teaches the model to connect evolutionary search dynamics with the final output\.
Let𝒟full\\mathcal\{D\}\_\{\\text\{full\}\}denote the reasoning\-augmented training set with input sequences𝐬i\\mathbf\{s\}\_\{i\}and output sequences𝐲i\\mathbf\{y\}\_\{i\}containing reasoning traces and structured answers\. The full finetuning objective is:
ℒfull=−∑i∑tlogp\(𝐲i,t∣𝐬i,<t;θ\),\\mathcal\{L\}\_\{\\text\{full\}\}=\-\\sum\_\{i\}\\sum\_\{t\}\\log p\(\\mathbf\{y\}\_\{i,t\}\\mid\\mathbf\{s\}\_\{i,<t\};\\theta\),\(11\)wherep\(⋅∣𝐬i,<t;θ\)p\(\\cdot\\mid\\mathbf\{s\}\_\{i,<t\};\\theta\)denotes the token\-level probability under all trainable parametersθ\\theta\. The resulting model jointly performs preference\-grounded reasoning and behavior\-conditioned instruction generation for downstream synthesis\.
### 3\.3Multi\-Task Reinforcement Learning
Supervised fine\-tuning produces a competent generator but does not directly optimize for the quality of personalized outputs\. NaviGen applies GRPO to refine the policyπθ\\pi\_\{\\theta\}under task\-specific reward signals that better reflect downstream objectives\.
Hierarchical CID Reward\.The CID encodes item behavior through three residual levels of decreasing granularity\. A match at any level provides a meaningful signal, but coarse agreement is weighted more heavily than fine precision, reflecting the intuition that predicting the right item family matters more than nailing the exact variant\. Formally, given a predicted CIDs^=\(s^a,s^b,s^c\)\\hat\{s\}=\(\\hat\{s\}\_\{a\},\\hat\{s\}\_\{b\},\\hat\{s\}\_\{c\}\)and ground truths~=\(s~a,s~b,s~c\)\\tilde\{s\}=\(\\tilde\{s\}\_\{a\},\\tilde\{s\}\_\{b\},\\tilde\{s\}\_\{c\}\), the hierarchical reward is:
Rcid=∑τ∈\{a,b,c\}wτ⋅𝕀\[s^τ=s~τ\],R\_\{\\text\{cid\}\}=\\sum\_\{\\tau\\in\\\{a,b,c\\\}\}w\_\{\\tau\}\\cdot\\mathbb\{I\}\[\\hat\{s\}\_\{\\tau\}=\\tilde\{s\}\_\{\\tau\}\],\(12\)wherewaw\_\{a\},wbw\_\{b\}andwcw\_\{c\}enforce the coarse\-to\-fine weighting\. A small bonus rewards predictions that remain within the valid CID vocabulary even when they do not match the ground truth, encouraging the model to stay within the learned identifier space\.
Instruction\-Aware Reward\.ForCID2INS, the reward combines instruction quality assessment with a triangular self\-consistency check\. An LLM\-based judge evaluates the instruction along four dimensions—specificity,creativity,content quality, andvisual generatability, aggregated intoRqualR\_\{\\text\{qual\}\}\. Beyond standalone quality, a closed\-loop alignment enforces three mutually reinforcing signals: the instruction must anchor to target semanticsRins↔t~R\_\{\\text\{ins\}\{\\leftrightarrow\}\\tilde\{t\}\}, remain self\-consistent with its own predictionRins↔t^R\_\{\\text\{ins\}\{\\leftrightarrow\}\\hat\{t\}\}, and the prediction itself must align with ground truthRt^↔t~R\_\{\\hat\{t\}\{\\leftrightarrow\}\\tilde\{t\}\}\. The combined reward is:
Ralign\\displaystyle R\_\{\\text\{align\}\}=γ1Rins↔t~\+γ2Rins↔t^\+γ3Rt^↔t~,\\displaystyle=\\gamma\_\{1\}R\_\{\\text\{ins\}\{\\leftrightarrow\}\\tilde\{t\}\}\+\\gamma\_\{2\}R\_\{\\text\{ins\}\{\\leftrightarrow\}\\hat\{t\}\}\+\\gamma\_\{3\}R\_\{\\hat\{t\}\{\\leftrightarrow\}\\tilde\{t\}\},\(13\)Rins\\displaystyle R\_\{\\text\{ins\}\}=λ1⋅Rqual\+λ2⋅Ralign,\\displaystyle=\\lambda\_\{1\}\\cdot R\_\{\\text\{qual\}\}\+\\lambda\_\{2\}\\cdot R\_\{\\text\{align\}\},\(14\)whereγ1,γ2,γ3\\gamma\_\{1\},\\gamma\_\{2\},\\gamma\_\{3\}balance the three alignment signals, andλ1,λ2\\lambda\_\{1\},\\lambda\_\{2\}weight quality against consistency\.
Optimization Objective\.We optimize the GRPO objective over a group ofGGcompletions:
𝒥grpo=𝔼𝐪\[ℒclip−κ𝔻KL\(πθ∥πref\)\],\\mathcal\{J\}\_\{\\text\{grpo\}\}=\\mathbb\{E\}\_\{\\mathbf\{q\}\}\[\\,\\mathcal\{L\}\_\{\\text\{clip\}\}\-\\kappa\\,\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\text\{ref\}\}\)\\,\],\(15\)whereℒclip\\mathcal\{L\}\_\{\\text\{clip\}\}is the clipped surrogate loss averaged over the group,ρi\\rho\_\{i\}is the importance sampling ratio,ϵ\\epsilonthe clipping range, andκ\\kappathe KL penalty weight\. The group\-relative advantageA^i\\hat\{A\}\_\{i\}is driven by a composite reward:
Rtask=\{wcidRcid\+Rbonus,winsRins\+Rformat−Rpenalty,\\displaystyle R\_\{\\text\{task\}\}=\\begin\{cases\}w\_\{\\text\{cid\}\}R\_\{\\text\{cid\}\}\+R\_\{\\text\{bonus\}\},&\\\\\[4\.0pt\] w\_\{\\text\{ins\}\}R\_\{\\text\{ins\}\}\+R\_\{\\text\{format\}\}\-R\_\{\\text\{penalty\}\},&\\end\{cases\}\(16\)whereRbonusR\_\{\\text\{bonus\}\}encourages vocabulary\-range adherence,RformatR\_\{\\text\{format\}\}verifies JSON parseability and reasoning completeness, andRpenaltyR\_\{\\text\{penalty\}\}penalizes structural violations\. The optimized instructionℐ~\\tilde\{\\mathcal\{I\}\}from this stage then serves as the control signal for the multimodal generatorgϕg\_\{\\phi\}defined in Section[2](https://arxiv.org/html/2606.24196#S2)\.
## 4Evaluation
Table 1:Statistics of the experimental datasets\.We evaluate NaviGen by answering five research questions\.RQ1compares personalized AIGC generation performance\.RQ2studies the contribution of key modules\.RQ3examines collaborative identifier prediction in the CID space\.RQ4analyzes the effect of SFT and RL steps\.RQ5examines qualitative instruction cases\.
Table 2:Overall performance comparison on personalized AIGC instruction generation\. Aesthetic and Novelty are averaged over three runs\. Excluding Oracle, boldface marks the best image\-generation result, while boldface with superscript⋆\\starmarks the best video\-generation result\.### 4\.1Experimental Settings
#### 4\.1\.1Datasets and Evaluation Protocols
Table[1](https://arxiv.org/html/2606.24196#S4.T1)reports statistics of the three datasets:ProductandGamesare Amazon review domains with captions derived from product metadata, whileShort Videoscomes from the OpenOneRec short\-video benchmark, where each item has a textual caption\(Zhouet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib38)\)\. Each data instance is formed by pairing a user’s history interaction sequence with a target item, and all instances are split into training, validation, and test sets with an 8:1:1 ratio; all methods are evaluated by Recall@K/NDCG@K\. For personalized generation, we sample 1,000 image\-generation and 100 video\-generation cases from the test split of each dataset, while keeping the AIGC instruction modality consistent across methods\. We compare NaviGen and baselines along four dimensions:Consistency, measured by image\-instruction CLIPScore and, for videos, average CLIPScore over one frame per second\(Hesselet al\.,[2021](https://arxiv.org/html/2606.24196#bib.bib33)\);Relevance, measured by cosine similarity between the generated instruction and the equal\-weight embedding of history and target item captions;Aesthetic, where a VLM judge scores visual quality in\[0,1\]\[0,1\]\(Kaoet al\.,[2017](https://arxiv.org/html/2606.24196#bib.bib31)\); andNovelty, where the same judge scores novelty and interestingness in\[0,1\]\[0,1\]\(Vargas and Castells,[2011](https://arxiv.org/html/2606.24196#bib.bib32)\)\. Details are provided in Appendix[A\.3](https://arxiv.org/html/2606.24196#A1.SS3)
#### 4\.1\.2Baseline Methods
NaviGen is compared with a comprehensive set of baselines, includingi\) Personalized Generation Methods:PMG\(Shenet al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib36)\), Pigeon\(Xuet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib28)\), RAGAR\(Linget al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib26)\), CIPHER\(Gaoet al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib34)\), PROSE\(Aroca\-Ouelletteet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib12)\), TRIPLE\(Nohet al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib35)\);ii\) Collaborative Filtering Methods:SASRec\(Kang and McAuley,[2018](https://arxiv.org/html/2606.24196#bib.bib37)\), TIGER\(Rajputet al\.,[2023](https://arxiv.org/html/2606.24196#bib.bib39)\), LC\-Rec\(Zhenget al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib2)\), OpenOneRec\-Pretrain\(Zhouet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib38)\); andiii\) Prompting and Reference Baselines:NPCremoves user\-specific evidence and relies only on a generic task prompt, whileOracleconditions on the ground\-truth target caption or TID as a non\-deployable target\-conditioned reference\.



\(a\)Image Generation


\(b\)Video Generation
Figure 3:Ablation study on image and video generation\.
#### 4\.1\.3Implementation Details
All baselines are implemented following their original papers\. NaviGen uses Qwen3\-1\.7B as the base language model\. Unless otherwise specified, we use Qwen3\.5\-Flash for auxiliary method components, including Instruction\-Aware Reward, TID, and reasoning generation\. For both cold\-start embedding initialization and full\-parameter finetuning, we set the learning rate to5×10−45\\times 10^\{\-4\}, the warmup ratio to0\.030\.03, and train for 3 epochs, using AdamW optimization with a cosine learning\-rate schedule, a maximum sequence length of 2048, and packed SFT examples\. For reinforcement learning, we set the group size to 8, the maximum sequence length to 2048, the weight decay to 0\.01, the learning rate to3×10−43\\times 10^\{\-4\}, the number of training steps to 600, and the batch size to 480\. We employ ViT\-B/32 for image encoding and text\-embedding\-v4 for text encoding\. We use GLM\-5V\-Turbo as the VLM judge, Z\-Image\-Turbo for 512×\\times512 image generation, and Open\-Sora 1\.3 for 720p video generation at 24 fps with a total of 81 frames\.
Table 3:Performance comparison on collaborative identifier prediction\. Best results are highlighted in bold\.
### 4\.2Overall Performance Comparison \(RQ1\)
We compare NaviGen against representative baselines on personalized image and video generation; Table[2](https://arxiv.org/html/2606.24196#S4.T2)reports the results, with Oracle as a non\-deployable target\-conditioned reference\.Consistent Image\-Level Gains\.NaviGen achieves the best image\-generation performance on most non\-oracle comparisons, leading across all metrics on Games and Short Videos and across Novelty, Aesthetic, and Relevance on Product\. The only exception is Product Consistency, suggesting that NaviGen mainly improves personalized target alignment, creative specificity, and visual quality while keeping image\-instruction consistency competitive\.Video\-Level Transfer and Trade\-offs\.NaviGen obtains the best video\-generation result in 9 of 12 non\-oracle comparisons, indicating that video generation involves a trade\-off among frame\-level consistency, novelty, visual quality, and preference\-specific relevance\.Oracle Reference\.Oracle attains the highest Relevance by directly conditioning on target semantics, yet NaviGen surpasses it on image Aesthetic for Product and Short Videos and nearly matches its video Consistency, suggesting that target semantics alone do not guarantee superior perceptual generation quality\. We provide additional human evaluation details in Appendix[A\.2](https://arxiv.org/html/2606.24196#A1.SS2)\.
Table 4:Ablation study on CID prediction modeling\.
### 4\.3Ablation Study \(RQ2\)
We conduct ablation studies from both personalized generation and CID\-space collaborative modeling perspectives, with results shown in Figure[3](https://arxiv.org/html/2606.24196#S4.F3)and Table[4](https://arxiv.org/html/2606.24196#S4.T4)\.Overall Effect\.The full NaviGen variant achieves the strongest or most balanced results across generation relevance, novelty, and CID modeling metrics, indicating that the proposed modules contribute complementary signals\.Identifier Grounding\.Removing TID grounding or CID initialization weakens both generation and CID modeling performance, while the initialization\-only variant performs worst, showing that stable collaborative identifier adaptation must be followed by semantic recovery and full collaborative trajectory modeling\.Reasoning and Reward Alignment\.Removing reasoning supervision hurts CID trajectory modeling and generation quality, and ablating GRPO leads to a weaker relevance–novelty trade\-off and lower CID\-space retrieval quality, confirming that transition\-level reasoning and multi\-task reward alignment jointly improve collaborative\-sequence consistency and generation readiness\.
### 4\.4Collaborative Identifier Prediction \(RQ3\)
We further evaluate NaviGen onCID2CIDprediction, where the model predicts the next CID from a user’s historical CID sequence\. This task is not intended to replace TID\-based semantic prediction; rather, it isolates whether CID provides an additional view for discovering collaborative signals that are not directly exposed by textual identifiers\. Table[3](https://arxiv.org/html/2606.24196#S4.T3)reports Recall and NDCG on the three datasets\. NaviGen achieves the best results on most metrics, with clear advantages on Product and Short Videos, while OneRec remains stronger on the top\-10 metrics of Games\. Although NaviGen is slightly weaker on Games top\-10 retrieval, its advantage at R@20 and N@20 indicates broader top\-k coverage in the collaborative identifier space, and its consistent gains on Product and Short Videos demonstrate robustness under different sparsity and item\-distribution conditions\.
\(a\)Product Generation
\(b\)Product Modeling
\(c\)Short Videos Generation
\(d\)Short Videos Modeling
Figure 4:Hyperparameter study on GRPO steps\.
### 4\.5Hyperparameter Study \(RQ4\)
We analyze the training\-step sensitivity of NaviGen in Table[5](https://arxiv.org/html/2606.24196#S4.T5)and Figure[4](https://arxiv.org/html/2606.24196#S4.F4)\.SFT Steps\.Increasing supervised tuning mainly improves R@20 across datasets, while consistency only fluctuates within a narrow range, suggesting that longer SFT strengthens CID\-space collaborative trajectory modeling without degrading instruction coherence\.GRPO Steps\.RL updates have a clearer effect on generation quality: relevance generally improves toward later steps and novelty remains stable or slightly increases, despite minor mid\-training fluctuations\.CID Prediction Stability\.CID collaborative modeling metrics are less sensitive to GRPO, showing mild Product fluctuations and a Short Videos plateau, suggesting that later RL mainly affects generation\-side alignment rather than substantially changing CID\-space collaborative modeling\.
### 4\.6Case Study \(RQ5\)
Figure[5](https://arxiv.org/html/2606.24196#S4.F5)presents a qualitative comparison with TRIPLE on history\-guided generation\. The user’s history evolves from humorous elf interactions to conflict\-centered fantasy anime, while the target TID further introduces romance and emotional conflict\. TRIPLE captures coarse historical cues such as anime and elf, but its instruction remains generic and fails to reflect the target\-side romantic transition\. In contrast, NaviGen leverages CID\-level collaborative interaction transitions and TID\-level semantic grounding to infer the next\-interest direction, preserving the anime/fantasy context while specifying visual cues such as a student couple, a tender moment, and a romantic atmosphere\. The image achieves higher scores across all metrics, suggesting NaviGen distills collaborative interaction trajectories into generation\-ready visual conditions rather than merely copying historical topics\.
Table 5:Hyperparameter study on SFT checkpoint steps\. Ours corresponds to 7128 steps\.Figure 5:Case study on TRIPLE and our NaviGen
## 5Related Work
Personalized Generation\. Personalized generation adapts content to individual users, styles, or contexts, with one line personalizing visual synthesis from histories, retrieved evidence, or recommendation\-guided signals\(Shenet al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib36); Xuet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib28); Linget al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib26)\)\. Another line constructs user\-specific profiles or alignment signals from edits, demonstrations, or behavioral theories\(Gaoet al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib34); Aroca\-Ouelletteet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib12); Nohet al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib35)\)\. While these methods improve user responsiveness, they often rely on profile\-like descriptions rather than generation\-ready instructions and provide limited coupling between compact behavioral modeling and explicit semantic grounding; in contrast, NaviGen converts implicit interaction histories into executable AIGC instructions for downstream image and video synthesis\.
Multimodal Generative Models\. Multimodal generative models provide the backbone for turning natural\-language instructions into images, videos, and other visual media\(Xieet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib24); Huanget al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib23)\)\. VLMs increasingly support reasoning and content creation\(Hurstet al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib16); Jinet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib17)\), while AIGC systems continue to improve prompt alignment, visual fidelity, temporal coherence, and motion control\(Sunet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib18); Seedreamet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib19); Seedanceet al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib21)\)\. However, these models are mainly optimized to follow explicit prompts rather than infer generation conditions from sparse and noisy user behaviors; NaviGen is therefore complementary, translating behavior\-grounded preferences into executable instructions for downstream multimodal synthesis\.
Behavioral Preference Modeling\. Preference modeling learns user interests from behavioral data, with graph and intent\-aware models capturing higher\-order user–item relations\(Heet al\.,[2020](https://arxiv.org/html/2606.24196#bib.bib3); Zhanget al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib4); Zhaoet al\.,[2025a](https://arxiv.org/html/2606.24196#bib.bib5)\), and sequential or multimodal methods modeling temporal dynamics and heterogeneous content features\(Renet al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib6); Fuet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib7); Linet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib8)\)\. Recent LLM\-augmented methods further use language models to interpret item descriptions, refine candidate scoring, simulate users, or calibrate preference decisions\(Sunet al\.,[2023](https://arxiv.org/html/2606.24196#bib.bib9); Qinet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib10); Yeet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib11)\)\. NaviGen instead transforms behavior\-derived preference signals into personalized creation by connecting behavior\-aware identifier prediction with generation\-ready instruction\.
## 6Conclusion
We presented NaviGen, a behavior\-aware framework that turns user interaction history into executable instructions for personalized AIGC\. Built on a dual\-identifier item representation, NaviGen couples reasoning\-augmented supervised tuning with multi\-task reinforcement learning to bridge user preference modeling and downstream generation control\. By separating CID\-based behavioral coding from TID\-based semantic grounding, NaviGen treats preference prediction as an intermediate reasoning step rather than an end in itself\. Experiments across product, game, and short\-video domains demonstrate consistent improvements in personalized image and video generation, instruction quality, and collaborative identifier prediction, establishing a unified path from implicit user behavior to controllable multimodal synthesis\.
## 7Limitations
Personalized generation is bounded by how user preference evidence is collected and used in real applications\. Interaction histories can contain sensitive behavioral signals, and users may not always expect such signals to be transformed into creative generation conditions\. Practical deployments should therefore require explicit opt\-in consent, minimize retained histories, anonymize or aggregate logs whenever possible, and provide clear controls for inspecting, editing, or deleting personalized preference profiles\. We view personalized instruction generation as an assistive layer for creative control rather than a replacement for deployment\-time safety governance\.
## 8Ethical Considerations
This work uses existing datasets and item captions for offline research evaluation, without collecting new personal data, inferring sensitive demographic attributes, or deploying personalized generation to real users\. The generated instructions are evaluated only as experimental outputs for studying preference\-to\-instruction modeling\. Beyond the privacy and deployment precautions discussed in the limitations, we do not identify additional ethical concerns specific to this work\.
## References
- Unictokens: boosting personalized understanding and generation via unified concept tokens\.Advances in Neural Information Processing Systems38,pp\. 144638–144664\.Cited by:[§1](https://arxiv.org/html/2606.24196#S1.p2.1)\.
- S\. Aroca\-Ouellette, N\. Mackraz, B\. Theobald, and K\. Metcalf \(2025\)Aligning LLMs by predicting preferences from user writing samples\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 1690–1721\.Cited by:[5th item](https://arxiv.org/html/2606.24196#A1.I1.i5.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1),[§5](https://arxiv.org/html/2606.24196#S5.p1.1)\.
- Y\. Dang, C\. Qian, X\. Luo, J\. Fan, Z\. Xie, R\. Shi, W\. Chen, C\. Yang, X\. Che, Y\. Tian,et al\.\(2026\)Multi\-agent collaboration via evolving orchestration\.Advances in neural information processing systems38,pp\. 165025–165059\.Cited by:[§1](https://arxiv.org/html/2606.24196#S1.p2.1)\.
- J\. Deng, S\. Wang, K\. Cai, L\. Ren, Q\. Hu, W\. Ding, Q\. Luo, and G\. Zhou \(2025\)OneRec: unifying retrieve and rank with generative recommender and iterative preference alignment\.arXiv preprint arXiv:2502\.18965\.Cited by:[§3\.1](https://arxiv.org/html/2606.24196#S3.SS1.p2.3)\.
- J\. Fu, X\. Ge, X\. Xin, A\. Karatzoglou, I\. Arapakis, K\. Zheng, Y\. Ni, and J\. M\. J\. Joemon \(2025\)Efficient and effective adaptation of multimodal foundation models in sequential recommendation\.IEEE TKDE\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- G\. Gao, A\. Taymanov, E\. Salinas, P\. Mineiro, and D\. Misra \(2024\)Aligning llm agents by learning latent preference from user edits\.Advances in neural information processing systems37,pp\. 136873–136896\.Cited by:[4th item](https://arxiv.org/html/2606.24196#A1.I1.i4.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1),[§5](https://arxiv.org/html/2606.24196#S5.p1.1)\.
- X\. He, K\. Deng, X\. Wang, Y\. Li, Y\. Zhang, and M\. Wang \(2020\)Lightgcn: simplifying and powering graph convolution network for recommendation\.InSIGIR,pp\. 639–648\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- J\. Hessel, A\. Holtzman, M\. Forbes, R\. Le Bras, and Y\. Choi \(2021\)Clipscore: a reference\-free evaluation metric for image captioning\.InEMNLP,pp\. 7514–7528\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.24196#S4.SS1.SSS1.p1.2)\.
- W\. Huang, S\. Chen, Z\. Xie, S\. Cao, S\. Tang, Y\. Shen, Q\. Yin, W\. Hu, X\. Wang, Y\. Tang,et al\.\(2025\)Interleaving reasoning for better text\-to\-image generation\.arXiv preprint arXiv:2509\.06945\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p2.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p2.1)\.
- Z\. Jin, W\. Tao, Y\. Li, Y\. Yang, C\. Han, S\. Li, and L\. Liu \(2025\)Large vison\-language foundation model in baidu aigc image advertising\.InKDD,pp\. 2303–2312\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p2.1)\.
- W\. Kang and J\. McAuley \(2018\)Self\-attentive sequential recommendation\.In2018 IEEE international conference on data mining \(ICDM\),pp\. 197–206\.Cited by:[1st item](https://arxiv.org/html/2606.24196#A1.I2.i1.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1)\.
- Y\. Kao, R\. He, and K\. Huang \(2017\)Deep aesthetic quality assessment with semantic information\.TIP26\(3\),pp\. 1482–1495\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.24196#S4.SS1.SSS1.p1.2)\.
- X\. Lin, R\. Liu, Y\. Cao, L\. Zou, Q\. Li, Y\. Wu, Y\. Liu, D\. Yin, and G\. Xu \(2025\)Contrastive modality\-disentangled learning for multimodal recommendation\.ACM TOIS43\(3\),pp\. 1–31\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- R\. Ling, W\. Wang, Y\. Liu, G\. Guo, H\. Liu, J\. Lu, Q\. Zhang, Y\. Xu, S\. Lu, and Y\. Wang \(2026\)RAGAR: retrieval augmented personalized image generation guided by recommendation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 15278–15286\.Cited by:[3rd item](https://arxiv.org/html/2606.24196#A1.I1.i3.p1.1),[§1](https://arxiv.org/html/2606.24196#S1.p1.1),[§1](https://arxiv.org/html/2606.24196#S1.p2.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1),[§5](https://arxiv.org/html/2606.24196#S5.p1.1)\.
- T\. Noh, S\. Jin, H\. Yeo, and K\. Han \(2026\)TRIPLE: theory\-driven integration of planned and habitual behaviors for llm\-based personalization\.InProceedings of the 40th AAAI Conference on Artificial Intelligence \(AAAI\-26\),Cited by:[6th item](https://arxiv.org/html/2606.24196#A1.I1.i6.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1),[§5](https://arxiv.org/html/2606.24196#S5.p1.1)\.
- W\. Qin, Y\. Xu, W\. Yu, C\. Shen, X\. Zhang, M\. He, J\. Fan, and J\. Xu \(2025\)More: a mixture of reflectors framework for large language model\-based sequential recommendation\.InRecsys,pp\. 299–308\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- S\. Rajput, N\. Mehta, A\. Singh, R\. Hulikal Keshavan, T\. Vu, L\. Heldt, L\. Hong, Y\. Tay, V\. Tran, J\. Samost,et al\.\(2023\)Recommender systems with generative retrieval\.Advances in Neural Information Processing Systems36,pp\. 10299–10315\.Cited by:[2nd item](https://arxiv.org/html/2606.24196#A1.I2.i2.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1)\.
- X\. Ren, W\. Wei, L\. Xia, L\. Su, S\. Cheng, J\. Wang, D\. Yin, and C\. Huang \(2024\)Representation learning with large language models for recommendation\.InWWW,pp\. 3464–3475\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- T\. Seedance, D\. Chen, L\. Chen, X\. Chen, Y\. Chen, Z\. Chen, Z\. Chen, F\. Cheng, T\. Cheng, Y\. Cheng,et al\.\(2026\)Seedance 2\.0: advancing video generation for world complexity\.arXiv preprint arXiv:2604\.14148\.Cited by:[§1](https://arxiv.org/html/2606.24196#S1.p1.1),[§1](https://arxiv.org/html/2606.24196#S1.p2.1),[§5](https://arxiv.org/html/2606.24196#S5.p2.1)\.
- T\. Seedream, Y\. Chen, Y\. Gao, L\. Gong, M\. Guo, Q\. Guo, Z\. Guo, X\. Hou, W\. Huang, Y\. Huang,et al\.\(2025\)Seedream 4\.0: toward next\-generation multimodal image generation\.arXiv preprint arXiv:2509\.20427\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p2.1)\.
- X\. Shen, R\. Zhang, X\. Zhao, J\. Zhu, and X\. Xiao \(2024\)Pmg: personalized multimodal generation with large language models\.InProceedings of the ACM Web Conference 2024,pp\. 3833–3843\.Cited by:[1st item](https://arxiv.org/html/2606.24196#A1.I1.i1.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1),[§5](https://arxiv.org/html/2606.24196#S5.p1.1)\.
- S\. Sun, X\. Liang, B\. Qu, and W\. Gao \(2025\)Content\-rich aigc video quality assessment via intricate text alignment and motion\-aware consistency\.arXiv preprint arXiv:2502\.04076\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p2.1)\.
- W\. Sun, L\. Yan, X\. Ma, S\. Wang, P\. Ren, Z\. Chen, D\. Yin, and Z\. Ren \(2023\)Is chatgpt good at search? investigating large language models as re\-ranking agents\.arXiv preprint arXiv:2304\.09542\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- S\. Vargas and P\. Castells \(2011\)Rank and relevance in novelty and diversity metrics for recommender systems\.InRecsys,pp\. 109–116\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.24196#S4.SS1.SSS1.p1.2)\.
- J\. Xie, W\. Mao, Z\. Bai, D\. J\. Zhang, W\. Wang, K\. Q\. Lin, Y\. Gu, Z\. Chen, Z\. Yang, and M\. Z\. Shou \(2025\)Show\-o: one single transformer to unify multimodal understanding and generation\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 28240–28264\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p2.1)\.
- Y\. Xu, W\. Wang, Y\. Zhang, B\. Tang, P\. Yan, F\. Feng, and X\. He \(2025\)Personalized image generation with large multimodal models\.InProceedings of the ACM on Web Conference 2025,pp\. 264–274\.Cited by:[2nd item](https://arxiv.org/html/2606.24196#A1.I1.i2.p1.1),[§1](https://arxiv.org/html/2606.24196#S1.p1.1),[§1](https://arxiv.org/html/2606.24196#S1.p2.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1),[§5](https://arxiv.org/html/2606.24196#S5.p1.1)\.
- Z\. Yang, J\. Teng, W\. Zheng, M\. Ding, S\. Huang, J\. Xu, Y\. Yang, W\. Hong, X\. Zhang, G\. Feng,et al\.\(2025\)Cogvideox: text\-to\-video diffusion models with an expert transformer\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 83048–83077\.Cited by:[§1](https://arxiv.org/html/2606.24196#S1.p1.1),[§1](https://arxiv.org/html/2606.24196#S1.p2.1)\.
- Y\. Ye, Z\. Zheng, Y\. Shen, T\. Wang, H\. Zhang, P\. Zhu, R\. Yu, K\. Zhang, and H\. Xiong \(2025\)Harnessing multimodal large language models for multimodal sequential recommendation\.InAAAI,Vol\.39,pp\. 13069–13077\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- Y\. Zhang, L\. Sang, and Y\. Zhang \(2024\)Exploring the individuality and collectivity of intents behind interactions for graph collaborative filtering\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1253–1262\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- Z\. Zhang, J\. She, K\. Cai, B\. Chen, S\. Wang, X\. Luo, Q\. Luo, R\. Tang, H\. Li, K\. Gai,et al\.\(2026\)Unleashing the native recommendation potential: llm\-based generative recommendation via structured term identifiers\.arXiv preprint arXiv:2601\.06798\.Cited by:[§3\.1](https://arxiv.org/html/2606.24196#S3.SS1.p4.2)\.
- C\. Zhao, E\. Yang, Y\. Liang, J\. Zhao, G\. Guo, and X\. Wang \(2025a\)Symmetric graph contrastive learning against noisy views for recommendation\.ACM TOIS43\(3\),pp\. 1–28\.Cited by:[§5](https://arxiv.org/html/2606.24196#S5.p3.1)\.
- Y\. Zhao, L\. Peng, Y\. Yang, Z\. Luo, H\. Li, Y\. Chen, Z\. Yang, X\. He, W\. Zhao, Q\. Lu,et al\.\(2025b\)Local conditional controlling for text\-to\-image diffusion models\.InProceedings of the AAAI conference on artificial intelligence,Vol\.39,pp\. 10492–10500\.Cited by:[§1](https://arxiv.org/html/2606.24196#S1.p2.1)\.
- B\. Zheng, Y\. Hou, H\. Lu, Y\. Chen, W\. X\. Zhao, M\. Chen, and J\. Wen \(2024\)Adapting large language models by integrating collaborative semantics for recommendation\.In2024 IEEE 40th International Conference on Data Engineering,pp\. 1435–1448\.External Links:[Document](https://dx.doi.org/10.1109/ICDE60146.2024.00118)Cited by:[3rd item](https://arxiv.org/html/2606.24196#A1.I2.i3.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1)\.
- G\. Zhou, H\. Bao, J\. Huang, J\. Deng, J\. Zhang, J\. She, K\. Cai, L\. Ren, L\. Ren, Q\. Luo,et al\.\(2025\)OpenOneRec technical report\.arXiv preprint arXiv:2512\.24762\.Cited by:[4th item](https://arxiv.org/html/2606.24196#A1.I2.i4.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.24196#S4.SS1.SSS1.p1.2),[§4\.1\.2](https://arxiv.org/html/2606.24196#S4.SS1.SSS2.p1.1)\.
\(a\)Image: Game
\(b\)Image: Product
\(c\)Video: Short Video
Figure 6:Comprehensive generation cases across three representative domains, each with its target tid\.## Appendix AAppendix
### A\.1Baseline Methods
To ensure a comprehensive study, we compare NaviGen against a broad set of baselines covering personalized generation, collaborative filtering and broader behavioral preference modeling, and prompting\-based references\.
Personalized Generation Methods
- •PMGShenet al\.\([2024](https://arxiv.org/html/2606.24196#bib.bib36)\): LLM\-extracted user cues condition multimodal generators for personalized content synthesis, guided by behavior\-aware prompts\.
- •PIGEONXuet al\.\([2025](https://arxiv.org/html/2606.24196#bib.bib28)\): Retrieved preference evidence steers frozen generation agents without model fine\-tuning, preserving efficient deployment workflows\.
- •RAGARLinget al\.\([2026](https://arxiv.org/html/2606.24196#bib.bib26)\): Semantic retrieval weights relevant histories, while ranking feedback balances personalization and fidelity across interaction\-rich scenarios\.
- •CIPHER\(Gaoet al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib34)\): Historical user edits are retrieved to infer preferences and align generated outputs with user intent through edit\-aware context matching\.
- •PROSE\(Aroca\-Ouelletteet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib12)\): Iterative refinement and consistency checks infer user preferences from demonstrations, using structured self\-verification loops\.
- •TRIPLE\(Nohet al\.,[2026](https://arxiv.org/html/2606.24196#bib.bib35)\): A theory\-driven LLM\-based personalization framework integrating planned and habitual behavior modeling for preference reasoning\.
Behavioral Preference Modeling Methods
- •SASRec\(Kang and McAuley,[2018](https://arxiv.org/html/2606.24196#bib.bib37)\): Transformer\-based sequential preference model learning user behavior from item ID sequences via self\-attention\.
- •TIGERRajputet al\.\([2023](https://arxiv.org/html/2606.24196#bib.bib39)\): A generative item\-ID modeling framework that converts item IDs into tokens and models sequential preference prediction as next\-token generation\.
- •LC\-Rec\(Zhenget al\.,[2024](https://arxiv.org/html/2606.24196#bib.bib2)\): An ID\-based sequential preference modeling method that enhances representation learning with additional latent/contextual signals\.   Figure 7:Human evaluation on created content\.
- •OpenOneRec\(Zhouet al\.,[2025](https://arxiv.org/html/2606.24196#bib.bib38)\): It integrates item\-text alignment into an end\-to\-end generative preference modeling framework for scalable preference prediction and reasoning\.
Prompting and Reference Baselines
- •NPC: No\-preference conditioning removes user evidence, such as reference images and similar historical items, using a generic prompt for generation to isolate personalization effects\.
- •Oracle: Ground\-truth target semantics, such as the target caption or TID, are used as a non\-deployable upper\-bound reference for diagnostic comparison only\.
### A\.2Human Evaluation on Content
To complement automatic VLM\-based evaluation, we further conduct a human study to directly assess the perceptual quality of NaviGen\. We perform human evaluation which averaged over three domains, including Product, Short Videos, and Games\. From each dataset, we randomly sample 20 image cases and 5 video cases for evaluation\. For NaviGen and three representative generation baselines, CIPHER, PROSE, and TRIPLE, which all support both image and video generation, we recruit 24 student volunteers to evaluate anonymized and randomly shuffled outputs using a 5\-point Likert scale along two dimensions: Novelty, reflecting the creativity and non\-triviality of the visual interpretation, and Aesthetic quality, capturing visual appeal, composition, clarity, and overall polish\. Scores from 1 to 5 indicate very poor, poor, fair, good, and excellent quality, respectively\. The evaluation process takes approximately 1\.5 hours to complete\. As shown in Figure[7](https://arxiv.org/html/2606.24196#A1.F7), NaviGen consistently achieves higher average ratings on both dimensions in image and video settings, suggesting that behavior\-conditioned instructions lead to more creative and visually appealing personalized content\.
### A\.3Prompt Templates
We apply these prompts to six main works and two reasoning cases \(shown at last\), including:
- •TID Generation:Fig\.[9](https://arxiv.org/html/2606.24196#A1.F9)converts item captions into compact semantic TIDs\.
- •Thinking Generation:Fig\.[14](https://arxiv.org/html/2606.24196#A1.F14)generates reasoning from historical TIDs to the target TID\.
- •Evolutionary Search:Fig\.[8](https://arxiv.org/html/2606.24196#A1.F8)searches for the best target\-aligned AIGC prompt\.
- •Oneshot Distillation:Fig\.[10](https://arxiv.org/html/2606.24196#A1.F10)distills history and so on into final first\-person reasoning\.
- •SFT Task Prompts:Fig\.[11](https://arxiv.org/html/2606.24196#A1.F11)defines SFT tasks for ID mapping, next\-item prediction, and AIGC instruction generation\.
- •GRPO/RL Task Prompts:Fig\.[12](https://arxiv.org/html/2606.24196#A1.F12)defines preference prediction and instruction generation\.
- •CID2CID/CID2INS Reasoning:Fig\.[13](https://arxiv.org/html/2606.24196#A1.F13)shows our reasoning process\.
- •Novelty/Aesthetic Judging:Fig\.[15](https://arxiv.org/html/2606.24196#A1.F15)illustrates the evaluation of novelty and aesthetics for image \(a\) and video \(b\) generation\.
Figure 8:Generation prompts create target\-aligned AIGC candidates, while scoring prompts evaluate and select the best final prompt\.Figure 9:Convert item captions into structured Term IDs\.Figure 10:Distills user history, target reasoning, and prompt refinement into final first\-person reasoning\.Figure 11:Four prompts of SFT tasks’ construction\.Figure 12:Two prompts of GRPO tasks’ construction\.Figure 13:Cases of reasoningFigure 14:Reason from hist TIDs to target TID\.\(a\)Image generation evaluation
\(b\)Video generation evaluation
Figure 15:AIM\-judge prompts for novelty and aesthetic evaluation\. Both static images \(a\) and dynamic video sequences \(b\) are appraised for their creative interpretation of instructions \(novelty\) and visual polish \(aesthetics\)\.Similar Articles
Towards Customized Multimodal Role-Play
This paper introduces UniCharacter, a two-stage training framework for Customized Multimodal Role-Play (CMRP) that enables unified customization of persona, dialogue style, and visual identity. It presents the RoleScape-20 dataset and demonstrates that the model can achieve coherent cross-modal generation with minimal data.
Learning to Learn from Multimodal Experience
This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.
Advancing DialNav through Automatic Embodied Dialog Augmentation
This paper proposes an automatic generation pipeline to create a large-scale training dataset (RAINbow) for DialNav, a dialog-based vision-and-language navigation task. Combined with dual-strategy training and a localization model, it achieves substantial gains over the baseline.
PersonaVLM: Long-Term Personalized Multimodal LLMs
PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
PresentAgent-2 is an agentic framework that generates presentation videos from user queries by conducting research, creating multimodal slides, and producing interactive content across single, discussion, and interaction modes.