PersonaVLM: Long-Term Personalized Multimodal LLMs

Hugging Face Daily Papers Papers

Summary

PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.

Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:27 AM

Paper page - PersonaVLM: Long-Term Personalized Multimodal LLMs

Source: https://huggingface.co/papers/2604.13074

Abstract

A novel personalized multimodal language model framework called PersonaVLM is introduced that enables long-term personalization through memory retention, multi-turn reasoning, and response alignment capabilities.

Multimodal Large Language Models (https://huggingface.co/papers?q=Multimodal%20Large%20Language%20Models) (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users’ evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework (https://huggingface.co/papers?q=personalized%20multimodal%20agent%20framework) designed for long-term personalization (https://huggingface.co/papers?q=long-term%20personalization). It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories (https://huggingface.co/papers?q=chronological%20multimodal%20memories) from interactions, consolidating them into a personalized database (https://huggingface.co/papers?q=personalized%20database). (b) Reasoning: It conducts multi-turn reasoning (https://huggingface.co/papers?q=multi-turn%20reasoning) by retrieving and integrating relevant memories from the database. (c) Response Alignment (https://huggingface.co/papers?q=Response%20Alignment): It infers the user’s evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method’s effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.

View arXiv page (https://arxiv.org/abs/2604.13074) View PDF (https://arxiv.org/pdf/2604.13074) Project page (https://personavlm.github.io/) GitHub (https://github.com/MiG-NJU/PersonaVLM) Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.13074)

Get this paper in your agent:

hf papers read 2604.13074

Don’t have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

ClareNie/PersonaVLM-8B • Updated 4 days ago • 37 • 7 (https://huggingface.co/ClareNie/PersonaVLM)

Datasets citing this paper 2

ClareNie/Persona-MME Viewer • Updated 4 days ago • 4.54k • 36.6k • 2 (https://huggingface.co/datasets/ClareNie/Persona-MME)

ClareNie/PersonaVLM-Dataset Viewer • Updated 4 days ago • 33.3k • 74 • 3 (https://huggingface.co/datasets/ClareNie/PersonaVLM-Dataset)

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.13074 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection (https://huggingface.co/new-collection) to link it from this page.

Similar Articles

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

arXiv cs.CL

This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

Hugging Face Daily Papers

Researchers introduce BEHEMOTH benchmark and CluE cluster-based prompt optimization to enable LLMs to extract and retain heterogeneous memory across diverse tasks, achieving 9% gains over prior self-evolving frameworks.