Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Hugging Face Daily Papers Papers

Summary

Researchers introduce the MM-OCEAN dataset and a three-tier evaluation framework for grounded personality reasoning in multimodal LLMs, revealing a 'Prejudice Gap' where models often make correct predictions without proper grounding.

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.
Original Article
View Cached Full Text

Cached at: 05/22/26, 06:38 AM

Paper page - Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Source: https://huggingface.co/papers/2605.22109 Authors:

,

,

,

,

,

,

,

,

,

Abstract

Researchers introduce a new task and dataset for evaluating personality reasoning in multimodal language models, revealing significant gaps between accurate predictions and grounded reasoning processes.

Multimodal Large Language Models(MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numericalBig Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalizeGrounded Personality Reasoning(GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through achain of rating,reasoning,and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by amulti-agent pipelinewith human verification, with timestampedbehavioral observations,evidence-grounded trait analyses, and seven categories ofcue-grounding MCQs. (iii) Benchmark and analysis: we design athree-tier evaluation(rating,reasoning, grounding) plus four sample-level failure-mode metrics:Prejudice Rate(PR),Confabulation Rate(CR),Integration-failure Rate(IR), andHolistic-grounding Rate(HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and theHolistic-Grounding Ratespans only 0-33.5%. These findings expose a disconnect between getting the right score andreasoningfor the right reason, charting a roadmap for grounded social cognition in MLLMs.

View arXiv pageView PDFGitHub2Add to collection

Get this paper in your agent:

hf papers read 2605\.22109

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22109 in a model README.md to link it from this page.

Datasets citing this paper1

#### anonymous-mm-ocean/MM-OCEAN Updatedabout 4 hours ago • 338

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22109 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

6x P40 running Minimax M2.7_Q3_XL

Reddit r/LocalLLaMA

A detailed home lab setup with 6x P40 GPUs running a quantized MiniMax M2.7 model, including hardware specs, benchmark results, and optimal configuration using llama.cpp.