Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
Summary
Researchers introduce the MM-OCEAN dataset and a three-tier evaluation framework for grounded personality reasoning in multimodal LLMs, revealing a 'Prejudice Gap' where models often make correct predictions without proper grounding.
View Cached Full Text
Cached at: 05/22/26, 06:38 AM
Paper page - Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
Source: https://huggingface.co/papers/2605.22109 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Researchers introduce a new task and dataset for evaluating personality reasoning in multimodal language models, revealing significant gaps between accurate predictions and grounded reasoning processes.
Multimodal Large Language Models(MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numericalBig Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalizeGrounded Personality Reasoning(GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through achain of rating,reasoning,and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by amulti-agent pipelinewith human verification, with timestampedbehavioral observations,evidence-grounded trait analyses, and seven categories ofcue-grounding MCQs. (iii) Benchmark and analysis: we design athree-tier evaluation(rating,reasoning, grounding) plus four sample-level failure-mode metrics:Prejudice Rate(PR),Confabulation Rate(CR),Integration-failure Rate(IR), andHolistic-grounding Rate(HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and theHolistic-Grounding Ratespans only 0-33.5%. These findings expose a disconnect between getting the right score andreasoningfor the right reason, charting a roadmap for grounded social cognition in MLLMs.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.22109
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22109 in a model README.md to link it from this page.
Datasets citing this paper1
#### anonymous-mm-ocean/MM-OCEAN Updatedabout 4 hours ago • 338
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22109 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EdgeBench Reveals the Next Scaling Law: On-the-Fly AI Learning Speed Doubles Every 3 Months
EdgeBench reveals a new scaling law indicating that on-the-fly AI learning speed doubles every three months.
Independent benchmark shows big drops on Claude Fable 5 after its relaunch, here’s the actual context
An independent benchmark reveals significant performance drops in Claude Fable 5 after its relaunch, with context provided.
6x P40 running Minimax M2.7_Q3_XL
A detailed home lab setup with 6x P40 GPUs running a quantized MiniMax M2.7 model, including hardware specs, benchmark results, and optimal configuration using llama.cpp.
@VikParuchuri: OCR hallucinations poison downstream workflows. We built research-driven safeguards that reduce hallucinations to near-…
Vik Paruchuri announces research-driven safeguards that reduce OCR hallucinations to near-zero in their benchmark, with word-level bounding boxes and confidence scores for any remaining errors.
@jun_song: How is this not considered as a consumer scam? This is the field that we need regulation.
A user highlights significant performance degradation in Claude Fable 5 after recent updates, with benchmark scores dropping drastically in debugging, refactoring, and hallucination tasks, calling for regulation to address potential consumer scams in AI model behavior.