We gave 45 psychological questionnaires to 50 LLMs. What we found was not “personality.”

Reddit r/artificial Papers

Summary

Researchers analyzed 50 LLMs across 45 psychometric questionnaires, identifying a 'Pinocchio Dimension' that measures how models endorse inner experiences rather than reflecting true personality traits.

What is the “personality” of an LLM? What actually differentiates models psychometrically? Since LLMs entered public use, researchers have been giving them psychometric questionnaires, with mixed results. Their answers often do not seem to reflect the same psychological constructs these tests measure in humans. So we asked a slightly different question: What do LLM responses to psychometric questionnaires actually reflect? We analyzed responses to 45 validated psychometric questionnaires completed by 50 different LLMs. The strongest source of variation was whether a model endorsed items about inner experience: emotions, sensations, thoughts, imagery, empathy, and other forms of first-person experience. We call this factor the Pinocchio Dimension. Importantly, the Pinocchio Dimension is not a classical personality trait. It does not tell us whether a model is “extraverted,” “neurotic,” or “agreeable” in the human sense. Rather, it captures the extent to which a model treats the language of inner experience as self-applicable: whether it responds as if it had feelings, mental imagery, and an inner point of view, or instead as a system that reacts behaviorally to inputs. Preprint in the comments.
Original Article

Similar Articles

Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?

arXiv cs.CL

This paper investigates whether fine-tuning LLMs on long-form essays with associated Big Five personality profiles stabilizes questionnaire responses and can induce target profiles, finding that while variance reduces, accuracy on the full five-dimensional profile remains near chance.

Evaluating LLMs as Human Surrogates in Controlled Experiments

arXiv cs.CL

This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.