HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

arXiv cs.CL Papers

Summary

HumanLLM presents a framework for benchmarking and improving LLM anthropomorphism by modeling psychological patterns as interacting causal forces, constructing 244 patterns from academic literature and 11,359 multi-pattern scenarios. The approach demonstrates that authentic human alignment requires cognitive modeling rather than shallow behavioral mimicry, with HumanLLM-8B outperforming larger models like Qwen3-32B on multi-pattern dynamics.

arXiv:2601.10198v4 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HumanLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.90) while revealing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4× fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling -- simulating not just what humans do, but the psychological processes generating those behaviors. Our dataset, code, and model are available at: https://github.com/YJGoodbye2024/HumanLLM
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Source: https://arxiv.org/html/2601.10198

Xintao Wang1, Jian Yang1, Weiyuan Li1, Rui Xie1, Jen-tse Huang3, Jun Gao2, Shuai Huang2, Yueping Kang2, Yuanli Guo1, Hongwei Feng1, Yanghua Xiao1

1Fudan University
2Hello Group
3Johns Hopkins University

{xtwang21, 24210240375, 25210980069, 25210980167}@m.fudan.edu.cn
{hwfeng, shawyh, guoyuanli}@fudan.edu.cn
[email protected]
{huang.shuai, kang.yueping}@hellogroup.com
[email protected]

## Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HumanLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from approximately 12,000 academic papers and synthesize 11,359 scenarios where 2–5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.90) while revealing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4× fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling—simulating not just what humans do, but the psychological processes generating those behaviors. Our dataset, code, and model are available at: https://github.com/YJGoodbye2024/HumanLLM

## 1 Introduction

With the rapid scaling of training data, Large Language Models (LLMs) have achieved remarkable progress in anthropomorphism—simulating human-like characteristics and social phenomena. Role-Playing Language Agents (RPLAs) have evolved from conceptual frameworks into practical applications, enabling digital clones, AI companions, and society simulation. As these applications advance, LLM anthropomorphism increasingly requires moving beyond shallow behavioral mimicry toward deeper cognitive and emotional fidelity—what we term psychological alignment.

However, existing approaches model personality as isolated label-to-behavior mappings—"extroverted" maps to "talkative," "agreeable" maps to "cooperative"—without capturing how multiple cognitive patterns interact to produce behavior. We define a pattern as a psychologically documented regularity in human cognition or behavior—either a stable personality trait (e.g., "assertive") or a context-triggered social-cognitive process (e.g., "spotlight effect").

In reality, a talkative person may fall silent when the spotlight effect is activated; an assertive individual may yield under conformity pressure. Human behavior emerges from the dynamic interplay of multiple patterns, not from any single trait in isolation. Current methods—whether prompting-based, fine-tuning-based, or activation steering—all treat traits independently, leading to personality drift and the "personality illusion" where models report traits while behaving inconsistently.

To address this, we propose the HumanLLM framework, treating cognitive patterns not as isolated labels but as interacting causal forces. HumanLLM refers to the overall framework; we use 'HumanLLM dataset' and 'HumanLLM-8B/32B' when referring specifically to the data artifact and fine-tuned models, respectively.

Our key insight: by exposing models to scenarios where multiple patterns reinforce, compete, or conflict, models can implicitly learn multi-pattern dynamics without architectural modifications.

Following Lewin's field theory, we decompose human cognition into two dimensions: (1) Personality Traits—stable individual characteristics, and (2) Social-Cognitive Patterns—context-triggered mechanisms. We collect 244 patterns (100 personality traits from Goldberg's Big Five markers and 144 social-cognitive patterns from established psychological research), each developed through systematic review of approximately 50 academic papers.

We then construct 11,359 scenarios involving 2–6 characters, each containing 2–5 patterns that may align (e.g., "self-serving bias" reinforcing "overconfidence effect"), conflict (e.g., "assertive" versus "conformity"), or interact conditionally (e.g., "talkative" suppressed by "spotlight effect"). For each scenario, we synthesize multi-turn conversations where each turn comprises inner thoughts, physical actions, and verbal expressions.

To ensure faithful pattern expression and enable systematic evaluation, we design dual-level checklists: pattern-level checklists (12–15 items per pattern) capture universal behavioral indicators; scenario-level checklists (2–6 items per character) specify expected behavioral tendencies under each multi-pattern configuration.

Our training pipeline consists of supervised fine-tuning on the synthesized conversations. We evaluate across in-domain, out-of-domain, and mixed settings to assess generalization, with additional validation on external benchmarks including LifeChoice and CroSS-MR.

Our contributions are as follows:

(1) We introduce HumanLLM, a framework that systematically leverages psychological cognitive patterns to enhance LLM anthropomorphism, shifting from isolated trait simulation toward modeling the dynamic interplay of human cognition.

(2) We construct a comprehensive dataset comprising 244 patterns and 11,359 scenarios with multi-turn, multi-character conversations. Each pattern is grounded in approximately 50 academic papers (over 12,000 papers in total), ensuring psychological rigor and scientific validity.

(3) We propose dual-level checklists that enable systematic evaluation at both pattern-level and scenario-level granularities, providing a principled framework for assessing generalization to unseen psychological patterns.

## 2 Related Work

Recent advances in large language models have catalyzed significant progress in role-playing language agents (RPLAs). Early work established foundational architectures: generative agents with memory, planning, and reflection modules have been employed to simulate human behavior in interactive environments, while Character-LLM proposed experience reconstruction to train agents embodying historical figures. Subsequent efforts focused on systematic benchmarking and enhancement: ChatHaruhi leveraged memory-based dialogue control for fictional characters, and CoSER curated authentic dialogues from 771 books using "given-circumstance acting" methodology.

For persona induction, three main approaches have emerged: (1) prompting-based methods that assign personality traits through instructions; (2) fine-tuning approaches that embed personas through training on character-specific data; and (3) activation steering via persona vectors that manipulate neural representations corresponding to specific traits.

A parallel line of research evaluates LLMs through the lens of psychological constructs. Theory of Mind (ToM) benchmarks such as ToMBench assess social cognitive abilities, revealing that GPT-4 lags behind humans by over 10%, with trivial task modifications causing significant performance degradation. Emotional intelligence benchmarks adopt psychology-grounded frameworks to evaluate emotional understanding and application, finding substantial gaps between LLMs and humans. Moral reasoning has been assessed through ETHICS and MoralBench, the latter grounded in Moral Foundations Theory.

Research on cognitive biases reveals that LLMs exhibit human-like irrationality but with divergent patterns. Personality assessment using validated instruments (BFI, MBTI) demonstrates that LLMs can manifest measurable traits, though self-report validity remains questionable. Critically, recent work cautions that LLMs do not reliably simulate human psychology and fail to generalize across semantically equivalent scenarios.

## 3 HumanLLM Dataset

This section introduces the HumanLLM dataset, a psychologically grounded resource for training and evaluating anthropomorphic language models. We describe pattern collection, pattern data construction, scenario and conversation generation, and dual-level checklist design.

### 3.1 Pattern Collection

Following theoretical foundations established in Lewin's Person-Environment framework, we compile patterns along two complementary dimensions.

#### Personality Traits (Person Dimension)

We adopt Goldberg's 100 Unipolar Markers, a psychometrically validated lexicon mapping onto the Big Five dimensions with 20 trait descriptors each (Extraversion, Agreeableness, Conscientiousness, Emotional Stability, Intellect).

#### Social-Cognitive Patterns (Environment Dimension)

We curate situationally-activated psychological mechanisms through systematic review of established theoretical traditions, including cognitive biases, social influence, evolutionary psychology, and motivation research. From an initial pool of 232 documented patterns, we apply two filtering criteria: (1) sufficient empirical validation, and (2) non-redundancy with other patterns. This yields 144 social-cognitive patterns.

### 3.2 Pattern Data Construction

Pattern data are structured representations of psychological patterns. We construct pattern data through a two-stage pipeline: literature retrieval followed by LLM-based synthesis.

#### Literature Retrieval

For each of the 244 patterns, we employ Gemini Deep Search to identify approximately 50 relevant academic papers. The search is guided by three retrieval dimensions: (1) foundational definitions from seminal works, (2) mechanistic explanations from theoretical and empirical studies, and (3) real-world applications from applied research. Retrieved references are filtered manually to remove irrelevant entries. Full-text documents are obtained through open-access APIs (Semantic Scholar, arXiv, OpenAlex, PubMed, Crossref); when full text is unavailable, abstracts are retained. This process yields a corpus of approximately 12,000 papers across all patterns.

#### Pattern Synthesis

We employ Gemini 2.5 Pro to summarize each pattern's literature corpus into a structured representation. Critically, the model is instructed to extract and summarize information exclusively from the provided 50 papers, rather than generating content from its parametric knowledge. Following the construct validity framework, each pattern is organized into three components: (1) Definition—a precise characterization grounded in authoritative sources; (2) Core Mechanisms—psychological explanations for why the pattern occurs; and (3) Real-World Manifestations—observable behavioral indicators across diverse contexts.

Similar Articles

Evaluating LLMs as Human Surrogates in Controlled Experiments

arXiv cs.CL

This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

arXiv cs.CL

This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.