Evaluating Developmental Cognition Capabilities of LLMs

arXiv cs.AI Papers

Summary

This paper introduces the Developmental Sentence Completion Test (DSCT) to evaluate Large Language Models' ability to recognize developmental cognitive stages in text, finding that models perform better on synthetic personas than on real human responses.

arXiv:2605.08549v1 Announce Type: new Abstract: Conversational AI is increasingly personalized around users' preferences, histories, goals, and knowledge, but much less around how users interpret and take up model outputs to construct and understand their reality. We draw on Robert Kegan's constructive-developmental theory as a complementary lens on this dimension. Existing methods for assessing developmental stage in the Keganian tradition rely either on expert interviews that do not scale or on sentence-completion instruments that are proprietary, lengthy, or invasive. To make this perspective tractable for LLM evaluation, we introduce the Developmental Sentence Completion Test (DSCT), a 20-item instrument designed to elicit developmental signal in self-administered text. Throughout, we treat the resulting labels as characterizations of stage-like structure in elicited responses, not as validated person-level developmental stage. We then ask how much of that signal can be recovered by LLMs across three elicited response regimes: simulated personas, real human respondents, and default model-generated answers. On simulated personas, top frontier models recover simulator-intended labels with high accuracy. On real human DSCT responses, human-LLM agreement is fair, with much stronger within-neighborhood than exact agreement. Finally, when LLMs answer DSCT prompts without persona-conditioning, their responses exhibit stable stage-like differences across model families, with larger and newer models tending to generate higher-rated text. These results suggest that stage-conditioned signal is cleaner in synthetic responses than in human-written DSCT text, and that the core constraint for stage-aware conversational AI is not classifier accuracy alone, but the availability of developmental signal from elicited text.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:17 AM

# Evaluating Developmental Cognition Capabilities of LLMs
Source: [https://arxiv.org/html/2605.08549](https://arxiv.org/html/2605.08549)
Xiao Xiao De Vinci Research Center, France MIT Media Lab, USA &Hayoun Noh University of Oxford, UK &Mar Gonzalez\-Franco Google, USA margon@google\.com

###### Abstract

Conversational AI is increasingly personalized around users’ preferences, histories, goals, and knowledge, but much less around how users interpret and take up model outputs to construct and understand their reality\. We draw on Robert Kegan’s constructive\-developmental theory as a complementary lens on this dimension\. Existing methods for assessing developmental stage in the Keganian tradition rely either on expert interviews that do not scale or on sentence\-completion instruments that are proprietary, lengthy, or invasive\. To make this perspective tractable for LLM evaluation, we introduce theDevelopmental Sentence Completion Test \(DSCT\), a 20\-item instrument designed to elicit developmental signal in self\-administered text\. Throughout, we treat the resulting labels as characterizations of stage\-like structure in elicited responses, not as validated person\-level developmental stage\. We then ask how much of that signal can be recovered by LLMs across three elicited response regimes: simulated personas, real human respondents, and default model\-generated answers\.

On simulated personas, top frontier models recover simulator\-intended labels with high accuracy\. On real human DSCT responses, human–LLM agreement is fair, with much stronger within\-neighborhood than exact agreement\. Finally, when LLMs answer DSCT prompts without persona\-conditioning, their responses exhibit stable stage\-like differences across model families, with larger and newer models tending to generate higher\-rated text\. These results suggest that stage\-conditioned signal is cleaner in synthetic responses than in human\-written DSCT text, and that the core constraint for stage\-aware conversational AI is not classifier accuracy alone, but the availability of developmental signal from elicited text\.

## 1Introduction

Conversational AI systems are increasingly used to support learning, reasoning, and decision\-making, including in educational, reflective, and counseling\-oriented settings\(Wanget al\.,[2026](https://arxiv.org/html/2605.08549#bib.bib47); Liet al\.,[2025](https://arxiv.org/html/2605.08549#bib.bib48); Chiuet al\.,[2024](https://arxiv.org/html/2605.08549#bib.bib7)\)\. Personalization has therefore become an active research direction, typically framed around users’ preferences, histories, contexts, goals, or knowledge states\(Chenet al\.,[2024](https://arxiv.org/html/2605.08549#bib.bib49)\)\. Such approaches mostly adapt to what users want, know, or are trying to accomplish, and much less to how they interpret and take up model outputs\. Developmental psychology, however, suggests that people also differ in how they make sense of experience and knowledge\(Piaget and Cook,[1952](https://arxiv.org/html/2605.08549#bib.bib50); Kegan,[1994](https://arxiv.org/html/2605.08549#bib.bib32)\)\.

One useful lens on these differences comes from Kegan’s constructive\-developmental theory, which characterizes development in terms of subject–object transformation: changes in what a person is subject to and cannot yet step back from, versus what they can take as an object of reflection and evaluate\(Kegan,[1994](https://arxiv.org/html/2605.08549#bib.bib32)\)\. In this work, we use Kegan’s framework to ask to what extent differences in meaning\-making can be recovered from elicited text\. This matters for conversational AI because the same model output may function differently for users whose expressed meaning\-making structure differs, even when their stated goals are similar\. Ignoring this variation may contribute to familiar failure modes such as homogenized outputs\(Jianget al\.,[2025](https://arxiv.org/html/2605.08549#bib.bib38)\)or sycophantic reinforcement of user beliefs\(Chandraet al\.,[2025](https://arxiv.org/html/2605.08549#bib.bib30)\)\. Note that our goal is not to measure a user’s developmental stage directly, but to test whether elicited text contains recoverable clues about meaning\-making structure\.

Standard methods for assessing developmental structure in the Keganian tradition are not well suited to LLM benchmarking\. The best\-known approach, the Subject–Object Interview and related developmental interviews, provides rich access to meaning\-making structure but requires extended dialogue and expert interpretive coding\(Laske,[2023](https://arxiv.org/html/2605.08549#bib.bib43); Kegan,[1994](https://arxiv.org/html/2605.08549#bib.bib32)\)\. While Kegan’s canonical assessments are interview\-based, sentence\-completion tests have emerged as another way to asses meaning\-making structure\(Loevinger and others,[1998](https://arxiv.org/html/2605.08549#bib.bib44); Cook\-Greuter,[1999](https://arxiv.org/html/2605.08549#bib.bib45)\)\. Here, respondents complete short open\-ended stems, and the resulting text is assessed structure of meaning\-making it expresses\. This format is especially relevant for LLM evaluation\. As LLMs are both producers and readers of text, they can generate sentence\-completion responses and classify such responses from either humans or other models\. However, sentence\-completion instruments such as Loevinger’s and Cook\-Greuter’s instruments are not designed to assess Keganian developmental structure directly, and they are often proprietary, lengthy, or include gendered and personally invasive prompts\(Loevinger and others,[1998](https://arxiv.org/html/2605.08549#bib.bib44); Cook\-Greuter,[1999](https://arxiv.org/html/2605.08549#bib.bib45)\)\. We therefore introduce the Developmental Sentence Completion Test \(DSCT\), a 20\-item self\-administered instrument designed to elicit text rich enough for trained human raters or LLMs to assign a tentative Keganian label to a response set, while removing the proprietary, gendered, and invasive items of older instruments\.

Our central question is: how much developmental signal can be recovered from DSCT\-style text, and how does that depend on who or what produced the text? To answer this, we compare three regimes: simulated personas, real human respondents, and default model\-generated answers\. First, we study synthetic DSCT responses generated from expert\-described developmental profiles, asking whether stage\-conditioned signal can be recovered by LLM classifiers and corroborated by trained human raters\. Second, we apply DSCT to human respondents and measure agreement between human raters and LLM classifiers on the resulting response sets\. Third, we prompt LLMs to answer DSCT items without persona\-conditioning and analyze the developmental structure of the text they produce\. These three regimes let us compare human\-written, model\-simulated, and model\-generated responses within a shared elicitation format\.

Our contributions are threefold\. First, we introduce DSCT, a 20\-item sentence\-completion instrument for matched human and LLM evaluation of stage\-like structure in elicited text\. Second, we benchmark twelve LLMs on Keganian labeling of simulated and human DSCT responses against trained human raters, showing that top frontier models recover simulator\-intended labels with high accuracy under controlled synthetic conditions, while developmental signal is less clean but still sufficiently recoverable in human\-written DSCT responses for substantial agreement on broader stage regions\. Third, we analyze default DSCT responses generated by LLMs, showing that larger and newer models tend to produce text rated at higher developmental stages\.

## 2Designing the DSCT

The Developmental Sentence Completion Test \(DSCT\) is a 20\-item self\-administered instrument designed to elicit text samples sufficient for an LLM or trained human rater to make a tentative assessment of a respondent’s likely Developmental Cognition Kegan stage\. We describe the design here so its scope and limits are clear before the experiments\.

Scope and limitations\.The Subject–Object InterviewKegan \([1994](https://arxiv.org/html/2605.08549#bib.bib32)\)remains the gold standard for stage assessment, but its 60–90 minute semi\-structured format and certified\-coder requirement are incompatible with computational evaluation at scale\. Sentence\-completion tests — the Loevinger SCTLoevinger and others \([1998](https://arxiv.org/html/2605.08549#bib.bib44)\)and Cook\-Greuter’s MAP / SCTi\-MAPCook\-Greuter \([1999](https://arxiv.org/html/2605.08549#bib.bib45)\)are the canonical instruments — offer a lighter\-weight alternative: respondents complete short stems in their own words, and structural features of the response are used to infer meaning\-making complexity\. DSCT inherits the SCT/MAP format and is intended to support the empirical question of*how much developmental signal can be recovered from a brief, self\-administered text sample*\. However, the DSCT should not be considered a full substitute for the SOI or SCTi\-MAP, neither a diagnostic instrument, and it is not validated for individual high\-stakes decisions\. The DSCT is also not a measure of intelligence, education, or verbal ability, although responses unavoidably co\-vary with these\. “Stage,” as we use the term, is a property of how a*response*is structured, not a stable trait we attribute to the respondent\.

Item construction\.We started from the affective territory probed by the SOI, which uses prompt cards covering recurrent meaning\-making situations:*angry / mad*,*torn / conflicted*,*sad*,*success*,*strong stand / conviction*,*moved / touched*,*loss / farewells*,*change*,*important*, and*anxious / nervous*\. For each territory we generated candidate vignettes with LLM assistance and then curated them by hand: two of the authors independently selected the vignettes they judged structurally richest and least culturally loaded, retaining only items both endorsed\. This yielded two vignettes per territory, 20 stems total\. The two vignettes per territory are written in different voices: one in the first person \(Section 1,*self\-assessment*, items 1–10\) and one in the third person about a generic other \(Section 2,*abstracted\-other assessment*, items 11–20\)\. The two\-voice design accommodates respondents who default to socially desirable first\-person responses but reveal more structural complexity when reasoning about another person in the same situation; the two sections probe the same constructs through different framings rather than measuring separate dimensions\.

Comparison to SCT\.Compared to the 36\-item Loevinger SCT, DSCT is 44% shorter and removes items targeting gender roles, sexuality, and family relationships \(e\.g\.,*“A man’s job…”*,*“Usually she/he felt that sex…”*,*“A wife/husband should…”*\) which we judged invasive in a self\-administered online setting and orthogonal to structural complexity in meaning\-making\. We piloted the resulting instrument informally on ourselves before the experiments\. Full item lists for both questionnaires are in Appendix A\.1\.

It is worth noting that despite the DSCT improvements compared to the SCT with respect to cultural norms, it is still not fully culture\-free: in the current version stems are in English, and the underlying frame reflects Western adult\-development scholarship\.

### 2\.1Experiment 1: Controlled validation on simulated personas

We begin with a controlled validation setting\. Because no large\-scale corpus of stage\-labeled DSCT responses exists, we use simulated personas anchored in 23 expert\-described developmental profiles drawn from prior literature\(Bartoneet al\.,[2002](https://arxiv.org/html/2605.08549#bib.bib42); Berger,[2024](https://arxiv.org/html/2605.08549#bib.bib41); Laske,[2023](https://arxiv.org/html/2605.08549#bib.bib43); Baxter Magolda and King,[2007](https://arxiv.org/html/2605.08549#bib.bib52)\)\(the 23 profiles used for simulating the personas is available in the Appendix[A\.4\.1](https://arxiv.org/html/2605.08549#A1.SS4.SSS1)\)\. The purpose of this experiment is not to establish real\-world stage inference, but to test whether stage\-conditioned developmental signal embedded in synthetic DSCT responses can be recovered by LLM classifiers, and to what extent those simulator\-intended labels are corroborated by trained human raters\. For further validation an additional comparison between the DSCT and the longer 36\-item Loevinger SCT, run on a subset of classifiers, is reported in Appendix[A\.2\.3](https://arxiv.org/html/2605.08549#A1.SS2.SSS3)\.

Simulating the personas\.Each of the 23 profiles specifies a target stage \(solid stages 2–5, plus transitional 2/3, 3/4, and 4/5\) together with a brief description of the corresponding meaning\-making structure from the literature\. We used Gemini 3\.1 Pro to generate DSCT responses conditioned on each profile, prompting it to adopt the persona’s worldview and tone while avoiding overt lexical cues that would reveal the target stage too directly \(full simulator prompt and system instruction in Appendix[A\.4\.2](https://arxiv.org/html/2605.08549#A1.SS4.SSS2)\)\. To account for stochasticity, we generated six independent responses per profile, yielding 138 simulated cases\. This is a best practice in AI benchmarking to remove single\-run ranking inversions\(Alvarado Gonzalezet al\.,[2025](https://arxiv.org/html/2605.08549#bib.bib31)\), – 3 iterations remove over 83% of stochastic effects\. Note that even with multiple simulations, these cases still only instantiate simulator\-intended stage structure in text; they do not constitute human ground truth\. The human\-rating step below therefore serves as a check on how well the intended structure is actually realized in the generated responses\.

Human rating of a stratified subset\.To assess whether the simulator\-intended targets were realized in a form that trained readers would endorse, we sampled 46 simulated cases, corresponding to two independently generated response sets for each of the 23 profiles, and had two raters with basic orientation to Kegan’s constructive\-developmental theory, supported by a brief rating guide, evaluate them independently\. Raters saw only the DSCT responses with random IDs and were blind to the target stage, profile descriptions, and one another’s ratings\. Inter\-rater reliability was high \(quadratic\-weightedκ=0\.927\\kappa=0\.927\), and agreement between rater consensus and simulator\-intended stage was 65\.2% exact and 100% within±0\.5\\pm 0\.5stage, indicating that even when raters diverged from the intended label, they never missed by more than a half\-stage\. We use this comparison not as validation of a human ground truth, but as a check on whether the generated responses exhibit the intended developmental structure in a form that human raters can recognize\. It also allows us to compare LLM judgments not only to simulator\-intended labels, but also to human ratings of the same responses\. Full protocol details are reported in Appendix[A\.3](https://arxiv.org/html/2605.08549#A1.SS3)\.

LLM classification across twelve models\.For each simulated case, we asked twelve LLMs spanning major frontier model families \(Claude Opus 4\.6, Claude 4\.5 Haiku, GPT 5\.5, GPT 5 Mini, Grok 4\.2, DeepSeek V4, DeepSeek R1, Gemini 3\.1 Pro, Gemini 3\.1 Flash, Mistral 3 Large, Qwen 3\.6 Plus, Kimi K2\.6\) to classify the response into one of stages 1–5 or a transitional stage\. The classifier prompt instructed the model to act as a developmental psychologist familiar with Kegan’s theory and to provide a brief rationale alongside each stage assignment \(full prompt in Appendix[A\.4\.2](https://arxiv.org/html/2605.08549#A1.SS4.SSS2)\)\. Because Gemini 3\.1 Pro is both the simulator and one of the classifiers, its results are reported for completeness but should be interpreted with particular caution\.

Results\.

![Refer to caption](https://arxiv.org/html/2605.08549v1/images/fig_exp1_simulated_personas_summary.png)Figure 1:Experiment 1 summary on simulated personas\.Left: overall accuracy differs by model tier, with frontier models outperforming compact/fast models\. Middle: the largest performance drop occurs on transitional stages, especially for compact/fast models\. Right: errors are directionally asymmetric, with compact/fast models showing stronger upward bias than frontier models\.Per\-model classification accuracy\.Table[1](https://arxiv.org/html/2605.08549#S2.T1)summarizes per\-model performance against simulator\-intended target stages, ordered by Cohen’s Kappa\. Three models — Claude Opus 4\.6, Gemini 3\.1 Pro, and Grok 4\.2 — recovered the intended stage label on all simulated cases\. Five further models \(DeepSeek V4, GPT 5\.5, DeepSeek R1, Kimi K2\.6, GPT 5 Mini\) recovered all solid\-stage cases but degraded on transitional cases, where the structural distinction between, e\.g\., a Stage 3/4 and a Stage 4 response is finer\-grained\. Claude 4\.5 Haiku, Qwen 3\.6 Plus, Gemini 3\.1 Flash, and Mistral 3 Large performed less consistently across both categories\.

Table 1:Average performance metrics across various models and human raters\. Ordered by quadratic\-weighted Cohen’sκ\\kappa\.Note:all model rows are computed on the full 138 simulated cases; the Human Raters row is computed on the 46\-case stratified subset rated by two human judges\.Frontier vs\. compact models\.Aggregating across model tiers \(Compact/Fast: Haiku, GPT 5 Mini, Flash; Frontier: the rest\), a two\-way ANOVA on accuracy with factors Model Tier and Stage Type \(Solid vs\. Transitional\) showed a main effect of Stage Type \(p<0\.0001p<0\.0001with Mistral excluded as outlier;p<0\.001p<0\.001with Mistral included\)\. compact/fast models maintained98\.3%98\.3\\%accuracy on solid stages but dropped to66\.5%66\.5\\%on transitional stages \(t=5\.12t=5\.12,p=0\.006p=0\.006, Welch’s\)\. Full statistics, including with\-outlier and without\-outlier analyses, are in Appendix[A\.4\.3](https://arxiv.org/html/2605.08549#A1.SS4.SSS3)\.

Directional bias differs by model tier\.Among the three top frontier models, classification was effectively unbiased on the simulated set: Claude Opus 4\.6, Gemini 3\.1 Pro, and Grok 4\.2 each produced zero overestimation and zero underestimation\. The remaining frontier models showed small overestimation effects on transitional cases only \(\+2\.9%\+2\.9\\%to\+6\.5%\+6\.5\\%\)\. By contrast, compact/fast models systematically overestimated stages \(M=12\.7%M=12\.7\\%,S​D=3\.3SD=3\.3\), an effect significantly larger than the frontier tier \(M=1\.8%M=1\.8\\%,S​D=2\.9SD=2\.9;t=6\.07t=6\.07,p=0\.001p=0\.001, Welch’s\)\. Mistral 3 Large was the sole model to underestimate, doing so substantially \(−39\.1%\-39\.1\\%on transitional cases\)\. Under these controlled synthetic conditions, smaller models therefore fail in a specific direction, tending to assign higher stages rather than producing random errors\.

Takeaways\.\(i\) Under controlled synthetic conditions, top frontier LLMs recover simulator\-intended stage labels with high accuracy\. \(ii\) Smaller models degrade especially on transitional cases, where the structural distinction is finer, and their errors are more often upward than random\. \(iii\) Agreement with trained human raters was lower than agreement with simulator\-intended labels, indicating that the synthetic responses contain signal that is more easily recoverable relative to the simulator target than to human raters\. Experiment 1 should therefore be read as a controlled validation of recoverable developmental signal in synthetic DSCT responses, not as evidence that stage can be inferred equivalently well from real human language\. This experiment provides additional evidence that the newly proposed DSCT is a sensitive metric for Kegan stage evaluation\.

## 3Experiment 2: DSCT on human participants

Experiment 1 establishes that frontier LLM classifiers recover simulator\-intended target stages on synthetic personas under controlled conditions\. The harder question is whether the same instrument and the same classifiers behave sensibly on real human respondents, where structural signal is noisier, response length and engagement vary, and there is no target stage to recover\. We address this by collecting DSCT responses from human participants and using ratings from three Kegan\-trained raters as the human reference label\.

Participants\.We recruited 83 participants via email lists\. The sample consisted of 49 males \(59\.0%\) and 34 females \(41\.0%\), with an average age of 32\.04 years \(S​D=10\.96SD=10\.96; range = 20–64 years\)\. Participants represented 19 countries, with the largest groups residing in France \(41\.5%\), the United States \(17\.1%\), Spain \(9\.8%\), and South Korea \(6\.1%\)\. The cohort was highly experienced with Generative AI technologies: 79\.7% of participants reported using Large Language Model \(LLM\) chatbots daily, and an additional 13\.9% used them more than twice a week\. When asked for their primary AI chatbot of choice, 50\.0% reported OpenAI’s ChatGPT, followed by Anthropic’s Claude \(23\.8%\) and Google’s Gemini \(16\.2%\)\. Recruitment was through a sign\-up form that explained the study purpose, voluntary participation, and the right to withdraw at any time\. The study did not require formal IRB review under the institution’s policies for low\-risk online survey research; the consent procedure and full sign\-up text are reproduced in Appendix[A\.5\.1](https://arxiv.org/html/2605.08549#A1.SS5.SSS1), and the experimenters followed the Helsinki declaration guidelines\.

Task\.Participants completed the DSCT, presented as a*Situational Sense\-Making Questionnaire*to avoid priming the developmental construct\. They were also invited to complete a separate exploratory follow\-up task involving their regular LLM chatbot, described in Appendix[A\.7](https://arxiv.org/html/2605.08549#A1.SS7)\.

Human rating\.Each participant’s 20\-item DSCT response set was rated independently by three raters with basic orientation to Kegan’s constructive\-developmental theory, supported by a brief rating guide, following the same protocol as in Experiment 1: blind independent rating, followed by aggregation into a downstream consensus label \(full protocol in Appendix[A\.3](https://arxiv.org/html/2605.08549#A1.SS3)\)\. Overall agreement across the three raters was fair \(Fleiss’κ=0\.31\\kappa=0\.31\)\. Pairwise quadratic\-weighted Cohen’sκ\\kapparanged from 0\.49 to 0\.81, indicating that some rater pairs agreed substantially more than others\. Of the 83 cases, 23 received unanimous ratings, 52 received a majority rating \(2 of 3 raters agreeing\), and 8 produced full three\-way disagreement\. We therefore use the term*rater consensus*rather than*ground truth*throughout: stage labels derived from a 20\-item self\-administered instrument are limited compared to the Subject–Object Interview, and the rater consensus is the best human reference label available at this scale rather than a definitive measurement\.

Sample distribution\.Using majority vote wherever at least two raters agreed, 75 of 83 participants received a usable consensus label, while 8 remained unresolved due to full disagreement\. The resulting distribution was: Stage 2/3 \(n=3n=3\), Stage 3 \(n=18n=18\), Stage 3/4 \(n=17n=17\), Stage 4 \(n=28n=28\), and Stage 4/5 \(n=9n=9\)\. The sample contains no consensus cases at Stage 2 or Stage 5, consistent with population estimates for these stages\(Kegan,[1994](https://arxiv.org/html/2605.08549#bib.bib32)\)and with a self\-selected sample drawn from professional and academic networks\. We treat this as a meaningful scope limitation: claims in this section apply primarily to the Stage 3 / 3\-4 / 4 / 4\-5 range, where most adult variation occurs, and not to the developmental extremes\.

LLM classification of human responses\.We then asked two of the top performing LLMs from Experiment 1 \(Claude Opus 4\.6 and Gemini 3\.1 Pro\) to classify each participant’s DSCT response set, using the identical classifier prompt \(Appendix[A\.4\.2](https://arxiv.org/html/2605.08549#A1.SS4.SSS2)\)\. For each participant this produces three quantities of interest: \(i\) the rater\-consensus stage, \(ii\) the per\-classifier stage from each of the top models, and \(iii\) the cross\-classifier consensus stage \(computed as the modal label across the models, with ties broken by averaging numeric stage values where applicable; ties and ambiguous cases are reported separately\)\. Comparing these allows us to ask whether classifier judgments on real human DSCT data agree with trained human raters, and whether the agreement pattern matches what we observed on simulated personas\.

![Refer to caption](https://arxiv.org/html/2605.08549v1/images/fig_human_vs_llm_stage_distribution_questionnaire.png)

![Refer to caption](https://arxiv.org/html/2605.08549v1/images/fig_human_vs_llm_agreement_heatmap_questionnaire.png)

Figure 2:Experiment 2: human vs\. LLM agreement on structured DSCT responses\.Left: marginal stage distributions assigned by human and LLM raters\. Right: agreement heatmap\. Most disagreements remain near the diagonal, indicating that LLM judgments often fall within the participant’s developmental neighborhood even when exact agreement is limited\.Results\.Human–LLM agreement on structured DSCT\.On the questionnaire data \(N=83\), agreement between the human reference labels and the LLM evaluator was fair \(quadratic weightedκ=0\.49\\kappa=0\.49\)\. Exact agreement was 46\.3% \(n=38n=38\), and agreement within±0\.5\\pm 0\.5stage was 82\.9% \(n=68n=68\)\. The human rater assigned a mean stage of 3\.62 \(S​D=0\.53SD=0\.53\), while the LLM assigned a mean stage of 3\.44 \(S​D=0\.53SD=0\.53\); this difference was statistically significant \(Wilcoxon signed\-rank,W=244\.5W=244\.5,p=0\.002p=0\.002\)\. Thus, on structured DSCT responses from real humans, the LLM generally locates the participant’s developmental neighborhood, but does not reliably reproduce exact human staging\.

Directional asymmetry\.Disagreements were not symmetric\. In 37\.8% of all cases \(n=31n=31\), the human evaluator assigned a higher stage than the LLM, whereas the LLM assigned a higher stage in only 15\.9% of cases \(n=13n=13\)\. On structured questionnaire input, the LLM therefore appears slightly more conservative than the human rater rather than systematically inflationary \(Figure[2](https://arxiv.org/html/2605.08549#S3.F2)\)\.

An exploratory follow\-up using participants’ unstructured chat history also produced strongly upward\-skewed ratings and very weak agreement with structured questionnaire\-based judgments, suggesting that stage inference from unconstrained chat is not yet reliable in the present setting \(Appendix[A\.7](https://arxiv.org/html/2605.08549#A1.SS7)\)\.

Interpretation\.This pattern contrasts with the cleaner synthetic performance of Experiment 1\. On real human DSCT responses, stage\-relevant signal remains recoverable, but the ceiling is lower because the elicited text is less internally consistent and provides a less uniform target for both human raters and LLM classifiers\. This is reflected not only in lower model agreement than in the synthetic setting, but also in the human labels themselves: while most participants received a usable consensus label, a non\-trivial minority remained ambiguous even for trained raters\. Additional item\-level analyses supported this interpretation, suggesting that predictive signal in the human data was concentrated in a subset of questions rather than expressed uniformly across the questionnaire \(Appendix[A\.5\.2](https://arxiv.org/html/2605.08549#A1.SS5.SSS2)\)\. The benchmark target in this setting is therefore not exact stage recovery, but approximate alignment with trained human judgment on elicited text\.

Takeaways\.\(i\) DSCT applied to real human respondents yields usable but imperfect developmental labels: agreement is fair, exact matches are limited, and proximity\-based agreement is substantially higher than exact agreement\. \(ii\) On structured questionnaire text, the LLM is slightly stricter than the human rater rather than systematically upward\-biased\. \(iii\) Compared to Experiment 1, performance drops on real human responses, indicating that recoverability depends not only on the classifier, but also on the source and cleanliness of the elicited text\.

### 3\.1Experiment 3: Stage\-like structure in default model\-generated DSCT responses

Experiments 1 and 2 ask whether developmental signal in DSCT responses can be recovered from synthetic personas and human respondents\. A complementary question is whether models also differ in the developmental structure of the text they generate by default\. To examine this, we prompted a broader set of language models to answer the DSCT without persona\-conditioning or target\-stage instructions, and then rated the resulting responses using the same developmental framework\.

Design\.We grew from our original cohort of 12 frontier models and prompted 29 language models, ranging from GPT\-3\.5 Turbo to current frontier systems, to complete the full 20\-item DSCT in a single pass\. The prompt presented the questionnaire exactly as a respondent would see it, instructed the model to answer every item, and requested JSON\-formatted responses \(prompt in Appendix[A\.6\.1](https://arxiv.org/html/2605.08549#A1.SS6.SSS1)\)\. Unlike Experiment 1, no developmental profile or target stage was supplied\. The resulting outputs therefore reflect the model’s default response style under DSCT elicitation rather than stage\-conditioned simulation\.

Model set\.The 29 models included 18 frontier models and 11 compact/fast models, spanning releases from early 2023 through the present\. This allows us to ask not only whether models differ in the stage\-like structure of their outputs, but also whether that structure changes over model generations and between model tiers\.

Rating procedure\.The DSCT responses generated by each model were rated by Gemini 3\.1 Pro using the same Keganian rubric as in Experiments 1 and 2\. To reduce single\-run stochasticity, we repeated the rating procedure three times per model and used the resulting labels jointly in analysis, following recent recommendations for repeated evaluation in LLM benchmarking\(Alvarado Gonzalezet al\.,[2025](https://arxiv.org/html/2605.08549#bib.bib31)\)\. No human raters were involved in this experiment\. We interpret the resulting labels not as claims that LLMs possess human developmental stages, but as ratings of the stage\-like structure of the text they generate under a shared elicitation format\.

Results\.Most current frontier models were rated in the Stage 4 to Stage 4/5 range, with a smaller number reaching Stage 5\-like structure\. compact/fast models were rated lower on average, with several clustering around Stage 3/4 and one model \(GPT\-3\.5 Turbo\) rated at Stage 2 \(Figure[3](https://arxiv.org/html/2605.08549#S3.F3)\)\. Across the full sample, frontier models exhibited a higher mean developmental score \(M=3\.97M=3\.97,S​D=0\.61SD=0\.61\) than compact/fast models \(M=3\.55M=3\.55,S​D=0\.42SD=0\.42\)\. Full per\-model ratings are reported in Appendix[A\.6\.2](https://arxiv.org/html/2605.08549#A1.SS6.SSS2)\.

![Refer to caption](https://arxiv.org/html/2605.08549v1/images/fig_exp3_model_generated_dsct_trends.png)Figure 3:Experiment 3: stage\-like structure in default model\-generated DSCT responses\.Left: developmental ratings of model\-generated DSCT responses increase over release time in both frontier and compact/fast model families\. Right: among models released after April 2025, frontier models produce higher\-rated responses on average than compact/fast models\.We also observe a temporal trend: newer models tend to generate DSCT responses rated at higher stages than older ones\. This pattern is visible in both frontier and compact/fast families\. Across the full set of models, developmental score increased with release date, and the overall regression was significant \(F​\(3,25\)=19\.22F\(3,25\)=19\.22,p<0\.001p<0\.001,R2=0\.698R^\{2\}=0\.698\), although the interaction between release time and model tier did not reach significance\. We therefore interpret the main pattern as a general upward shift in stage\-like output structure over successive model generations rather than a sharply diverging slope between tiers\.

Interpretation\.These findings suggest that developmental structure matters not only on the classifier side, but also in the default style of model\-generated text\. Models that generate more self\-authored or self\-transforming DSCT responses may also be more likely to interpret others through those same structures\. This provides one plausible mechanism for systematic upward pressure in stage assignment: a model whose own outputs default to Stage 4 or Stage 4/5 language may also tend to read elicited responses through that lens\.

At the same time, these ratings should be interpreted cautiously\. We do not claim that LLMs possess human developmental stages, nor that stage labels capture stable internal properties of a model\. Rather, Keganian ratings provide a descriptive framework for comparing the structure of model\-generated responses under a shared elicitation format\.

Takeaways\.\(i\) LLMs differ not only in their ability to classify developmental structure, but also in the stage\-like structure of the DSCT responses they generate by default\. \(ii\) Larger and newer models tend to produce text rated at higher developmental stages than smaller or older models\. \(iii\) Model\-side output structure may therefore influence not only what models say, but also how they read stage\-like structure in the text of others\.

## 4Discussion

##### Recoverable developmental signal in text\.

The main result of this paper is that DSCT makes developmental signal recoverable in text, but not equally across sources\. Under stage\-conditioned simulation, frontier models recover intended labels with high accuracy\. On real human DSCT responses, agreement is weaker and bounded by the fact that even trained raters do not fully agree\. Simulated personas produce cleaner stage\-conditioned text than human respondents, while human DSCT responses are noisier, shorter, and more uneven\. This establishes a useful but limited benchmark target: not developmental structure in the person, but recoverable developmental signal in elicited text\.

##### Model\-side developmental style\.

A notable finding is that models differ not only in how they classify developmental structure, but also in the structure of the responses they generate under the same DSCT prompt\. Larger and newer models tend to produce outputs rated at higher stages than smaller and older ones\. We do not interpret this as evidence that LLMs possess developmental stages in the human sense\. Rather, Keganian ratings provide a descriptive lens on output structure\. One possible implication is that models whose default outputs are more self\-authored or system\-level may also be more likely to interpret others through that same frame, though this relationship remains to be tested directly\.

##### Developmental modeling and the Stage 3/4 region\.

The most consequential region for conversational AI remains the space around Stage 3, Stage 4, and their transition\. Stage 3\-like organization is more shaped by external recognition, belonging, and relational alignment\. Stage 4\-like organization supports more independent evaluation from an internal frame\(Kegan,[1994](https://arxiv.org/html/2605.08549#bib.bib32),[1998](https://arxiv.org/html/2605.08549#bib.bib33)\)\. This distinction matters because the same model behavior may function differently depending on the meaning\-making structure expressed in a user’s elicited text\. The point is not that users must reach a particular stage before safely engaging AI as suggested by\([Aruleswaran,](https://arxiv.org/html/2605.08549#bib.bib39)\), but that developmental variation may matter for how model outputs are interpreted and taken up\. More broadly, this benchmark is not a test of theory of mind in the classical sense\(Frith and Frith,[2005](https://arxiv.org/html/2605.08549#bib.bib46)\): what is being compared across all three regimes is not hidden mental state, but recoverable structure in elicited language\.

##### Implications and limits\.

The immediate implication is not that systems should infer developmental stage from arbitrary interaction traces, but that structured elicitation may provide a more reliable basis for developmental adaptation than unconstrained inference\. Our exploratory chat\-history follow\-up reinforces this point: stage assignments derived from unstructured conversational traces were strongly skewed upward and showed very weak agreement with questionnaire\-based judgments\. DSCT is one possible instrument for structured elicitation of this kind\. More broadly, this suggests a complementary direction for personalization: not only adapting to preferences, goals, expertise, or trust\-related variables\(Chenet al\.,[2024](https://arxiv.org/html/2605.08549#bib.bib49); Srinivasan and Thomason,[2025](https://arxiv.org/html/2605.08549#bib.bib5)\), but also to differences in how users make sense of and take up model outputs\. At the same time, DSCT is a brief self\-administered instrument and does not substitute for the Subject–Object Interview\. Stage labels at this scale should be treated as probabilistic rather than diagnostic\. Human agreement is only fair overall, the current sample is concentrated in the Stage 3 to Stage 4/5 range, and Kegan’s framework is only one developmental lens\. Future work could compare alternative developmental frameworks, examine state dependence, and test whether similar signal can be recovered under other elicitation formats\.

## 5Conclusion

We introduce DSCT, a 20\-item instrument for eliciting developmental signal in self\-administered text, and use it to benchmark LLMs on Keganian labeling across three response regimes: simulated personas, real human respondents, and default model\-generated answers\. Our results show that developmental signal is recoverable from DSCT responses, but not equally across sources\. Structured synthetic responses support high classifier performance, real human responses are noisier and bounded by only fair human agreement, and model\-generated DSCT responses reveal stable differences in stage\-like output structure across model families\.

These findings suggest that the central constraint for stage\-aware conversational AI is not simply classifier accuracy, but whether elicited text contains developmental structure that can be reliably recovered\. If developmental context is to inform interaction, structured elicitation may provide a more reliable basis than passive inference from arbitrary language\.

## References

- Do repetitions matter? strengthening reliability in llm evaluations\.arXiv preprint\.Cited by:[§2\.1](https://arxiv.org/html/2605.08549#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.08549#S3.SS1.p4.1)\.
- \[2\]A\. AruleswaranThe emergence mirror framework: developmental architecture for human\-ai relational governance\.Available at SSRN 6114166\.Cited by:[§4](https://arxiv.org/html/2605.08549#S4.SS0.SSS0.Px3.p1.1)\.
- P\. T\. Bartone, S\. A\. Snook, and T\. R\. Tremble Jr \(2002\)Cognitive and personality predictors of leader performance in west point cadets\.Military Psychology14\(4\),pp\. 321–338\.Cited by:[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.11.10.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.14.13.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.2.1.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.6.5.4.1.1),[§2\.1](https://arxiv.org/html/2605.08549#S2.SS1.p1.1)\.
- M\. B\. Baxter Magolda and P\. M\. King \(2007\)Interview strategies for assessing self\-authorship: constructing conceptions of the self\.Journal of College Student Development48\(5\),pp\. 491–508\.Cited by:[§A\.2\.3](https://arxiv.org/html/2605.08549#A1.SS2.SSS3.p4.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.16.15.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.4.3.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.7.6.4.1.1),[§2\.1](https://arxiv.org/html/2605.08549#S2.SS1.p1.1)\.
- J\. G\. Berger \(2024\)Changing on the job: how leaders become courageous, wise, and steady in an anxious world\.Stanford University Press\.Cited by:[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.12.11.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.19.18.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.22.21.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.3.2.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.5.4.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.9.8.4.1.1),[§2\.1](https://arxiv.org/html/2605.08549#S2.SS1.p1.1)\.
- K\. Chandra, M\. Kleiman\-Weiner, J\. Ragan\-Kelley, and J\. B\. Tenenbaum \(2025\)Sycophantic chatbots cause delusional spiraling, even in ideal bayesians\.arXiv preprint arXiv:2602\.19141\.Cited by:[§1](https://arxiv.org/html/2605.08549#S1.p2.1)\.
- J\. Chen, Z\. Liu, X\. Huang, C\. Wu, Q\. Liu, G\. Jiang, Y\. Pu, Y\. Lei, X\. Chen, X\. Wang,et al\.\(2024\)When large language models meet personalization: perspectives of challenges and opportunities\.World wide web27\(4\),pp\. 42\.Cited by:[§1](https://arxiv.org/html/2605.08549#S1.p1.1),[§4](https://arxiv.org/html/2605.08549#S4.SS0.SSS0.Px4.p1.1)\.
- Y\. Y\. Chiu, A\. Sharma, I\. W\. Lin, and T\. Althoff \(2024\)A computational framework for behavioral assessment of llm therapists\.arXiv preprint arXiv:2401\.00820\.Cited by:[§1](https://arxiv.org/html/2605.08549#S1.p1.1)\.
- S\. R\. Cook\-Greuter \(1999\)Postautonomous ego development: a study of its nature and measurement\.Harvard University\.Cited by:[§A\.2\.3](https://arxiv.org/html/2605.08549#A1.SS2.SSS3.p5.1),[§1](https://arxiv.org/html/2605.08549#S1.p3.1),[§2](https://arxiv.org/html/2605.08549#S2.p2.1)\.
- C\. Frith and U\. Frith \(2005\)Theory of mind\.Current biology15\(17\),pp\. R644–R645\.Cited by:[§4](https://arxiv.org/html/2605.08549#S4.SS0.SSS0.Px3.p1.1)\.
- L\. Jiang, Y\. Chai, M\. Li, M\. Liu, R\. Fok, N\. Dziri, Y\. Tsvetkov, M\. Sap, A\. Albalak, and Y\. Choi \(2025\)Artificial hivemind: the open\-ended homogeneity of language models \(and beyond\)\.arXiv preprint arXiv:2510\.22954\.Cited by:[§1](https://arxiv.org/html/2605.08549#S1.p2.1)\.
- R\. Kegan \(1994\)In over our heads: the mental demands of modern life\.Cited by:[§A\.1](https://arxiv.org/html/2605.08549#A1.SS1.p3.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.10.9.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.13.12.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.21.20.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.23.22.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.24.23.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.8.7.4.1.1),[§1](https://arxiv.org/html/2605.08549#S1.p1.1),[§1](https://arxiv.org/html/2605.08549#S1.p2.1),[§1](https://arxiv.org/html/2605.08549#S1.p3.1),[§2](https://arxiv.org/html/2605.08549#S2.p2.1),[§3](https://arxiv.org/html/2605.08549#S3.p5.5),[§4](https://arxiv.org/html/2605.08549#S4.SS0.SSS0.Px3.p1.1)\.
- R\. Kegan \(1998\)In over our heads: the mental demands of modern life\.Harvard University Press\.Cited by:[§4](https://arxiv.org/html/2605.08549#S4.SS0.SSS0.Px3.p1.1)\.
- O\. Laske \(2023\)A process model of social\-emotional development\.InAdvanced Systems\-Level Problem Solving, Volume 1: Approaching Real\-World Complexity with Dialectical Thinking,pp\. 149–169\.Cited by:[§A\.2\.3](https://arxiv.org/html/2605.08549#A1.SS2.SSS3.p4.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.15.14.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.17.16.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.18.17.4.1.1),[Table 3](https://arxiv.org/html/2605.08549#A1.T3.1.20.19.4.1.1),[§1](https://arxiv.org/html/2605.08549#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.08549#S2.SS1.p1.1)\.
- Z\. Li, D\. Zhang, M\. Zhang, J\. Zhang, Z\. Liu, Y\. Yao, H\. Xu, J\. Zheng, P\. Wang, X\. Chen,et al\.\(2025\)From system 1 to system 2: a survey of reasoning large language models\.arXiv preprint arXiv:2502\.17419\.Cited by:[§1](https://arxiv.org/html/2605.08549#S1.p1.1)\.
- J\. Loevingeret al\.\(1998\)Technical foundations for measuring ego development: the washington university sentence completion test\.Psychology Press\.Cited by:[§A\.2\.3](https://arxiv.org/html/2605.08549#A1.SS2.SSS3.p1.1),[§1](https://arxiv.org/html/2605.08549#S1.p3.1),[§2](https://arxiv.org/html/2605.08549#S2.p2.1)\.
- J\. Piaget and M\. T\. Cook \(1952\)The origins of intelligence in children\.\.Cited by:[§1](https://arxiv.org/html/2605.08549#S1.p1.1)\.
- T\. Srinivasan and J\. Thomason \(2025\)Adjust for trust: mitigating trust\-induced inappropriate reliance on ai assistance\.arXiv preprint arXiv:2502\.13321\.Cited by:[§4](https://arxiv.org/html/2605.08549#S4.SS0.SSS0.Px4.p1.1)\.
- S\. Wang, T\. Xu, H\. Li, C\. Zhang, J\. Liang, J\. Tang, P\. S\. Yu, and Q\. Wen \(2026\)Large language models for education: a survey and outlook\.IEEE Signal Processing Magazine42\(6\),pp\. 51–63\.Cited by:[§1](https://arxiv.org/html/2605.08549#S1.p1.1)\.

## Appendix AAppendix

### A\.1Summary of Kegan stages

For orientation, Robert Kegan’s constructive\-developmental theory describes a sequence of increasingly complex meaning\-making structures\.Stage 1 \(Impulsive Mind\)is organized by immediate perceptions and impulses, with little stable reflective distance; this stage is generally rare in adult populations outside severe cognitive impairment\.Stage 2 \(Imperialist Mind\)is organized around personal needs, concrete interests, and transactional outcomes; external rules are experienced mainly as constraints or tools\.Stage 3 \(Socialized Mind\)is organized through external validation, belonging, and alignment with important others; identity is shaped by relationships and social expectations\.Stage 4 \(Self\-Authoring Mind\)is organized through an internalized value system or self\-authored framework that can evaluate and reject external rules\.Stage 5 \(Self\-Transforming Mind\)can reflect on and coordinate multiple systems at once, holding contradiction and paradox without collapsing them prematurely\.

In practice, respondents may also be rated as*transitional*between adjacent stages \(e\.g\., 3/4 or 4/5\), indicating that their responses show meaningful evidence of both structures rather than a fully consolidated center of gravity at either one stage or the other\.

Population estimates in Kegan’s framework place Stage 3 and Stage 4 as the most common adult structures, with Stage 2 and Stage 5 comparatively rareKegan \[[1994](https://arxiv.org/html/2605.08549#bib.bib32)\]\.

### A\.2Questionnaires and instrument comparison

This section presents the DSCT and SCT questionnaires, along with the supplementary comparison showing that, on the simulated\-persona benchmark, the shorter DSCT preserves enough signal for the strongest classifiers to recover stage as reliably as with the longer SCT\.

#### A\.2\.1Developmental Sentence Completion Test \(DSCT\)

The Developmental Sentence Completion Test \(DSCT\) is a 20\-item self\-administered instrument designed to elicit text samples sufficient for a trained human rater or LLM to make a tentative assessment of developmental structure\. It is organized into two sections of 10 items each\. The first section uses first\-person prompts \(*self\-assessment*\); the second uses matched third\-person prompts about a generic other \(*abstracted assessment*\)\. Respondents were instructed to complete each sentence with the first thought that came to mind, using a single word, short phrase, or full paragraph\.

##### Self\-Assessment

1. 1\.When a promise is broken…
2. 2\.When both choices feel right…
3. 3\.When things don’t go as I hoped…
4. 4\.When the hard work finally pays off…
5. 5\.If I am asked to compromise…
6. 6\.When I realized someone was paying attention…
7. 7\.Saying goodbye to something that mattered…
8. 8\.When what used to work no longer works…
9. 9\.When I have to choose what comes first…
10. 10\.When I realize I cannot control what happens next…

##### Abstracted Assessment

1. 11\.When a person feels they were treated unfairly by their supervisor…
2. 12\.If a person feels pulled between their own view and the expectations of others…
3. 13\.A person works very hard on something important, but it fails…
4. 14\.A team celebrates a project that succeeded\. The person who led it…
5. 15\.A person believes a decision their group supports is wrong; they will…
6. 16\.When a person sees someone make a sacrifice for others…
7. 17\.When a person has to leave a role or place that was important to them…
8. 18\.A person realizes their plans need to change significantly\. This might affect their existing beliefs…
9. 19\.When a person must choose between two opportunities that both seem meaningful…
10. 20\.A person must make an important decision and take on a new responsibility without complete information…

#### A\.2\.2Sentence Completion Test \(SCT\)

For comparison, we also include the longer 36\-item Loevinger Sentence Completion Test \(SCT\), which served as the reference instrument in the DSCT vs\. SCT comparison reported in Appendix[A\.2\.3](https://arxiv.org/html/2605.08549#A1.SS2.SSS3)\. As in DSCT, respondents were instructed to complete each stem with the first thought that came to mind\.

1. 1\.When a child will not join in group activities…
2. 2\.Raising a family…
3. 3\.When I am criticized…
4. 4\.A man’s job…
5. 5\.Being with other people…
6. 6\.The thing I like about myself is…
7. 7\.My mother and I…
8. 8\.What gets me into trouble is…
9. 9\.Education…
10. 10\.When people are helpless…
11. 11\.Women are lucky because…
12. 12\.A good boss… \(Alternative: A good father…\)
13. 13\.A girl/boy has a right to…
14. 14\.When they talked about sex, I…
15. 15\.A wife/husband should…
16. 16\.I feel sorry…
17. 17\.A man/woman feels good when…
18. 18\.Rules are…
19. 19\.Crime and delinquency could be halted if…
20. 20\.Men are lucky because…
21. 21\.I just can’t stand people who…
22. 22\.At times she/he worried about…
23. 23\.I am…
24. 24\.A woman/man feels good when…
25. 25\.My main problem is…
26. 26\.Whenever she/he was with her/his mother, she/he…
27. 27\.The worst thing about being a woman/man is…
28. 28\.A good mother…
29. 29\.Sometimes she/he wished that…
30. 30\.When I am with a man/woman…
31. 31\.When she/he thought of her/his mother, she/he…
32. 32\.If I can’t get what I want…
33. 33\.Usually she/he felt that sex…
34. 34\.For a woman/man, a career is…
35. 35\.My conscience bothers me if…
36. 36\.A woman/man should always…

#### A\.2\.3DSCT vs\. SCT comparison

DSCT was designed as a shorter, less invasive successor to the Loevinger SCT\[Loevinger and others,[1998](https://arxiv.org/html/2605.08549#bib.bib44)\]\(see Section[2](https://arxiv.org/html/2605.08549#S2)\)\. To check that DSCT preserves enough developmental signal for computational stage classification, we ran a parallel comparison on the simulated\-persona set used in Experiment 1\.

Setup\.Each of the 23 developmental profiles produced six independent simulated response sets to the 36\-item SCT and six independent simulated response sets to the 20\-item DSCT, yielding 138 classification decisions per classifier per instrument\. The simulator prompt and system instruction were identical across the two questionnaires; only the questionnaire stems differed\. We then asked the two top\-performing classifiers from Experiment 1, Gemini 3\.1 Pro and Claude Opus 4\.6, to classify each response set into a Kegan stage\.

Results\.Both classifiers correctly recovered the simulator\-intended target stage on all 23 profiles, on all six rounds, for both questionnaires \(Table[2](https://arxiv.org/html/2605.08549#A1.T2)\)\. In other words, on this simulated\-persona benchmark, the shorter DSCT carried enough signal for the strongest classifiers to recover stage as reliably as the longer SCT\.

Table 2:Classifier accuracy against simulator\-intended target stage, by questionnaire\. Numerator is profiles correctly classified out of 23\. Six rounds per profile; stage labels were identical across rounds in every cell\.Sub\-stage nuance\.The simulator\-intended labels for transitional profiles use the X/Y notation \(e\.g\., 3/4\), and our classifiers were instructed to output labels in that same form\. Some source profiles in the developmental literature contain finer distinctions \(e\.g\., 4\(3\) or 3\(4\)\)\[Laske,[2023](https://arxiv.org/html/2605.08549#bib.bib43), Baxter Magolda and King,[2007](https://arxiv.org/html/2605.08549#bib.bib52)\]\. Our classifiers correctly recovered the broader transitional category, but not these asymmetric within\-transition emphases\. We treat this as expected given the coarser X/Y output format\.

Scope\.This comparison does not show that DSCT and SCT are interchangeable for human respondents, nor that DSCT covers the full construct space of SCT or SCTi\-MAP\[Cook\-Greuter,[1999](https://arxiv.org/html/2605.08549#bib.bib45)\]\. It supports the narrower claim that, on simulator\-generated personas, the shorter DSCT preserves enough signal for the strongest available classifiers to recover the intended stage as reliably as with the longer SCT\.

### A\.3Rater protocol

All human ratings in this paper were based on written responses to the DSCT and used Robert Kegan’s constructive\-developmental framework to infer meaning\-making structure from text\. Raters were instructed to focus on*how*experience was organized in the response rather than on verbal sophistication, morality, intelligence, or general maturity\. In particular, the rating guide emphasized the subject–object distinction: what the respondent appeared able to reflect on versus what still seemed to organize the response from within\. The guide also included stage descriptions, transitional structures, and cautions about low\-evidence cases such as brief, affect\-only, or linguistically limited responses\.

Allowed labels\.Raters were allowed to assign any of the following labels: Stage 1, Stage 2, Stage 2/3, Stage 3, Stage 3/4, Stage 4, Stage 4/5, and Stage 5\. Stage 1 was allowed in principle but was not expected in the datasets considered here\.

Experiment 1\.In the simulated\-persona condition, two raters independently evaluated a stratified subset of 46 simulated cases, corresponding to two independently generated response sets for each of the 23 developmental profiles\. Raters were blind to the simulator\-intended target stage, the profile identity, and one another’s ratings\. The two ratings were never more than a half\-stage apart\. The final human reference label for each case was taken as the lower of the two ratings\.

Experiment 2\.In the human\-participant condition, three raters independently evaluated all 83 DSCT response sets\. Raters were blind to participant identity and to one another’s ratings, but they knew they were evaluating human rather than simulated responses\. When at least two raters agreed, that label was taken as the consensus\. In three\-way disagreement cases, the middle rating was used as the final reference label\. This rule applied both when the ratings formed an adjacent progression \(e\.g\., 2, 2/3, 3\) and when they spanned a larger gap \(e\.g\., 2, 2/3, 4\), yielding a conservative center label\.

Confidence and low\-evidence cases\.Raters also assigned a confidence level \(Low / Medium / High\)\. The guide instructed raters to lower confidence when responses were sparse, vague, brief, affect\-only, inconsistent across prompts, or linguistically limited, including cases likely shaped by non\-native English fluency\. Mixed responses across prompts were treated as normal; raters were asked to infer the respondent’s overall center of gravity rather than coding from isolated strong answers, and to use transitional labels when the evidence suggested a genuine mix of adjacent structures\.

### A\.4Experiment 1 supplementary materials

#### A\.4\.1Simulated developmental profiles

Experiment 1 used 23 literature\-derived developmental profiles as conditioning templates for persona simulation\. These profiles were not treated as ground\-truth people, but as compact textual representations of stage\-specific meaning\-making structures drawn from prior developmental literature\. Each profile was associated with a target stage label used for simulation and evaluation\.

Table 3:Literature\-derived developmental profiles used as conditioning templates in Experiment 1\.
#### A\.4\.2Simulation and classification prompts

This subsection reports the prompt templates used in Experiment 1\. Full questionnaire contents are given in Appendix[A\.2\.1](https://arxiv.org/html/2605.08549#A1.SS2.SSS1)and Appendix[A\.2\.2](https://arxiv.org/html/2605.08549#A1.SS2.SSS2)\.

##### Persona\-conditioned simulation prompt \(DSCT response generation\)\.

The following user prompt template was used to generate simulated DSCT responses from literature\-derived developmental profiles:

> “You are roleplaying as the following person\. Adopt their persona, developmental model, worldview, iq, and tone of voice completely\. Profile: \[PROFILE\_JSON\] Please answer the following questionnaire from the perspective of this person\. Questionnaire: \[DSCT\_QUESTIONNAIRE\]”

This prompt was combined with the following system instruction:

> “You are an expert at persona adoption and psychological roleplay\. You must answer the questions exactly as the described person would, reflecting their specific developmental stage, biases, knowledge, limitations and personality\. Don’t make it too long\. Answer like the real person would\. Make each response differentiated enough but use mostly plain language, don’t make it excessively complex in neither writing or vocabulary\.”

##### DSCT classification prompt\.

The following prompt template was used to classify DSCT responses into Kegan stages:

> “You are an expert developmental psychologist specializing in Robert Kegan’s Constructive\-Developmental Theory\. Your task is to evaluate the cognitive development stage of several candidates based on their responses to a questionnaire\. The possible stages are 1, 2, 3, 4, 5, and transitional stages \(e\.g\., 2/3, 3/4, 4/5\)\. Here is the questionnaire the candidates responded to: \[DSCT\_QUESTIONNAIRE\] Attached are the candidates’ responses to evaluate\. Please evaluate each candidate and provide their ID, the assigned Kegan stage, and a detailed explanation of your reasoning in 1 line\. Here are some examples of responses we expect from your evaluation in json format: \[EXAMPLE\_OUTPUT\_JSON\]”

##### SCT classification prompt\.

For the DSCT vs\. SCT comparison, the same classification template was used with the questionnaire field replaced by the SCT stems:

> “You are an expert developmental psychologist specializing in Robert Kegan’s Constructive\-Developmental Theory\. Your task is to evaluate the cognitive development stage of several candidates based on their responses to a questionnaire\. The possible stages are 1, 2, 3, 4, 5, and transitional stages \(e\.g\., 2/3, 3/4, 4/5\)\. Here is the questionnaire the candidates responded to: \[SCT\_QUESTIONNAIRE\] Attached are the candidates’ responses to evaluate\. Please evaluate each candidate and provide their ID, the assigned Kegan stage, and a detailed explanation of your reasoning in 1 line\. Here are some examples of responses we expect from your evaluation in json: \[EXAMPLE\_OUTPUT\_JSON\]”

#### A\.4\.3Detailed statistics

This subsection reports the additional statistical tests underlying the model\-tier comparisons in Experiment 1\.

Stage type effect\.Aggregating classifiers by model tier \(compact/fast: Claude 4\.5 Haiku, GPT 5 Mini, Gemini 3\.1 Flash; Frontier: all remaining models\), a two\-way ANOVA on classification accuracy with factors*Model Tier*and*Stage Type*\(Solid vs\. Transitional\) showed a main effect of Stage Type\. This effect remained significant both when Mistral 3 Large was excluded as an outlier \(p<0\.0001p<0\.0001\) and when it was retained \(p<0\.001p<0\.001\)\.

Compact/Fast degradation on transitional stages\.compact/fast models maintained 98\.3% accuracy on solid\-stage profiles but dropped to 66\.5% on transitional\-stage profiles\. This difference was significant under Welch’s t\-test \(t=5\.12t=5\.12,p=0\.006p=0\.006\), indicating that the main loss of performance in smaller models was concentrated on transitional cases rather than solid stages\.

Directional bias by model tier\.Overestimation bias also differed by model tier\. compact/fast models showed substantially larger upward bias \(M=12\.7%M=12\.7\\%,S​D=3\.3SD=3\.3\) than frontier models \(M=1\.8%M=1\.8\\%,S​D=2\.9SD=2\.9\)\. This difference was significant under Welch’s t\-test \(t=6\.07t=6\.07,p=0\.001p=0\.001\)\. Under these controlled synthetic conditions, smaller models therefore failed not only more often, but also in a more directionally consistent way\.

Outlier handling\.Mistral 3 Large was treated as an outlier in one version of the ANOVA because its performance was substantially lower than the rest of the model set, particularly on transitional stages\. We therefore report the Stage Type effect both with and without Mistral included\. The qualitative interpretation was unchanged in both cases\.

### A\.5Experiment 2 supplementary materials

#### A\.5\.1Human study materials

This subsection reproduces the participant\-facing materials used for the human study in Experiment 2\. Participants were recruited through personal, academic, and professional networks\. Compensation was a $15 or €15 Amazon gift card \(or equivalent\)\.

##### Enrollment / sign\-up form\.

Participants first completed an enrollment form that collected their email address and consent to participate\. The form explained that, after enrollment, an experimenter would assign each participant a Subject ID to keep subsequent responses anonymous\. It also described the overall study as consisting of two short tasks: \(1\) a 20\-item sentence\-completion questionnaire and \(2\) a prompt\-based task in which participants would ask their regular LLM chatbot to complete a cognitive assessment and report the numerical output\. The form stated that the full study would take approximately 30 minutes\.

The enrollment form also stated that participation was voluntary, that participants could stop or withdraw at any time without giving a reason, and that they could contact the experimenters if they wished to withdraw\. It further noted that the study followed the Declaration of Helsinki ethics guidelines and allowed participants to indicate whether they wished to receive their individual results and the aggregated study results\. Under the institution’s policies, this study did not require formal IRB review because it was classified as low\-risk online survey research\.

##### Questionnaire instructions shown before DSCT\.

Before completing the DSCT, participants saw the following instructions:

> “This questionnaire contains 20 sentence\-completion prompts\. For the first 10 prompts, you will be asked to complete sentences about how you think or respond when certain situations happen to you\. For the next 10 prompts, you will be asked to complete sentences about how a person might think or respond in a similar situation\. Please finish each sentence with the first thought that comes to mind\. There are no right or wrong answers\. You may complete the sentences with a single word, a short phrase, or a full paragraph, whichever best expresses your thought\. Please complete every item\. Do not skip any sentences\. Work at a steady pace, and try not to overthink your responses\.”

##### Anonymization and data handling\.

After enrollment, each participant was assigned a Subject ID by an experimenter\. This ID was used to link responses across study components while keeping questionnaire responses separate from identifying contact information\.

##### Additional study component\.

After completing the DSCT, participants were invited to complete a second prompt\-based task involving their regular LLM chatbot, described in Appendix[A\.7](https://arxiv.org/html/2605.08549#A1.SS7)\.

#### A\.5\.2Additional exploratory analysis

To better understand why Experiment 2 was harder than the synthetic\-persona setting of Experiment 1, we ran additional item\-level analyses on both the simulated and human DSCT responses\. Human raters had already noted that the synthetic responses appeared unusually consistent compared to the human data\. To examine this more directly, we prompted Gemini 3\.1 Pro to assign a Kegan stage to each questionnaire item independently, rather than rating each 20\-item response set as a whole\.

This analysis confirmed the hyper\-consistency of the synthetic data\. Simulated persona responses were highly uniform across items: responses generated from a Stage 4 persona, for example, were rated almost uniformly as Stage 4 at the item level\. The same pattern did not hold for real human responses, where developmental signal was more unevenly distributed across questions\.

In the human data, a Pearson correlation analysis identified Q2, Q20, Q3, Q16, and Q18 as the strongest item\-level predictors of final stage labels \(r\>0\.65r\>0\.65\)\. By contrast, Q14, Q10, and Q8 appeared more susceptible to upward drift, with some Stage 3 respondents producing answers that resembled Stage 4\-like logic on those prompts\. The random forest analysis shown in Figure[4](https://arxiv.org/html/2605.08549#A1.F4)supports the same interpretation: predictive importance was concentrated in a relatively small subset of items, with Q2 alone accounting for roughly 25% of total feature importance\.

These analyses suggest that real human DSCT responses contain usable developmental signal, but not with the near\-uniform consistency seen in synthetic persona text\. This helps explain why agreement in Experiment 2 is lower than in Experiment 1, and why the relevant benchmark target for human\-written DSCT responses is approximate alignment with trained human judgment rather than exact stage recovery\.

![Refer to caption](https://arxiv.org/html/2605.08549v1/images/randomforest.png)Figure 4:Experiment 2 exploratory analysis\.Random forest feature importance across DSCT items for predicting final human stage labels\. Predictive weight is concentrated in a small subset of questions, with Q2 contributing the largest share\.

### A\.6Experiment 3 supplementary materials

#### A\.6\.1Default DSCT prompt for models

This subsection reports the prompt template used in Experiment 3, where models were asked to answer the DSCT without persona\-conditioning or target\-stage instructions\. Full questionnaire contents are given in Appendix[A\.2\.1](https://arxiv.org/html/2605.08549#A1.SS2.SSS1)\.

> “This questionnaire contains 20 sentence\-completion prompts\. For the first 10 prompts, you will be asked to complete sentences about how you think or respond when certain situations happen to you\. For the next 10 prompts, you will be asked to complete sentences about how a person might think or respond in a similar situation\. Please finish each sentence with the first thought that comes to mind\. There are no right or wrong answers\. You may complete the sentences with a single word, a short phrase, or a full paragraph, whichever best expresses your thought\. Please complete every item\. Do not skip any sentences\. Work at a steady pace, and try not to overthink your responses\. Here is the questionnaire: \[DSCT\_QUESTIONNAIRE\] Here an example of the format for your responses in json: \[RESPONSE\_FORMAT\_JSON\]”

#### A\.6\.2Full ratings for model\-generated DSCT responses

Table[4](https://arxiv.org/html/2605.08549#A1.T4)reports the full developmental ratings assigned to the DSCT responses generated by each model in Experiment 3\. These ratings should be interpreted as descriptions of output structure under a shared elicitation format, not as claims about human\-like developmental stages in the models themselves\.

Table 4:Developmental ratings assigned to default model\-generated DSCT responses in Experiment 3\.

### A\.7Follow\-up: stage inference from naturalistic conversation

As a follow\-up analysis, we compared stage assignments derived from participants’ naturalistic chat history with stage assignments derived from the structured DSCT\. This was not part of the core benchmark, but serves as a stress test of whether the same developmental signal remains recoverable when the input shifts from elicited questionnaire responses to unconstrained conversational traces\.

We focus on the matched subset for which all three quantities were available \(N=69N=69\): LLM grading of chat history, human grading of the questionnaire, and LLM grading of the questionnaire\. Mean assigned stage differed substantially across these three conditions: LLM grading of chat history produced the highest mean stage \(M=4\.15M=4\.15,S​D=0\.40SD=0\.40\), followed by human grading of the questionnaire \(M=3\.67M=3\.67,S​D=0\.51SD=0\.51\), and LLM grading of the questionnaire \(M=3\.53M=3\.53,S​D=0\.53SD=0\.53\)\. A Friedman test showed a significant omnibus difference across the three paired conditions \(χ2​\(2\)=52\.435\\chi^\{2\}\(2\)=52\.435,p<\.001p<\.001\)\.

Agreement analyses revealed a pronounced conversational overestimation effect\. Comparing LLM grading of chat history to human grading of the questionnaire yielded very weak agreement \(κ=0\.022\\kappa=0\.022\), with 26\.1% exact agreement and 40\.6% severe disagreements \(defined as≥1\.0\\geq 1\.0full stage apart\)\. Comparing LLM grading of chat history to LLM grading of the questionnaire also showed poor agreement \(κ=0\.149\\kappa=0\.149\), with 18\.8% exact agreement and 39\.1% severe disagreements\. By contrast, agreement between human and LLM grading of the questionnaire was substantially stronger \(κ=0\.324\\kappa=0\.324\), with 43\.5% exact agreement and 73\.9% agreement within±0\.5\\pm 0\.5stage\.

Refusal rates were also uneven across providers in the broader chat\-history collection: 11 of 77 responses \(14\.3%\) did not produce a stage assignment, including 9 of 39 ChatGPT responses\. This suggests that willingness to infer stage from prior chat history did not consistently track the actual availability of usable evidence\.

These results suggest that the strongest source of upward bias is not the classifier alone, but the conversational medium itself: the same participant can appear substantially more “self\-authored” when evaluated through unstructured chat history than when evaluated through structured DSCT responses\.

![Refer to caption](https://arxiv.org/html/2605.08549v1/images/fig_tripartite_stage_distribution_violin.png)Figure 5:Stage distributions across the matchedN=69N=69subset\.LLM grading of chat history is systematically shifted upward relative to both human and LLM grading of the structured questionnaire\.![Refer to caption](https://arxiv.org/html/2605.08549v1/images/tripartitehistogram.png)

![Refer to caption](https://arxiv.org/html/2605.08549v1/images/llmchatconfusion.png)

Figure 6:Scoring and Agreement heatmaps for conversational and questionnaire\-based stage inference\.LLMs using conversational data over estimate the kegan stage of their users\. Agreement is very weak for LLM\(chat\) vs\. Human\(questionnaire\)\.

## NeurIPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact\. Do not remove the checklist:The papers not including the checklist will be desk rejected\.The checklist should follow the references and follow the \(optional\) supplemental material\. The checklist does NOT count towards the page limit\.

Please read the checklist guidelines carefully for information on how to answer these questions\. For each question in the checklist:

- •You should answer\[Yes\],\[No\], or\[N/A\]\.
- •\[N/A\]means either that the question is Not Applicable for that particular paper or the relevant information is Not Available\.
- •Please provide a short \(1–2 sentence\) justification right after your answer \(even for\[N/A\]\)\.

The checklist answers are an integral part of your paper submission\.They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers\. You will also be asked to include it \(after eventual revisions\) with the final version of your paper, and its final version will be published with the paper\.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation\. While\[Yes\]is generally preferable to\[No\], it is perfectly acceptable to answer\[No\]provided a proper justification is given \(e\.g\., error bars are not reported because it would be too computationally expensive” or “we were unable to find the license for the dataset we used”\)\. In general, answering\[No\]or\[N/A\]is not grounds for rejection\. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate\. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix\. If you answer\[Yes\]to a question, in the justification please point to the section\(s\) where related material for the question can be found\.

IMPORTANT, please:

- •Delete this instruction block, but keep the section heading “NeurIPS Paper Checklist",
- •Keep the checklist subsection headings, questions/answers and guidelines below\.
- •Do not modify the questions and only use the provided macros for your answers\.

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: All the claims are justified via experimentation with frontier LLMs and Human studies\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: We include a through limitation discussion\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]\.
14. Justification: There’s no theoretical results in this paper\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[TODO\]\[Yes\]
19. Justification: The paper fully disclose all the information needed to reproduce the main experimental results of the paper, and we release all our prompts, datasets and evaluation tools to assist the reproducibility of our experimental results\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[TODO\]\[Yes\]
24. Justification: This paper provides open access to the data, evaluation prompts and results and include instructions for running the experiments\. We include the links to the dataset collection\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[N/A\]\.
29. Justification: No training was performed\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: We provide statistical significance analyses of the Pearson correlation, Wilcoxon rank tests, Cohen’s Kappas and in general differences between the full sets of data for all experiments\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[TODO\]\[Yes\]
39. Justification: We include all experiments and human annotation resources in the Appendix, covering all resources used for generating model responses and evaluations, and for collecting human labels\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: We confirm the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: We discuss impacts and limitations both positive and negative in the discussion\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[Yes\]
54. Justification: We discuss safeguards for responsible release use of the findings and questionnaires for evaluating developmental cognition\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: The creators or original owners of assets used in the paper are properly credited and are respected for the license and terms of use explicitly mentioned\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.08549v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: We document all assets\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[Yes\]
69. Justification: We include details for human annotations and questionnaires used in the Appendix\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]\.
74. Justification: Our human annotation is innocuous and thus does not require IRB approval\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification: We describe how LLMs are being used as part of the tools for evaluation data, as part of the research process as well as for generating data\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

Evaluating LLMs as Human Surrogates in Controlled Experiments

arXiv cs.CL

This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.