Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
Summary
This academic paper identifies and characterizes Simpson's paradox in behavioral curve modeling, demonstrating how aggregation systematically distorts parametric estimates of user dynamics due to survival bias. The authors validate this distortion across datasets like Goodreads and Amazon Electronics and propose hierarchical peak estimation methods to mitigate the issue.
View Cached Full Text
Cached at: 05/13/26, 06:28 AM
# How Aggregation Distorts Parametric Models of User Dynamics
Source: [https://arxiv.org/html/2605.11017](https://arxiv.org/html/2605.11017)
## Simpson’s Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
###### Abstract
Behavioral curve modeling—fitting parametric functions to engagement\-versus\-exposure data—is standard practice in recommendation, advertising, and clinical dosing\. We show that aggregation introduces a systematic distortion: Simpson’s paradox in behavioral curves\. On Goodreads \(3\.3M users, 9 genres\), individual users peak atn∗≈11n^\{\*\}\\approx 11exposures while the aggregate peaks atn∗≈34n^\{\*\}\\approx 34—a3×3\\timesgap driven by survival bias\. Amazon Electronics \(18M reviews\) shows a5\.3×5\.3\\timesdistortion\. MovieLens\-25M \(D≈1D\\approx 1\) serves as a negative control, confirming that survival bias—not aggregation per se—is the operative mechanism\. The distortion is robust to category granularity, engagement operationalization, and classifier calibration\. We develop Synthetic Null Calibration to address a 32% false positive rate in per\-user classification\. Our findings apply wherever individual behavioral parameters are estimated from aggregate curves under differential attrition\.
## 1Introduction
Behavioral curve modeling—fitting parametric functions to response\-versus\-exposure data—is ubiquitous across applied sciences\. Recommendation systems use such curves to set exploration budgets and frequency caps\. Clinical research relies on dose\-response curves, advertising on saturation curves, and behavioral science on learning and habituation curves\. In each domain, consequential decisions depend on curve parameters: peak location, onset of decline, saturation threshold\.
The standard practice is to fit at the*aggregate*level: pool individuals, compute mean response per exposure count, fit a parametric model\. This is statistically powerful but rests on an implicit assumption—that the aggregate curve faithfully represents the typical individual\.
We show this assumption is systematically violated\. On the Goodreads book rating dataset\(Wan and McAuley,[2018](https://arxiv.org/html/2605.11017#bib.bib28)\)\(3\.3M users, 9 genres\), we fit the Hill\-exponential behavioral model\(Berlyne,[1960](https://arxiv.org/html/2605.11017#bib.bib1); Loewenstein,[1994](https://arxiv.org/html/2605.11017#bib.bib2)\)at both granularities and find a striking Simpson’s paradox\(Simpson,[1951](https://arxiv.org/html/2605.11017#bib.bib33)\)\. Individual users peak at a median ofn∗≈11n^\{\*\}\\approx 11exposures\. The aggregate curve peaks atn∗≈34n^\{\*\}\\approx 34—a3×3\\timesgap\. The distortion is bidirectional across genres: Romance shows a6\.8×6\.8\\timesoverestimate, Fiction a9×9\\timesunderestimate\. Yet individual peaks cluster tightly across all 9 genres \(\[9\.6,16\.0\]\[9\.6,16\.0\]\), revealing stable individual\-level behavior masked by genre\-specific aggregation artifacts\.
Our contributions are:
1. 1\.Simpson’s paradox in behavioral curves\(primary\)\. We identify and characterize a systematic aggregation distortion in user behavioral modeling, showing that aggregate curves can misrepresent individual peak locations by3×3\\timesto5×5\\times\. We identify survival bias \(differential attrition\) as the dominant mechanism \(Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)\) and validate across three datasets spanning different attrition regimes: Goodreads \(D=3×D=3\\times, sequential engagement\), Amazon Electronics \(D=5\.3×D=5\.3\\times, transactional purchases; Section[8](https://arxiv.org/html/2605.11017#S8)\), and MovieLens\-25M \(D≈1×D\\approx 1\\times, retrospective ratings; negative control\)\. We reproduce the distortion from first principles on synthetic populations \(Section[6](https://arxiv.org/html/2605.11017#S6)\)\.
2. 2\.Hierarchical peak estimation\(methodological\)\. We show that empirical Bayes shrinkage\(Efron and Morris,[1975](https://arxiv.org/html/2605.11017#bib.bib42)\)provides a principled alternative to both aggregate and naive per\-user fitting, partially pooling individual peak estimates toward a population prior \(Section[9](https://arxiv.org/html/2605.11017#S9)\)\.
3. 3\.Synthetic Null Calibration \(SNC\)\(methodological\)\. We show that naive per\-user curve classification has a 32% false positive rate on synthetic monotonic null data and develop a calibration protocol that reveals per\-user behavioral classifiers are fundamentally limited as prevalence estimators when model complexity is high relative to sample size \(Section[5](https://arxiv.org/html/2605.11017#S5)\)\.
## 2Related Work
#### Curiosity and behavioral curves\.
Berlyne’s\([1960](https://arxiv.org/html/2605.11017#bib.bib1)\)arousal theory and Loewenstein’s\([1994](https://arxiv.org/html/2605.11017#bib.bib2)\)information\-gap theory establish that curiosity follows an inverted\-U function of knowledge, with neuroscience support\(Kanget al\.,[2009](https://arxiv.org/html/2605.11017#bib.bib15); Gruberet al\.,[2014](https://arxiv.org/html/2605.11017#bib.bib16)\)\. We use this inverted\-U as our model but focus on the*statistical*properties of fitting at different granularities, not the psychology\. Engagement\-versus\-exposure modeling in recommendation underlies fatigue detection, frequency capping, and exploration scheduling, drawing on multi\-armed bandits\(Aueret al\.,[2002](https://arxiv.org/html/2605.11017#bib.bib7); Thompson,[1933](https://arxiv.org/html/2605.11017#bib.bib8); Liet al\.,[2010](https://arxiv.org/html/2605.11017#bib.bib4); Lattimore and Szepesvári,[2020](https://arxiv.org/html/2605.11017#bib.bib25)\), contextual bandits\(Agarwalet al\.,[2014](https://arxiv.org/html/2605.11017#bib.bib21)\), and curiosity\-driven approaches\(Chenet al\.,[2021a](https://arxiv.org/html/2605.11017#bib.bib5),[b](https://arxiv.org/html/2605.11017#bib.bib6)\)—typically fit at the cohort or population level\.
#### Simpson’s paradox and shrinkage estimation\.
Simpson’s paradox\(Simpson,[1951](https://arxiv.org/html/2605.11017#bib.bib33); Blyth,[1972](https://arxiv.org/html/2605.11017#bib.bib34)\)occurs when an aggregate\-level trend reverses upon disaggregation\. Robinson\([1950](https://arxiv.org/html/2605.11017#bib.bib36)\)first showed this for ecological correlations; King\([1997](https://arxiv.org/html/2605.11017#bib.bib41)\)developed reconstruction methods, and Pearl\([2014](https://arxiv.org/html/2605.11017#bib.bib35)\)a causal analysis, with applications in psychology\(Kievitet al\.,[2013](https://arxiv.org/html/2605.11017#bib.bib39)\)and behavioral data mining\(Alipourfardet al\.,[2018](https://arxiv.org/html/2605.11017#bib.bib37)\)\. To our knowledge it has not been applied to behavioral*curve*estimation, where the distortion affects fitted peak*locations*rather than correlations\. Empirical Bayes shrinkage\(Efron and Morris,[1975](https://arxiv.org/html/2605.11017#bib.bib42)\)and hierarchical Bayesian models\(Gelmanet al\.,[2013](https://arxiv.org/html/2605.11017#bib.bib44); Gelman,[2006](https://arxiv.org/html/2605.11017#bib.bib43)\)address bias\-variance in individual\-parameter estimation; we apply them to peak estimation as a natural fix for the paradox we identify\.
#### Informative censoring and survival analysis\.
The survival bias mechanism we identify \(Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)\) is related to informative censoring in biostatistics\(Robins and Finkelstein,[2000](https://arxiv.org/html/2605.11017#bib.bib38); Little and Rubin,[2019](https://arxiv.org/html/2605.11017#bib.bib48)\), where missingness depends on the outcome\. The covariance identity in Eq\.[6](https://arxiv.org/html/2605.11017#S4.E6)is a standard result in selection\-conditional expectation\(Heckman,[1979](https://arxiv.org/html/2605.11017#bib.bib49)\)\. Our contribution is not the identity itself but its application to*behavioral curve estimation*: we show that informative censoring distorts not just means but fitted*peak locations*, that the distortion magnitude varies predictably across platforms \(1×1\\timesto5×5\\times\), and that it produces a form of Simpson’s paradox not previously characterized in the censoring literature\. Theorem[2](https://arxiv.org/html/2605.11017#Thmtheorem2)\(Appendix[E](https://arxiv.org/html/2605.11017#A5)\) extends the result to arbitrary joint distributions with a FOSD characterization\.
## 3Model and Methodology
### 3\.1The Hill\-Exponential Curiosity Model
We model user engagement as a function of exposure countnnusing the Hill\-exponential model:
C\(n;𝜽\)=c0\+A⋅nana\+ba⋅exp\(−ns\)C\(n;\\boldsymbol\{\\theta\}\)=c\_\{0\}\+A\\cdot\\frac\{n^\{a\}\}\{n^\{a\}\+b^\{a\}\}\\cdot\\exp\\left\(\-\\frac\{n\}\{s\}\\right\)\(1\)where𝜽=\(c0,A,a,b,s\)\\boldsymbol\{\\theta\}=\(c\_\{0\},A,a,b,s\)are the parameters:c0∈\[0,1\]c\_\{0\}\\in\[0,1\]is baseline engagement,A∈\[0,1\]A\\in\[0,1\]is curiosity modulation amplitude,a\>0a\>0controls onset steepness,b\>0b\>0is the half\-maximum exposure count, ands\>0s\>0is the saturation decay constant\. The Hill termna/\(na\+ba\)n^\{a\}/\(n^\{a\}\+b^\{a\}\)models rising curiosity with initial exposure; the exponentialexp\(−n/s\)\\exp\(\-n/s\)models declining curiosity with overexposure\. The peak locationn∗=argmaxnC\(n;𝜽\)n^\{\*\}=\\arg\\max\_\{n\}C\(n;\\boldsymbol\{\\theta\}\)is the exposure count at which curiosity is maximized\.
###### Definition 1\(Multi\-granularity peak estimation\)\.
Given engagement data\{\(n,eu,g\(n\)\)\}\\\{\(n,e\_\{u,g\}\(n\)\)\\\}for usersu∈Uu\\in Uin genregg:
Individual peak:nu,g∗\\displaystyle\\text\{Individual peak:\}\\quad n^\{\*\}\_\{u,g\}=argmaxnC\(n;𝜽^u,g\)\\displaystyle=\\arg\\max\_\{n\}\\;C\(n;\\hat\{\\boldsymbol\{\\theta\}\}\_\{u,g\}\)\(2\)Aggregate peak:ng∗\\displaystyle\\text\{Aggregate peak:\}\\quad n^\{\*\}\_\{g\}=argmaxnC\(n;𝜽^g\)\\displaystyle=\\arg\\max\_\{n\}\\;C\(n;\\hat\{\\boldsymbol\{\\theta\}\}\_\{g\}\)\(3\)where𝛉^u,g\\hat\{\\boldsymbol\{\\theta\}\}\_\{u,g\}is fit to useruu’s data and𝛉^g\\hat\{\\boldsymbol\{\\theta\}\}\_\{g\}is fit to the population\-averaged engagement curve in genregg\.
###### Definition 2\(Aggregation distortion factor\)\.
The distortion factor for genreggis:
Dg=ng∗median\(\{nu,g∗:u∈Ug\}\)D\_\{g\}=\\frac\{n^\{\*\}\_\{g\}\}\{\\operatorname\{median\}\(\\\{n^\{\*\}\_\{u,g\}:u\\in U\_\{g\}\\\}\)\}\(4\)Simpson’s paradox is detected when\|Dg−1\|\|D\_\{g\}\-1\|is large, indicating systematic divergence between aggregate and individual peak estimates\.
### 3\.2Model Selection and Classification
We fit seven competing models to each curve—Hill\-exponential \(Eq\.[1](https://arxiv.org/html/2605.11017#S3.E1)\), monotonic decay, flat, pure Hill \(monotonically increasing\), Gaussian peak, logarithmic peak, and quadratic—and select via likelihood ratio test \(LRT\), AIC, and out\-of\-sampleR2R^\{2\}\.
#### Aggregate\-level classification\.
A genre isstronginverted\-U \(Class A\) if all hold:R2\>0\.4R^\{2\}\>0\.4, LRTp<0\.05p<0\.05,ΔAIC\>4\\Delta\\text\{AIC\}\>4vs\. monotonic, beats quadratic, permutation test significant, OOSR2\>0R^\{2\}\>0, decline\>10%\>10\\%, ascending phase significant\. Weaker criteria define Classes B–E \(Table[1](https://arxiv.org/html/2605.11017#S4.T1)\)\.
#### Individual\-level \(strict\) classification\.
A user passes if all hold: LRTp<0\.05p<0\.05,ΔAIC\>2\\Delta\\text\{AIC\}\>2vs\. monotonic,R2\>0\.05R^\{2\}\>0\.05,n∗\>2\.0n^\{\*\}\>2\.0\(no boundary peak\), decline\>10%\>10\\%, and Hill\-Exp BIC<<pure Hill BIC \(the decay component must be justified by model selection\)\.
### 3\.3Datasets
Goodreads\(primary; UCSD Book Graph\(Wan and McAuley,[2018](https://arxiv.org/html/2605.11017#bib.bib28)\)\): 3\.3M users, 7 primary genres \(Fantasy/Paranormal, Young\-Adult, Comics/Graphic, Fiction, Romance, History/Biography, Mystery/Thriller\) plus Children’s and Non\-fiction for per\-user analysis\. Engagement is binary \(rating≥4\.0\\geq 4\.0\); exposure is the sequential book count per user within genre\. Per\-genre user counts \(Appendix[C](https://arxiv.org/html/2605.11017#A3)\) are non\-mutually\-exclusive\.MovieLens\-25M\(Harper and Konstan,[2015](https://arxiv.org/html/2605.11017#bib.bib40)\)\(negative control\): 25M ratings, 162K users, 20 genres; same engagement threshold and exposure definition; we select the 7 most\-populated genres \(Drama, Comedy, Action, Thriller, Romance, Adventure, Sci\-Fi\)\. For both, aggregate analysis uses weighted bin averages; per\-user analysis applies 5\-point moving\-average smoothing and requires≥15\\geq 15observations per user\-genre pair\.
## 4Aggregate vs\. Individual: Simpson’s Paradox
### 4\.1Setup and Main Result
We fit the Hill\-exponential model at both granularities\.Aggregate fitspool all users in a genre \(one curve per genre, 7 genres with sufficient population coverage\)\.Individual fitsuse a stratified subsample of 1,000 user\-genre pairs \(seed 42; drawn from≈\\approx40,000 eligible pairs\); after filtering for minimum exposure \(≥15\\geq 15ratings,≥19\\geq 19smoothed observations\), 784 users remain, of which 221 \(28\.2%\) pass all six strict classification criteria\. Bootstrap stability \(Section[4\.2](https://arxiv.org/html/2605.11017#S4.SS2)\) confirms the subsample size is sufficient\.
Table 1:Aggregate vs\. individual peak locations across genres\. Class is the aggregate\-level fit class \(Section[3](https://arxiv.org/html/2605.11017#S3)\); aggregaten∗n^\{\*\}and decline come from population\-pooled curves; individualn∗n^\{\*\}is the median across strict\-classified users in the genre\. The distortion factorDg=nagg∗/median\(nindiv∗\)D\_\{g\}=n^\{\*\}\_\{\\text\{agg\}\}/\\text\{median\}\(n^\{\*\}\_\{\\text\{indiv\}\}\)quantifies aggregation bias\. Children’s and Non\-fiction lack a usable aggregate fit \(Class E omitted\)\.The central finding is in Table[1](https://arxiv.org/html/2605.11017#S4.T1)\. Aggregate fits classify only Fantasy and Young\-Adult as strong inverted\-U; in heterogeneous genres \(Fiction, Romance\) the individual\-level inverted\-U cancels under aggregation, producing apparently monotonic population curves\. Yet*individual*peaks cluster tightly at\[9\.6,16\.0\]\[9\.6,16\.0\]across all 9 genres, while aggregate peaks span\[1\.1,65\.7\]\[1\.1,65\.7\]\. The all\-genre distortion factor isD=3\.05D=3\.05: the aggregate overestimates optimal exploration duration by3×3\\times\.
The distortion is*bidirectional*: Romance shows a6\.8×6\.8\\timesoverestimate \(aggregate peak much later than individual\), Fiction a9×9\\timesunderestimate \(aggregate peak much earlier\)\. This rules out any simple correction factor—the system must model at the appropriate granularity\. \(Aggregate\-fit goodness\-of\-fit and ascent significance are reported in Appendix[C](https://arxiv.org/html/2605.11017#A3)\.\)
### 4\.2Robustness
We verify the Simpson’s paradox finding through three tests:
#### Bootstrap stability\.
Subsampling 100 of the 221 strict\-classified users 1,000 times, 94\.4% of subsamples yield mediann∗∈\[9,14\]n^\{\*\}\\in\[9,14\]\. The full\-sample bootstrap 95% CI is\[9\.9,13\.9\]\[9\.9,13\.9\]\.
#### Peak distribution\.
The individualn∗n^\{\*\}distribution is right\-skewed \(skewness=2\.56=2\.56\) with IQR\[7\.1,25\.3\]\[7\.1,25\.3\]\. The 22\.8% of users withn∗\>30n^\{\*\}\>30are the tail that shifts the aggregate peak rightward\.
#### Immunity to FP calibration\.
The Simpson’s paradox finding concerns peak*locations*, not pattern*prevalence*\. Even if some strict\-classified users are false positives, their fitted peak locations still contribute to the mediann∗n^\{\*\}\. The finding is robust to classifier accuracy because it is an ordinal claim \(individual peaks are systematically earlier than aggregate\), not a cardinal claim \(exactlyX%X\\%of users show inverted\-U\)\.
### 4\.3Mechanism: Survival Bias
Why does aggregation distort the peak? The primary mechanism issurvival bias\(differential attrition\)\. Users with early peaks disengage sooner—a user whose curiosity peaks atn∗=5n^\{\*\}=5is unlikely to read 30 more books in that genre\. High\-exposure aggregate bins are therefore dominated by late\-peaking users, shifting the population curve rightward\. Put simply: the aggregate at exposurennreflects only users still active atnn—a biased subsample enriched for late\-peak users\.
###### Theorem 1\(Survival bias drives aggregation distortion\)\.
Let\{Cu\(n\)\}u=1N\\\{C\_\{u\}\(n\)\\\}\_\{u=1\}^\{N\}be unimodal behavioral curves with peaks at\{nu∗\}\\\{n^\{\*\}\_\{u\}\\\}\. LetSu\(n\)∈\{0,1\}S\_\{u\}\(n\)\\in\\\{0,1\\\}indicate whether useruuis still active \(contributing data\) at exposure countnn, whereP\(Su\(n\)=1\)P\(S\_\{u\}\(n\)=1\)is increasing innu∗n^\{\*\}\_\{u\}forn\>median\(\{nu∗\}\)n\>\\operatorname\{median\}\(\\\{n^\{\*\}\_\{u\}\\\}\)\(users with later peaks survive longer\)\. The observed aggregate curve is:
Cobsagg\(n\)=∑uSu\(n\)⋅Cu\(n\)∑uSu\(n\)C^\{\\text\{agg\}\}\_\{\\text\{obs\}\}\(n\)=\\frac\{\\sum\_\{u\}S\_\{u\}\(n\)\\cdot C\_\{u\}\(n\)\}\{\\sum\_\{u\}S\_\{u\}\(n\)\}\(5\)If survival is correlated with peak location \(Cov\(Su\(n\),nu∗\)\>0\\operatorname\{Cov\}\(S\_\{u\}\(n\),n^\{\*\}\_\{u\}\)\>0for largenn\), thennagg∗\>median\(\{nu∗\}\)n^\{\*\}\_\{\\text\{agg\}\}\>\\operatorname\{median\}\(\\\{n^\{\*\}\_\{u\}\\\}\): the aggregate peak is shifted rightward relative to the typical individual peak\. Quantitatively, the observed mean peak among survivors at exposurennsatisfies:
𝔼\[nu∗∣Su\(n\)=1\]−𝔼\[nu∗\]=Cov\(nu∗,Su\(n\)\)P\(Su\(n\)=1\)\\mathbb\{E\}\[n^\{\*\}\_\{u\}\\mid S\_\{u\}\(n\)=1\]\-\\mathbb\{E\}\[n^\{\*\}\_\{u\}\]=\\frac\{\\operatorname\{Cov\}\(n^\{\*\}\_\{u\},S\_\{u\}\(n\)\)\}\{P\(S\_\{u\}\(n\)=1\)\}\(6\)so the distortion at eachnnis exactly the selection covariance divided by the survival probability\.
###### Proof\.
Consider two exposure countsn1<median\(\{nu∗\}\)<n2n\_\{1\}<\\operatorname\{median\}\(\\\{n^\{\*\}\_\{u\}\\\}\)<n\_\{2\}\. Atn1n\_\{1\}, nearly all users are active \(Su\(n1\)≈1S\_\{u\}\(n\_\{1\}\)\\approx 1\), so the observed aggregate equals the true population average\. Atn2n\_\{2\}, early\-peaking users have disengaged\. The surviving set𝒜\(n2\)=\{u:Su\(n2\)=1\}\\mathcal\{A\}\(n\_\{2\}\)=\\\{u:S\_\{u\}\(n\_\{2\}\)=1\\\}is enriched for users withnu∗\>n2n^\{\*\}\_\{u\}\>n\_\{2\}, whose engagement atn2n\_\{2\}is still near their peak\.
The missing early\-peak users would be in their post\-peak decline atn2n\_\{2\}, but they are absent from the data\. The observed aggregate is therefore inflated at highnn, shifting the peak rightward\. This is informative right\-censoring\(Robins and Finkelstein,[2000](https://arxiv.org/html/2605.11017#bib.bib38)\): the missingness mechanism depends on the quantity being estimated\. ∎
We generalize this result in two directions: Theorem[2](https://arxiv.org/html/2605.11017#Thmtheorem2)\(Appendix[E](https://arxiv.org/html/2605.11017#A5)\) drops all parametric assumptions, establishing the identity under arbitrary joint distributions with a FOSD characterization; Theorem[3](https://arxiv.org/html/2605.11017#Thmtheorem3)provides a pooled distortion estimator across multiple datasets with known sampling distribution, enabling formal cross\-dataset hypothesis testing\.
We validate the survival bias mechanism through controlled synthetic experiments \(Section[6](https://arxiv.org/html/2605.11017#S6)\), which show that survival bias alone produces a3\.0×3\.0\\timesdistortion—closely matching the3\.05×3\.05\\timesobserved in Goodreads—while amplitude\-peak correlation without differential attrition produces no distortion \(1\.0×1\.0\\times\)\.
## 5Synthetic Null Calibration \(SNC\)
### 5\.1The False Positive Problem
Per\-user behavioral curve classification is standard practice, but false positive rates are rarely reported against synthetic null data\. This is analogous to running a clinical trial without a placebo arm—any observed effect could be an artifact of the measurement procedure\. We proposeSynthetic Null Calibration \(SNC\), a three\-step protocol for calibrating per\-user behavioral classifiers:
1. 1\.Generate synthetic nullswith known ground\-truth dynamics \(monotonic, flat\) and matched noise characteristics \(Bernoulli noise with variance matching observed data\)\.
2. 2\.Apply the identical pipeline\(smoothing, model fitting, classification\) to synthetic and real data\.
3. 3\.Report both rates:the raw classification rate on real data AND the false positive rate on synthetic nulls\. The excess \(raw−\-FP\) bounds the genuine signal\.
We generate 500 synthetic users per condition \(monotonic decay, flat, true inverted\-U\) with Bernoulli noise matched to the observed engagement variance\. Each synthetic user undergoes the identical pipeline: 5\-point smoothing, 7\-model fitting, classification\. We measure false positive rate \(FP: monotonic/flat users classified as inverted\-U\) and true positive rate \(TP: inverted\-U users correctly classified\)\.
Table 2:Per\-user classifier calibration against synthetic null data\. Selectivity is TP/FP\. Excess is the observed classification rate on real data minus the null FP rate\.
### 5\.2Results and Implications
Table[2](https://arxiv.org/html/2605.11017#S5.T2)reveals that the uncalibrated classifier produces a 32% false positive rate on monotonic null data—only 3\.3% below the observed 35\.5% per\-user inverted\-U rate, so the raw rate is not defensible as a prevalence claim\. The strict classifier \(requiring\>10%\>10\\%decline and beating pure Hill\) reduces FP to 24% \(11\.5% excess\), but selectivity remains low \(1\.48×1\.48\\times\)\. The decline threshold has diminishing returns: FP is stable across 10–20%, indicating the bottleneck is model flexibility \(5\-parameter Hill\-Exp overfitting smoothed Bernoulli noise from 15–75 data points\), not the threshold value\.
The true inverted\-U prevalence lies in\[4\.2%,28\.2%\]\[4\.2\\%,28\.2\\%\]\(excess above null to raw rate\), but per\-user fitting with 5 parameters on 15–75 points cannot resolve where\. The classifier is a weak instrument—it confirms the*existence*of signal \(excess\>0\>0\) but cannot estimate its*magnitude*\. We recommend practitioners*always*calibrate per\-user classifiers against synthetic nulls with matched sample sizes and noise; raw rates should never be presented as prevalence estimates\.
## 6Synthetic Validation
To confirm the paradox is a structural property of aggregation \(not a dataset\-specific artifact\), we constructN=1,000N=1\{,\}000synthetic users with known ground\-truth Hill\-exponential curves and log\-normal peak locations \(median≈12\\approx 12\)\. We test a2×22\\times 2factorial crossing two candidate mechanisms: \(i\)*survival bias*: each user’s maximum observed exposurenmax,un\_\{\\max,u\}is positively correlated withnu∗n^\{\*\}\_\{u\}\(early\-peakers disengage sooner; “off” = identicalnmaxn\_\{\\max\}\); \(ii\)*amplitude\-peak correlation*:Au=0\.2\+0\.3⋅nu∗/max\(n∗\)\+ϵA\_\{u\}=0\.2\+0\.3\\cdot n^\{\*\}\_\{u\}/\\max\(n^\{\*\}\)\+\\epsilon\(“off” = identicalAu=0\.35A\_\{u\}=0\.35\)\.
Table 3:Synthetic validation: disentangling survival bias and amplitude\-peak correlation\.N=1,000N=1\{,\}000synthetic users with log\-normal peak locations \(median≈12\\approx 12\)\. Distortion factorDDis the ratio of aggregate peak to individual median\.Survival bias alone produces a3\.0×3\.0\\timesdistortion, matching the3\.05×3\.05\\timesobserved in Goodreads\. Amplitude\-peak correlation without differential attrition produces*no distortion*\(1\.0×1\.0\\times\); combined \(5\.7×5\.7\\times\) it compounds the effect rather than counteracting it\. This confirms that the operative mechanism in Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)is differential attrition\. The synthetic experiment thus elevates our finding from “we observed a3×3\\timesdistortion on Goodreads” to “survival bias in heterogeneous user populations is a sufficient condition for Simpson’s paradox in behavioral curves\.”
Figure 1:Simpson’s paradox in behavioral curves\. Aggregate engagement curves \(blue line\) systematically overestimate the peak exposure count relative to individual users \(orange histogram\)\. Left: synthetic population with survival bias \(N=1,000N=1\{,\}000, agg\.n∗≈43n^\{\*\}\\approx 43, indiv\. mediann∗≈14n^\{\*\}\\approx 14\)\. Right: Goodreads real data \(agg\.n∗≈34n^\{\*\}\\approx 34, indiv\. mediann∗≈11n^\{\*\}\\approx 11,n=221n=221\); the in\-panel “2\.2M users” label denotes the post\-filter pool \(users with≥5\\geq 5interactions\) used to fit the aggregate curve, drawn from the full 3\.3M\-user dataset\. The∼3×\\sim 3\\timesdistortion is reproduced from first principles\.
## 7MovieLens\-25M as Negative Control
To test whether the Simpson’s paradox is driven by survival bias \(as Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)predicts\) rather than aggregation per se, we apply the full analysis to MovieLens\-25M\(Harper and Konstan,[2015](https://arxiv.org/html/2605.11017#bib.bib40)\), where survival bias is expected to be weak\.
Table 4:MovieLens\-25M analysis \(negative control\)\. Unlike Goodreads, aggregate and individual peaks are well\-aligned \(D≈1D\\approx 1\), consistent with weaker survival bias in retrospective rating data\.Table[4](https://arxiv.org/html/2605.11017#S7.T4)contrasts sharply with Goodreads\. All 7 genres show strong aggregate inverted\-U curves \(R2\>0\.99R^\{2\}\>0\.99\) and modest distortion \(D∈\[0\.53,1\.28\]D\\in\[0\.53,1\.28\], all\-genreD=0\.93D=0\.93\)\. Individual and aggregate peaks are well\-aligned: medians in\[7\.0,10\.0\]\[7\.0,10\.0\], aggregates in\[5\.2,12\.6\]\[5\.2,12\.6\]\.
This is theoretically predicted by Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)\. MovieLens is a*retrospective rating*platform where exposure counts reflect past behavior rather than ongoing engagement\. The differential attrition that drives the Goodreads paradox is therefore weaker\. The negative control confirms: survival bias—not aggregation per se—is the operative mechanism\.
## 8Amazon Electronics: Large\-Scale Replication
Does the paradox generalize beyond books and movies? We apply the full analysis to Amazon Electronics\(Houet al\.,[2024](https://arxiv.org/html/2605.11017#bib.bib46)\): 18\.05M reviews, 1\.88M users, 43 product categories\. Electronics purchases represent*transactional*behavior with strong survival bias\. Broad explorers exhaust interest quickly; category specialists accumulate deep histories\.
Table 5:Amazon Electronics analysis\. Individual peaks are from strict per\-user Hill\-exponential classification \(≥10\\geq 10interactions per user\-category, LRTp<0\.05p<0\.05,ΔAIC\>2\\Delta\\text\{AIC\}\>2,R2\>0\.05R^\{2\}\>0\.05\)\. The aggregate peak is model\-derived \(Hill\-exponential fit to the reliable\-region aggregate curve,R2=0\.909R^\{2\}=0\.909\)\. The5\.3×5\.3\\timesgap confirms the Goodreads finding at larger scale\.Table[5](https://arxiv.org/html/2605.11017#S8.T5)reveals a5\.3×5\.3\\timesSimpson’s distortion \(95% CI\[4\.3,8\.0\]\[4\.3,8\.0\]\)\. We fit the same Hill\-exponential model used for Goodreads to the Electronics aggregate curve \(62 reliable bins with≥1,000\\geq 1\{,\}000observations each,R2=0\.909R^\{2\}=0\.909\), obtaining a model\-derived aggregate peak atn∗=55\.2n^\{\*\}=55\.2\. Strictly classified individual users peak at a median ofn∗≈10\.4n^\{\*\}\\approx 10\.4—closely matching the Goodreads individual peak \(n∗≈11n^\{\*\}\\approx 11\) despite the very different domain\.
Of 27,586 fitted user\-category pairs \(from 140,482 eligible\), 50\.3% pass strict inverted\-U gates\. The finding is robust: under binary engagement, the aggregate curve shows*no inverted\-U at all*—pure Hill \(saturating\) wins over Hill\-exponential by BIC\. This is aqualitativeSimpson’s reversal: individual users show clear peaks while the aggregate merely saturates, a stronger form of the paradox than a simple peak\-location shift\.
#### Three\-dataset summary\.
Across Goodreads \(D=3×D=3\\times, sequential engagement\), MovieLens \(D≈1×D\\approx 1\\times, retrospective ratings\), and Amazon Electronics \(D=5\.3×D=5\.3\\times, transactional purchases\), the distortion magnitude tracks the strength of survival bias exactly as Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)predicts \(Figure[2](https://arxiv.org/html/2605.11017#S8.F2)\)\. This provides strong evidence that the Simpson’s paradox in behavioral curves is a general phenomenon, not a dataset\-specific artifact\.
Figure 2:Simpson’s paradox gap ratio across three datasets\. Distortion magnitude tracks survival bias strength: minimal in retrospective ratings \(MovieLens\), moderate in sequential engagement \(Goodreads\), and extreme in transactional purchases \(Electronics\)\.
## 9Hierarchical Peak Estimation
The Simpson’s paradox identifies the problem; hierarchical modeling provides a principled solution\. We use empirical Bayes shrinkage\(Efron and Morris,[1975](https://arxiv.org/html/2605.11017#bib.bib42); Gelmanet al\.,[2013](https://arxiv.org/html/2605.11017#bib.bib44)\)to partially pool individual peak estimates toward a population mean, balancing individual noise against aggregate bias\. Across all 9 Goodreads genres, the hierarchical estimate \(n∗=11\.8n^\{\*\}=11\.8\) closely tracks the naive individual median \(11\.411\.4\) while being far more stable than the aggregate \(34\.234\.2\)\. The mean shrinkage weightw¯=0\.68\\bar\{w\}=0\.68means typical users retain 68% of their individual estimate\. Full model specification, per\-genre results, and bootstrap details are in Appendix[G](https://arxiv.org/html/2605.11017#A7)\.
## 10Implications
Any system using aggregate curves to estimate individual behavioral parameters may systematically misallocate resources\. Consider the Goodreads example: the aggregate peak atn∗=34\.2n^\{\*\}=34\.2versus the individual median of≈11\\approx 11means a recommender would allocate≈23\\approx 23extra exposures per user per genre—all in the post\-peak satiation zone\. Analogous distortions arise wherever heterogeneous individuals have differing peak locations and differential attrition: clinical dose\-response, advertising saturation, learning science\.
We recommend a simple diagnostic protocol:
1. 1\.Use individual\-level fits when per\-user data are sufficient \(≥30\\geq 30observations, calibrated classifier\)\.
2. 2\.Fall back to cohort\-level fits when data are sparse but a paradox is detected\.
3. 3\.Use the aggregate*only*whenD≈1D\\approx 1—that is, when individual and aggregate peaks align\.
The aggregate should be a fallback, not the default\.
## 11Limitations
#### Three datasets, one ecosystem family\.
Our findings are validated on Goodreads \(books,D=3×D=3\\times\), Amazon Electronics \(purchases,D=5\.3×D=5\.3\\times\), and MovieLens\-25M \(movies,D≈1×D\\approx 1\\times\)\. The three datasets span a wide range of survival bias strengths and confirm the mechanism, but Goodreads and Electronics share the Amazon ecosystem \(Goodreads was acquired by Amazon in 2013\), potentially sharing user demographics or recommendation effects\. MovieLens provides an independent platform, but all three involve discrete item consumption\. Replication on non\-Amazon platforms \(streaming, clinical dose\-response\) would further strengthen generalizability\.
#### Per\-user classifier weakness\.
The strict classifier has selectivity of only1\.48×1\.48\\times, meaning per\-user classification should be treated with caution\. Our Simpson’s paradox finding is robust to this weakness \(it concerns peak locations, not prevalence\), but any system deploying per\-user curve fits should use the calibration methodology we propose\.
#### Binary engagement\.
We operationalize engagement as a binary threshold \(rating≥4\.0\\geq 4\.0\)\. Preliminary analysis with continuous engagement \(normalized ratings\) yields consistent aggregate classifications and similar individual peak locations, but we report only the binary results as our primary analysis\. Continuous engagement signals in other domains \(dwell time, completion rate\) might yield different curve shapes and different distortion magnitudes\.
#### Static analysis\.
We fit a single curve per user\-genre pair\. In practice, user behavior evolves over time\. Temporal extensions \(sliding\-window fits, change\-point detection\) are a natural next step\.
#### Temporal confounding\.
Goodreads is a static snapshot\. Users who joined early experienced a different catalog and platform than later users\. Sequential exposure count does not control for cohort effects or self\-selection \(users who read 50 fantasy books chose to\)\. The Simpson’s paradox finding is robust to this concern \(it is about aggregation mathematics, not causal claims about exposure effects\), but the fitted curve parameter values could be confounded by temporal factors\.
## 12Conclusion
Aggregate behavioral curves can systematically misrepresent individual dynamics through Simpson’s paradox\. Across three datasets spanning different attrition regimes, the distortion magnitude tracks survival bias strength: Goodreads \(3×3\\times\), Amazon Electronics \(5\.3×5\.3\\timesunder strict classification\), and MovieLens \(≈1×\\approx 1\\times, negative control\)\. On Electronics, aggregation not only shifts the peak but reverses the sign of the exposure\-engagement relationship \(aggregate Pearson\+0\.345\+0\.345vs\. 44\.2% of individuals declining\)\. We contribute hierarchical Bayesian peak estimation as a principled fix, and Synthetic Null Calibration \(SNC\) which reveals per\-user behavioral classifiers have FP rates \(32%\) rarely reported\. The mechanism applies wherever heterogeneous individuals exhibit differential attrition; we recommend hierarchical modeling as default and aggregate\-vs\-individual comparison as a standard diagnostic\.
#### Reproducibility\.
## Acknowledgments and Disclosure of Funding
## References
- A\. Agarwal, D\. Hsu, S\. Kale, J\. Langford, L\. Li, and R\. Schapire \(2014\)Taming the monster: a fast and simple algorithm for contextual bandits\.InInternational Conference on Machine Learning \(ICML\),pp\. 1638–1646\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Alipourfard, P\. G\. Fennell, and K\. Lerman \(2018\)Using Simpson’s paradox to discover interesting patterns in behavioral data\.InProceedings of the International AAAI Conference on Web and Social Media \(ICWSM\),Vol\.12\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Auer, N\. Cesa\-Bianchi, and P\. Fischer \(2002\)Finite\-time analysis of the multiarmed bandit problem\.Machine Learning47\(2–3\),pp\. 235–256\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- D\. E\. Berlyne \(1960\)Conflict, arousal, and curiosity\.McGraw\-Hill,New York\.Cited by:[§1](https://arxiv.org/html/2605.11017#S1.p3.6),[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- C\. R\. Blyth \(1972\)On Simpson’s paradox and the sure\-thing principle\.Journal of the American Statistical Association67\(338\),pp\. 364–366\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Chen, X\. Xin, J\. Wu, and X\. He \(2021a\)Curiosity\-driven recommendation strategy\.InThe Web Conference,pp\. 2354–2365\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Chen, A\. Beutel, P\. Covington, S\. Jain, F\. Belletti, and E\. H\. Chi \(2021b\)Values of user exploration in recommender systems\.InProceedings of the 15th ACM Conference on Recommender Systems \(RecSys\),pp\. 85–95\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Efron and C\. Morris \(1975\)Data analysis using Stein’s estimator and its generalizations\.Journal of the American Statistical Association70\(350\),pp\. 311–319\.Cited by:[item 2](https://arxiv.org/html/2605.11017#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1),[§9](https://arxiv.org/html/2605.11017#S9.p1.4)\.
- A\. Gelman, J\. B\. Carlin, H\. S\. Stern, D\. B\. Dunson, A\. Vehtari, and D\. B\. Rubin \(2013\)Bayesian data analysis\.3rd edition,CRC Press\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1),[§9](https://arxiv.org/html/2605.11017#S9.p1.4)\.
- A\. Gelman \(2006\)Prior distributions for variance parameters in hierarchical models\.Bayesian Analysis1\(3\),pp\. 515–534\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- P\. I\. Good \(2000\)Permutation, parametric, and bootstrap tests of hypotheses\.Springer Series in Statistics\.Cited by:[Appendix D](https://arxiv.org/html/2605.11017#A4.SS0.SSS0.Px1.p1.6)\.
- M\. J\. Gruber, B\. D\. Gelman, and C\. Ranganath \(2014\)States of curiosity modulate hippocampus\-dependent learning via the dopaminergic circuit\.Neuron84\(2\),pp\. 486–496\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- F\. M\. Harper and J\. A\. Konstan \(2015\)The MovieLens datasets: history and context\.InACM Transactions on Interactive Intelligent Systems,Vol\.5,pp\. 1–19\.Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.11017#Ax1.I1.ix24.p1.1),[§3\.3](https://arxiv.org/html/2605.11017#S3.SS3.p1.2),[§7](https://arxiv.org/html/2605.11017#S7.p1.1)\.
- J\. J\. Heckman \(1979\)Sample selection bias as a specification error\.Econometrica47\(1\),pp\. 153–161\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px3.p1.2)\.
- Y\. Hou, J\. Li, Z\. He, A\. Yan, X\. Chen, and J\. McAuley \(2024\)Bridging language and items for retrieval and recommendation\.InFindings of the Association for Computational Linguistics: NAACL 2024,Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.11017#Ax1.I1.ix24.p1.1),[§8](https://arxiv.org/html/2605.11017#S8.p1.1)\.
- M\. J\. Kang, M\. Hsu, I\. M\. Krajbich, G\. Loewenstein, S\. M\. McClure, J\. T\. Wang, and C\. F\. Camerer \(2009\)The wick in the candle of learning: epistemic curiosity activates reward circuitry and enhances memory\.Psychological Science20\(8\),pp\. 963–973\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- R\. A\. Kievit, W\. E\. Frankenhuis, L\. J\. Waldorp, and D\. Borsboom \(2013\)Simpson’s paradox in psychological science: a practical guide\.Frontiers in Psychology4,pp\. 513\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- G\. King \(1997\)A solution to the ecological inference problem: reconstructing individual behavior from aggregate data\.Princeton University Press\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Lattimore and C\. Szepesvári \(2020\)Bandit algorithms\.Cambridge University Press\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Li, W\. Chu, J\. Langford, and R\. E\. Schapire \(2010\)A contextual\-bandit approach to personalized news article recommendation\.InProceedings of the 19th International Conference on World Wide Web \(WWW\),pp\. 661–670\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- R\. J\. A\. Little and D\. B\. Rubin \(2019\)Statistical analysis with missing data\.3rd edition,Wiley\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px3.p1.2)\.
- G\. Loewenstein \(1994\)The psychology of curiosity: a review and reinterpretation\.Psychological Bulletin116\(1\),pp\. 75–98\.Cited by:[§1](https://arxiv.org/html/2605.11017#S1.p3.6),[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Pearl \(2014\)Comment: understanding Simpson’s paradox\.The American Statistician68\(1\),pp\. 8–13\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- J\. M\. Robins and D\. M\. Finkelstein \(2000\)Correcting for non\-compliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted \(IPCW\) log\-rank tests\.Biometrics56\(3\),pp\. 779–788\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px3.p1.2),[§4\.3](https://arxiv.org/html/2605.11017#S4.SS3.2.p2.2)\.
- W\. S\. Robinson \(1950\)Ecological correlations and the behavior of individuals\.American Sociological Review15\(3\),pp\. 351–357\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Shaked and J\. G\. Shanthikumar \(2007\)Stochastic orders\.Springer,New York\.Cited by:[Appendix E](https://arxiv.org/html/2605.11017#A5.1.p1.3)\.
- E\. H\. Simpson \(1951\)The interpretation of interaction in contingency tables\.Journal of the Royal Statistical Society, Series B13\(2\),pp\. 238–241\.Cited by:[§1](https://arxiv.org/html/2605.11017#S1.p3.6),[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px2.p1.1)\.
- W\. R\. Thompson \(1933\)On the likelihood that one unknown probability exceeds another in view of the evidence of two samples\.Biometrika25\(3–4\),pp\. 285–294\.Cited by:[§2](https://arxiv.org/html/2605.11017#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Wan and J\. McAuley \(2018\)Item recommendation on monotonic behavior chains\.InProceedings of the 12th ACM Conference on Recommender Systems \(RecSys\),pp\. 86–94\.Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.11017#Ax1.I1.ix24.p1.1),[§1](https://arxiv.org/html/2605.11017#S1.p3.6),[§3\.3](https://arxiv.org/html/2605.11017#S3.SS3.p1.2)\.
## Appendix AProof of Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)\(Extended\)
We provide additional detail on the survival bias mechanism\. Let each user’s curve have the formCu\(n\)=Au⋅h\(n−nu∗wu\)C\_\{u\}\(n\)=A\_\{u\}\\cdot h\\\!\\left\(\\frac\{n\-n^\{\*\}\_\{u\}\}\{w\_\{u\}\}\\right\)wherehhis a unimodal shape function withh\(0\)=1h\(0\)=1andh\(x\)→0h\(x\)\\to 0as\|x\|→∞\|x\|\\to\\infty\. The key is that the*observed*aggregate is not a simple average over all users, but a survival\-weighted average:
Cobsagg\(n\)=∑uSu\(n\)⋅Cu\(n\)∑uSu\(n\)C^\{\\text\{agg\}\}\_\{\\text\{obs\}\}\(n\)=\\frac\{\\sum\_\{u\}S\_\{u\}\(n\)\\cdot C\_\{u\}\(n\)\}\{\\sum\_\{u\}S\_\{u\}\(n\)\}\(7\)whereSu\(n\)=𝟏\[n≤nmax,u\]S\_\{u\}\(n\)=\\mathbf\{1\}\[n\\leq n\_\{\\max,u\}\]indicates whether useruuhas data at exposure countnn\. Crucially,nmax,un\_\{\\max,u\}is correlated withnu∗n^\{\*\}\_\{u\}: users whose curiosity peaks later tend to accumulate more exposures before disengaging\.
At lownn,Su\(n\)≈1S\_\{u\}\(n\)\\approx 1for most users, so the observed aggregate reflects the full population\. At highnn, only users with largenmax,un\_\{\\max,u\}\(and thus typically largenu∗n^\{\*\}\_\{u\}\) contribute\. This creates a composition shift: the population contributing to the aggregate changes withnn, enriching for late\-peak users at high exposure counts\.
In the synthetic validation, we confirm this by showing that survival bias alone \(without any amplitude\-peak correlation\) produces a3\.0×3\.0\\timesdistortion, while amplitude\-peak correlation alone \(without differential attrition\) produces no distortion \(1\.0×1\.0\\times\)\.
## Appendix BSynthetic Calibration Details
### B\.1Synthetic User Generation
Monotonic users are generated withe\(n\)∼Bernoulli\(c0\+A⋅e−n/s\)e\(n\)\\sim\\text\{Bernoulli\}\(c\_\{0\}\+A\\cdot e^\{\-n/s\}\)wherec0∼U\(0\.2,0\.5\)c\_\{0\}\\sim U\(0\.2,0\.5\),A∼U\(0\.1,0\.3\)A\\sim U\(0\.1,0\.3\),s∼U\(10,50\)s\\sim U\(10,50\)\. Flat users usee\(n\)∼Bernoulli\(c0\)e\(n\)\\sim\\text\{Bernoulli\}\(c\_\{0\}\)withc0∼U\(0\.3,0\.7\)c\_\{0\}\\sim U\(0\.3,0\.7\)\. True inverted\-U users use Eq\.[1](https://arxiv.org/html/2605.11017#S3.E1)withc0∼U\(0\.1,0\.3\)c\_\{0\}\\sim U\(0\.1,0\.3\),A∼U\(0\.15,0\.4\)A\\sim U\(0\.15,0\.4\),a∼U\(1,3\)a\\sim U\(1,3\),b∼U\(3,15\)b\\sim U\(3,15\),s∼U\(15,60\)s\\sim U\(15,60\)\.
Each synthetic user has a random exposure count drawn from the empirical distribution of real user exposure counts \(15–75\)\. The identical preprocessing pipeline \(5\-point smoothing, binary threshold at 4\.0\) is applied to both synthetic and real data\.
### B\.2Power Analysis
We assess detection power as a function of effect amplitude \(peak\-to\-trough difference in engagement probability\) and sample size \(observations per exposure bin\), using 30 repetitions per condition:
Table 6:Detection power \(fraction of 30 synthetic datasets correctly classified as inverted\-U\) by effect amplitude and observations per exposure bin\.The method reliably detects effects with amplitude≥10%\\geq 10\\%at realistic sample sizes \(≥500\\geq 500observations per bin\)\. Below 5%, detection is conservative \(no false positives, but misses weak effects\)\.
## Appendix CAdditional Genre Results
### C\.1Aggregate Fit Quality
Table[7](https://arxiv.org/html/2605.11017#A3.T7)reports goodness\-of\-fit and ascent significance for the aggregate\-level Hill\-exponential fits supporting Table[1](https://arxiv.org/html/2605.11017#S4.T1)\.
Table 7:Aggregate\-level fit diagnostics across 7 genres\. OOSR2R^\{2\}is computed on held\-out exposure bins \(every 5th bin\); ascentpp\-value is the slope\-significance test for the pre\-peak rise\.Two anomalies are worth noting\. Comics fails ascent significance \(p=0\.317p=0\.317\) despite visual evidence of a rise—the peak atn∗≈4n^\{\*\}\\approx 4leaves only 3 pre\-peak bins, insufficient for reliable slope estimation\. History and Mystery have negative OOSR2R^\{2\}, indicating the model captures noise rather than signal at the aggregate level\.
### C\.2Parameter Identifiability
High bootstrap correlations between Hill\-Exp parameters are structurally expected:c0c\_\{0\}–ss\(ρ=−0\.979\\rho=\-0\.979\),AA–bb\(ρ=−0\.861\\rho=\-0\.861\)\. Despite individual parameter uncertainty, the peak locationn∗n^\{\*\}is well\-determined\. For Fantasy, the bootstrap 95% CI forn∗n^\{\*\}is\[1\.8,5\.2\]\[1\.8,5\.2\]—a tight interval despite the parameter correlations\.
### C\.3Methodology Validation
We validate the aggregate classification pipeline on 6 synthetic scenarios: strong inverted\-U, weak inverted\-U, noisy inverted\-U, mixed population, monotonic decay, and flat\. All 6 are correctly classified \(strong/weak→\\toClass A; mixed→\\toClass C; monotonic/flat→\\tonot A/B\)\.
## Appendix DCross\-Genre Consistency and Permutation Test
A key strength of the Simpson’s paradox finding is its consistency across genres \(Figure[3](https://arxiv.org/html/2605.11017#A4.F3)\)\. We test this with a within\-user permutation test that correctly accounts for the dependence structure \(users who read multiple genres\)\.
Figure 3:Cross\-genre consistency of individual peak locations\. Box plots show the distribution of individualn∗n^\{\*\}values for strict\-classified users in each genre\. Red diamonds indicate the aggregate peakng∗n^\{\*\}\_\{g\}\. Individual medians \(horizontal bars\) cluster tightly while aggregate peaks vary wildly\.#### Permutation test design\.
Under the null hypothesisH0H\_\{0\}that individual peak locations are artifacts of the fitting procedure \(determined by noise and model flexibility, not systematic user behavior\), a user’s fitted peak in genreAAis exchangeable with their fitted peak in genreBB\. We permute each multi\-genre user’s peak values across their genres\[Good,[2000](https://arxiv.org/html/2605.11017#bib.bib45)\], preserving the within\-user correlation structure: \(1\) compute the observed rangeRobs=maxg\(mediang\)−ming\(mediang\)R\_\{\\text\{obs\}\}=\\max\_\{g\}\(\\text\{median\}\_\{g\}\)\-\\min\_\{g\}\(\\text\{median\}\_\{g\}\)of genre\-level median peaks; \(2\) for each of 10,000 permutations, shuffle each multi\-genre user’s peak values across their genres, recompute genre medians, and compute permuted rangeRpermR\_\{\\text\{perm\}\}; \(3\)p\-value=P\(Rperm≤Robs\)p\\text\{\-value\}=P\(R\_\{\\text\{perm\}\}\\leq R\_\{\\text\{obs\}\}\)\.
This test is strictly more valid than an independence\-based bound: users reading both Fantasy and Romance create statistical dependence between those genres’ peak distributions, and the within\-user permutation preserves this structure exactly\. Single\-genre users contribute identical values in every permutation \(conservative\)\.
#### Results\.
On Goodreads \(9 genres, 440–1,250 multi\-genre users with valid Hill\-Exp fits depending on sampling\), the permutation test does not reach significance \(p\>0\.2p\>0\.2across runs\), and the direction of the effect is unstable across stratified samples\. This reflects the test’s limited power: with noisy per\-user peak estimates and moderate multi\-genre overlap, the cross\-genre signal\-to\-noise ratio is insufficient\. We report this result as genuinely inconclusive\.
Crucially, the Simpson’s paradox finding does not require cross\-genre consistency\. It requires only that individual peaks are systematically earlier than the aggregate, which holds across all 9 genres independently\. The permutation test addresses a secondary question \(are individual peaks genre\-invariant?\) orthogonal to the primary finding\.
## Appendix EExtended Theoretical Results
###### Theorem 2\(Non\-parametric survival distortion\)\.
Let\(nu∗,Su\(n\)\)\(n^\{\*\}\_\{u\},S\_\{u\}\(n\)\)be jointly distributed under any lawPPwithEP\[Su\(n\)\]\>0E\_\{P\}\[S\_\{u\}\(n\)\]\>0, no further parametric assumptions\. Define the selection\-conditional measure on peaks bydμS\(t\):=P\(nu∗=t,Su\(n\)=1\)/EP\[Su\(n\)\]d\\mu\_\{S\}\(t\):=P\(n^\{\*\}\_\{u\}=t,S\_\{u\}\(n\)=1\)/E\_\{P\}\[S\_\{u\}\(n\)\]\. Then:
∫t𝑑μS\(t\)−∫t𝑑Pn∗\(t\)=CovP\(nu∗,Su\(n\)\)EP\[Su\(n\)\]\\int t\\,d\\mu\_\{S\}\(t\)\-\\int t\\,dP\_\{n^\{\*\}\}\(t\)=\\frac\{\\mathrm\{Cov\}\_\{P\}\(n^\{\*\}\_\{u\},S\_\{u\}\(n\)\)\}\{E\_\{P\}\[S\_\{u\}\(n\)\]\}\(8\)Moreover, ifμS\\mu\_\{S\}first\-order stochastically dominatesPn∗P\_\{n^\{\*\}\}, the LHS is≥0\\geq 0, with equality iffμS≡Pn∗\\mu\_\{S\}\\equiv P\_\{n^\{\*\}\}\.
###### Proof\.
∫t𝑑μS=E\[nu∗Su\(n\)\]/E\[Su\(n\)\]\\int t\\,d\\mu\_\{S\}=E\[n^\{\*\}\_\{u\}S\_\{u\}\(n\)\]/E\[S\_\{u\}\(n\)\]\. SubtractE\[nu∗\]E\[n^\{\*\}\_\{u\}\]and applyE\[XY\]−E\[X\]E\[Y\]=Cov\(X,Y\)E\[XY\]\-E\[X\]E\[Y\]=\\mathrm\{Cov\}\(X,Y\)\. FOSD implication follows fromShaked and Shanthikumar \[[2007](https://arxiv.org/html/2605.11017#bib.bib47)\]\. ∎
This generalizes Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)by \(i\) dropping i\.i\.d\. across users, \(ii\) admitting any joint law \(continuous, discrete, or mixed\), and \(iii\) providing a measure\-theoretic form suitable for FOSD and Wasserstein extensions\.
###### Theorem 3\(Pooled distortion across datasets\)\.
Let𝒟1,…,𝒟k\\mathcal\{D\}\_\{1\},\\ldots,\\mathcal\{D\}\_\{k\}be independent datasets, each satisfying Theorem[2](https://arxiv.org/html/2605.11017#Thmtheorem2)with correlationρd:=Corr\(nu∗,Su\(n\)\)\\rho\_\{d\}:=\\mathrm\{Corr\}\(n^\{\*\}\_\{u\},S\_\{u\}\(n\)\), standard deviationσd:=SD\(nu∗\)\\sigma\_\{d\}:=\\mathrm\{SD\}\(n^\{\*\}\_\{u\}\), survival rateS¯d:=E\[Su\(n\)\]\\bar\{S\}\_\{d\}:=E\[S\_\{u\}\(n\)\], and sample sizendn\_\{d\}\. Define the sample\-weighted pooled estimator:
Δ^:=∑d=1kndN⋅ρdσd1−S¯dS¯d,N:=∑dnd\.\\hat\{\\Delta\}:=\\sum\_\{d=1\}^\{k\}\\frac\{n\_\{d\}\}\{N\}\\cdot\\rho\_\{d\}\\sigma\_\{d\}\\sqrt\{\\frac\{1\-\\bar\{S\}\_\{d\}\}\{\\bar\{S\}\_\{d\}\}\},\\quad N:=\\sum\_\{d\}n\_\{d\}\.\(9\)Then: \(1\)E\[Δ^\]=∑d\(nd/N\)ΔdE\[\\hat\{\\Delta\}\]=\\sum\_\{d\}\(n\_\{d\}/N\)\\Delta\_\{d\}\(unbiased\); \(2\)Δ^≥0\\hat\{\\Delta\}\\geq 0with equality iffρd=0\\rho\_\{d\}=0for alldd\(sign\-pinned\); \(3\)Var\(Δ^\)≤N−1maxd\[\(ρdσd\)2\(1−S¯d\)/S¯d\]\\mathrm\{Var\}\(\\hat\{\\Delta\}\)\\leq N^\{\-1\}\\max\_\{d\}\[\(\\rho\_\{d\}\\sigma\_\{d\}\)^\{2\}\(1\-\\bar\{S\}\_\{d\}\)/\\bar\{S\}\_\{d\}\]; \(4\) underH0:ρd=0∀dH\_\{0\}:\\rho\_\{d\}=0\\;\\forall d,NΔ^→d𝒩\(0,V\)\\sqrt\{N\}\\,\\hat\{\\Delta\}\\to\_\{d\}\\mathcal\{N\}\(0,V\), enabling a one\-sidedzz\-test\.
###### Proof\.
\(1\) Linearity of expectation\. \(2\) Each summand is non\-negative\. \(3\) Independence across datasets plus Cauchy–Schwarz per term\. \(4\) Per\-dataset CLT; sum of independent normals is normal\. ∎
Theorem[3](https://arxiv.org/html/2605.11017#Thmtheorem3)directly addresses the single\-dataset generalizability concern: applied to our three datasets \(Goodreads, MovieLens, Electronics\), the pooled estimatorΔ^\>0\\hat\{\\Delta\}\>0with the null hypothesis rejected atp<0\.001p<0\.001\.
## Appendix FAmazon Electronics: Robustness Checks
### F\.1K\-Core Sensitivity
To verify that the5\.3×5\.3\\timesgap is not an artifact of the minimum\-interaction threshold, we repeat the strict classification pipeline atkk\-core values of 5, 10, and 20 \(minimum interactions per user\-category pair\)\.
Table 8:K\-core sensitivity analysis\. The bin\-argmax gap is stable at≈4\.3×\\approx 4\.3\\timesacross all thresholds; the Hill\-exp model\-derived gap \(5\.3×5\.3\\times\) is computed atk=10k=10\.The gap is robust acrosskk\-core thresholds, confirming that the Simpson’s paradox is not driven by sparse user\-category pairs\.
### F\.2Selection Bias in Failed Fits
A concern is that users who fail the strict classifier might be undetected inverted\-U users, inflating the gap\. We compare covariates of failed\-fit pairs \(n=113,126n=113\{,\}126\) against strict\-classified A\+B pairs \(n=13,876n=13\{,\}876\):
- •Rating variance: failed = 0\.10 vs\. A\+B = 1\.17\. Failed users are near\-constant raters \(predominantly 5/5\), not hidden inverted\-U users\.
- •First rating: failed = 4\.71 vs\. A\+B = 3\.94\. Failed users start with high engagement that never declines\.
The 50\.3% inverted\-U rate is a defensible*lower bound*: excluded users are saturated raters, not inverted\-U users whose signal falls below the classifier’s sensitivity\.
### F\.3Permutation Null Test
To confirm that the aggregate Hill\-exponential fit captures genuine structure rather than noise, we run a permutation test: shuffle user\-category assignments 500 times and refit the aggregate curve\.
- •Real data:R2=0\.914R^\{2\}=0\.914
- •Permuted mean:R2=0\.047R^\{2\}=0\.047
- •p=0/500p=0/500permutations
The aggregate engagement\-by\-exposure structure is highly significant against random shuffling\.
### F\.4Additional Figures
Figure 4:Amazon Electronics aggregate engagement curve with Hill\-exponential fit overlay \(43 main categories, 62 reliable bins\)\. Model\-derived peak atn∗=55\.2n^\{\*\}=55\.2\(R2=0\.909R^\{2\}=0\.909\); strict individual mediann∗≈10\.4n^\{\*\}\\approx 10\.4, yielding a5\.3×5\.3\\timesdistortion\.Figure 5:Distribution of individual peak locations on Amazon Electronics \(strict classification, 13,876 A\+B users\)\. Individual peaks cluster nearn∗≈10n^\{\*\}\\approx 10while the model\-derived aggregate peak is atn∗=55\.2n^\{\*\}=55\.2\.
## Appendix GHierarchical Model Details
### G\.1Per\-Genre Hierarchical Estimates
Table 9:Comparison of peak estimation methods on Goodreads\. Hierarchical estimates are closer to individual peaks than the aggregate, while being more stable than naive per\-user fits \(lower IQR\)\.w¯\\bar\{w\}is the mean shrinkage weight in the genre\.
### G\.2Empirical Bayes Estimation
Population parameters are estimated from the observed per\-user peaks via moment matching on the log scale:
μ^pop\\displaystyle\\hat\{\\mu\}\_\{\\text\{pop\}\}=1N∑u=1Nlogn∗^u\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{u=1\}^\{N\}\\log\\hat\{n^\{\*\}\}\_\{u\}\(10\)σ^pop2\\displaystyle\\hat\{\\sigma\}\_\{\\text\{pop\}\}^\{2\}=max\(Var\(logn∗^u\)−1N∑u=1Nσobs,u2,0\.01\)\\displaystyle=\\max\\\!\\left\(\\text\{Var\}\(\\log\\hat\{n^\{\*\}\}\_\{u\}\)\-\\frac\{1\}\{N\}\\sum\_\{u=1\}^\{N\}\\sigma\_\{\\text\{obs\},u\}^\{2\},\\;0\.01\\right\)\(11\)The variance decomposition subtracts the estimated observation noise from the total variance to recover the population variance\. The floor of0\.010\.01prevents degenerate shrinkage when the population is highly homogeneous\.
### G\.3Bootstrap Standard Errors
Per\-user observation uncertaintyσobs,u\\sigma\_\{\\text\{obs\},u\}is estimated via 200 bootstrap resamples of each user’s engagement sequence\. For each resample, we refit the Hill\-exponential model and recordlogn∗^\\log\\hat\{n^\{\*\}\}\. The standard deviation of these bootstrap log\-peaks isσobs,u\\sigma\_\{\\text\{obs\},u\}\. Users with fewer than 10 valid bootstrap fits are assignedσobs,u=1\.0\\sigma\_\{\\text\{obs\},u\}=1\.0\(high uncertainty\), ensuring heavy shrinkage toward the population mean\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Answer:\[Yes\]
3. Justification: Abstract and Section[1](https://arxiv.org/html/2605.11017#S1)state all three contributions with scope \(3 datasets,3×3\\times–5×5\\timesdistortion range\)\. Limitations in Section[11](https://arxiv.org/html/2605.11017#S11)\.
4. 2\.Limitations
5. Answer:\[Yes\]
6. Justification: Section[11](https://arxiv.org/html/2605.11017#S11)discusses ecosystem overlap, classifier weakness, binary engagement, static analysis, and temporal confounding\.
7. 3\.Theory assumptions and proofs
8. Answer:\[Yes\]
9. Justification: Theorem[1](https://arxiv.org/html/2605.11017#Thmtheorem1)includes assumptions and proof in Section[4](https://arxiv.org/html/2605.11017#S4); extended proof in Appendix[A](https://arxiv.org/html/2605.11017#A1)\. Theorems[2](https://arxiv.org/html/2605.11017#Thmtheorem2)and[3](https://arxiv.org/html/2605.11017#Thmtheorem3)with proofs in Appendix[E](https://arxiv.org/html/2605.11017#A5)\.
10. 4\.Experimental result reproducibility
11. Answer:\[Yes\]
12. Justification: Section[3](https://arxiv.org/html/2605.11017#S3)specifies all classification criteria, datasets, engagement thresholds, smoothing, and minimum observation requirements\. Code released\.
13. 5\.Open access to data and code
14. Answer:\[Yes\]
16. 6\.Experimental setting/details
17. Answer:\[Yes\]
18. Justification: Section[3](https://arxiv.org/html/2605.11017#S3)specifies classification criteria, model selection pipeline, and dataset details\.
19. 7\.Experiment statistical significance
20. Answer:\[Yes\]
21. Justification: Bootstrap 95% CIs for median peak estimates \(Section[4](https://arxiv.org/html/2605.11017#S4)\)\. Hill\-exp aggregate fit CI\[4\.3,8\.0\]\[4\.3,8\.0\]for Electronics gap \(Section[8](https://arxiv.org/html/2605.11017#S8)\)\. Permutation test with 10,000 permutations \(Appendix[D](https://arxiv.org/html/2605.11017#A4)\)\.
22. 8\.Experiments compute resources
23. Answer:\[N/A\]
24. Justification: All experiments are CPU\-only curve fitting; no GPU or significant compute required\. Total runtime under 2 hours on a single CPU\.
25. 9\.Code of ethics
26. Answer:\[Yes\]
27. Justification: Research uses only publicly available datasets\. No human subjects, no privacy concerns\.
28. 10\.Broader impacts
29. Answer:\[N/A\]
30. Justification: This is a statistical methodology paper identifying a measurement artifact\. The primary impact is improving the accuracy of behavioral models\. No direct negative societal impact\.
31. 11\.Safeguards
32. Answer:\[N/A\]
33. Justification: No models or datasets with high misuse risk are released\. Uses only publicly available datasets\.
34. 12\.Licenses for existing assets
35. Answer:\[Yes\]
36. Justification: Goodreads dataset cited asWan and McAuley \[[2018](https://arxiv.org/html/2605.11017#bib.bib28)\]; MovieLens\-25M cited asHarper and Konstan \[[2015](https://arxiv.org/html/2605.11017#bib.bib40)\]; Amazon Electronics cited asHouet al\.\[[2024](https://arxiv.org/html/2605.11017#bib.bib46)\]\. All publicly available for research\.
37. 13\.New assets
38. Answer:\[N/A\]
39. Justification: No new datasets introduced\. Analysis code released via anonymous repository\.
40. 14\.Crowdsourcing and research with human subjects
41. Answer:\[N/A\]
42. Justification: No human subjects research\. Analysis of existing public datasets\.
43. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
44. Answer:\[N/A\]
45. Justification: No human subjects research\.
46. 16\.Declaration of LLM usage
47. Answer:\[N/A\]
48. Justification: LLMs are not used as a component of the core methodology\. The research involves statistical curve fitting and aggregation analysis\.Similar Articles
Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
This paper addresses the degradation of likelihood-based machine-generated text detectors by identifying a Simpson's paradox in token-score aggregation. It proposes a learned local calibration step that significantly improves detection performance across various models and datasets.
When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop
This paper studies self-consuming training in a multi-model regime, showing that human curation can backfire and degrade long-term alignment due to cross-model interactions.
Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism
This paper studies observation masking in long-horizon search agents, finding that accuracy gains follow an asymmetric inverted-U shape depending on the interplay between retriever capability and model capacity, with a collapse when the model is saturated. It provides a mechanistic analysis and a regime map for context management.
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
This paper introduces a population coupling trend and h-field diagnostic to analyze the relationship between coding and reasoning capabilities across frontier AI models, finding that capabilities cooperate but with varying emphasis per lab. It provides a playbook for measurement and predicts benchmark saturation trends.
Mechanisms of Misgeneralization in Physical Sequence Modeling
This paper identifies and analyzes 'physical misgeneralization' in generative sequence models, where individual trajectories appear plausible but the aggregate distribution over physical quantities is incorrect, and proposes a kernel-informed mitigation.