ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent

arXiv cs.CL Papers

Summary

ProfileFoundry introduces a deterministic synthetic person-object dataset of 100,000 profiles with 700k+ events, designed to evaluate LLM agents on privacy, memory, and tool-use while ensuring inspectability and consistency.

arXiv:2606.26403v1 Announce Type: new Abstract: Foundation-model research increasingly needs data about people: user state, personal histories, relationships, contact-like fields, documents, and longitudinal updates. Real user data is difficult to share, perturb, audit, or redistribute responsibly, while independently generated fake fields rarely preserve the cross-field and temporal consistency needed for controlled evaluation. We present PROFILEFOUNDRY, a deterministic generator and fixed reference release of 100,000 adult synthetic Person Objects across eight locales. Each object combines a typed current snapshot, household, family, and employer links, snapshot-aligned events, normalized relational views, and generation provenance. The release contains 709,228 events, 40,338 households, 52,491 employers, and 518,564 directed relationship edges. We report evidence in separate categories: selected population-marginal comparisons, per-object invariant checks, release-wide referential and temporal closure, and coincidence/provenance screens. PROFILEFOUNDRY is not a population-fidelity model, a rendered-text corpus, or a formal privacy mechanism. Instead, it is a responsible synthetic source layer for constructing downstream foundation-model evaluations involving memory, privacy, document understanding, record linkage, and agent state while keeping the synthetic person behind each artifact inspectable
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:15 AM

# A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agents
Source: [https://arxiv.org/html/2606.26403](https://arxiv.org/html/2606.26403)
###### Abstract

Foundation\-model research increasingly needs data about people: user state, personal histories, relationships, contact\-like fields, documents, and longitudinal updates\. Real user data is difficult to share, perturb, audit, or redistribute responsibly, while independently generated fake fields rarely preserve the cross\-field and temporal consistency needed for controlled evaluation\. We presentProfileFoundry††footnotetext:, a deterministic generator and fixed reference release of 100,000 adult synthetic Person Objects across eight locales\. Each object combines a typed current snapshot, household, family, and employer links, snapshot\-aligned events, normalized relational views, and generation provenance\. The release contains 709,228 events, 40,338 households, 52,491 employers, and 518,564 directed relationship edges\. We report evidence in separate categories: selected population\-marginal comparisons, per\-object invariant checks, release\-wide referential and temporal closure, and coincidence/provenance screens\.ProfileFoundryis not a population\-fidelity model, a rendered\-text corpus, or a formal privacy mechanism\. Instead, it is a responsible synthetic source layer for constructing downstream foundation\-model evaluations involving memory, privacy, document understanding, record linkage, and agent state while keeping the synthetic person behind each artifact inspectable\.

## 1Introduction

Research on language\-model privacy, memorization, and user\-state behavior has become increasingly dataset\-dependent\. Memorization studies have used inserted canaries to measure exposure\(Carliniet al\.,[2019](https://arxiv.org/html/2606.26403#bib.bib1)\), web\-scale extraction attacks to recover verbatim training snippets containing public PII\(Carliniet al\.,[2021](https://arxiv.org/html/2606.26403#bib.bib2)\), and real user traces to study privacy inference beyond memorization\(Staabet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib26)\)\. PII detection and redaction work has introduced synthetic span\-labeled corpora such as SPY and Nemotron\-PII\(Savkinet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib24); NVIDIA Corporation,[2025](https://arxiv.org/html/2606.26403#bib.bib17)\)\. Long\-term memory and personalization benchmarks use multi\-session conversations, user histories, or curated profiles to test whether models can recall, update, and apply user state\(Maharanaet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib13); Wuet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib31); Salemiet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib23); Jianget al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib11)\)\. These lines of work show that synthetic or semi\-synthetic personal data is already central to NLP research, especially when real personal data is unsafe or unreleasable\.

These datasets leave a common lower\-level gap\. Canary studies provide controlled secrets, but not coherent people\. Extraction and inference studies often depend on real web or social traces, which are difficult to redistribute, perturb, or audit as synthetic identities\. PII corpora provide labeled spans in rendered text, but usually do not expose the underlying person graph that generated those spans\. Memory and personalization benchmarks provide fixed conversations, histories, or user profiles for a particular evaluation, but not a scalable population from which new privacy, retrieval, dialogue, document, or agent\-state datasets can be derived under the same schema and seed\. The missing artifact is therefore not another single task benchmark; it is a reusable source layer: an auditable population of internally consistent synthetic people whose identifiers, relationships, employers, addresses, timelines, and provenance can be rendered into many downstream study designs\.

![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/pf_fig_a_complete_object.png)Figure 1:A sample released person object from en\-US locale\.ProfileFoundryis designed as this base layer\. It provides structured adult Person Objects whose demographics, contacts, households, family links, employers, addresses, identifiers, and event histories are generated and validated together\. Researchers can transform these objects into task\-specific artifacts such as PII\-laced text, memory entries, retrieval documents, forms, dialogue states, linkage pairs, perturbation sets, or exposure corpora while retaining a known source object and deterministic generation path\. Existing resources support important pieces of this workflow, but they generally expose different layers: rendered text, task instances, learned or simulated tables, local fake fields, or domain\-specific longitudinal records\.

ProfileFoundrydoes not claim that structured synthetic profiles are new\. Its distinct contribution is to separate and release a broader source artifact: a versioned schema, represented\-person and employer graph, snapshot\-aligned events, normalized analytical views, deterministic generator, and release\-level evidence\. Appendix Tables[18](https://arxiv.org/html/2606.26403#A1.T18)and[19](https://arxiv.org/html/2606.26403#A1.T19)record the artifact\-level distinction\.

ProfileFoundryis especially relevant now because language systems increasingly act through tools, memory, retrieval, and persistent user state\. A source object makes controlled confounders possible: two people can share a household, employer, surname, or city without being the same person; an earlier address or job can be superseded by a later event; and rendered spans can retain links to the fields and events that produced them\. The present paper establishes the source layer and its released evidence, not downstream model effectiveness\.

Our contributions are:

- •Person Object abstraction:a reusable Person Object with snapshot fields, graph links, typed events, reserved document hooks, and generation provenance for constructing sensitive\-data\-like foundation\-model evaluations without releasing real user traces\.
- •Internally consistent linked generation:the generator first commits to household roles, represented relationships, shared addresses, employers, and snapshot facts, then derives person records, graph edges, foreign keys, and snapshot\-aligned events from those commitments so linked fields simulate real person\.
- •Executable generator and SDK:a Python package and CLI for deterministic profile and household generation, scaled release builds, validation, export, and release\-rebuild workflows\.
- •Audited reference release:the 100K release includes canonical JSONL, a complete viewer Parquet file, flat snapshots, normalized relational tables, manifest hashes, and a dataset card, accompanied by validation, leakage, and report\-quality evidence\.
- •Artifact\-accountability protocol:a release audit that separates distributional gaps, declared consistency, referential and temporal closure, coincidence screens, reserved\-domain email checks, provenance, and documentation\-drift evidence\.

## 2Related Work

Synthetic personal data spans privacy, personalization, agent evaluation, record linkage, statistical disclosure control, and domain simulation, but these areas usually release different artifacts\. Privacy corpora expose rendered text and labels; memory benchmarks expose fixed histories or tasks; population and tabular systems expose linked records, learned relationships, or domain simulations; and fake\-data libraries expose localized fields or schema\-generated records\.ProfileFoundrytargets the reusable source layer: schema\-governed Person Objects with inspectable links, typed state\-changing events, normalized exports, provenance, and release\-level evidence\.

#### Privacy\-rich text and PII corpora\.

Language\-model privacy work shows that models can memorize sensitive strings and infer private attributes without verbatim reproduction\(Carliniet al\.,[2019](https://arxiv.org/html/2606.26403#bib.bib1);[2021](https://arxiv.org/html/2606.26403#bib.bib2); Staabet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib26)\)\. Privasis, PANORAMA, SynthPAI, SPY, Nemotron\-PII, Gretel’s multilingual financial PII data, and PIIBench address releasable private text, PII/PHI detection, span labeling, de\-identification, memorization, or corpus unification\(Kimet al\.,[2026](https://arxiv.org/html/2606.26403#bib.bib12); Selvam and Ghosh,[2025](https://arxiv.org/html/2606.26403#bib.bib25); Yukhymenkoet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib32); Savkinet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib24); NVIDIA Corporation,[2025](https://arxiv.org/html/2606.26403#bib.bib17); Gretel AI,[2024](https://arxiv.org/html/2606.26403#bib.bib7); Jha,[2026](https://arxiv.org/html/2606.26403#bib.bib10)\)\. Several use profiles internally and provide substantial human or benchmark evaluation\.ProfileFoundrydiffers at the exposed layer: it releases the structured identities, households, employers, relationships, identifiers, events, normalized views, and provenance from which text corpora can be derived, but it does not itself release rendered prose or span labels\.

#### Personalization, memory, and private\-user benchmarks\.

PersonaBench, LaMP, LoCoMo, LongMemEval, and PersonaMem evaluate personal\-information QA, personalized tasks, long\-term conversational memory, temporal reasoning, knowledge updates, or user\-aware response generation\(Tanet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib27); Salemiet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib23); Maharanaet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib13); Wuet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib31); Jianget al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib11)\)\. LoCoMo and PersonaMem have meaningful temporal structure, and PersonaBench uses a social graph during construction\. Their primary interfaces remain fixed conversations, documents, or tasks\.ProfileFoundryinstead makes source state reusable through stable profile IDs, represented\-person links, employer IDs, relationship edges, and typed histories\.

#### Personas and simulated behavior\.

PersonaChat, Persona Hub, and generative\-agent systems use persona facts, descriptions, memories, reflection, planning, or social behavior as conditioning signals\(Zhanget al\.,[2018](https://arxiv.org/html/2606.26403#bib.bib34); Geet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib5); Parket al\.,[2023](https://arxiv.org/html/2606.26403#bib.bib18)\)\. Generative Agents in particular model social behavior over simulated time\.ProfileFoundryis narrower in behavior and broader in release structure: it provides inspectable personal\-state objects and links rather than a behavioral simulation or persona\-prompt collection\.

#### Synthetic populations, record linkage, and tabular synthesis\.

Pseudopeople exposes stable simulant, household, and employer identifiers across simulated administrative records and is the closest population\-level comparator\(Haddocket al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib8); pseudopeople Contributors,[2026](https://arxiv.org/html/2606.26403#bib.bib38)\)\. Synthea provides longitudinal, linked patient records within healthcare\(Walonoskiet al\.,[2018](https://arxiv.org/html/2606.26403#bib.bib30)\)\. Febrl and GeCo support generation, corruption, and linkage workflows\(Christen,[2008](https://arxiv.org/html/2606.26403#bib.bib3); Tranet al\.,[2013](https://arxiv.org/html/2606.26403#bib.bib28)\)\. synthpop, SDV, PrivBayes, and PrivSyn provide statistical, relational, sequential, or differentially private synthesis\(Nowoket al\.,[2016](https://arxiv.org/html/2606.26403#bib.bib16); Patkiet al\.,[2016](https://arxiv.org/html/2606.26403#bib.bib19); Zhanget al\.,[2017](https://arxiv.org/html/2606.26403#bib.bib33);[2021](https://arxiv.org/html/2606.26403#bib.bib35)\)\.ProfileFoundrydoes not replace these systems; it packages a multi\-locale, NLP\-facing person\-state layer with canonical objects, normalized views, represented links, typed events, and release evidence\.

#### From fake fields to coupled objects\.

Faker generates localized values and composite profiles, while current Mimesis schemas can express foreign\-key references between generated schemas\(Faker Contributors,[2025](https://arxiv.org/html/2606.26403#bib.bib4); Mimesis Contributors,[2026](https://arxiv.org/html/2606.26403#bib.bib36)\)\.ProfileFoundryuses fake\-data providers at the leaf level but adds release\-specific cross\-field constraints, represented household and employer commitments, snapshot\-aligned histories, deterministic identifiers, and audit artifacts\. Its novelty is therefore the combination and exposure of these layers, not the first generation of synthetic names, profiles, households, or longitudinal records\.

Appendix Tables[17](https://arxiv.org/html/2606.26403#A1.T17),[18](https://arxiv.org/html/2606.26403#A1.T18), and[19](https://arxiv.org/html/2606.26403#A1.T19)summarize the closest adjacent resources and give the resource\-by\-resource evidence rubric\.

## 3Object Contract

ProfileFoundrygenerates from a constrained object space\. A Person Object is not a bag of independent fake fields; it is a typed adult record whose snapshot fields, household references, employer links, event history, normalized rows, and provenance are generated as mutually constrained commitments\. Figure[1](https://arxiv.org/html/2606.26403#S1.F1)shows the object at the level a downstream NLP system would consume: current fields, linked household members, snapshot\-aligned address and job histories, release rows, and seed and manifest metadata\.

The canonical schema has four surfaces\. The*snapshot*contains identity, contact, addresses, employment, education, finance, health, government IDs, household ID, family graph, events, reserved document hooks, and generation metadata\. The*graph*surface contains household membership, spouse or partner links, parent–adult\-child links, sibling links, colleague links, and employer IDs\. The*temporal*surface contains typed events:birth,education,move,job\_change,marriage,divorce,name\_change, andcredit\_event\. The*provenance*surface records global seed, profile seed, SDK version, generation date, exported timestamp, and reference manifest hash\.

The schema is also an interoperability contract\. The in\-memory source is implemented with Pydantic models and exported as JSON Schema for non\-Python consumers\(Pydantic Contributors,[2025](https://arxiv.org/html/2606.26403#bib.bib21)\)\. Canonical JSONL preserves complete nested objects, while Parquet views expose the same source record as row\-counted relational tables\. This separation lets one source object seed an agent\-memory store, rendered document, PII\-tagged passage, linkage pair, or perturbation set without discarding provenance\. Appendix Table[8](https://arxiv.org/html/2606.26403#A1.T8)maps each schema group to the release evidence that supports it\.

## 4Constrained Generation

### 4\.1Household\-First Linkage

The generator implements the object contract as the cascade summarized in Appendix Figure[3](https://arxiv.org/html/2606.26403#A1.F3)\. It first samples a household plan, turns that plan into member hints, materializes each person under those hints, closes links, constructs events from the finalized snapshot, and then exports audited release files\. This ordering creates the conditions under which marital pins, adult\-child slots, shared surnames, shared addresses, family edges, employer reuse, historical rows, and temporal constraints can be made coherent before release validation\.

Generation begins with a locale\-specific household composition table\. For en\-US, the eight implemented weights are single households at 0\.285, couples without children at 0\.265, couples with represented adult children at 0\.190, single\-parent households with represented adult children at 0\.090, cohabiting households without children at 0\.050, cohabiting households with represented adult children at 0\.030, multigenerational households at 0\.045, and unrelated\-adult households at 0\.045\. The weights sum to one\. The “child” slots in v1\.0 are adult children living with parents, normally age 18–28\.

![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/pf_constraint_influence_graph.png)Figure 2:Constraint influence graph, the full dependency map: which factor constrains which, and whether the effect is a hard gate, a soft prior, or both\.Figure[2](https://arxiv.org/html/2606.26403#S4.F2)shows the dependency map behind this household\-first design and why it is not equivalent to sampling marital status and then adding dependents\. Composition determines which role slots exist\. Role slots then carry sex hints, age bands, marital pins, surname rules, and shared\-address context into the person factory\. The family builder closes the represented graph: head\-spouse mutual links, cohabiting partner links, parent\-child links from head and spouse to every adult\-child slot, sibling links among adult children, and grandparent\-to\-head links in multigenerational households\. When a married or cohabiting partner is not represented, the partner field is explicitly marked external rather than silently omitted or assigned a dangling profile ID\.

Several small constraints are deliberately visible in the generated object\. Spouse ages are sampled within an 11\-year band around the head with mass near small gaps; adult\-child ages are capped by the youngest represented parent; adult siblings are spaced by at least three current\-age years when possible; co\-residents share a home phone when one is present; and emergency contacts prefer spouse, parent, sibling, then another household adult\. Current employers are assigned from a deterministic locale pool of 250 employers; compatible working adults in a household share an employer with probability 0\.25\. These rules support internal coherence but are modeling heuristics, not claims about every locale’s real household structure\. Appendix Figure[12](https://arxiv.org/html/2606.26403#A1.F12)shows the resulting employer and colleague\-edge structure in the release\.

### 4\.2Field Materialization and Partial Replay

Within each member slot, fields are generated in dependency order\. Age is sampled first and gates downstream choices\. Education zeroes out levels whose minimum completion age exceeds the sampled age: Bachelor requires age 21, Master 22, and Doctorate 25\. Marital status is sampled by age band and sex, then adjusted by feasibility rules: below the configured modeling floor the state is Single; former\-partner states are impossible under 19; widowhood under 40 is multiplied by 0\.05; cohabiting is multiplied by 0\.40 for ages 35–44 and by 0\.10 for ages 45–54; and cohabiting is zero at 55\+\. These are generator assumptions rather than legal or demographic ground truth\. Appendix Figures[4](https://arxiv.org/html/2606.26403#A1.F4)and[5](https://arxiv.org/html/2606.26403#A1.F5)provide the detailed visual atlas\.

Occupation sampling uses education and age gates over title tiers\. At age 30, a High School profile has most eligible title weight in Entry/Service and Skilled/Technical titles, whereas a Bachelor profile shifts mass toward Professional/Analyst and Management/Executive titles\. Master and Doctorate profiles are hard\-blocked from Entry/Service, while some overqualification paths remain possible at small weights; Appendix Figure[6](https://arxiv.org/html/2606.26403#A1.F6)summarizes which field combinations are blocked, bent, or kept as weighted outliers\. Title tier bounds salary; salary percentile drives finance tier; and credit scores use locale\-specific bureau scales\. The US release uses FICO 300–850 with tier centers 480, 580, 680, 740, and 790, while non\-US locales use explicit scale labels in the schema\. These coupled heuristics can encode socioeconomic assumptions and are documented as limitations\.

After link closure, temporal history is constructed from the finalized snapshot rather than sampled independently\. This avoids stale timeline rows caused by linkages or current fields changing after draft events were sampled\. The event backfill always includes birth, appends the current move when an address exists, appends the current job\-change when employment exists, aligns represented\-spouse marriage dates, emits name\-change and divorce events when the snapshot implies them, and samples prior moves or jobs only when enough lifetime span exists\. Validation checks that the latest covered move and job events agree with current address and employment, that address rows retain source\-event IDs, and that events do not predate DOB\. This is snapshot\-aligned history with partial replay over declared fields, not complete event sourcing of the whole Person Object\. Appendix Figure[7](https://arxiv.org/html/2606.26403#A1.F7)shows a concrete example\.

Contact and ID fields are also downstream of earlier commitments\. Work email depends on the final employer name and employer ID; phone formatting depends on locale and address region; personal and work emails use reservedprofilefoundry\.exampledomains following RFC 2606\(Eastlake and Panitz,[1999](https://arxiv.org/html/2606.26403#bib.bib22)\)\. Social handles are age\-gated, LinkedIn requires age 16, banned\-platform rules are locale\-aware, and platform selection changes with age rather than being flat across the adult range\.

## 5Release

We distributeProfileFoundryin two complementary forms: an executable Python package and a fixed 100K reference dataset\. The package supports incremental fixes and task\-specific generation under the same schema;ProfileFoundry\-Synthetic\-Person\-Objectsprovides a stable, citable population without requiring users to rerun generation\. Hosted access conditions and package metadata are governed by the accompanying artifacts, so the paper does not depend on an unrestricted\-hosting claim\.

### 5\.1Python Package Release

The package can be installed withpip install profilefoundry\. It emits the Person Object schema described above, so a generated profile can be consumed as nested JSON, converted into normalized tables, or used as a seed object for downstream NLP artifacts\. Table[5\.2](https://arxiv.org/html/2606.26403#S5.SS2)summarizes the command guide\.

### 5\.2100K Reference Release

ProfileFoundry\-Synthetic\-Person\-Objectsis the fixed reference artifact for direct use and comparison\. It is intended for users who want a stable population without rerunning the generator, while the package supports task\-specific generation under the same object contract\. Table[5\.2](https://arxiv.org/html/2606.26403#S5.SS2)lists the release contents and reproducibility pins\.

The release is deliberately more than a flat profile table\. The canonical JSONL preserves nested Person Objects, while the Parquet views expose scalar, temporal, relational, household, employer, education, address, social\-handle, and allergy views for downstream analysis\. The 14\-file local bundle contains 709,228 events, 518,564 directed relationships, 167,089 addresses, 111,955 employment rows, 74,738 education rows, 40,338 households, and 52,491 employers\. Appendix Figure[8](https://arxiv.org/html/2606.26403#A1.F8)maps the release inventory and object topology, Appendix Table[5](https://arxiv.org/html/2606.26403#A1.T5)gives the complete row\-counted inventory, and Appendix Figure[9](https://arxiv.org/html/2606.26403#A1.F9)reports profile\-level coverage\.

The reference bundle and associated reports are rebuilt through the repository release workflow, "python scripts/run\_full\_core\.py \-\-generation\-date 2026\-05\-24 \-\-exported\-at 2026\-05\-24 \-\-skip\-hibp"\. The fixed data identity is pinned by its manifest identifier, per\-file hashes, row counts, seed, and dates\. Appendix Figure[18](https://arxiv.org/html/2606.26403#A1.F18)and Table[16](https://arxiv.org/html/2606.26403#A1.T16)record the reproducibility pin and the command\-level verification checklist\.

Table 2: Release contents\.

## 6Audit

TheProfileFoundry\-Synthetic\-Person\-Objectsaudit asks whether the reference release is usable as a synthetic person\-object substrate: whether its limited population comparisons are disclosed, its declared references and covered histories resolve, its objects satisfy the implemented rules, its coincidence risks are screened, and its reports remain tied to the artifact\. These forms of evidence answer different questions and are not combined into one quality score\.

### 6\.1Population Fit and Declared Consistency

For the five full\-validation locales—US, UK, IN, CA, and AU—the validator compares generated age\-by\-sex, education, and marital\-status bucket shares with public reference tables\. For each marginal, it reports the largest absolute bucket\-share difference, anL∞L\_\{\\infty\}marginal gap rather than a Kolmogorov–Smirnov statistic\. Separately, it checks whether every generated object in those locales satisfies the declared structural and covered replay invariants\. IE, NZ, and PH are included in the release but excluded from the locked marginal\-fit table because their reference coverage is lighter\.

The locked targets were maximum gap≤0\.10\\leq 0\.10per attribute and mean gap≤0\.07\\leq 0\.07per locale\. The release does not meet the mean target, and we report the miss directly: locale means range from 0\.074 to 0\.089, with IN largest because its male and female age gaps are both approximately 0\.124\. In contrast, all 90,000 profiles in the five full\-validation locales pass the declared consistency suite\. This pass establishes agreement with implemented constraints; it does not independently validate the realism of those constraints, joint distributions, household topology, or event rates\.

### 6\.2Object, Linkage, and Temporal Closure

The object audit covers age gates, address validity after DOB, phone and locale rules, reserved\-domain contacts, identifier uniqueness, employer foreign keys, relationship endpoints, household closure, whole\-household selection, and partial replay\. Across the full 100K normalized release, relationship source and target endpoints have zero misses; employment rows and current profile employer references have zero missing employer foreign keys; household member counts sum to 100,000; and represented spouse links are mutual in 49,072 of 49,072 cases\. Parent–child, partner, sibling, household\-member, and colleague reciprocal commitments have zero reported reverse\-edge misses\. External spouse references are explicit sentinel cases rather than broken edges\.

Temporal checks cover 709,228 typed events\. All 167,089 address rows retain source\-event identifiers, every profile has one current address, and no event predates date of birth\. Latest covered move and job\-change events agree with current address and employment\. These checks establish source linkage and partial replay for covered fields; they do not establish complete event sourcing or realism of real\-world transition rates\.

### 6\.3Coincidence, Collision, and Drift Screens

We screen coincidence and drift separately from privacy claims: 7 repeated name\+DOB tuples, 1,038 repeated name\+birth\-city tuples, 0 personal\-email self\-collisions, 0 email\-syntax findings, and 342 Wikidata Bloom flags\. The Bloom filter covers 683,897 humans with known birth dates and at least five sitelinks, indexingname\|birth\_yearandname\|birth\_citykeys at target false\-positive rate10−410^\{\-4\}\(Vrandevcic and Kroetzsch,[2014](https://arxiv.org/html/2606.26403#bib.bib29)\)\. Reservedprofilefoundry\.exampledomains make email evidence syntax/uniqueness\-only; a report\-quality verifier checks manuscript counts, figures, validation, leakage, and manifest metadata for drift\. Appendix Figures[13](https://arxiv.org/html/2606.26403#A1.F13),[10](https://arxiv.org/html/2606.26403#A1.F10), and[14](https://arxiv.org/html/2606.26403#A1.F14)–[17](https://arxiv.org/html/2606.26403#A1.F17)and Appendix Tables[10](https://arxiv.org/html/2606.26403#A1.T10),[6](https://arxiv.org/html/2606.26403#A1.T6),[7](https://arxiv.org/html/2606.26403#A1.T7),[11](https://arxiv.org/html/2606.26403#A1.T11)–[15](https://arxiv.org/html/2606.26403#A1.T15), and[12](https://arxiv.org/html/2606.26403#A1.T12)give the supporting ledger, closure, inventory, leakage, and provenance evidence\.

## 7Downstream Use as an NLP Substrate

ProfileFoundrysupports downstream datasets by rendering audited fields, links, and events into text or records; intervening through masking, corruption, updates, withholding, or temporal shifts; and evaluating outputs against canonical profile, field, relationship, and event IDs\. This covers document understanding, memory and agent\-state evaluation, privacy and PII rendering, and record linkage while preserving deterministic provenance and controlled near\-misses: shared households, employers, cities, or surnames without identity; stale facts superseded by later events; and graph\-grounded ambiguous pairs\. Appendix Table[9](https://arxiv.org/html/2606.26403#A1.T9)summarizes recommended, caveated, and discouraged uses\.

## 8Conclusion

ProfileFoundryargues for a different unit of synthetic personal data: not isolated fake fields, fixed personas, or unreleasable real traces, but schema\-governed people whose identities, households, links, histories, exports, and provenance can be inspected together\. This matters for stateful NLP because memory, privacy, document, agent, and linkage evaluations often depend on the same hidden requirement: a coherent source person that can be rendered, perturbed, partially replayed, and audited\.

The v1\.0 release makes that substrate concrete through an executable generator and a 100K reference population with normalized views, manifest hashes, validation reports, closure checks, coincidence screens, and reproducibility commands\. The release is intentionally not presented as a perfect population model, a formal privacy mechanism, or a completed downstream benchmark\. Its contribution is a reusable, accountable baseline for building evaluations in which the synthetic person behind each artifact remains visible\.

## 9Limitations

ProfileFoundryv1\.0 is English\-only, adult\-only, and limited to eight locales; Faker and project\-specific references constrain field coverage and cultural fidelity\. Binary sex/gender conditioning, surname sharing, household composition, education–occupation mappings, salary/credit heuristics, disability/health categories, and social\-platform rules are simplifying assumptions that may be narrow or stereotyped; Appendix Table[3](https://arxiv.org/html/2606.26403#A1.T3)lists risks and disclosures\. Household “children” are represented adults, so pediatric, school, custody, guardian\-consent, and child\-safety workflows are unsupported; family links are household\-local with external\-partner sentinels, limiting extended kin and non\-household graphs\. Several full\-validation locales miss marginal targets, and the population audit covers univariate age\-by\-sex, education, and marital\-status marginals rather than joint distributions, household composition, age gaps, graph degrees, employer structure, or event rates\. Invariants test implemented rules, not realism; temporal history is snapshot\-backfilled partial replay\. The paper includes no downstream benchmark, human study, or household\-first ablation; finance/health/ID/salary fields are synthetic plausibility attributes, and package/card/report/manifest metadata require synchronization\.

## 10Ethical Considerations

Synthetic person objects can still be misused for impersonation, fraud rehearsal, spam, credential testing, or misleading demonstrations\. The release uses reserved email domains, excludes minors, records synthetic provenance, and publishes collision and notable\-person coincidence screens\. These safeguards do not provide differential privacy, proof of non\-resemblance, or authorization to use generated contact\-like fields outside controlled research\. Generated phone numbers, government identifiers, addresses, names, and other identity\-like values should not be used to contact, authenticate, evaluate, or make decisions about real people\.

Because demographic and socioeconomic rules may reproduce stereotypes, downstream studies should report which fields and locales they use, preserve synthetic labeling and provenance, audit group\-conditioned outcomes, and avoid treating the resource as demographic ground truth\. Derived text or documents should retain the dataset card, license, and relevant risk disclosures\.ProfileFoundryshould not be used to train or validate consequential decision systems about real people\.

## 11Acknowledgments

We gratefully acknowledge the open\-source and open\-science infrastructure that madeProfileFoundrypractical\. Faker provides important localized synthetic\-data providers used at the leaf\-field level, while Pydantic and JSON Schema support the typed object contract and public schema export\. We also thank the maintainers of the scientific Python and columnar\-data ecosystem, especially NumPy, pandas, and PyArrow, as well as Hugging Face tooling and theLaTeXvenue templates used to package and disseminate the release\.

## References

- N\. Carlini, C\. Liu, Ú\. Erlingsson, J\. Kos, and D\. Song \(2019\)The secret sharer: evaluating and testing unintended memorization in neural networks\.In28th USENIX Security Symposium,pp\. 267–284\.External Links:[Link](https://www.usenix.org/conference/usenixsecurity19/presentation/carlini)Cited by:[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Carlini, F\. Tramer, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel \(2021\)Extracting training data from large language models\.In30th USENIX Security Symposium,pp\. 2633–2650\.External Links:[Link](https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting)Cited by:[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Christen \(2008\)Febrl: a freely available record linkage system with a graphical user interface\.InProceedings of the Second Australasian Workshop on Health Data and Knowledge Management,pp\. 17–25\.Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- DataCebo, Inc\. \(2026\)Synthetic data vault documentation: single\-table, multi\-table, and sequential data\.Note:Documentation accessed 2026\-06\-18External Links:[Link](https://docs.sdv.dev/sdv)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2)\.
- D\. Eastlake and A\. Panitz \(1999\)RFC 2606: reserved top level DNS names\.Internet Engineering Task Force\.External Links:[Link](https://www.rfc-editor.org/rfc/rfc2606)Cited by:[§4\.2](https://arxiv.org/html/2606.26403#S4.SS2.p4.1)\.
- Faker Contributors \(2025\)Faker: python package that generates fake data\.External Links:[Link](https://faker.readthedocs.io/)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px5.p1.1)\.
- T\. Ge, J\. Hu, L\. Wang, X\. Wang, S\. Chen, and F\. Wei \(2024\)Scaling synthetic data creation with 1,000,000,000 personas\.External Links:2406\.20094,[Link](https://arxiv.org/abs/2406.20094)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px3.p1.1)\.
- Gretel AI \(2024\)Synthetic PII finance multilingual\.Note:[https://huggingface\.co/datasets/gretelai/synthetic\_pii\_finance\_multilingual](https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual)Hugging Face datasetCited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Haddock, A\. Pletcher, N\. Blair\-Stahn, O\. Keyes, M\. Kappel, S\. Bachmeier,et al\.\(2024\)Simulated data for census\-scale entity resolution research without privacy restrictions: a large\-scale dataset generated by individual\-based modeling\.Gates Open Research8,pp\. 36\.External Links:[Document](https://dx.doi.org/10.12688/gatesopenres.15418.2)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- P\. Jha \(2026\)PIIBench: a unified multi\-source benchmark corpus for personally identifiable information detection\.External Links:2604\.15776,[Link](https://arxiv.org/abs/2604.15776)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Jiang, Z\. Hao, Y\. M\. Cho, B\. Li, Y\. Yuan, S\. Chen, L\. Ungar, C\. J\. Taylor, and D\. Roth \(2025\)Know me, respond to me: benchmarking LLMs for dynamic user profiling and personalized responses at scale\.InConference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=6ox8XZGOqP)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Kim, N\. Mireshghallah, M\. Duan, R\. Xin, S\. S\. Li, J\. Jung, D\. Acuna, Q\. Pang, H\. Xiao, G\. E\. Suh, S\. Oh, Y\. Tsvetkov, P\. W\. Koh, and Y\. Choi \(2026\)Privasis: synthesizing the largest “public” private dataset from scratch\.External Links:2602\.03183,[Link](https://arxiv.org/abs/2602.03183)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of LLM agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,pp\. 13851–13870\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.747)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px2.p1.1)\.
- Mimesis Contributors \(2026\)Mimesis schema: foreign\-key references and structured generation\.Note:Documentation accessed 2026\-06\-18External Links:[Link](https://mimesis.name/master/schema.html)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px5.p1.1)\.
- B\. Nowok, G\. M\. Raab, and C\. Dibben \(2016\)synthpop: bespoke creation of synthetic data in R\.Journal of Statistical Software74\(11\),pp\. 1–26\.External Links:[Document](https://dx.doi.org/10.18637/jss.v074.i11)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- NVIDIA Corporation \(2025\)Nemotron\-PII\.Note:[https://huggingface\.co/datasets/nvidia/Nemotron\-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII)Synthetic, persona\-grounded dataset for PII/PHI detectionCited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,External Links:[Document](https://dx.doi.org/10.1145/3586183.3606763)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Patki, R\. Wedge, and K\. Veeramachaneni \(2016\)The synthetic data vault\.In2016 IEEE International Conference on Data Science and Advanced Analytics,pp\. 399–410\.External Links:[Document](https://dx.doi.org/10.1109/DSAA.2016.49)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- pseudopeople Contributors \(2026\)Pseudopeople documentation\.Note:Documentation accessed 2026\-06\-18External Links:[Link](https://pseudopeople.readthedocs.io/en/latest/)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- Pydantic Contributors \(2025\)Pydantic: data validation using python type hints\.External Links:[Link](https://docs.pydantic.dev/)Cited by:[§3](https://arxiv.org/html/2606.26403#S3.p3.1)\.
- A\. Salemi, S\. Mysore, M\. Bendersky, and H\. Zamani \(2024\)LaMP: when large language models meet personalization\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2304.11406)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Savkin, T\. Ionov, and V\. Konovalov \(2025\)SPY: enhancing privacy with synthetic PII detection dataset\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 4: Student Research Workshop\),pp\. 236–246\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-srw.23)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Selvam and A\. Ghosh \(2025\)PANORAMA: a synthetic PII\-laced dataset for studying sensitive data memorization in LLMs\.External Links:[Link](https://arxiv.org/abs/2505.12238)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Staab, M\. Vero, M\. Balunovic, and M\. Vechev \(2024\)Beyond memorization: violating privacy via inference with large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=kmn0BhQk7p)Cited by:[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Tan, L\. Yang, Z\. Liu, Z\. Liu, R\. Murthy, T\. M\. Awalgaonkar, J\. Zhang, W\. Yao, M\. Zhu, S\. Kokane, S\. Savarese, H\. Wang, C\. Xiong, and S\. Heinecke \(2025\)PersonaBench: evaluating AI models on understanding personal information through accessing synthetic private user data\.External Links:2502\.20616,[Link](https://arxiv.org/abs/2502.20616)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Tran, D\. Vatsalan, and P\. Christen \(2013\)GeCo: an online personal data generator and corruptor\.InProceedings of the 22nd ACM International Conference on Information and Knowledge Management,pp\. 2473–2476\.External Links:[Document](https://dx.doi.org/10.1145/2505515.2508207)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Vrandevcic and M\. Kroetzsch \(2014\)Wikidata: a free collaborative knowledgebase\.Communications of the ACM57\(10\),pp\. 78–85\.External Links:[Document](https://dx.doi.org/10.1145/2629489)Cited by:[§6\.3](https://arxiv.org/html/2606.26403#S6.SS3.p1.1)\.
- J\. Walonoski, M\. Kramer, J\. Nichols, A\. Quina, C\. Moesel, D\. Hall, C\. Duffett, K\. Dube, T\. Gallagher, and S\. McLachlan \(2018\)Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record\.Journal of the American Medical Informatics Association25\(3\),pp\. 230–238\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocx079)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2410.10813)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§1](https://arxiv.org/html/2606.26403#S1.p1.1),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Yukhymenko, R\. Staab, M\. Vero, and M\. Vechev \(2024\)A synthetic dataset for personal attribute inference\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=1nqfIQIQBf)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhang, G\. Cormode, C\. M\. Procopiuc, D\. Srivastava, and X\. Xiao \(2017\)PrivBayes: private data release via bayesian networks\.ACM Transactions on Database Systems42\(4\),pp\. 25:1–25:41\.External Links:[Document](https://dx.doi.org/10.1145/3134428)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Zhang, E\. Dinan, J\. Urbanek, A\. Szlam, D\. Kiela, and J\. Weston \(2018\)Personalizing dialogue agents: I have a dog, do you have pets too?\.InProceedings of ACL,pp\. 2204–2213\.External Links:[Document](https://dx.doi.org/10.18653/v1/P18-1205)Cited by:[Table 18](https://arxiv.org/html/2606.26403#A1.T18.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Zhang, T\. Wang, N\. Li, J\. Honorio, M\. Backes, S\. He, J\. Chen, and Y\. Zhang \(2021\)PrivSyn: differentially private data synthesis\.In30th USENIX Security Symposium,pp\. 929–946\.External Links:[Link](https://www.usenix.org/conference/usenixsecurity21/presentation/zhang-zhikun)Cited by:[Table 19](https://arxiv.org/html/2606.26403#A1.T19.2),[§2](https://arxiv.org/html/2606.26403#S2.SS0.SSS0.Px4.p1.1)\.

## Appendix ASupplementary Evidence and Reference Material

This appendix collects the supporting evidence forProfileFoundryand follows four buckets\. Appendix[A\.1](https://arxiv.org/html/2606.26403#A1.SS1)documents thegeneration logic: the age\-gated constraint atlas, the education–career–finance signature, the outlier policy, snapshot\-aligned partial replay, and modeling assumptions\. Appendix[A\.2](https://arxiv.org/html/2606.26403#A1.SS2)describes thereleased artifacts, covering the Python package command surface, the 100K reference set, its inventory and object topology, per\-profile coverage, household and employer graph structure, temporal surface, schema, and use guidance\. Appendix[A\.3](https://arxiv.org/html/2606.26403#A1.SS3)gives theauditevidence: validation against public marginals, reference\-data provenance, referential and temporal closure, leakage and collision screening, invariant families, claim\-to\-evidence mapping, and reproducibility\. Appendix[A\.4](https://arxiv.org/html/2606.26403#A1.SS4)provides the resource\-by\-resourcerelated\-work comparison\. Counts are descriptive of the fixed reference release unless a caption states otherwise\. Internal consistency, distributional fit, coincidence screening, formal privacy, and downstream utility remain separate forms of evidence\.

### A\.1Generation Logic

These figures expand the constrained\-generation mechanics summarized in the main text\. Age is the master gate that conditions education, marital state, occupation, and finance; the generator then preserves plausible rare combinations rather than collapsing every profile toward the modal path, and finally reconstructs history backward from the finalized snapshot\.

![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/pf_fig_b_constraint_cascade.png)Figure 3:Constrained cascade generation\.ProfileFoundrycarries constraints forward from reference tables and household plans into person fields, represented\-link closure, snapshot\-aligned temporal backfill, and export\-time evidence checks\. Household\-first generation is an engineering design choice; without an ablation, this paper does not claim causal superiority over every alternative\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_d_age_atlas.png)Figure 4:Age\-gated constraint atlas for the en\-US generator rules\. Age is the master gate: each panel is a conditional distribution computed after age gating and renormalization, with cells shown as implemented probabilities multiplied by100100and dots marking structural zeros\. Higher degrees and former\-partner states only become reachable at the configured minimum ages\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_e_edu_career_finance.png)Figure 5:Education to career tier to finance signature\. Education indices gate which career tiers a title can be drawn from \(ribbons, weighted at age3030\); the chosen tier then bounds salary, which in turn drives the finance tier and credit\-score scale\. Rare cross\-tier crossings are retained as weighted outliers rather than removed\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_g_outlier_policy.png)Figure 6:Outlier policy: what is blocked, bent, or common\.ProfileFoundryseparates the impossible from the merely rare—hard gates remove contradictions outright, weighted bends keep low\-probability but plausible cases \(for example an1818\-year\-old graduate\), and ordinary mass covers common combinations\.#### Generator mechanics\.

The generator deliberately preserves some rare combinations after feasibility checks\. These weighted outliers are different from contradictions: a doctorate holder may still appear outside high\-professional work, a high\-income person may still have a lower credit tier, and household members may have different employers unless a shared\-employer draw fires\. The validator blocks impossible or unsupported combinations, but it does not collapse every profile toward the modal path\.

![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_f_temporal_replay.png)Figure 7:Snapshot\-aligned temporal backfill\. Histories are reconstructed backward from the finalized snapshot rather than sampled independently; the current address and job are always appended so present state stays consistent with its past\. The validator confirms that the latest covered move and job\-change events agree with current address and employment—this is partial replay over declared fields, not complete event sourcing\.
#### Assumption ledger\.

Table[3](https://arxiv.org/html/2606.26403#A1.T3)records the high\-level modeling assumptions that downstream users should disclose when using generator outputs\.

Table 3:High\-level modeling assumptions and disclosure guidance\. These are heuristics or priors, not verified descriptions of every locale\.

### A\.2Released Artifacts: Package and 100K Reference Set

The release ships as both an executable Python package and a fixed 100K reference bundle\. Table[4](https://arxiv.org/html/2606.26403#A1.T4)lists the package command surface; the remaining figures and tables describe the reference set, which is an object graph rather than a flat profile table: one canonical Person Object fans out into normalized address, employment, education, event, household, employer, and relationship views\.

Table 4:Package command surface\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_i_release_topology.png)Figure 8:Release inventory and object topology\. Raw row counts orient the reader on a log scale spanning36\.836\.8K–709709K rows across the twelve normalized views, while the rows\-per\-profile multipliers and hub edges show that the release is an object graph: one canonical profile expands into addresses, employment, education, typed events, and relationship edges\. This panel consolidates what were previously separate inventory and overview figures\.Table 5:Row\-counted release views\.MANIFEST\.jsonrecords file hashes and row counts; the dataset card defines corresponding viewer configs\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_j_profile_coverage.png)Figure 9:Multi\-surface coverage and per\-profile density\. Every profile carries a current address and event history;95\.9%95\.9\\%sit on a relationship edge, and many simultaneously carry employment, education, credit, social, and allergy context\. Optional sparsity is explicit, and25,39225\{,\}392profiles jointly include all six analytical surfaces\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_k_household_graph.png)Figure 10:Households resolved into a directed relationship graph\. Three quarters of the40,33840\{,\}338households hold multiple represented adults, and household\-membership, family, partner, and colleague relations resolve into518,564518\{,\}564directed edges\. Endpoint and reciprocal\-link checks confirm the graph closes; the composition and edge mix are tabulated in Table[6](https://arxiv.org/html/2606.26403#A1.T6)\.Table 6:Household compositions and relationship\-edge mix\. Of 40,338 households, 30,312 are multi\-person and 15,044 contain at least three represented adults\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_l_temporal_surface.png)Figure 11:Temporal release surface\. Typed events project into address and employment histories: every address row carries source\-event identifiers, every profile has exactly one current address, and no event predates date of birth\. The typed\-event composition and selected age summaries are tabulated in Table[7](https://arxiv.org/html/2606.26403#A1.T7)\.Table 7:Typed\-event composition and selected temporal sanity summaries\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_m_employer_network.png)Figure 12:Employer context exported as resolvable entities rather than free\-text names\. Current and historical employment resolves to52,49152\{,\}491employer IDs and produces295,982295\{,\}982directed colleague edges with zero missing foreign keys, so shared\-employer co\-membership can be queried as graph structure rather than inferred from matching strings\.Table 8:Schema coverage and corresponding evidence\.#### Release use guidance\.

Table[9](https://arxiv.org/html/2606.26403#A1.T9)summarizes recommended, caveated, and discouraged uses of the released package and reference population\.

Table 9:Use guidance for downstream users\.

### A\.3Audit

Audit evidence is reported as several distinct forms rather than a single pass/fail summary\. Each generation stage carries its own validator family into the release audit \(Figure[14](https://arxiv.org/html/2606.26403#A1.F14)\): distributional fit against public marginals, referential and temporal closure, leakage and collision screening, structural invariants, and reproducibility are stated separately\.

#### Evidence overview\.

Figure[13](https://arxiv.org/html/2606.26403#A1.F13)and Table[10](https://arxiv.org/html/2606.26403#A1.T10)map each headline claim about the release to the concrete public artifact that backs it and to the metric or check that verifies it\. The figure is a visual ledger of this correspondence; the table states the same mapping in full\.

![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_n_claim_evidence_ledger.png)Figure 13:Claim\-to\-evidence ledger\. Each headline capability of the release—structured, executable, consistent, linked, temporal, leakage\-audited, reproducible, and documentation\-checked—maps to a concrete public artifact and a verifying metric or check\. This visual ledger summarizes the same correspondence detailed textually in Table[10](https://arxiv.org/html/2606.26403#A1.T10)\.Table 10:Artifact evidence summary\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_h_audit_attachment.png)Figure 14:Audit attachment map\. Every generation stage carries its own validator family into the release audit, so validation, replay, referential\-integrity, manifest, and leakage checks attach to the part of the release object they verify rather than to a single flat table\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_o_validation.png)Figure 15:Validation target audit: honest misses alongside invariant pass\. Distributional fit is disclosed separately from invariant consistency—all90,00090\{,\}000full\-validation profiles pass the declared\-consistency checks, while marginal gaps that exceed the per\-attribute target are published rather than tuned away\. Per\-locale detail is given in Table[11](https://arxiv.org/html/2606.26403#A1.T11)\.Table 11:Validation interpretation by locale\. The declared consistency and invariant pass rate is 100% for all five full\-validation locales\.The full\-validation report uses public age\-by\-sex, education, and marital\-status reference tables for US, UK, IN, CA, and AU\. The metric is a maximum absolute bucket\-share discrepancy \(L∞L\_\{\\infty\}marginal gap\), not a KS statistic\. The report does not tune the generator repeatedly against the target because that would overfit the release to its own audit\. IE, NZ, and PH are included in the release but excluded from the locked marginal\-fit and consistency table in v1\.0\.

Table 12:Reference\-data provenance summary\. The generator can fall back to committed bootstrap marginals; richer derived tables are regenerated from source APIs when available\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_p_closure.png)Figure 16:Release\-wide referential and temporal closure\. Relationship endpoints resolve to released profiles, employer references resolve to employer rows, and represented spouse and parent–child links close reciprocally; exact checks are stated individually rather than summarized as “no errors\.” The full check inventory is given in Table[13](https://arxiv.org/html/2606.26403#A1.T13)\.Graph or entity checkResultTemporal/source checkResultMissing relationship source endpoints0Address rows with source event IDs167,089 / 167,089Missing relationship target endpoints0Current addresses100,000 / 100,000Missing employer foreign keys0Events dated before DOB0Represented spouse links mutual49,072 / 49,072Current address rows100,000Parent–child reciprocal misses0Historical address rows67,089Sibling reverse\-edge misses0Current employment rows61,428Partner reverse\-edge misses0Historical employment rows50,527Colleague reverse\-edge misses0Total typed events709,228Household member sum100,000 / 100,000Events/profile, mean/median/90th/max7\.09 / 7 / 11 / 17Table 13:Referential and temporal closure checks over the fixed release\. External spouse references \(17,844\) are explicit sentinel cases rather than missing endpoints\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_q_leakage.png)Figure 17:Collision and coincidence screening\. The release publishes exact within\-release collision checks, a Wikidata Bloom notable\-person screen, and reserved\-domain email syntax and uniqueness evidence\. These are separate screens over common names and places and do not constitute a formal privacy guarantee; denominators and interpretation are given in Table[14](https://arxiv.org/html/2606.26403#A1.T14)\.Table 14:Interpretation of the release leakage and collision audits\.
#### Leakage methodology\.

The Wikidata filter is constructed over humans with a known birth date and at least five sitelinks\. The released filter covers birth years 1850–2015, contains 683,897 records, and indexes 667,179 name\-year keys plus 631,907 name\-city keys\. Bloom filters introduce false positives; the configured target false\-positive rate is10−410^\{\-4\}\. Because the filter is a screening tool over common names and places, its output is best interpreted as a conservative coincidence rate\.

The HIBP prototype audit is intentionally not a release metric\. The public\-email\-domain prototype demonstrated that realisticfirst\.last@providerpatterns can overlap breached\-account corpora even when no real identity was copied\. The current release changes the design by using reserved\*\.profilefoundry\.exampledomains, so the correct release audit is syntax and uniqueness rather than breached\-account lookup\.

Table 15:Invariant families used to support the declared\-suite consistency claim\.![Refer to caption](https://arxiv.org/html/2606.26403v1/figures/app_r_reproducibility.png)Figure 18:Reproducibility pin\. The release records the global seed, generation date, export timestamp, reference\-data hash, row counts, and per\-file SHA\-256 hashes; a release verifier compares the local and published bundles so that any drift is caught\. The corresponding command\-level checklist is given in Table[16](https://arxiv.org/html/2606.26403#A1.T16)\.Table 16:Reproducibility checklist\. The fixed data release is keyed by seed, generation date, exported timestamp, manifest identifier, and per\-file hashes; external version labels must be reconciled before archival submission\.

### A\.4Comparison with Adjacent Resources

#### Comparison rubric\.

The main\-text comparison is intentionally descriptive\. The rubric below distinguishes public artifacts from generation\-time internals, persistent identifiers from ordinary co\-occurrence, explicit state change from timestamps, localized fields from multilingual text, and formal privacy from quality or provenance checks\. The resource rows are kept individual rather than grouped into heterogeneous families\.

Table 17:Closest adjacent resources, described by what users can inspect in the released artifact\. Generation\-only structures are not counted as released source objects\. Internal consistency, statistical fidelity, human quality evaluation, leakage screening, provenance, and differential privacy are not collapsed into one “audit” mark\.Sources:\(Kimet al\.,[2026](https://arxiv.org/html/2606.26403#bib.bib12); Selvam and Ghosh,[2025](https://arxiv.org/html/2606.26403#bib.bib25); Yukhymenkoet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib32); Tanet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib27); Savkinet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib24); NVIDIA Corporation,[2025](https://arxiv.org/html/2606.26403#bib.bib17); Gretel AI,[2024](https://arxiv.org/html/2606.26403#bib.bib7); Jha,[2026](https://arxiv.org/html/2606.26403#bib.bib10); Salemiet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib23); Maharanaet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib13); Wuet al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib31); Jianget al\.,[2025](https://arxiv.org/html/2606.26403#bib.bib11); Zhanget al\.,[2018](https://arxiv.org/html/2606.26403#bib.bib34); Geet al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib5); Parket al\.,[2023](https://arxiv.org/html/2606.26403#bib.bib18)\)\.

Table 18:Text, privacy, memory, persona, and behavior resources\.Sources:\(Haddocket al\.,[2024](https://arxiv.org/html/2606.26403#bib.bib8); pseudopeople Contributors,[2026](https://arxiv.org/html/2606.26403#bib.bib38); Tranet al\.,[2013](https://arxiv.org/html/2606.26403#bib.bib28); Christen,[2008](https://arxiv.org/html/2606.26403#bib.bib3); Nowoket al\.,[2016](https://arxiv.org/html/2606.26403#bib.bib16); Patkiet al\.,[2016](https://arxiv.org/html/2606.26403#bib.bib19); DataCebo, Inc\.,[2026](https://arxiv.org/html/2606.26403#bib.bib37); Zhanget al\.,[2017](https://arxiv.org/html/2606.26403#bib.bib33);[2021](https://arxiv.org/html/2606.26403#bib.bib35); Walonoskiet al\.,[2018](https://arxiv.org/html/2606.26403#bib.bib30); Faker Contributors,[2025](https://arxiv.org/html/2606.26403#bib.bib4); Mimesis Contributors,[2026](https://arxiv.org/html/2606.26403#bib.bib36)\)\.

Table 19:Population, linkage, tabular, domain\-simulation, and fake\-data resources\.

Similar Articles

Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation

arXiv cs.CL

Researchers from KAIST propose a framework that uses persona-guided LLM agents to synthesize diverse harmful content for stress-testing detection systems, addressing limitations of static benchmarks such as scalability, diversity, and data contamination. Both human and LLM evaluations confirm the synthetic scenarios are harder to detect than existing benchmarks while maintaining linguistic and topical diversity.