PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

arXiv cs.CL 04/20/26, 04:00 AM Papers

pii-detection benchmark-dataset nlp privacy multi-source information-extraction

Summary

PIIBench presents a unified multi-source benchmark corpus for detecting personally identifiable information (PII) across diverse data sources. This resource addresses the need for standardized evaluation in PII detection tasks, which is critical for privacy-preserving NLP applications.

arXiv:2604.15776v1 Announce Type: new Abstract: We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh-2711/pii-bench.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Source: https://arxiv.org/abs/2604.15776

Bibliographic Tools

## Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Code, Data, Media

## Code, Data and Media Associated with this Article

Demos

## Demos

Related Papers

## Recommenders and Search Tools

About arXivLabs

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? **Learn more about arXivLabs** (https://info.arxiv.org/labs/index.html).

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Similar Articles

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

A P\={a}ninian Foundation for Indic Language Processing

UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

Meddies PII: An Open Multilingual De-identification Model for Clinical Text

Submit Feedback

Similar Articles

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

A P\={a}ninian Foundation for Indic Language Processing

UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

Meddies PII: An Open Multilingual De-identification Model for Clinical Text