PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
Summary
PIIBench presents a unified multi-source benchmark corpus for detecting personally identifiable information (PII) across diverse data sources. This resource addresses the need for standardized evaluation in PII detection tasks, which is critical for privacy-preserving NLP applications.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
# PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection Source: https://arxiv.org/abs/2604.15776 Bibliographic Tools ## Bibliographic and Citation Tools Bibliographic Explorer Toggle Code, Data, Media ## Code, Data and Media Associated with this Article Demos ## Demos Related Papers ## Recommenders and Search Tools About arXivLabs ## arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? **Learn more about arXivLabs** (https://info.arxiv.org/labs/index.html).
Similar Articles
IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products
IndustryBench-MIPU is a large-scale benchmark for multi-image industrial product understanding, evaluating 9 MLLMs and revealing a completeness gap where precision is high but attribute recovery is low.
MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval
MMed-Bench-IR is a heterogeneous benchmark for multilingual medical information retrieval across six languages, evaluating cross-lingual alignment, concept discrimination, and evidence retrieval. It reveals severe performance drops for non-English queries, highlighting gaps in existing English-only evaluations.
A P\={a}ninian Foundation for Indic Language Processing
This paper proposes a benchmark suite grounded in Pāṇinian grammar to unify Indic language processing across languages, aiming to improve accuracy, data efficiency, and transferability.
UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval
UsefulBench introduces a domain-specific benchmark dataset that distinguishes between document relevance and usefulness for information retrieval, showing that similarity-based IR systems conflate these concepts while LLMs can address this but lack domain expertise.
Meddies PII: An Open Multilingual De-identification Model for Clinical Text
Meddies PII is an open multilingual model and dataset for clinical text de-identification, designed to remove patient identifiers while preserving clinical facts. It uses synthetic data generated with dynamic prompting to handle diverse real-world formats.