Meddies PII: An Open Multilingual De-identification Model for Clinical Text

Reddit r/LocalLLaMA 06/08/26, 11:08 AM Models

clinical de-identification multilingual open-source healthcare privacy synthetic-data

Summary

Meddies PII is an open multilingual model and dataset for clinical text de-identification, designed to remove patient identifiers while preserving clinical facts. It uses synthetic data generated with dynamic prompting to handle diverse real-world formats.

A clinical AI model does not need to know who the patient is to reason clinically. It needs the symptoms, medications, lab results, diagnosis history, and treatment course. The problem is that in real medical records, those facts usually sit next to identifiers: names, record IDs, insurance numbers, addresses, phone numbers, admission dates, department names. So clinical de-identification has a double contract: 1. Do not let patient identifiers leak. 2. Do not destroy the clinical facts that still need to be used. That second part is easy to underestimate. If a model misses a date of birth, the privacy boundary fails. If it removes "creatinine 86 µmol/L" or "metformin 500 mg," the downstream clinical record loses meaning. Both are failures, but they have different consequences. We built Meddies PII for this problem. It is an open research model and dataset for multilingual clinical de-identification. The dataset is synthetic and built with dynamic prompting, varying language, document type, document label, note length, text format, edge case, and identifier family across generations. The goal is not one pretty template. The goal is stable extraction behavior across the messy surfaces hospital data actually appears in: rushed notes, nursing forms, JSON/XML exports, multilingual text, administrative records, and chat-style prompts. Meddies PII is not a complete de-identification product. Hospitals still need policy, audit logs, local validation, human escalation paths, and deployment controls. But we think this is a useful starting point: open enough to inspect, careful enough to discuss honestly, and built from the reality that clinical AI needs more than benchmark performance to be deployable. Full post: [https://meddies.ai/research/meddies-pii](https://meddies.ai/research/meddies-pii) Demo: [https://huggingface.co/spaces/Meddies/meddies-pii-extractor](https://huggingface.co/spaces/Meddies/meddies-pii-extractor) Model: [https://huggingface.co/Meddies/meddies-pii](https://huggingface.co/Meddies/meddies-pii) Dataset: [https://huggingface.co/datasets/Meddies/meddies-pii](https://huggingface.co/datasets/Meddies/meddies-pii)

Original Article

Meddies PII: An Open Multilingual De-identification Model for Clinical Text

Similar Articles

maziyarpanahi/openmed

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

Submit Feedback

Similar Articles

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction