Meddies PII: An Open Multilingual De-identification Model for Clinical Text
Summary
Meddies PII is an open multilingual model and dataset for clinical text de-identification, designed to remove patient identifiers while preserving clinical facts. It uses synthetic data generated with dynamic prompting to handle diverse real-world formats.
Similar Articles
maziyarpanahi/openmed
OpenMed is an open-source local-first healthcare AI toolkit that provides entity extraction, PII de-identification, and over 1,000 specialized medical models, all running on-device with no cloud dependency.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
IndicMedDialog is a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with a fine-tuned model for personalized symptom elicitation. The dataset is derived from MDDial, enhanced with LLM-generated synthetic consultations and expert verification, supporting multilingual healthcare AI.
PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
PIIBench presents a unified multi-source benchmark corpus for detecting personally identifiable information (PII) across diverse data sources. This resource addresses the need for standardized evaluation in PII detection tasks, which is critical for privacy-preserving NLP applications.
MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models
MEDSYN is a multilingual multimodal benchmark for evaluating MLLMs on complex clinical cases with up to 7 distinct visual evidence types per case. The study reveals that while frontier models match human experts on differential diagnosis generation, all MLLMs show significant gaps in final diagnosis selection due to poor synthesis of heterogeneous clinical evidence.
From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
This paper introduces MedTPE, a method for efficient, lossless prompt compression of electronic health records for large language models, significantly reducing token length and inference latency in clinical prediction tasks.