PII data to LLM
Summary
Discusses the risks and considerations of sending Personally Identifiable Information (PII) to large language models.
Similar Articles
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
A unified survey of pretraining data exposure (PDE) in large language models, covering membership inference, data contamination, and security implications, with a review of attack and defense methods.
@pallavishekhar_: https://x.com/pallavishekhar_/status/2058460434035060758
Explains what large language models actually do (next-token prediction) and why they sound confident even when wrong. Offers a mental model and verification checklist for using LLMs safely.
Can LLMs Take Retrieved Information with a Grain of Salt?
This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.
Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning
This paper presents a cross-domain benchmark for federated fine-tuning of large language models on private data, evaluating LoRA, QLoRA, and IA3 strategies on healthcare and finance datasets. Results show federated fine-tuning approaches centralized performance and outperforms isolated learning, supporting its viability for adapting LLMs when data cannot be shared.
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.