Ask HN: How to avoid sensitive data being part of LLM training data?

The article discusses the concern of ensuring sensitive and personally identifiable information (PII) does not become part of Language Model (LLM) training data. It acknowledges that while manual verification is feasible with small data sets, it becomes challenging when dealing with larger data sizes. The article then poses a question about possible methods to mask or filter out PII or sensitive data when dealing with large data sets.

Key takeaways:

The importance of ensuring sensitive data and PII do not become part of LLM training data is highlighted.
Manual verification of training data is possible when the data size is small.
When the data size is large, the challenge of filtering out PII/sensitive data increases.
The need for a method to mask/filter out PII/sensitive data in large datasets is emphasized.

Ask HN: How to avoid sensitive data being part of LLM training data?

Key takeaways:

Comments (0)

Newsletter