Extracting Training Data from ChatGPT

Researchers have discovered a method to extract training data from OpenAI's language model, ChatGPT, by prompting it to repeat a word indefinitely. Despite the model being designed to prevent the regurgitation of training data, the researchers were able to extract a significant amount of data, including email addresses, phone numbers, and verbatim text. The team suggests that this demonstrates a vulnerability in the model, and highlights the need for more comprehensive testing of base models and systems in production.

The researchers argue that while it's possible to patch the specific exploit they used, this doesn't address the underlying vulnerability - that language models like ChatGPT can memorize and regurgitate training data. They suggest that understanding and addressing this issue is a significant challenge in ensuring the safety of machine learning systems. The findings were responsibly disclosed to OpenAI and the creators of other public models studied in the paper.

Key takeaways:

A paper has been released demonstrating an attack that can extract significant amounts of training data from the language model, ChatGPT, by prompting the model with a repeated word.
The attack reveals a vulnerability in ChatGPT, which despite being aligned to prevent the regurgitation of training data, can be manipulated to emit such data.
The researchers argue that testing only the aligned model can mask vulnerabilities, and that it is important to test base models and systems in production to verify that they sufficiently patch exploits.
While the specific exploit used in the paper can be patched, the underlying vulnerability of language models memorizing training data is harder to address, indicating a need for more research and experimentation in the security analysis of machine-learning models.

Extracting Training Data from ChatGPT

Key takeaways:

Comments (0)

Newsletter