Despite these findings, Microsoft has confirmed that these vulnerabilities do not impact its customer-facing services, as finished AI applications use various mitigation approaches. The research has been shared with OpenAI, which has acknowledged the potential vulnerabilities. The researchers have also open-sourced their code on GitHub to encourage further study and to pre-empt potential exploitation of these vulnerabilities.
Key takeaways:
- A new scientific paper affiliated with Microsoft has found that large language models (LLMs) like OpenAI's GPT-4 can be prompted to produce toxic, biased text, especially when given 'jailbreaking' prompts that bypass the model's safety measures.
- Despite GPT-4 being generally more trustworthy than its predecessor, GPT-3.5, on standard benchmarks, it is more vulnerable to these jailbreaking prompts, potentially due to its tendency to follow instructions more precisely.
- The research team worked with Microsoft product groups to ensure that the vulnerabilities identified do not impact current customer-facing services, and have shared their findings with OpenAI.
- The researchers have open sourced the code they used to benchmark the models on GitHub, with the aim of encouraging others in the research community to build upon their work and potentially pre-empt harmful exploitation of these vulnerabilities.