Encoding hidden prompt in LLMs as potential attack vector.

The article discusses the potential of using OpenAI's GPT-4 to encode and decode base64 messages, which can be used to influence the AI's responses. The author explains that while the AI's tokenization can sometimes lead to inaccuracies, it is possible to hide secret messages in prompts, which can then be decoded by the AI. An example is provided where a base64 encoded message is hidden in a code-question type example, and the AI is able to decode and respond to the hidden message.

The author also raises concerns about the potential misuse of this technique. As it is relatively easy to hide messages in base64 encoding, it could be used as an attack vector by malicious actors. The author shares an anecdote where a colleague was confused when the AI seemed to know his name, demonstrating the potential for confusion or deception using this method.

Key takeaways:

The recent publication on LLM 'sleeper agents' has led to further exploration on influencing LLMs to pursue alternative objectives without altering training data.
GPT-4 can encode/decode base64 somewhat consistently, which can be used to hide/inject secret messages in prompts.
Base64 encodings can be hidden in a code-question type example, and using GPT-4 Turbo, it is possible to get GPT to comply with the hidden message.
Since it is quite simple to start hiding messages/prompts in Base64 encoding, it can reasonably be an attack vector for malicious actors.

Encoding hidden prompt in LLMs as potential attack vector. | Jakob Serlier's Personal Site

Key takeaways:

Comments (0)

Newsletter