Aligning language models to follow instructions

OpenAI has improved its language models using a technique called reinforcement learning from human feedback (RLHF), resulting in the creation of InstructGPT models. These models are better at following instructions, make up facts less often, and show small decreases in toxic output generation compared to GPT-3. Despite having fewer parameters, InstructGPT models are preferred by labelers over GPT-3. The models have been in beta on the API for over a year and are now the default language models on the API.

However, the models are not fully safe or aligned yet. They still generate toxic or biased outputs, make up facts, and generate inappropriate content. OpenAI acknowledges these limitations and is working on improving the models. The company is also researching how to align the models with the values of specific populations and how to handle societal implications of these decisions.

Key takeaways:

InstructGPT models are better at following instructions than GPT-3 and generate less toxic outputs.
The models were trained using reinforcement learning from human feedback (RLHF), a technique that uses human preferences as a reward signal to fine-tune the models.
Despite significant progress, InstructGPT models are not fully safe; they can still generate toxic or biased outputs and make up facts.
OpenAI is working on improving the alignment of their models to better serve the needs of specific populations and reduce biases.

Aligning language models to follow instructions

Key takeaways:

Comments (0)

Newsletter