However, the models are not fully safe or aligned yet. They still generate toxic or biased outputs, make up facts, and generate inappropriate content. OpenAI acknowledges these limitations and is working on improving the models. The company is also researching how to align the models with the values of specific populations and how to handle societal implications of these decisions.
Key takeaways:
- InstructGPT models are better at following instructions than GPT-3 and generate less toxic outputs.
- The models were trained using reinforcement learning from human feedback (RLHF), a technique that uses human preferences as a reward signal to fine-tune the models.
- Despite significant progress, InstructGPT models are not fully safe; they can still generate toxic or biased outputs and make up facts.
- OpenAI is working on improving the alignment of their models to better serve the needs of specific populations and reduce biases.