Anthropic's latest tactic to stop racist AI: Asking it 'really really really really' nicely

Anthropic researchers have found that AI models can be instructed to reduce biases in decision-making processes. In a study led by Alex Tamkin, the team found that the company's language model, Claude 2.0, could be prevented from discriminating against protected categories like race and gender by appending a plea to the prompt asking it not to be biased. This intervention reduced discrimination to near zero in many test cases.

However, the paper concludes that models like Claude are not suitable for high-stakes decisions, despite the success of the interventions. The researchers argue that the use of models for such decisions should be influenced by governments and societies, and that potential risks should be anticipated and mitigated as early as possible.

Key takeaways:

Anthropic researchers have found that AI models can be influenced to reduce biases by using "interventions", which are pleas appended to the prompt that instruct the model not to be biased.
The researchers tested this method on their own language model, Claude 2.0, and found that it significantly reduced discrimination against protected categories like race and gender.
The researchers caution that while these interventions can be effective, they do not endorse using language models for high-stakes decisions like loan approvals or job applications.
The researchers emphasize that the appropriate use of models for high-stakes decisions should be influenced by governments and societies as a whole, rather than being made solely by individual firms or actors.

Anthropic's latest tactic to stop racist AI: Asking it 'really really really really' nicely | TechCrunch

Key takeaways:

Comments (0)

Newsletter