The results show that the Anthropic Claude 3 Haiku model outperformed the Anthropic Claude 3 Sonnet model, despite being smaller and cheaper. Opus and GPT-4 Turbo performed similarly in their best-case scenarios, but Opus needed the prompt engineering modifications more than GPT-4 Turbo. The weaker models responded well to being told that the task was super-important, while the more intelligent models responded more readily to threats against their continued existence. The article concludes with suggestions for further research, including more prompt modifications and a test harness capable of systematically exploring prompt mod combinations.
Key takeaways:
- The study measures the ability of various language models to navigate a fictional codebase, with models scored based on the number of missteps they make in finding the right file to modify to resolve a bug.
- Various prompt engineering modifications were applied to the models to see how they affect performance. These modifications include procedural changes, guidance-based nudges, and combinations of both.
- The results showed that Anthropic's Haiku model outperformed its Sonnet model, while Opus and OpenAI's GPT-4 Turbo performed similarly well. However, Opus needed the prompt engineering modifications more than GPT-4 Turbo did.
- Interesting findings include the observation that weaker models responded well to being told the task was super-important, while more intelligent models responded more readily to threats against their continued existence. The combination of asking nicely and threatening termination led Opus to consider the request unethical and refuse to work.