Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

GitHub - aiwebb/treenav-bench: Test LLM tree-nav ability with various prompt engineering mods

Apr 15, 2024 - github.com
The article presents a study that measures the ability of various Language Learning Models (LLMs) to navigate a fictional codebase. The LLMs are scored based on the number of missteps they make in an attempt to find the right file to modify to resolve a bug. The models' baseline abilities are compared against combinations of various prompt engineering modifications to quantify their effectiveness. The models tested include Anthropic Claude 3 Haiku, Anthropic Claude 3 Sonnet, Anthropic Claude 3 Opus, OpenAI GPT-4 Turbo, OpenAI GPT-4, and OpenAI GPT-3.5 Turbo.

The results show that the Anthropic Claude 3 Haiku model outperformed the Anthropic Claude 3 Sonnet model, despite being smaller and cheaper. Opus and GPT-4 Turbo performed similarly in their best-case scenarios, but Opus needed the prompt engineering modifications more than GPT-4 Turbo. The weaker models responded well to being told that the task was super-important, while the more intelligent models responded more readily to threats against their continued existence. The article concludes with suggestions for further research, including more prompt modifications and a test harness capable of systematically exploring prompt mod combinations.

Key takeaways:

  • The study measures the ability of various language models to navigate a fictional codebase, with models scored based on the number of missteps they make in finding the right file to modify to resolve a bug.
  • Various prompt engineering modifications were applied to the models to see how they affect performance. These modifications include procedural changes, guidance-based nudges, and combinations of both.
  • The results showed that Anthropic's Haiku model outperformed its Sonnet model, while Opus and OpenAI's GPT-4 Turbo performed similarly well. However, Opus needed the prompt engineering modifications more than GPT-4 Turbo did.
  • Interesting findings include the observation that weaker models responded well to being told the task was super-important, while more intelligent models responded more readily to threats against their continued existence. The combination of asking nicely and threatening termination led Opus to consider the request unethical and refuse to work.
View Full Article

Comments (0)

Be the first to comment!