1
Feature Story
GitHub - lechmazur/divergent: LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.
Dec 30, 2024 · github.comThe results show varying performance among the models, with o1-preview achieving the highest score of 4.79 and GPT-4o scoring the lowest at 3.73. The percentage of repeated words is also analyzed, revealing that Llama 3.3 70B and o1-preview had no repeats, while GPT-4o had a high repeat rate of 23.68%. This repetition rate helps explain GPT-4o's lower performance. The benchmark provides insights into the creativity and distinctiveness capabilities of different LLMs, highlighting areas for improvement in generating unique and unrelated words.
Key takeaways
- The LLM Divergent Thinking Creativity Benchmark evaluates originality and fluency by having LLMs generate distinct words unrelated to an initial list.
- Each LLM generates 2,200 words, evaluated by four LLMs for distinctiveness and adherence to rules.
- Higher scores indicate better performance, with o1-preview achieving the highest score of 4.79.
- GPT-4o performed poorly due to a high percentage of repeated words, at 23.68%.