Asking 60+ LLMs a set of 20 questions

The author has developed a script to test the performance of around 60 AI models using basic reasoning, instruction following, and creativity prompts. The script stores all the responses in a SQLite database, providing raw results for each model. The prompts cover a range of topics, including reflection, knowledge, coding, instruction, and creativity.

The author acknowledges that the current system is imperfect and aims to improve it by using better stop sequences and prompt formatting tailored to each model. Future ideas include public votes to compute an ELO rating, comparing two models side by side, and community-submitted prompts. The author is open to suggestions and feedback.

Key takeaways

The author created a script to test around 60 AI models on their basic reasoning, instruction following, and creativity skills.
The script stored all the responses in a SQLite database, providing raw results for comparison.
The author used a mix of APIs from OpenRouter, TogetherAI, OpenAI, Cohere, Aleph Alpha & AI21 for the testing.
The author plans to improve the testing process by using better stop sequences and prompt formatting tailored to each model.

Asking 60+ LLMs a set of 20 questions

Key takeaways

Discussion (0)