The author acknowledges that the current system is imperfect and aims to improve it by using better stop sequences and prompt formatting tailored to each model. Future ideas include public votes to compute an ELO rating, comparing two models side by side, and community-submitted prompts. The author is open to suggestions and feedback.
Key takeaways:
- The author created a script to test around 60 AI models on their basic reasoning, instruction following, and creativity skills.
- The script stored all the responses in a SQLite database, providing raw results for comparison.
- The author used a mix of APIs from OpenRouter, TogetherAI, OpenAI, Cohere, Aleph Alpha & AI21 for the testing.
- The author plans to improve the testing process by using better stop sequences and prompt formatting tailored to each model.