Best at Text Classification: Gemini Pro, GPT-4 or Claude2?

The article discusses an experiment conducted to determine the best model for classifying customer support tickets as resolved or not. Four models were compared: GPT-3.5 Turbo, GPT-4 Turbo, Claude 2.1, and Gemini Pro. The results showed that Gemini Pro was the best performing model with the highest overall performance across all metrics, including accuracy and F1 score. Claude 2.1 had a high recall rate, making it a good alternative for classification tasks, while GPT-4 Turbo showed the highest precision, making it ideal for tasks where precision is crucial.

The experiment was inspired by Pylon, a company that helps B2B companies interact with their clients on frequently used apps. The company was already using a classifier to identify resolved tickets but wanted to improve its accuracy. The article also discusses the challenges of classifying customer conversations due to language variability, nuanced context, and evolving dialogues. It also highlights the importance of recall and precision in classification tasks, depending on the specific use-case.

Key takeaways:

In an experiment to identify the best model for classifying customer support tickets as resolved or not, Gemini Pro proved to be the best performing model with an accuracy of 74% and an F1 score of 76.69%.
While Claude 2.1 showed a high recall rate, GPT-4 Turbo demonstrated high precision, making it a good choice for tasks where precision is crucial.
Despite being the most advanced model, GPT-3.5 Turbo had the lowest accuracy of all models at 57%, with a recall of 35.71% and an F1 score of 48.19%.
Future work includes trying other examples in the prompts to understand why GPT-4 and GPT-3.5 weren’t able to generalize as well as Gemini Pro, and potentially fine-tuning a model using the dataset derived from Pylon's classifier data for better performance and reliability.

Best at Text Classification: Gemini Pro, GPT-4 or Claude2? - Vellum

Key takeaways:

Comments (0)

Newsletter