The experiment was inspired by Pylon, a company that helps B2B companies interact with their clients on frequently used apps. The company was already using a classifier to identify resolved tickets but wanted to improve its accuracy. The article also discusses the challenges of classifying customer conversations due to language variability, nuanced context, and evolving dialogues. It also highlights the importance of recall and precision in classification tasks, depending on the specific use-case.
Key takeaways:
- In an experiment to identify the best model for classifying customer support tickets as resolved or not, Gemini Pro proved to be the best performing model with an accuracy of 74% and an F1 score of 76.69%.
- While Claude 2.1 showed a high recall rate, GPT-4 Turbo demonstrated high precision, making it a good choice for tasks where precision is crucial.
- Despite being the most advanced model, GPT-3.5 Turbo had the lowest accuracy of all models at 57%, with a recall of 35.71% and an F1 score of 48.19%.
- Future work includes trying other examples in the prompts to understand why GPT-4 and GPT-3.5 weren’t able to generalize as well as Gemini Pro, and potentially fine-tuning a model using the dataset derived from Pylon's classifier data for better performance and reliability.