The author emphasizes the importance of these evaluations due to the potential serious consequences of errors in their project, which involves drawing conclusions from press release data. The next step is to code the evaluation criteria and apply them to the fine-tuned LLMs and API-driven proprietary models. The author plans to use their intimate knowledge of the data, gained from personally annotating every item, to guide this process.
Key takeaways:
- The author is evaluating the performance of fine-tuned language models (LLMs) that convert press release text into structured data.
- Several evaluation criteria are considered, including accuracy, handling of out-of-domain data, interpretation of vague terms, dealing with spelling variations, and handling complex stories.
- The author is particularly interested in how the models handle edge cases and is considering using hard examples to generate synthetic data to improve performance.
- The next step is to code these evaluation criteria and test them on the fine-tuned LLMs and API-driven proprietary models.