Most of the score dimensions are deterministic, with some new ones integrated with a language model for scoring. This, however, introduces the challenge of scoring the scoring prompt. The outputs are manually scanned to ensure they make sense. The article mentions that they are not doing any fine tuning yet as satisfactory results are being achieved with just prompting.
Key takeaways:
- A script and input library are used for head-to-head comparison of a new candidate prompt with the existing production.
- The process involves a configuration that includes the prompt, the LLM to use, temperature, etc.
- Most score dimensions are deterministic, but some have been added where an LLM is integrated for scoring.
- Manual scanning of outputs is done for a sanity check, and fine tuning is not yet being done as good results are being achieved with just prompting.