The study also proposed using QE scores for response ranking for each target language to address the noise introduced by low-quality translations. This resulted in fine-tuned multilingual dialogue evaluation models that strongly correlated with human judgments. The study suggests that future research could involve evaluating generative model responses in different languages using annotators exposed to the culture associated with a given language.
Key takeaways:
- Researchers have developed a new approach to address the lack of multilingual data and open-sourced multilingual dialogue systems, using a multilingual pretrained encoder-based Language Model and Machine Translation (MT).
- Simply finetuning a pretrained multilingual encoder model with translated data does not outperform the existing baseline. Instead, using MT Quality Estimation (QE) metrics to curate translated data is more effective.
- The authors proposed using QE scores for response ranking for each target language, which provides a standardized method for filtering and improving the method’s scalability to new languages.
- The study suggests that filtering out low-quality translations can reduce the performance gap on ChatGPT and outperform it on select correlation metrics. Future research could involve evaluating generative model responses in different languages using annotators exposed to the culture associated with a given language.