Furthermore, the authors explore the potential of distilling the ranking capabilities of ChatGPT into a smaller, specialized model. This distilled model, trained on 10,000 data points generated by ChatGPT, outperforms the monoT5 model, which was trained on 400,000 annotated MS MARCO data points, on the BEIR benchmark. The study highlights the effectiveness of LLMs in relevance ranking and suggests that their capabilities can be distilled into more efficient models. The code for reproducing the study's results is made available on GitHub.
Key takeaways:
- Large Language Models (LLMs) like ChatGPT and GPT-4 can perform relevance ranking in Information Retrieval (IR) tasks competitively, even surpassing some supervised methods.
- GPT-4 outperforms the fully fine-tuned monoT5-3B model on various IR benchmarks, including MS MARCO, TREC datasets, BEIR datasets, and low-resource languages like Mr.TyDi.
- A small specialized model trained on 10K ChatGPT-generated data outperforms the monoT5 model trained on 400K annotated MS MARCO data on BEIR benchmarks.
- The study explores the potential for distilling the ranking capabilities of ChatGPT into a specialized model, with code available for reproduction on GitHub.