Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent

The paper investigates the use of Large Language Models (LLMs), specifically ChatGPT and GPT-4, as re-ranking agents in Information Retrieval (IR) tasks. The study reveals that these generative LLMs, when properly instructed, can achieve competitive or even superior performance compared to traditional supervised methods on standard IR benchmarks. Notably, GPT-4 surpasses the fully fine-tuned monoT5-3B model on various datasets, including MS MARCO, TREC, BEIR, and Mr.TyDi, with significant improvements in normalized Discounted Cumulative Gain (nDCG) scores.

Furthermore, the authors explore the potential of distilling the ranking capabilities of ChatGPT into a smaller, specialized model. This distilled model, trained on 10,000 data points generated by ChatGPT, outperforms the monoT5 model, which was trained on 400,000 annotated MS MARCO data points, on the BEIR benchmark. The study highlights the effectiveness of LLMs in relevance ranking and suggests that their capabilities can be distilled into more efficient models. The code for reproducing the study's results is made available on GitHub.

Key takeaways:

Large Language Models (LLMs) like ChatGPT and GPT-4 can perform relevance ranking in Information Retrieval (IR) tasks competitively, even surpassing some supervised methods.
GPT-4 outperforms the fully fine-tuned monoT5-3B model on various IR benchmarks, including MS MARCO, TREC datasets, BEIR datasets, and low-resource languages like Mr.TyDi.
A small specialized model trained on 10K ChatGPT-generated data outperforms the monoT5 model trained on 400K annotated MS MARCO data on BEIR benchmarks.
The study explores the potential for distilling the ranking capabilities of ChatGPT into a specialized model, with code available for reproduction on GitHub.

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent

Key takeaways:

Comments (0)

Newsletter