Catch me if you can! How to beat GPT-4 with a 13B model

The article announces the Llama-rephraser, a 13B model that achieves GPT-4 performance in major benchmarks by simply rephrasing the test set. The authors highlight the issue of contamination, where test set information leaks into the training set, leading to overly optimistic performance estimates. They argue that existing decontamination measures, such as n-gram overlap and embedding similarity search, fail to detect contamination when the test data is rephrased or translated.

To address this, the authors propose a new contamination detection method called the LLM decontaminator. This method identifies the top-k training items with the highest similarity to each test case, then generates k potential rephrased pairs to evaluate for rephrasing using an advanced LLM. The LLM decontaminator was found to be more effective at removing rephrased samples than existing methods. The authors encourage the community to adopt stronger decontamination tools and develop fresh one-time exams to accurately evaluate LLMs.

Key takeaways:

The authors introduced Llama-rephraser, a 13B model that achieves GPT-4 performance in major benchmarks by rephrasing the test set or translating it into a different language.
They proposed a new contamination detection method called "LLM decontaminator" which uses embedding similarity search and an advanced LLM to identify and remove rephrased samples from the training set.
The LLM decontaminator was found to be more effective than existing methods in detecting contamination, including n-gram overlap and embedding similarity search.
The authors applied the LLM decontaminator to real-world datasets and found a significant amount of rephrased samples, suggesting that contamination may be more prevalent than previously thought.

Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org

Key takeaways:

Comments (0)

Newsletter