Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

The author asserts that training on documents is a misleading term used by startups, and instead, they are likely using RAG with Llamaindex, which is considered the best option. The author suggests testing out a script that creates question and answer pairs with gpt-4 for qLoRA, although it's noted that this method has not been successful for private document knowledgebases due to the need for large amounts of repetitive data.

The author strongly advises against feeding a set of documents into fine-tuning, stating that it only results in learning the patterns within those documents. The author has personally disproven this method multiple times due to a persistent client who has been misled into believing it's effective.

Key takeaways

Training on documents is a misleading term used by many startups, the actual process involves using RAG and Llamaindex.
Llamaindex is the best option for most startups with working products.
Creating question and answer pairs with gpt-4 and using it for qLoRA might work, but it requires a lot of data and repeated concepts.
Feeding a set of documents into fine tuning does not work, it only helps in learning the patterns in those documents.

Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

Key takeaways

Discussion (0)