The article further discusses the costs associated with training and fine-tuning models, stating that training a 13 billion parameter model on 1.4 trillion tokens costs around $1 million, while fine-tuning is significantly cheaper. It also provides insights into GPU memory requirements, indicating that a 7 billion parameter model requires about 14GB of GPU space. The article concludes by highlighting the benefits of batching LLM requests, which can improve throughput by more than 10 times, and the memory requirements for output generation.
Key takeaways:
- Appending "Be Concise" to your prompt can save 40-90% of the cost as you pay by the token for responses.
- It is significantly cheaper to use GPT-3.5-Turbo than GPT-4, and using a vector store for information retrieval is more cost-effective than asking an LLM to generate it.
- Training your own LLM is possible but expensive, costing around $1 million to train a 13 billion parameter model on 1.4 trillion tokens. Fine-tuning, however, is relatively negligible in cost.
- Understanding GPU memory is crucial when self-hosting a model as LLMs push your GPU's memory to the limit. The memory required is directly proportional to the maximum number of tokens you want to generate.