The author suggests using the safetensors format, which saves memory and makes the model safer and faster to run. They also recommend using the device_map feature to split the model across different devices to maximize memory usage. To further reduce memory consumption, the author suggests quantizing Falcon 180B to a lower precision. They conclude that with quantization and 100 GB of memory, Falcon 180B can be run on a reasonably affordable computer. For faster inference or fine-tuning, a GPU like the RTX 4090 or RTX 3090 24GB is recommended.
Key takeaways:
- The Technology Innovation Institute (TII) of Abu-Dhabi has released a new model, Falcon 180B, a 180 billion parameter model that has demonstrated superior performance and is ranked first on the OpenLLM leaderboard.
- Running Falcon 180B on a standard computer can be challenging due to its size and the intensive computing required. However, it is possible to run it on consumer hardware by upgrading the computer and using a quantized version of the model.
- Quantization of Falcon 180B to a lower precision, such as 4-bit precision, significantly reduces its memory consumption, making it possible to run the model on a reasonably affordable computer with 100 GB of memory.
- For fast inference or fine-tuning, a GPU like the RTX 4090 or RTX 3090 24GB is recommended. Without a GPU, fine-tuning would be too slow, but inference is possible with a high-end CPU and software optimized for faster inference, such as llama.cpp.