The tool can be used by either entering the model id of a huggingface model or uploading a json config or entering the model size. It supports bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2) for quantization. The results can vary depending on the model, input data, cuda version, and what quant is used. The author has cross-checked the results against what the website provides and what they get on their RTX 4090 & 2060, and all numbers are within 500MB.
Key takeaways:
- This tool calculates how much GPU memory you need for training/inference of any LLM model with quantization, inference frameworks and QLoRA.
- It helps to determine the maximum context length your GPU can handle, the type of finetuning you can do, the maximum batch size you can use during finetuning, and what is consuming your GPU memory.
- The tool supports bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2) for quantization. The latter are only for inference while bnb int8/int4 can be used for both training & inference.
- The results can vary depending on your model, input data, cuda version & what quant you are using. The author has tried to take these into account & make sure the results are within 500MB.