GitHub - RahulSChand/gpu_poor: Calculate GPU memory requirement & breakdown for training/inference of LLM models. Supports ggml/bnb quantization

The markdown data discusses a tool designed to calculate how much GPU memory is needed for training or inference of any LLM model with quantization. The tool also provides a breakdown of where the GPU memory is allocated. It is useful for determining the maximum context length a GPU can handle, the type of finetuning possible, the maximum batch size during finetuning, and what is consuming the GPU memory. The tool provides the total vRAM and a breakdown of where the vRAM goes.

The tool can be used by either entering the model id of a huggingface model or uploading a json config or entering the model size. It supports bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2) for quantization. The results can vary depending on the model, input data, cuda version, and what quant is used. The author has cross-checked the results against what the website provides and what they get on their RTX 4090 & 2060, and all numbers are within 500MB.

Key takeaways:

This tool calculates how much GPU memory you need for training/inference of any LLM model with quantization, inference frameworks and QLoRA.
It helps to determine the maximum context length your GPU can handle, the type of finetuning you can do, the maximum batch size you can use during finetuning, and what is consuming your GPU memory.
The tool supports bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2) for quantization. The latter are only for inference while bnb int8/int4 can be used for both training & inference.
The results can vary depending on your model, input data, cuda version & what quant you are using. The author has tried to take these into account & make sure the results are within 500MB.

GitHub - RahulSChand/gpu_poor: Calculate GPU memory requirement & breakdown for training/inference of LLM models. Supports ggml/bnb quantization

Key takeaways:

Comments (0)

Newsletter