The largest and best model of the Llama 2 family has 70 billion parameters. One fp16 parameter weighs 2 bytes. Loading Llama 2 70B requires 140 GB of memory (70 billion * 2 bytes).
In a previous article, I showed how you can run a 180-billion-parameter model, Falcon 180B, on 100 GB of CPU RAM thanks to quantization.

Falcon 180B: Can It Run on Your Computer?
·
September 11, 2023
Llama 2 70B is substantially smaller than Falcon 180B.
Can it entirely fit into a single consumer GPU?
This is challenging. Let’s define that a high-end consumer GPU, such as the NVIDIA RTX 3090* or 4090. has a maximum of 24 GB of VRAM. If we quantize Llama 2 70B to 4-bit precision, we still need 35 GB of memory (70 billion * 0.5 bytes). The model could fit into 2 consumer GPUs.
With GPTQ quantization, we can further reduce the precision to 3-bit without losing much in the performance of the model. A 3-bit parameter weighs 0.375 bytes in memory. Llama 2 70B quantized to 3-bit would still weigh 26.25 GB. It doesn’t fit into one consumer GPU.

Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL
·
August 30, 2023
We could reduce the precision to 2-bit. It would fit into 24 GB of VRAM but then the performance of the model would also significantly drop according to previous studies on 2-bit quantization.
To avoid losing too much in the performance of the model, we could quantize important layers, or parts, of the model to a higher precision and the less important parts to a lower precision. The model would be quantized with mixed precision.
ExLlamaV2 (MIT license) implements mixed-precision quantization.
In this article, I show how to use ExLlamaV2 to quantize models with mixed precision. More particularly, we will see how to quantize Llama 2 70B to an average precision lower than 3-bit. For smaller GPUs, I show how to quantize Llama 2 13B with mixed precision. I also benchmark ExLlamaV2’s computational cost for quantization. We will see that the resulting models are very fast for inference.
The notebook demonstrating mixed-precision quantization of Llama 2 with ExLlamaV2 is available here:
Note: The links marked with a “*” are Amazon affiliate links.
To quantize models with mixed precision and run them, we need to install ExLlamaV2. Note that this is a very young project (2 weeks old at the time of writing this article). Bugs are expected, but I’ve found that the project works well enough to be useful.
Install it from source:
git clone https://github.com/turboderp/exllamav2 cd exllamav2 pip install -r requirements.txtWe will download models from Hugging Face Hub. We need to install transformers:
pip install transformersWe aim to run models on consumer GPUs.
Llama 2 70B: We target 24 GB of VRAM. NVIDIA RTX3090/4090 GPUs would work.
The NVIDIA RTX 3090* is less expensive but slower than the RTX 4090*. If you do a lot of AI experiments, I recommend the RTX 4090*. It will save you a lot of time.
If you use Google Colab, you cannot run the model on the free Google Colab. Only the A100 of Google Colab PRO has enough VRAM.
Llama 2 13B: We target 12 GB of VRAM. Many GPUs with at least 12 GB of VRAM are available. RTX3060/3080/4060/4080 are some of them.
If you are looking for a GPU under $500, the RTX 4060* has the best value. It is fast and has 16 GB of VRAM.
The quantization algorithm used by ExLlamaV2 is similar to GPTQ. But instead of choosing one precision type, ExLlamaV2 tries different precision types for each layer while measuring quantization errors. All the tries and associated error rates are saved. Then, given a target precision provided by the user, the ExLlamaV2 algorithm will quantize the model by choosing for each layer’s module the quantization precision that leads, on average, to the target precision with the lowest error rate.
ncG1vNJzZmijkZ7BpLTUqWWsrZKowaKvymeaqKVfpXyzwc1mo6WZnZZ6c3mWaZlmp55ixrDB0Waeqa1drLa1tA%3D%3D