A minimal Introduction to Quantization

Published

August 4, 2024

For the last couple of weeks, I’ve been considering writing some introductory content for quantization. After exploring a bit more, I realized there are many great resources for it! Rather than write an in-depth introduction to the topic, I’ll give a couple of high-level explanations and link to relevant resources. I hope you find this useful! Feel free to leave a star in the GitHub repository if you do.

What is Quantization?

When we talk about models such as GPT-4, we’re referring to neural networks with billions of parameters. Each of these parameters is a number that needs to be stored with some precision. For instance, during training, a 32-bit floating-point number is usually used. However, for deployment and inference, we do not need that level of precision and can hence use fewer bits to store these numbers.

What do different numbers represent.

The following table shows the range of numbers and the precision that can be represented with different data types:

Data Type Range of numbers Precision
float32 -1.18e38 to 3.4e38 7 digits
float16 -65k to 65k 3 digits
bfloat16 -3.39e38 to 3.39e38 3 digits
int8 -128 to 127 0 digits
int4 -8 to 7 0 digits

How much memory does a model need?

Models come in all sizes! Llama 3.1, for example, came out in three sizes: 8B, 70B, and 405B. Let’s go through a quick estimate of how much memory would be needed to load a model:

  • 8B means that the model has 8 billion parameters.
  • If you want to use the model for inference, you would use 16-bit numbers (e.g., bfloat16) to store the parameters.
  • So we have 8 billion parameters, each one using 16 bits (or 2 bytes).

A quick estimate is calculated as:

\[ needed_bytes = bytes\_per\_parameter * number\_of\_parameters \]

For the 8B model, we would need

\[ needed_bytes = 16 * 8e9 / 8 = 16000000000 bytes = 16GB \]

Note that this is a very rough estimate and it’s just to load the model. You also need to take into account the memory needed for the input and output tensors, as well as the memory needed for the intermediate computations. For example, using long sequences would require more memory than using short sequences.

Useful Napkin Math

Without going into too much detail, the following table shows the memory needed to load 2B, 8B, 70B, and 405B models using different data types:

Model Size float32 float16 int8 int4
2B 8GB 4GB 2GB 1GB
8B 32GB 16GB 8GB 4GB
70B 280GB 140GB 70GB 35GB
405B 1620GB 810GB 405GB 202GB

For reference, a H100 has 80GB of memory, so loading Llama 3.1 405B would require at least a full node (of 8 H100s) to load the model in 8-bit integers.

Once again, consider that these are just estimates. For training, you would require more memory to store the gradients. For more precise calculations, please review the following resources:

Let’s Talk More About Quantization

Going from 32-bit floating-point numbers to 16-bit floating-point numbers is a common practice. However, you can also use 8-bit integers, 4-bit integers, or even ternary numbers! For certain models such as Mixture of Experts, even sub 1-bit per parameter has been explored.

Some quick things to take into account

  • As you go from 32-bit to 16-bit to 8-bit, you lose precision. This means that the model will not be able to represent the same range of numbers as before. Beyond 8-bit, the model tends to degrade and lose quality. However, 8-bit and 4-bit models are very popular in the community, and there are significant efforts to push these even further.
  • There are many quantization methods (AQLM, AWQ, bitsandbytes, GGUF, HQQ, etc.) and there is no single best method. The best method depends on the model, the target number of bits, the target hardware, and few other factors. The transformers docs have a nice table with the different features of the quantization methods.
  • Smaller quants will use less memory, but they are not necessarily faster. This is a bit counterintuitive. On one hand, you have fewer bits to use for the computation, but on the other hand, some quantization methods add overhead to the computation. For example, bitsandbytes (as far as I know) does not support 4-bit compute and converts the 4-bit integers to half precision as needed.
  • Evaluating quantization precisely is not trivial. I don’t think there’s too much discussion about this, but the recent Llama 3.1 405B release led to a situation in which different API providers were serving the same model with different quality. Fireworks AI wrote a blog post about evaluatin quantization quality through different methods.

Where to learn about quantization?

Here are some resources I recommend

If you prefer video format, there are two free courses from DeepLearning.AI + Hugging Face.

  • Quantization Fundamentals: This course shows how to quantize open access models, how to optimize any model (independently of their modality), and how to do downcasting.
  • Quantization in Depth: This ocurse goes deeper to implementing quantization from scratch and bulding a general-purpose quantizer.

Quantization can also be mixed with training. In 2023, QLoRA, a method that combines parameter efficient training techniqus (LoRA in particular) with quantization led to way that allow us to fine-tune 7B models even with free Google Colab instances! QLoRA is nowadays well integrated across the ecosystem (e.g., in transformers, trl for RLHF, axolotl, etc.). You can read its original blog post for more information about it.

Thanks for reading!