With the initial excitement around AI cooling down, interest in running generative AI models independently is growing. Free models are catching up with commercial AI offerings. Hosting your own Large Language Model (LLM) at home or in a business setting is becoming more appealing. However, the best models usually demand high-end hardware and consume significant energy. Models like Llama 3.1, with 405 billion parameters, are beyond reach for most, even those with powerful workstations or servers. Medium-sized businesses might hesitate to invest in twelve H100 GPUs, each costing around 30,000 Euros.
Language models require precise training, but post-training accuracy can be reduced through quantization without much quality loss. Common quantization methods reduce weight storage from 16-bit to 4-bit, with newer methods using 2-bit or less. Quantized models compute responses faster, lowering electricity costs and allowing them to run on local machines and slower processors.
Smaller models like Mistral Large 2 with 124 billion parameters or Nvidia Nemotron with 70 billion parameters are nearly as capable as larger models. Mistral Large 2 can run on a single 80-GB GPU like an A100 or H100, and Nemotron on a 48-GB GPU such as an A6000, which costs about 6,000 Euros. This is made possible by quantization, which even allows models with ChatGPT 3.5’s performance to run locally on smartphones.
This article provides an overview of different quantization methods and which frameworks and hardware platforms use them efficiently. Quantization is a technique that compresses AI models, making them smaller and more manageable without significant loss in performance. It is crucial for deploying AI systems on devices with limited resources, such as smartphones or personal computers, and helps in reducing operational costs by lowering energy consumption.
Quantization involves reducing the precision of the model’s weights. Initially, models are trained with high precision, typically using 16-bit floating-point numbers. Post-training, these weights can be converted to lower precision formats like 8-bit or even 4-bit integers. This reduction in precision significantly decreases the model size and speeds up computations, making it feasible to deploy large models on less powerful hardware.
For mobile devices, quantization is essential. It allows complex models to run efficiently on smartphones, enabling advanced AI applications without relying on cloud computing. This is not only cost-effective but also enhances privacy, as data processing can occur locally on the device.
Quantization is also beneficial for CPUs, which are generally less powerful than GPUs for AI tasks. By reducing the model size, quantized models can perform satisfactorily on CPUs, making high-quality AI accessible to a broader audience without the need for expensive GPU setups.
In conclusion, quantization is a powerful technique that enables the deployment of large language models on a variety of devices, from high-end servers to everyday smartphones. It reduces the resource requirements and operational costs, making AI technology more accessible and sustainable. As AI continues to evolve, quantization will play a critical role in ensuring that powerful models can be used efficiently across different platforms and devices.