Since the inception of the Large Language Models (LLMs), there has been an increasing load of automating tasks like translation, text classification, and customer service. To use LLMs, users need to send requests to a centralized server, which processes them and pingbacks with the response.
However, this method is expensive, energy-intensive, and often slow. As the data is stored on their servers, it could face potential data leaks or data loss in case of system failure. To overcome these challenges, researchers have developed a technique for compressing data.
Engineers at Princeton and Stanford Engineering have proposed a new algorithm that trims redundancies and reduces the precision of layers. Such compressed reams of data can be stored locally on a device like a phone or laptop.
Trimming redundancy means streamlining excess data that don’t actively contribute to the output. Meanwhile, reducing the precision of layers means reducing precisions (8-bit or 16-bit) to yield nearly the same results.
This algorithm would not just provide performance nearly as accurate as an uncompressed version, but also increase privacy, save energy, and lower costs. The new algorithm CALDERA (Calibration Aware Low precision DEcomposition with low Rank Adaptation), will be presented in December.
“When you use ChatGPT, whatever request you give it goes to the back-end servers of OpenAI, which process all of that data, and that is very expensive,” said coauthor Rajarshi Saha.
“So, you want to be able to do this LLM inference using consumer GPUs [graphics processing units], and the way to do that is by compressing these LLMs.“
AI-powered approach to establishing a carbon-neutral energy city
While the CALDERA is not the first to compress LLMs, it uses two new approaches; Low-Precision and Low-Rank. The Low-Rank framework reduces the redundancies in the LLM weight matrices, while the Low-Precision reduces the number of bits.
“Using both of these properties together, we are able to get much more compression than either of these techniques can achieve individually,” said Saha.
To train the CALDERA algorithm, researchers used large collections of information that are used to train LLMs. These data sets were composed of matrices and grids of numbers to store data.
Researchers tested their algorithm with open-source large language models released by Meta AI. The team found that the low-rank framework can further improve methods that use Low-Precision.
Engineers evaluated the performance of the compressed language models using several sets of tasks. They witnessed an improvement of up to 5%, which is significant for metrics.
“I think it’s encouraging and a bit surprising that we were able to get such good performance in this compression scheme,” said Goldsmith. “By taking advantage of the weight matrix rather than just using a generic compression algorithm for the bits that are representing the weight matrix, we were able to do much better.“
New AI can separate brain patterns related to a particular behavior
Journal Reference
- Saha, R., Sagan, N., Srivastava, V., Goldsmith, A. J., & Pilanci, M. (2024). Compressing Large Language Models using Low Rank and Low Precision Decomposition. ArXiv. DOI: 10.48550/arXiv.2405.18886