Quantization
1039 papers with code • 10 benchmarks • 18 datasets
Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).
Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers
Libraries
Use these libraries to find Quantization models and implementationsDatasets
Most implemented papers
FastText.zip: Compressing text classification models
We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients.
Billion-scale similarity search with GPUs
Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures.
QLoRA: Efficient Finetuning of Quantized LLMs
Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99. 3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU.
HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.
GLM-130B: An Open Bilingual Pre-trained Model
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters.