GLM-130B: An Open Bilingual Pre-trained Model

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Long-Context Understanding Ada-LEval (BestAnswer) ChatGLM3-6b-32k 2k 18.8 # 7
4k 9.0 # 7
8k 3.4 # 6
16k 0.5 # 9
1k 39.8 # 7
6k 5.0 # 6
12k 0.9 # 9
Long-Context Understanding Ada-LEval (BestAnswer) ChatGLM2-6b-32k 2k 10.9 # 9
4k 4.5 # 10
8k 1.6 # 10
16k 0.3 # 10
1k 31.2 # 10
6k 1.6 # 10
12k 0.0 # 10
Long-Context Understanding Ada-LEval (TSort) ChatGLM3-6b-32k 2k 2.3 # 9
4k 2.4 # 8
8k 2.0 # 9
16k 0.7 # 10
Long-Context Understanding Ada-LEval (TSort) ChatGLM2-6b-32k 2k 0.9 # 10
4k 0.2 # 10
8k 0.7 # 10
16k 0.9 # 9
Language Modelling BIG-bench-lite GLM-130B (3-shot) Accuracy 15.11 # 1
Language Modelling BIG-bench-lite GLM-130B (0-shot) Accuracy 13.31 # 3
Language Modelling BIG-bench-lite GLM-130B (1-shot) Accuracy 14.91 # 2
Language Modelling CLUE (AFQMC) GLM-130B Accuracy 71.2 # 1
Language Modelling CLUE (AFQMC) ERNIE 3.0 Titan-260B Accuracy 69.0 # 2
Language Modelling CLUE (C3) ERNIE 3.0 Titan-260B Accuracy 54.9 # 2
Language Modelling CLUE (C3) GLM-130B Accuracy 77.5 # 1
Language Modelling CLUE (CMNLI) ERNIE 3.0 Titan-260B Accuracy 51.7 # 2
Language Modelling CLUE (CMNLI) GLM-130B Accuracy 77.0 # 1
Language Modelling CLUE (CMRC2018) ERNIE 3.0 Titan-260B Accuracy 16.6 # 2
Language Modelling CLUE (CMRC2018) GLM-130B Accuracy 55.7 # 1
Language Modelling CLUE (DRCD) GLM-130B Accuracy 77.1 # 1
Language Modelling CLUE (DRCD) ERNIE 3.0 Titan-260B Accuracy 29.5 # 2
Language Modelling CLUE (OCNLI_50K) GLM-130B Accuracy 74.7 # 1
Language Modelling CLUE (OCNLI_50K) ERNIE 3.0 Titan-260B Accuracy 44.6 # 2
Language Modelling CLUE (WSC1.1) ERNIE 3.0 Titan-260B Accuracy 81.1 # 2
Language Modelling CLUE (WSC1.1) GLM-130B Accuracy 83.9 # 1
Language Modelling FewCLUE (BUSTM) GLM-130B Accuracy 77.5 # 1
Language Modelling FewCLUE (BUSTM) ERNIE 3.0 Titan-260B Accuracy 64.4 # 2
Language Modelling FewCLUE (CHID-FC) ERNIE 3.0 Titan-260B Accuracy 87.1 # 2
Language Modelling FewCLUE (CHID-FC) GLM-130B Accuracy 90.1 # 1
Language Modelling FewCLUE (CLUEWSC-FC) GLM-130B Accuracy 77.4 # 1
Language Modelling FewCLUE (CLUEWSC-FC) ERNIE 3.0 Titan-260B Accuracy 53.5 # 2
Language Modelling FewCLUE (EPRSTMT) GLM-130B Accuracy 92.5 # 1
Language Modelling FewCLUE (EPRSTMT) ERNIE 3.0 Titan-260B Accuracy 88.8 # 2
Language Modelling FewCLUE (OCNLI-FC) GLM-130B Accuracy 73.8 # 1
Language Modelling FewCLUE (OCNLI-FC) ERNIE 3.0 Titan-260B Accuracy 53.8 # 2
Language Modelling LAMBADA GLM-130B (bidirectional attention) Accuracy 80.2 # 12
Multi-task Language Understanding MMLU GLM-130B Average (%) 44.8 # 72
Language Modelling The Pile GLM-130B Bits per byte 0.634 # 1
Language Modelling The Pile GPT-3 Bits per byte 0.742 # 4
Language Modelling The Pile Jurassic-1 Bits per byte 0.65 # 2

Methods