TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	2k	18.8	# 7
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	4k	9.0	# 7
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	8k	3.4	# 6
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	16k	0.5	# 9
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	1k	39.8	# 7
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	6k	5.0	# 6
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	12k	0.9	# 9
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	2k	10.9	# 9
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	4k	4.5	# 10
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	8k	1.6	# 10
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	16k	0.3	# 10
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	1k	31.2	# 10
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	6k	1.6	# 10
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	12k	0.0	# 10
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM3-6b-32k	2k	2.3	# 9
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM3-6b-32k	4k	2.4	# 8
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM3-6b-32k	8k	2.0	# 9
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM3-6b-32k	16k	0.7	# 10
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM2-6b-32k	2k	0.9	# 10
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM2-6b-32k	4k	0.2	# 10
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM2-6b-32k	8k	0.7	# 10
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM2-6b-32k	16k	0.9	# 9
Language Modelling	BIG-bench-lite	GLM-130B (3-shot)	Accuracy	15.11	# 1
Language Modelling	BIG-bench-lite	GLM-130B (0-shot)	Accuracy	13.31	# 3
Language Modelling	BIG-bench-lite	GLM-130B (1-shot)	Accuracy	14.91	# 2
Language Modelling	CLUE (AFQMC)	GLM-130B	Accuracy	71.2	# 1
Language Modelling	CLUE (AFQMC)	ERNIE 3.0 Titan-260B	Accuracy	69.0	# 2
Language Modelling	CLUE (C3)	ERNIE 3.0 Titan-260B	Accuracy	54.9	# 2
Language Modelling	CLUE (C3)	GLM-130B	Accuracy	77.5	# 1
Language Modelling	CLUE (CMNLI)	ERNIE 3.0 Titan-260B	Accuracy	51.7	# 2
Language Modelling	CLUE (CMNLI)	GLM-130B	Accuracy	77.0	# 1
Language Modelling	CLUE (CMRC2018)	ERNIE 3.0 Titan-260B	Accuracy	16.6	# 2
Language Modelling	CLUE (CMRC2018)	GLM-130B	Accuracy	55.7	# 1
Language Modelling	CLUE (DRCD)	GLM-130B	Accuracy	77.1	# 1
Language Modelling	CLUE (DRCD)	ERNIE 3.0 Titan-260B	Accuracy	29.5	# 2
Language Modelling	CLUE (OCNLI_50K)	GLM-130B	Accuracy	74.7	# 1
Language Modelling	CLUE (OCNLI_50K)	ERNIE 3.0 Titan-260B	Accuracy	44.6	# 2
Language Modelling	CLUE (WSC1.1)	ERNIE 3.0 Titan-260B	Accuracy	81.1	# 2
Language Modelling	CLUE (WSC1.1)	GLM-130B	Accuracy	83.9	# 1
Language Modelling	FewCLUE (BUSTM)	GLM-130B	Accuracy	77.5	# 1
Language Modelling	FewCLUE (BUSTM)	ERNIE 3.0 Titan-260B	Accuracy	64.4	# 2
Language Modelling	FewCLUE (CHID-FC)	ERNIE 3.0 Titan-260B	Accuracy	87.1	# 2
Language Modelling	FewCLUE (CHID-FC)	GLM-130B	Accuracy	90.1	# 1
Language Modelling	FewCLUE (CLUEWSC-FC)	GLM-130B	Accuracy	77.4	# 1
Language Modelling	FewCLUE (CLUEWSC-FC)	ERNIE 3.0 Titan-260B	Accuracy	53.5	# 2
Language Modelling	FewCLUE (EPRSTMT)	GLM-130B	Accuracy	92.5	# 1
Language Modelling	FewCLUE (EPRSTMT)	ERNIE 3.0 Titan-260B	Accuracy	88.8	# 2
Language Modelling	FewCLUE (OCNLI-FC)	GLM-130B	Accuracy	73.8	# 1
Language Modelling	FewCLUE (OCNLI-FC)	ERNIE 3.0 Titan-260B	Accuracy	53.8	# 2
Language Modelling	LAMBADA	GLM-130B (bidirectional attention)	Accuracy	80.2	# 12
Multi-task Language Understanding	MMLU	GLM-130B	Average (%)	44.8	# 72
Language Modelling	The Pile	GLM-130B	Bits per byte	0.634	# 1
Language Modelling	The Pile	GPT-3	Bits per byte	0.742	# 4
Language Modelling	The Pile	Jurassic-1	Bits per byte	0.65	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-big-bench-lite)](https://paperswithcode.com/sota/language-modelling-on-big-bench-lite?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-clue-afqmc)](https://paperswithcode.com/sota/language-modelling-on-clue-afqmc?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-clue-c3)](https://paperswithcode.com/sota/language-modelling-on-clue-c3?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-clue-cmnli)](https://paperswithcode.com/sota/language-modelling-on-clue-cmnli?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-clue-cmrc2018)](https://paperswithcode.com/sota/language-modelling-on-clue-cmrc2018?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-clue-drcd)](https://paperswithcode.com/sota/language-modelling-on-clue-drcd?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-clue-ocnli-50k)](https://paperswithcode.com/sota/language-modelling-on-clue-ocnli-50k?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-clue-wsc1-1)](https://paperswithcode.com/sota/language-modelling-on-clue-wsc1-1?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-fewclue-bustm)](https://paperswithcode.com/sota/language-modelling-on-fewclue-bustm?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-fewclue-chid-fc)](https://paperswithcode.com/sota/language-modelling-on-fewclue-chid-fc?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-fewclue-cluewsc-fc)](https://paperswithcode.com/sota/language-modelling-on-fewclue-cluewsc-fc?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-fewclue-eprstmt)](https://paperswithcode.com/sota/language-modelling-on-fewclue-eprstmt?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-fewclue-ocnli-fc)](https://paperswithcode.com/sota/language-modelling-on-fewclue-ocnli-fc?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-the-pile)](https://paperswithcode.com/sota/language-modelling-on-the-pile?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/long-context-understanding-on-ada-leval)](https://paperswithcode.com/sota/long-context-understanding-on-ada-leval?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/long-context-understanding-on-ada-leval-tsort)](https://paperswithcode.com/sota/long-context-understanding-on-ada-leval-tsort?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/language-modelling-on-lambada)](https://paperswithcode.com/sota/language-modelling-on-lambada?p=glm-130b-an-open-bilingual-pre-trained-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glm-130b-an-open-bilingual-pre-trained-model/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=glm-130b-an-open-bilingual-pre-trained-model)`

GLM-130B: An Open Bilingual Pre-trained Model

5 Oct 2022 · Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, WenGuang Chen, Peng Zhang, Yuxiao Dong, Jie Tang ·

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

PDF Abstract

Code

Add Remove Mark official

thudm/glm-130b official

↳ Quickstart in

Spaces

7,610

thudm/chatglm-6b

39,275

thudm/chatglm2-6b

15,480

thudm/chatglm

12,035

thudm/chatglm3

12,035

See all 10 implementations

Tasks

Add Remove

Language Modelling

Long-Context Understanding

Multi-task Language Understanding

Quantization

Datasets

MMLU

The Pile

BIG-bench

LAMBADA CLUE

FewCLUE

Results from the Paper

Edit

Ranked #1 on Language Modelling on CLUE (OCNLI_50K)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM3-6b-32k	2k	18.8	# 7	Compare
			4k	9.0	# 7	Compare
			8k	3.4	# 6	Compare
			16k	0.5	# 9	Compare
			1k	39.8	# 7	Compare
			6k	5.0	# 6	Compare
			12k	0.9	# 9	Compare
Long-Context Understanding	Ada-LEval (BestAnswer)	ChatGLM2-6b-32k	2k	10.9	# 9	Compare
			4k	4.5	# 10	Compare
			8k	1.6	# 10	Compare
			16k	0.3	# 10	Compare
			1k	31.2	# 10	Compare
			6k	1.6	# 10	Compare
			12k	0.0	# 10	Compare
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM3-6b-32k	2k	2.3	# 9	Compare
			4k	2.4	# 8	Compare
			8k	2.0	# 9	Compare
			16k	0.7	# 10	Compare
Long-Context Understanding	Ada-LEval (TSort)	ChatGLM2-6b-32k	2k	0.9	# 10	Compare
			4k	0.2	# 10	Compare
			8k	0.7	# 10	Compare
			16k	0.9	# 9	Compare
Language Modelling	BIG-bench-lite	GLM-130B (3-shot)	Accuracy	15.11	# 1	Compare
Language Modelling	BIG-bench-lite	GLM-130B (0-shot)	Accuracy	13.31	# 3	Compare
Language Modelling	BIG-bench-lite	GLM-130B (1-shot)	Accuracy	14.91	# 2	Compare
Language Modelling	CLUE (AFQMC)	GLM-130B	Accuracy	71.2	# 1	Compare
Language Modelling	CLUE (AFQMC)	ERNIE 3.0 Titan-260B	Accuracy	69.0	# 2	Compare
Language Modelling	CLUE (C3)	ERNIE 3.0 Titan-260B	Accuracy	54.9	# 2	Compare
Language Modelling	CLUE (C3)	GLM-130B	Accuracy	77.5	# 1	Compare
Language Modelling	CLUE (CMNLI)	ERNIE 3.0 Titan-260B	Accuracy	51.7	# 2	Compare
Language Modelling	CLUE (CMNLI)	GLM-130B	Accuracy	77.0	# 1	Compare
Language Modelling	CLUE (CMRC2018)	ERNIE 3.0 Titan-260B	Accuracy	16.6	# 2	Compare
Language Modelling	CLUE (CMRC2018)	GLM-130B	Accuracy	55.7	# 1	Compare
Language Modelling	CLUE (DRCD)	GLM-130B	Accuracy	77.1	# 1	Compare
Language Modelling	CLUE (DRCD)	ERNIE 3.0 Titan-260B	Accuracy	29.5	# 2	Compare
Language Modelling	CLUE (OCNLI_50K)	GLM-130B	Accuracy	74.7	# 1	Compare
Language Modelling	CLUE (OCNLI_50K)	ERNIE 3.0 Titan-260B	Accuracy	44.6	# 2	Compare
Language Modelling	CLUE (WSC1.1)	ERNIE 3.0 Titan-260B	Accuracy	81.1	# 2	Compare
Language Modelling	CLUE (WSC1.1)	GLM-130B	Accuracy	83.9	# 1	Compare
Language Modelling	FewCLUE (BUSTM)	GLM-130B	Accuracy	77.5	# 1	Compare
Language Modelling	FewCLUE (BUSTM)	ERNIE 3.0 Titan-260B	Accuracy	64.4	# 2	Compare
Language Modelling	FewCLUE (CHID-FC)	ERNIE 3.0 Titan-260B	Accuracy	87.1	# 2	Compare
Language Modelling	FewCLUE (CHID-FC)	GLM-130B	Accuracy	90.1	# 1	Compare
Language Modelling	FewCLUE (CLUEWSC-FC)	GLM-130B	Accuracy	77.4	# 1	Compare
Language Modelling	FewCLUE (CLUEWSC-FC)	ERNIE 3.0 Titan-260B	Accuracy	53.5	# 2	Compare
Language Modelling	FewCLUE (EPRSTMT)	GLM-130B	Accuracy	92.5	# 1	Compare
Language Modelling	FewCLUE (EPRSTMT)	ERNIE 3.0 Titan-260B	Accuracy	88.8	# 2	Compare
Language Modelling	FewCLUE (OCNLI-FC)	GLM-130B	Accuracy	73.8	# 1	Compare
Language Modelling	FewCLUE (OCNLI-FC)	ERNIE 3.0 Titan-260B	Accuracy	53.8	# 2	Compare
Language Modelling	LAMBADA	GLM-130B (bidirectional attention)	Accuracy	80.2	# 12	Compare
Multi-task Language Understanding	MMLU	GLM-130B	Average (%)	44.8	# 72	Compare
Language Modelling	The Pile	GLM-130B	Bits per byte	0.634	# 1	Compare
Language Modelling	The Pile	GPT-3	Bits per byte	0.742	# 4	Compare
Language Modelling	The Pile	Jurassic-1	Bits per byte	0.65	# 2	Compare

Methods

Add Remove

Adam • Attention Dropout • AWARE • BPE • Cosine Annealing • Dense Connections • Dropout • ERNIE • Fixed Factorized Attention • GELU • GLM • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

GLM-130B: An Open Bilingual Pre-trained Model

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove