TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Document Summarization	CNN / Daily Mail	GPT-2	ROUGE-1	29.34	# 26
Document Summarization	CNN / Daily Mail	GPT-2	ROUGE-2	8.27	# 26
Document Summarization	CNN / Daily Mail	GPT-2	ROUGE-L	26.58	# 26
Language Modelling	enwik8	GPT-2 (48 layers, h=1600)	Bit per Character (BPC)	0.93	# 1
Language Modelling	enwik8	GPT-2 (48 layers, h=1600)	Number of params	1542M	# 1
Language Modelling	LAMBADA	GPT-2 1.5B (Zero Shot)	Accuracy	63.24	# 29
Language Modelling	LAMBADA	GPT-2 1.5B (Zero Shot)	Perplexity	8.63	# 10
Language Modelling	One Billion Word	GPT-2	PPL	42.16	# 21
Language Modelling	One Billion Word	GPT-2	Number of params	1.54B	# 1
Language Modelling	Penn Treebank (Word Level)	GPT-2	Test perplexity	35.76	# 3
Language Modelling	Penn Treebank (Word Level)	GPT-2	Params	1542M	# 2
Dialogue State Tracking	SIMMC2.0	GPT-2	Slot F1	81.7	# 4
Dialogue State Tracking	SIMMC2.0	GPT-2	Act F1	94.5	# 4
Response Generation	SIMMC2.0	GPT-2	BLEU	19.2	# 5
Language Modelling	Text8	GPT-2	Bit per Character (BPC)	0.98	# 1
Language Modelling	Text8	GPT-2	Number of params	1542M	# 1
Language Modelling	WikiText-103	GPT-2 Large	Test perplexity	22.05	# 46
Language Modelling	WikiText-103	GPT-2 Large	Number of params	774M	# 8
Language Modelling	WikiText-103	GPT-2 Medium	Test perplexity	26.37	# 63
Language Modelling	WikiText-103	GPT-2 Medium	Number of params	355M	# 10
Language Modelling	WikiText-103	GPT-2 Full	Test perplexity	17.48	# 25
Language Modelling	WikiText-103	GPT-2 Full	Number of params	1542M	# 6
Language Modelling	WikiText-103	GPT-2 Small	Test perplexity	37.50	# 79
Language Modelling	WikiText-103	GPT-2 Small	Number of params	124M	# 39
Language Modelling	WikiText-2	GPT-2 (small)	Test perplexity	29.41	# 9
Language Modelling	WikiText-2	GPT-2 (small)	Number of params	117M	# 7
Language Modelling	WikiText-2	GPT-2 (medium)	Test perplexity	22.76	# 8
Language Modelling	WikiText-2	GPT-2 (medium)	Number of params	345M	# 5
Language Modelling	WikiText-2	GPT-2 (large)	Test perplexity	19.93	# 7
Language Modelling	WikiText-2	GPT-2 (large)	Number of params	762M	# 3
Language Modelling	WikiText-2	GPT-2	Test perplexity	18.34	# 6
Language Modelling	WikiText-2	GPT-2	Number of params	1542M	# 1
Coreference Resolution	Winograd Schema Challenge	GPT-2-XL 1.5B	Accuracy	70.7	# 33

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/language-modelling-on-enwiki8)](https://paperswithcode.com/sota/language-modelling-on-enwiki8?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/language-modelling-on-text8)](https://paperswithcode.com/sota/language-modelling-on-text8?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/language-modelling-on-penn-treebank-word)](https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/dialogue-state-tracking-on-simmc2-0)](https://paperswithcode.com/sota/dialogue-state-tracking-on-simmc2-0?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/response-generation-on-simmc2-0)](https://paperswithcode.com/sota/response-generation-on-simmc2-0?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/language-modelling-on-wikitext-2)](https://paperswithcode.com/sota/language-modelling-on-wikitext-2?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/language-modelling-on-one-billion-word)](https://paperswithcode.com/sota/language-modelling-on-one-billion-word?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/document-summarization-on-cnn-daily-mail)](https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/language-modelling-on-lambada)](https://paperswithcode.com/sota/language-modelling-on-lambada?p=language-models-are-unsupervised-multitask)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-unsupervised-multitask/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=language-models-are-unsupervised-multitask)`

Language Models are Unsupervised Multitask Learners

Preprint 2019 · Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever ·

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

PDF

Code

Add Remove Mark official

openai/gpt-2 official

21,117

huggingface/transformers

124,984

PaddlePaddle/PaddleNLP

10,731

minimaxir/gpt-2-simple

↳ Quickstart in

Colab

3,380

imcaspar/gpt2-ml

↳ Quickstart in

Colab

1,709

See all 16 implementations

Tasks

Add Remove

Common Sense Reasoning

Coreference Resolution

Data-to-Text Generation

Dialogue State Tracking

Document Summarization

Language Modelling

Machine Translation

Multi-task Language Understanding

Multi-Task Learning

Question Answering

Reading Comprehension

Response Generation

Text Generation

Translation

Datasets

Introduced in the Paper:

WebText

Used in the Paper:

GLUE

Natural Questions

Penn Treebank

WikiText-2

WikiText-103

CNN/Daily Mail

WSC

CoQA

LAMBADA Billion Word Benchmark

CBT One Billion Word Benchmark Children's Book Test

decaNLP BookTest Text8

SIMMC2.0

Results from the Paper

Add Remove

Ranked #1 on Language Modelling on enwik8 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Document Summarization	CNN / Daily Mail	GPT-2	ROUGE-1	29.34	# 26	Compare
			ROUGE-2	8.27	# 26	Compare
			ROUGE-L	26.58	# 26	Compare
Language Modelling	enwik8	GPT-2 (48 layers, h=1600)	Bit per Character (BPC)	0.93	# 1	Compare
Language Modelling	enwik8	GPT-2 (48 layers, h=1600)	Number of params	1542M	# 1	Compare
Language Modelling	LAMBADA	GPT-2 1.5B (Zero Shot)	Accuracy	63.24	# 29	Compare
Language Modelling	LAMBADA	GPT-2 1.5B (Zero Shot)	Perplexity	8.63	# 10	Compare
Language Modelling	One Billion Word	GPT-2	PPL	42.16	# 21	Compare
Language Modelling	One Billion Word	GPT-2	Number of params	1.54B	# 1	Compare
Language Modelling	Penn Treebank (Word Level)	GPT-2	Test perplexity	35.76	# 3	Compare
Language Modelling	Penn Treebank (Word Level)	GPT-2	Params	1542M	# 2	Compare
Dialogue State Tracking	SIMMC2.0	GPT-2	Slot F1	81.7	# 4	Compare
Dialogue State Tracking	SIMMC2.0	GPT-2	Act F1	94.5	# 4	Compare
Response Generation	SIMMC2.0	GPT-2	BLEU	19.2	# 5	Compare
Language Modelling	Text8	GPT-2	Bit per Character (BPC)	0.98	# 1	Compare
Language Modelling	Text8	GPT-2	Number of params	1542M	# 1	Compare
Language Modelling	WikiText-103	GPT-2 Large	Test perplexity	22.05	# 46	Compare
Language Modelling	WikiText-103	GPT-2 Large	Number of params	774M	# 8	Compare
Language Modelling	WikiText-103	GPT-2 Medium	Test perplexity	26.37	# 63	Compare
Language Modelling	WikiText-103	GPT-2 Medium	Number of params	355M	# 10	Compare
Language Modelling	WikiText-103	GPT-2 Full	Test perplexity	17.48	# 25	Compare
Language Modelling	WikiText-103	GPT-2 Full	Number of params	1542M	# 6	Compare
Language Modelling	WikiText-103	GPT-2 Small	Test perplexity	37.50	# 79	Compare
Language Modelling	WikiText-103	GPT-2 Small	Number of params	124M	# 39	Compare
Language Modelling	WikiText-2	GPT-2 (small)	Test perplexity	29.41	# 9	Compare
Language Modelling	WikiText-2	GPT-2 (small)	Number of params	117M	# 7	Compare
Language Modelling	WikiText-2	GPT-2 (medium)	Test perplexity	22.76	# 8	Compare
Language Modelling	WikiText-2	GPT-2 (medium)	Number of params	345M	# 5	Compare
Language Modelling	WikiText-2	GPT-2 (large)	Test perplexity	19.93	# 7	Compare
Language Modelling	WikiText-2	GPT-2 (large)	Number of params	762M	# 3	Compare
Language Modelling	WikiText-2	GPT-2	Test perplexity	18.34	# 6	Compare
Language Modelling	WikiText-2	GPT-2	Number of params	1542M	# 1	Compare
Coreference Resolution	Winograd Schema Challenge	GPT-2-XL 1.5B	Accuracy	70.7	# 33	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Discriminative Fine-Tuning • Dropout • GELU • GPT-2 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay

Edit Social Preview

Language Models are Unsupervised Multitask Learners

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove