TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	Natural Questions	RETRO + DPR (full)	EM	45.5	# 9
Language Modelling	WikiText-103	RETRO (7.5B)	Test perplexity	2.4	# 1
Language Modelling	WikiText-103	RETRO (7.5B)	Number of params	7532M	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-language-models-by-retrieving-from/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=improving-language-models-by-retrieving-from)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-language-models-by-retrieving-from/question-answering-on-natural-questions)](https://paperswithcode.com/sota/question-answering-on-natural-questions?p=improving-language-models-by-retrieving-from)`

Improving language models by retrieving from trillions of tokens

8 Dec 2021 · Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, Laurent SIfre ·

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.

PDF Abstract

Code

Add Remove Mark official

labmlai/annotated_deep_learning_pap…

↳ View annotated code at

labml.ai

48,096

lucidrains/RETRO-pytorch

827

Tasks

Add Remove

Language Modelling

Question Answering

Retrieval

Datasets

Natural Questions

WikiText-2

WikiText-103

The Pile

LAMBADA MassiveText

Results from the Paper

Add Remove

Ranked #1 on Language Modelling on WikiText-103 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	Natural Questions	RETRO + DPR (full)	EM	45.5	# 9	Compare
Language Modelling	WikiText-103	RETRO (7.5B)	Test perplexity	2.4	# 1	Compare
Language Modelling	WikiText-103	RETRO (7.5B)	Number of params	7532M	# 4	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BERT • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Linear Warmup With Linear Decay • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Transformer • Weight Decay • WordPiece

Edit Social Preview

Improving language models by retrieving from trillions of tokens

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove