TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	enwiki8	PAR Transformer 24B	Bit per Character (BPC)	1.11	# 1
Sentiment Analysis	SST-2 Binary classification	PAR BERT Base	Accuracy	91.6	# 50
Language Modelling	Text8	PAR Transformer 24B	Bit per Character (BPC)	1.18	# 13
Language Modelling	WikiText-103	PAR Transformer Large	Test perplexity	18.4	# 35
Language Modelling	WikiText-103	PAR Transformer Base	Test perplexity	22.7	# 48

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pay-attention-when-required/language-modelling-on-enwiki8-1)](https://paperswithcode.com/sota/language-modelling-on-enwiki8-1?p=pay-attention-when-required)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pay-attention-when-required/language-modelling-on-text8)](https://paperswithcode.com/sota/language-modelling-on-text8?p=pay-attention-when-required)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pay-attention-when-required/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=pay-attention-when-required)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pay-attention-when-required/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=pay-attention-when-required)`

Pay Attention when Required

9 Sep 2020 · Swetha Mandava, Szymon Migacz, Alex Fit Florea ·

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

PDF Abstract

Code

Add Remove Mark official

NVIDIA/DeepLearningExamples

12,603

Jmkernes/PAR-Transformer-XL

Tasks

Add Remove

Language Modelling

Paraphrase Identification

Question Answering

Sentiment Analysis

Datasets

GLUE

SST

SQuAD SST-2

WikiText-2

WikiText-103 Text8

Results from the Paper

Edit

Ranked #1 on Language Modelling on enwiki8

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	enwiki8	PAR Transformer 24B	Bit per Character (BPC)	1.11	# 1	Compare
Sentiment Analysis	SST-2 Binary classification	PAR BERT Base	Accuracy	91.6	# 50	Compare
Language Modelling	Text8	PAR Transformer 24B	Bit per Character (BPC)	1.18	# 13	Compare
Language Modelling	WikiText-103	PAR Transformer Large	Test perplexity	18.4	# 35	Compare
Language Modelling	WikiText-103	PAR Transformer Base	Test perplexity	22.7	# 48	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Adaptive Input Representations • Adaptive Softmax • Attention Dropout • BERT • BPE • Cosine Annealing • Dense Connections • Dropout • GELU • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Linear Warmup With Linear Decay • Multi-Head Attention • PAR Transformer • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Transformer-XL • Variational Dropout • Weight Decay • WordPiece

Edit Social Preview

Pay Attention when Required

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove