TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	enwik8	Transformer-LS (large)	Bit per Character (BPC)	0.97	# 8
Language Modelling	enwik8	Transformer-LS (large)	Number of params	110M	# 11
Language Modelling	enwik8	Transformer-LS (small)	Bit per Character (BPC)	0.99	# 12
Language Modelling	enwik8 dev	Transformer-LS (small)	Bit per Character (BPC)	1.01	# 1
Language Modelling	Text8	Transformer-LS (small)	Bit per Character (BPC)	1.09	# 7
Language Modelling	Text8 dev	Transformer-LS (small)	Bit per Character (BPC)	1.03	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/long-short-transformer-efficient-transformers/language-modelling-on-enwik8-dev)](https://paperswithcode.com/sota/language-modelling-on-enwik8-dev?p=long-short-transformer-efficient-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/long-short-transformer-efficient-transformers/language-modelling-on-text8-dev)](https://paperswithcode.com/sota/language-modelling-on-text8-dev?p=long-short-transformer-efficient-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/long-short-transformer-efficient-transformers/language-modelling-on-text8)](https://paperswithcode.com/sota/language-modelling-on-text8?p=long-short-transformer-efficient-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/long-short-transformer-efficient-transformers/language-modelling-on-enwiki8)](https://paperswithcode.com/sota/language-modelling-on-enwiki8?p=long-short-transformer-efficient-transformers)`

Long-Short Transformer: Efficient Transformers for Language and Vision

NeurIPS 2021 · Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro ·

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images. The source code and models are released at https://github.com/NVIDIA/transformer-ls .

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

NVIDIA/transformer-ls official

219

keonlee9420/Comprehensive-Transform…

313

lucidrains/long-short-transformer

117

Tasks

Add Remove

Language Modelling

Datasets

ImageNet

IMDb Movie Reviews LRA

ListOps Text8

Results from the Paper

Edit

Ranked #1 on Language Modelling on enwik8 dev

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	enwik8	Transformer-LS (large)	Bit per Character (BPC)	0.97	# 8	Compare
Language Modelling	enwik8	Transformer-LS (large)	Number of params	110M	# 11	Compare
Language Modelling	enwik8	Transformer-LS (small)	Bit per Character (BPC)	0.99	# 12	Compare
Language Modelling	enwik8 dev	Transformer-LS (small)	Bit per Character (BPC)	1.01	# 1	Compare
Language Modelling	Text8	Transformer-LS (small)	Bit per Character (BPC)	1.09	# 7	Compare
Language Modelling	Text8 dev	Transformer-LS (small)	Bit per Character (BPC)	1.03	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Long-Short Transformer: Efficient Transformers for Language and Vision

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove