TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	BoolQ	LLaMA 2 70B (0-shot)	Accuracy	85	# 17
Question Answering	BoolQ	LLaMA 2 7B (zero-shot)	Accuracy	77.4	# 29
Question Answering	BoolQ	LLaMA 2 13B (0-shot)	Accuracy	81.7	# 24
Question Answering	BoolQ	LLaMA 2 34B (0-shot)	Accuracy	83.7	# 20
Arithmetic Reasoning	GSM8K	LLaMA 2 70B (on-shot)	Accuracy	56.8	# 110
Arithmetic Reasoning	GSM8K	LLaMA 2 70B (on-shot)	Parameters (Billion)	70	# 86
Sentence Completion	HellaSwag	LLaMA 2 7B (0-shot)	Accuracy	77.2	# 45
Sentence Completion	HellaSwag	LLaMA 2 13B (0-shot)	Accuracy	80.7	# 39
Sentence Completion	HellaSwag	LLaMA 2 34B (0-shot)	Accuracy	83.3	# 29
Sentence Completion	HellaSwag	LLaMA 2 70B (0-shot)	Accuracy	85.3	# 21
Code Generation	HumanEval	Llama 2 34B (zero-shot)	Pass@1	22.6	# 95
Code Generation	HumanEval	Llama 2 70B (zero-shot)	Pass@1	29.9	# 79
Code Generation	HumanEval	Llama 2 7B (zero-shot)	Pass@1	12.8	# 115
Code Generation	HumanEval	Llama 2 13B (zero-shot)	Pass@1	18.3	# 100
Math Word Problem Solving	MAWPS	LLaMA 2-Chat	Accuracy (%)	82.4	# 15
Code Generation	MBPP	Llama 2 13B (0-shot)	Accuracy	30.6	# 77
Code Generation	MBPP	Llama 2 70B (zero-shot)	Accuracy	45	# 63
Code Generation	MBPP	Llama 2 34B (0-shot)	Accuracy	33	# 76
Code Generation	MBPP	Llama 2 7B (0-shot)	Accuracy	20.8	# 84
Multi-task Language Understanding	MMLU	LLaMA 2 34B (5-shot)	Average (%)	62.6	# 46
Multi-task Language Understanding	MMLU	LLaMA 2 7B (5-shot)	Average (%)	45.3	# 70
Multi-task Language Understanding	MMLU	LLaMA 2 13B (5-shot)	Average (%)	54.8	# 59
Multiple Choice Question Answering (MCQA)	MMLU (Professional medicine)	Llama2-7B-chat	Accuracy	40.07	# 6
Multiple Choice Question Answering (MCQA)	MMLU (Professional medicine)	Llama2-7B	Accuracy	43.38	# 5
Question Answering	Natural Questions	LLaMA 2 70B (one-shot)	EM	33.0	# 23
Question Answering	PIQA	LLaMA 2 70B (0-shot)	Accuracy	82.8	# 12
Question Answering	PIQA	LLaMA 2 34B (0-shot)	Accuracy	81.9	# 19
Question Answering	PIQA	LLaMA 2 13B (0-shot)	Accuracy	80.5	# 26
Question Answering	PIQA	LLaMA 2 7B (0-shot)	Accuracy	78.8	# 33
Question Answering	PubChemQA	Llama2-7B-chat	BLEU-2	0.075	# 2
Question Answering	PubChemQA	Llama2-7B-chat	BLEU-4	0.009	# 2
Question Answering	PubChemQA	Llama2-7B-chat	ROUGE-1	0.184	# 2
Question Answering	PubChemQA	Llama2-7B-chat	ROUGE-2	0.043	# 2
Question Answering	PubChemQA	Llama2-7B-chat	ROUGE-L	0.142	# 2
Question Answering	PubChemQA	Llama2-7B-chat	MEATOR	0.149	# 2
Math Word Problem Solving	SVAMP	LLaMA 2-Chat	Execution Accuracy	69.2	# 8
Question Answering	TriviaQA	LLaMA 2 70B (one-shot)	EM	85	# 5
Question Answering	UniProtQA	Llama2-7B-chat	BLEU-2	0.019	# 2
Question Answering	UniProtQA	Llama2-7B-chat	BLEU-4	0.002	# 2
Question Answering	UniProtQA	Llama2-7B-chat	ROUGE-1	0.103	# 2
Question Answering	UniProtQA	Llama2-7B-chat	ROUGE-2	0.060	# 2
Question Answering	UniProtQA	Llama2-7B-chat	ROUGE-L	0.009	# 2
Question Answering	UniProtQA	Llama2-7B-chat	MEATOR	0.052	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/question-answering-on-pubchemqa)](https://paperswithcode.com/sota/question-answering-on-pubchemqa?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/question-answering-on-uniprotqa)](https://paperswithcode.com/sota/question-answering-on-uniprotqa?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/multiple-choice-question-answering-mcqa-on-25)](https://paperswithcode.com/sota/multiple-choice-question-answering-mcqa-on-25?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/question-answering-on-triviaqa)](https://paperswithcode.com/sota/question-answering-on-triviaqa?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/math-word-problem-solving-on-svamp)](https://paperswithcode.com/sota/math-word-problem-solving-on-svamp?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/math-word-problem-solving-on-mawps)](https://paperswithcode.com/sota/math-word-problem-solving-on-mawps?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/question-answering-on-natural-questions)](https://paperswithcode.com/sota/question-answering-on-natural-questions?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/code-generation-on-mbpp)](https://paperswithcode.com/sota/code-generation-on-mbpp?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/code-generation-on-humaneval)](https://paperswithcode.com/sota/code-generation-on-humaneval?p=llama-2-open-foundation-and-fine-tuned-chat)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-2-open-foundation-and-fine-tuned-chat/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=llama-2-open-foundation-and-fine-tuned-chat)`

Llama 2: Open Foundation and Fine-Tuned Chat Models

18 Jul 2023 · Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom ·

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/llama official

53,002

flagalpha/llama2-chinese

10,456

Lightning-AI/lit-gpt

6,621

IBM/Dromedary

1,088

xverse-ai/xverse-13b

634

See all 14 implementations

Tasks

Add Remove

Arithmetic Reasoning

Code Generation

Math Word Problem Solving

Multiple Choice Question Answering (MCQA)

Multi-task Language Understanding

Question Answering

Sentence Completion

Datasets

Natural Questions

MMLU

GSM8K

TriviaQA

HumanEval

HellaSwag

BoolQ

MATH

PIQA

CommonsenseQA

TruthfulQA MBPP

SVAMP

ARC (AI2 Reasoning Challenge) MAWPS

ToxiGen PubChemQA UniProtQA

Results from the Paper

Edit

Ranked #2 on Question Answering on PubChemQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	BoolQ	LLaMA 2 70B (0-shot)	Accuracy	85	# 17	Compare
Question Answering	BoolQ	LLaMA 2 7B (zero-shot)	Accuracy	77.4	# 29	Compare
Question Answering	BoolQ	LLaMA 2 13B (0-shot)	Accuracy	81.7	# 24	Compare
Question Answering	BoolQ	LLaMA 2 34B (0-shot)	Accuracy	83.7	# 20	Compare
Arithmetic Reasoning	GSM8K	LLaMA 2 70B (on-shot)	Accuracy	56.8	# 110	Compare
Arithmetic Reasoning	GSM8K	LLaMA 2 70B (on-shot)	Parameters (Billion)	70	# 86	Compare
Sentence Completion	HellaSwag	LLaMA 2 7B (0-shot)	Accuracy	77.2	# 45	Compare
Sentence Completion	HellaSwag	LLaMA 2 13B (0-shot)	Accuracy	80.7	# 39	Compare
Sentence Completion	HellaSwag	LLaMA 2 34B (0-shot)	Accuracy	83.3	# 29	Compare
Sentence Completion	HellaSwag	LLaMA 2 70B (0-shot)	Accuracy	85.3	# 21	Compare
Code Generation	HumanEval	Llama 2 34B (zero-shot)	Pass@1	22.6	# 95	Compare
Code Generation	HumanEval	Llama 2 70B (zero-shot)	Pass@1	29.9	# 79	Compare
Code Generation	HumanEval	Llama 2 7B (zero-shot)	Pass@1	12.8	# 115	Compare
Code Generation	HumanEval	Llama 2 13B (zero-shot)	Pass@1	18.3	# 100	Compare
Math Word Problem Solving	MAWPS	LLaMA 2-Chat	Accuracy (%)	82.4	# 15	Compare
Code Generation	MBPP	Llama 2 13B (0-shot)	Accuracy	30.6	# 77	Compare
Code Generation	MBPP	Llama 2 70B (zero-shot)	Accuracy	45	# 63	Compare
Code Generation	MBPP	Llama 2 34B (0-shot)	Accuracy	33	# 76	Compare
Code Generation	MBPP	Llama 2 7B (0-shot)	Accuracy	20.8	# 84	Compare
Multi-task Language Understanding	MMLU	LLaMA 2 34B (5-shot)	Average (%)	62.6	# 46	Compare
Multi-task Language Understanding	MMLU	LLaMA 2 7B (5-shot)	Average (%)	45.3	# 70	Compare
Multi-task Language Understanding	MMLU	LLaMA 2 13B (5-shot)	Average (%)	54.8	# 59	Compare
Multiple Choice Question Answering (MCQA)	MMLU (Professional medicine)	Llama2-7B-chat	Accuracy	40.07	# 6	Compare
Multiple Choice Question Answering (MCQA)	MMLU (Professional medicine)	Llama2-7B	Accuracy	43.38	# 5	Compare
Question Answering	Natural Questions	LLaMA 2 70B (one-shot)	EM	33.0	# 23	Compare
Question Answering	PIQA	LLaMA 2 70B (0-shot)	Accuracy	82.8	# 12	Compare
Question Answering	PIQA	LLaMA 2 34B (0-shot)	Accuracy	81.9	# 19	Compare
Question Answering	PIQA	LLaMA 2 13B (0-shot)	Accuracy	80.5	# 26	Compare
Question Answering	PIQA	LLaMA 2 7B (0-shot)	Accuracy	78.8	# 33	Compare
Question Answering	PubChemQA	Llama2-7B-chat	BLEU-2	0.075	# 2	Compare
			BLEU-4	0.009	# 2	Compare
			ROUGE-1	0.184	# 2	Compare
			ROUGE-2	0.043	# 2	Compare
			ROUGE-L	0.142	# 2	Compare
			MEATOR	0.149	# 2	Compare
Math Word Problem Solving	SVAMP	LLaMA 2-Chat	Execution Accuracy	69.2	# 8	Compare
Question Answering	TriviaQA	LLaMA 2 70B (one-shot)	EM	85	# 5	Compare
Question Answering	UniProtQA	Llama2-7B-chat	BLEU-2	0.019	# 2	Compare
			BLEU-4	0.002	# 2	Compare
			ROUGE-1	0.103	# 2	Compare
			ROUGE-2	0.060	# 2	Compare
			ROUGE-L	0.009	# 2	Compare
			MEATOR	0.052	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • AdamW • BPE • Dense Connections • Dropout • Entropy Regularization • Feedforward Network • Grouped-query attention • Label Smoothing • LLaMA • PPO • Residual Connection • RMSNorm • Rotary Embeddings • Scaled Dot-Product Attention • Softmax • SwiGLU • Transformer

Edit Social Preview

Llama 2: Open Foundation and Fine-Tuned Chat Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove