In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering BoolQ LLaMA 2 70B (0-shot) Accuracy 85 # 17
Question Answering BoolQ LLaMA 2 7B (zero-shot) Accuracy 77.4 # 29
Question Answering BoolQ LLaMA 2 13B (0-shot) Accuracy 81.7 # 24
Question Answering BoolQ LLaMA 2 34B (0-shot) Accuracy 83.7 # 20
Arithmetic Reasoning GSM8K LLaMA 2 70B (on-shot) Accuracy 56.8 # 110
Parameters (Billion) 70 # 86
Sentence Completion HellaSwag LLaMA 2 7B (0-shot) Accuracy 77.2 # 45
Sentence Completion HellaSwag LLaMA 2 13B (0-shot) Accuracy 80.7 # 39
Sentence Completion HellaSwag LLaMA 2 34B (0-shot) Accuracy 83.3 # 29
Sentence Completion HellaSwag LLaMA 2 70B (0-shot) Accuracy 85.3 # 21
Code Generation HumanEval Llama 2 34B (zero-shot) Pass@1 22.6 # 95
Code Generation HumanEval Llama 2 70B (zero-shot) Pass@1 29.9 # 79
Code Generation HumanEval Llama 2 7B (zero-shot) Pass@1 12.8 # 115
Code Generation HumanEval Llama 2 13B (zero-shot) Pass@1 18.3 # 100
Math Word Problem Solving MAWPS LLaMA 2-Chat Accuracy (%) 82.4 # 15
Code Generation MBPP Llama 2 13B (0-shot) Accuracy 30.6 # 77
Code Generation MBPP Llama 2 70B (zero-shot) Accuracy 45 # 63
Code Generation MBPP Llama 2 34B (0-shot) Accuracy 33 # 76
Code Generation MBPP Llama 2 7B (0-shot) Accuracy 20.8 # 84
Multi-task Language Understanding MMLU LLaMA 2 34B (5-shot) Average (%) 62.6 # 46
Multi-task Language Understanding MMLU LLaMA 2 7B (5-shot) Average (%) 45.3 # 70
Multi-task Language Understanding MMLU LLaMA 2 13B (5-shot) Average (%) 54.8 # 59
Multiple Choice Question Answering (MCQA) MMLU (Professional medicine) Llama2-7B-chat Accuracy 40.07 # 6
Multiple Choice Question Answering (MCQA) MMLU (Professional medicine) Llama2-7B Accuracy 43.38 # 5
Question Answering Natural Questions LLaMA 2 70B (one-shot) EM 33.0 # 23
Question Answering PIQA LLaMA 2 70B (0-shot) Accuracy 82.8 # 12
Question Answering PIQA LLaMA 2 34B (0-shot) Accuracy 81.9 # 19
Question Answering PIQA LLaMA 2 13B (0-shot) Accuracy 80.5 # 26
Question Answering PIQA LLaMA 2 7B (0-shot) Accuracy 78.8 # 33
Question Answering PubChemQA Llama2-7B-chat BLEU-2 0.075 # 2
BLEU-4 0.009 # 2
ROUGE-1 0.184 # 2
ROUGE-2 0.043 # 2
ROUGE-L 0.142 # 2
MEATOR 0.149 # 2
Math Word Problem Solving SVAMP LLaMA 2-Chat Execution Accuracy 69.2 # 8
Question Answering TriviaQA LLaMA 2 70B (one-shot) EM 85 # 5
Question Answering UniProtQA Llama2-7B-chat BLEU-2 0.019 # 2
BLEU-4 0.002 # 2
ROUGE-1 0.103 # 2
ROUGE-2 0.060 # 2
ROUGE-L 0.009 # 2
MEATOR 0.052 # 2

Methods