PaLM: Scaling Language Modeling with Pathways
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
PDF Abstract Google Research 2022 PDFCode
Tasks
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Memorization | BIG-bench (Hindu Knowledge) | PaLM-540B (few-shot, k=5) | Accuracy | 95.4 | # 1 | ||
Memorization | BIG-bench (Hindu Knowledge) | PaLM-62B (few-shot, k=5) | Accuracy | 77.7 | # 3 | ||
Common Sense Reasoning | BIG-bench (Known Unknowns) | PaLM-540B (few-shot, k=5) | Accuracy | 73.9 | # 1 | ||
Auto Debugging | Big-bench Lite | PaLM 62B (few-shot, k=5) | Exact string match | 38.2 | # 1 | ||
Auto Debugging | Big-bench Lite | PaLM 540B (few-shot, k=5) | Exact string match | 38.2 | # 1 | ||
Auto Debugging | Big-bench Lite | PaLM 8B (few-shot, k=5) | Exact string match | 14.7 | # 3 | ||
Multiple Choice Question Answering (MCQA) | BIG-bench (Novel Concepts) | PaLM-62B (few-shot, k=5) | Accuracy | 59.4 | # 3 | ||
Multiple Choice Question Answering (MCQA) | BIG-bench (Novel Concepts) | PaLM-540B (few-shot, k=5) | Accuracy | 71.9 | # 1 | ||
Logical Reasoning | BIG-bench (StrategyQA) | PaLM-540B (few-shot, k=5) | Accuracy | 73.9 | # 1 | ||
Logical Reasoning | BIG-bench (StrategyQA) | PaLM-62B (few-shot, k=5) | Accuracy | 65.4 | # 3 | ||
Common Sense Reasoning | BIG-bench (Winowhy) | PaLM-540B (few-shot, k=5) | Accuracy | 65.9 | # 1 | ||
Common Sense Reasoning | BIG-bench (Winowhy) | PaLM-62B (few-shot, k=5) | Accuracy | 61.0 | # 3 | ||
Question Answering | BoolQ | PaLM 540B (fine-tuned) | Accuracy | 92.2 | # 2 | ||
Natural Language Inference | CommitmentBank | PaLM 540B (finetuned) | F1 | 100 | # 1 | ||
Accuracy | 100 | # 1 | |||||
Question Answering | COPA | PaLM 540B (finetuned) | Accuracy | 100 | # 1 | ||
Extreme Summarization | GEM-XSum | T5-XXL | ROUGE-2 | 21.0 | # 3 | ||
Extreme Summarization | GEM-XSum | PaLM (finetuning)-62B | ROUGE-2 | 18.5 | # 4 | ||
Parameters | 62 B | # 3 | |||||
Extreme Summarization | GEM-XSum | PaLM (finetuning)-540B | ROUGE-2 | 21.2 | # 2 | ||
Parameters | 540 B | # 2 | |||||
Sentence Completion | HellaSwag | PaLM-540B (Few-Shot) | Accuracy | 83.8 | # 25 | ||
Sentence Completion | HellaSwag | PaLM-540B (0-shot) | Accuracy | 83.4 | # 28 | ||
Sentence Completion | HellaSwag | PaLM-540B (1-shot) | Accuracy | 83.6 | # 26 | ||
Code Generation | HumanEval | PaLM 62B | Pass@1 | 15.9 | # 108 | ||
Code Generation | HumanEval | PaLM 540B | Pass@1 | 26.2 | # 87 | ||
Code Generation | HumanEval | PaLM-cont 62B | Pass@1 | 23.7 | # 91 | ||
Code Generation | HumanEval | PaLM 8B | Pass@1 | 3.6 | # 127 | ||
Language Modelling | LAMBADA | PaLM-540B (Zero-Shot) | Accuracy | 77.9 | # 15 | ||
Language Modelling | LAMBADA | PaLM-540B (Few-Shot) | Accuracy | 89.7 | # 1 | ||
Language Modelling | LAMBADA | PaLM-540B (One-Shot) | Accuracy | 81.8 | # 9 | ||
Code Generation | MBPP | PaLM 540B | Accuracy | 36.8 | # 73 | ||
Code Generation | MBPP | PaLM Coder 540B | Accuracy | 47 | # 58 | ||
Multi-task Language Understanding | MGSM | PaLM 540B | Average (%) | 55.0 | # 6 | ||
Multi-task Language Understanding | MMLU | PaLM | Average (%) | 69.3 | # 33 | ||
Question Answering | MultiRC | PaLM 540B (finetuned) | F1 | 90.1 | # 1 | ||
EM | 69.2 | # 1 | |||||
Question Answering | Natural Questions | PaLM-540B (Zero-Shot) | EM | 21.2 | # 36 | ||
Question Answering | Natural Questions | PaLM-540B (Few-Shot, k=64) | EM | 39.6 | # 19 | ||
Question Answering | Natural Questions | PaLM-540B (One-Shot) | EM | 29.3 | # 28 | ||
Question Answering | OBQA | PaLM 540B (zero-shot) | Accuracy | 53.4 | # 8 | ||
Question Answering | OBQA | PaLM 62B (zero-shot) | Accuracy | 50.4 | # 9 | ||
Reading Comprehension | RACE | PaLM 8B (zero-shot) | Accuracy (High) | 42.3 | # 14 | ||
Accuracy (Middle) | 57.9 | # 14 | |||||
Reading Comprehension | RACE | PaLM 540B (zero-shot) | Accuracy (High) | 49.1 | # 8 | ||
Accuracy (Middle) | 68.1 | # 7 | |||||
Reading Comprehension | RACE | PaLM 62B (zero-shot) | Accuracy (High) | 47.5 | # 10 | ||
Accuracy (Middle) | 64.3 | # 9 | |||||
Common Sense Reasoning | ReCoRD | PaLM 540B (finetuned) | F1 | 94.6 | # 2 | ||
EM | 94.0 | # 4 | |||||
Natural Language Inference | RTE | PaLM 540B (fine-tuned) | Accuracy | 95.7% | # 2 | ||
Natural Language Inference | RTE | PaLM 540B (0-shot) | Accuracy | 72.9% | # 49 | ||
Natural Language Inference | RTE | PaLM 540B (1-shot) | Accuracy | 78.7% | # 41 | ||
Natural Language Inference | RTE | PaLM 540B (5-shot) | Accuracy | 79.6% | # 38 | ||
Question Answering | TriviaQA | PaLM-540B (Few-Shot) | EM | 81.4 | # 7 | ||
Question Answering | TriviaQA | PaLM-540B (One-Shot) | EM | 81.4 | # 7 | ||
Question Answering | TriviaQA | PaLM-540B (Zero-Shot) | EM | 76.9 | # 11 | ||
Cross-Lingual Question Answering | TyDiQA-GoldP | PaLM-540B (CoT) | EM | 52.9 | # 7 | ||
Question Answering | WebQuestions | PaLM-540B (Few-Shot) | EM | 43.5 | # 5 | ||
Question Answering | WebQuestions | PaLM-540B (One-Shot) | EM | 22.6 | # 14 | ||
Question Answering | WebQuestions | PaLM-540B (Zero-Shot) | EM | 10.6 | # 18 | ||
Coreference Resolution | Winograd Schema Challenge | PaLM 540B (5-shot) | Accuracy | 89.5 | # 11 | ||
Coreference Resolution | Winograd Schema Challenge | PaLM 540B (0-shot) | Accuracy | 89.1 | # 12 | ||
Coreference Resolution | Winograd Schema Challenge | PaLM 540B (fine-tuned) | Accuracy | 100 | # 1 | ||
Coreference Resolution | Winograd Schema Challenge | PaLM 540B (1-shot) | Accuracy | 86.3 | # 16 | ||
Common Sense Reasoning | WinoGrande | PaLM 62B (0-shot) | Accuracy | 77.0 | # 18 | ||
Common Sense Reasoning | WinoGrande | PaLM 540B (0-shot) | Accuracy | 81.1 | # 12 | ||
Common Sense Reasoning | WinoGrande | PaLM-cont 62B (0-shot) | Accuracy | 77.0 | # 18 | ||
Word Sense Disambiguation | Words in Context | PaLM 540B (finetuned) | Accuracy | 78.8 | # 2 |