TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Instruction Following	IFEval	GPT-4	Prompt-level strict-accuracy	76.89	# 1
Instruction Following	IFEval	GPT-4	Inst-level strict-accuracy	83.57	# 1
Instruction Following	IFEval	GPT-4	Prompt-level loose-accuracy	79.3	# 1
Instruction Following	IFEval	GPT-4	Inst-level loose-accuracy	85.37	# 1
Instruction Following	IFEval	PaLM 2 S	Prompt-level strict-accuracy	43.07	# 2
Instruction Following	IFEval	PaLM 2 S	Inst-level strict-accuracy	55.76	# 2
Instruction Following	IFEval	PaLM 2 S	Prompt-level loose-accuracy	46.95	# 2
Instruction Following	IFEval	PaLM 2 S	Inst-level loose-accuracy	59.11	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instruction-following-evaluation-for-large/instruction-following-on-ifeval)](https://paperswithcode.com/sota/instruction-following-on-ifeval?p=instruction-following-evaluation-for-large)`

Instruction-Following Evaluation for Large Language Models

14 Nov 2023 · Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou ·

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

PDF Abstract

Code

Add Remove Mark official

google-research/google-research official

32,809

deepseek-ai/deepseek-llm

1,123

Tasks

Add Remove

Instruction Following

Datasets

Introduced in the Paper:

IFEval

Results from the Paper

Edit

Ranked #1 on Instruction Following on IFEval

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Instruction Following	IFEval	GPT-4	Prompt-level strict-accuracy	76.89	# 1	Compare
			Inst-level strict-accuracy	83.57	# 1	Compare
			Prompt-level loose-accuracy	79.3	# 1	Compare
			Inst-level loose-accuracy	85.37	# 1	Compare
Instruction Following	IFEval	PaLM 2 S	Prompt-level strict-accuracy	43.07	# 2	Compare
			Inst-level strict-accuracy	55.76	# 2	Compare
			Prompt-level loose-accuracy	46.95	# 2	Compare
			Inst-level loose-accuracy	59.11	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Instruction-Following Evaluation for Large Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove