Grounded SCAN poses a simple task, where an agent must execute action sequences based on a synthetic language instruction.
20 PAPERS • NO BENCHMARKS YET
This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks", etc.
11 PAPERS • 1 BENCHMARK
MultI-Modal In-Context Instruction Tuning (MIMIC-IT) is a dataset for instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. The data sample consists of a queried image-instruction-answer triplet, with the instruction-answer tailored to the image, and context. The context contains a series of image-instruction-answer triplets that contextually correlate with the queried triplet, emulating the relationship between the context and the queried image-text pair found in the MMC4 dataset.
9 PAPERS • NO BENCHMARKS YET
UGIF is a multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone. It contains 523 natural language instructions with paired sequences of multilingual UI screens and actions that show how to execute the task in eight languages.
7 PAPERS • NO BENCHMARKS YET
Bactrian-X is a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. The instructions were obtained from alpaca-52k, and dolly-15k, and tranlated into 52 languages (52 languages x 67k instances = 3.4M instances).
4 PAPERS • NO BENCHMARKS YET
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
1 PAPER • NO BENCHMARKS YET
CIDAR contains 10,000 instructions and their output. The dataset was created by selecting around 9,109 samples from Alpagasus dataset then translating it to Arabic using ChatGPT. In addition, we append that with around 891 Arabic grammar instructions from the webiste Ask the teacher. All the 10,000 samples were reviewed by around 12 reviewers.
InstructOpenWiki is a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions.
This is the sequential instructions dataset from Understanding the Effects of RLHF on LLM Generalisation and Diversity. The dataset is in the alpaca_eval format.
Dataset Generation
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
Dataset Card for "tamil-alpaca"
Dataset Card for "tamil-alpaca" This repository includes a Tamil-translated versions of the Alpaca dataset and a subset of OpenOrca dataset.