The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Over time the collection was extended with a 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.
823 PAPERS • 7 BENCHMARKS
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).
163 PAPERS • 2 BENCHMARKS
CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.
157 PAPERS • 2 BENCHMARKS
PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.
100 PAPERS • 1 BENCHMARK
QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.
99 PAPERS • NO BENCHMARKS YET
The Free Music Archive (FMA) is a large-scale dataset for evaluating several tasks in Music Information Retrieval. It consists of 343 days of audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies.
95 PAPERS • 2 BENCHMARKS
TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic. One of the key characteristics of pandemic search is the accelerated rate of change: the topics of interest evolve as the pandemic progresses and the scientific literature in the area explodes. The COVID-19 pandemic provides an opportunity to capture this progression as it happens. TREC-COVID, in creating a test collection around COVID-19 literature, is building infrastructure to support new research and technologies in pandemic search.
64 PAPERS • 1 BENCHMARK
MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 task types are Bitext mining, Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity and Summarisation. The 56 datasets contain varying text lengths and they are grouped into three categories: Sentence to sentence, Paragraph to paragraph, and Sentence to paragraph.
51 PAPERS • 8 BENCHMARKS
MedleyDB, is a dataset of annotated, royalty-free multitrack recordings. It was curated primarily to support research on melody extraction. For each song melody f₀ annotations are provided as well as instrument activations for evaluating automatic instrument recognition. The original dataset consists of 122 multitrack songs out of which 108 include melody annotations.
41 PAPERS • NO BENCHMARKS YET
A large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community.
40 PAPERS • 5 BENCHMARKS
InsuranceQA is a question answering dataset for the insurance domain, the data stemming from the website Insurance Library. There are 12,889 questions and 21,325 answers in the training set. There are 2,000 questions and 3,354 answers in the validation set. There are 2,000 questions and 3,308 answers in the test set.
38 PAPERS • NO BENCHMARKS YET
The MSLR-WEB10K dataset consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair. It is a subset of the MSLR-WEB30K dataset.
35 PAPERS • NO BENCHMARKS YET
SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.
32 PAPERS • NO BENCHMARKS YET
The MSLR-WEB30K dataset consists of 30,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair.
31 PAPERS • 1 BENCHMARK
The MQ2007 dataset consists of queries, corresponding retrieved documents and labels provided by human experts. The possible relevance labels for each document are “relevant”, “partially relevant”, and “not relevant”.
30 PAPERS • NO BENCHMARKS YET
The MQ2008 dataset is a dataset for Learning to Rank. It contains 800 queries with labelled documents.
27 PAPERS • NO BENCHMARKS YET
WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).
26 PAPERS • NO BENCHMARKS YET
A large scale dataset to enable the transfer step, exploiting the Natural Questions dataset.
25 PAPERS • 1 BENCHMARK
GuitarSet is a dataset of high-quality guitar recordings and rich annotations. It contains 360 excerpts 30 seconds in length. The 360 excerpts are the result of the following combinations:
24 PAPERS • NO BENCHMARKS YET
The Yahoo! Learning to Rank Challenge dataset consists of 709,877 documents encoded in 700 features and sampled from query logs of the Yahoo! search engine, spanning 29,921 queries.
Composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD).
20 PAPERS • NO BENCHMARKS YET
The iKala dataset is a singing voice separation dataset that comprises of 252 30-second excerpts sampled from 206 iKala songs (plus 100 hidden excerpts reserved for MIREX data mining contest). The music accompaniment and the singing voice are recorded at the left and right channels respectively. Additionally, the human-labeled pitch contours and timestamped lyrics are provided.
20 PAPERS • 1 BENCHMARK
The George Washington dataset contains 20 pages of letters written by George Washington and his associates in 1755 and thereby categorized into historical collection. The images are annotated at word level and contain approximately 5,000 words.
19 PAPERS • NO BENCHMARKS YET
HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.
A dataset on asking Questions for Lack of Clarity in open-domain information-seeking conversations. Qulac presents the first dataset and offline evaluation framework for studying clarifying questions in open-domain information-seeking conversational search systems.
18 PAPERS • NO BENCHMARKS YET
A dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing.
17 PAPERS • NO BENCHMARKS YET
The beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.
15 PAPERS • NO BENCHMARKS YET
ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
Enhances QuAC by adapting it to an open-retrieval setting. It is an aggregation of three existing datasets: (1) the QuAC dataset that offers information-seeking conversations, (2) the CANARD dataset that consists of context-independent rewrites of QuAC questions, and (3) the Wikipedia corpus that serves as the knowledge source of answering questions.
TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine.
JuICe is a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data.
13 PAPERS • NO BENCHMARKS YET
The goal of the Robust track is to improve the consistency of retrieval technology by focusing on poorly performing topics. In addition, the track brings back a classic, ad hoc retrieval task in TREC that provides a natural home for new participants. An ad hoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic.
13 PAPERS • 2 BENCHMARKS
Ohsumed includes medical abstracts from the MeSH categories of the year 1991. In [Joachims, 1997] were used the first 20,000 documents divided in 10,000 for training and 10,000 for testing. The specific task was to categorize the 23 cardiovascular diseases categories. After selecting the such category subset, the unique abstract number becomes 13,929 (6,286 for training and 7,643 for testing). As current computers can easily manage larger number of documents we make available all 34,389 cardiovascular diseases abstracts out of 50,216 medical abstracts contained in the year 1991.
11 PAPERS • 2 BENCHMARKS
Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.
10 PAPERS • NO BENCHMARKS YET
ClariQ is an extension of the Qulac dataset with additional new topics, questions, and answers in the training set. The test set is completely unseen and newly collected. Like Qulac, ClariQ consists of single-turn conversations (initial_request, followed by clarifying question and answer). In addition, it comes with synthetic multi-turn conversations (up to three turns). ClariQ features approximately 18K single-turn conversations, as well as 1.8 million multi-turn conversations.
9 PAPERS • NO BENCHMARKS YET
QUASAR-S is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of 37,362 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The answer to each question is restricted to be another software entity, from an output vocabulary of 4874 entities.
8 PAPERS • NO BENCHMARKS YET
CLIRMatrix is a large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. It includes:
7 PAPERS • NO BENCHMARKS YET
OpenMIC-2018 is an instrument recognition dataset containing 20,000 examples of Creative Commons-licensed music available on the Free Music Archive. Each example is a 10-second excerpt which has been partially labeled for the presence or absence of 20 instrument classes by annotators on a crowd-sourcing platform.
7 PAPERS • 1 BENCHMARK
The dataset contains:
ArCOV-19 is an Arabic COVID-19 Twitter dataset that covers the period from 27th of January till 30th of April 2020. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes over 1M tweets alongside the propagation networks of the most-popular subset of them (i.e., most-retweeted and -liked).
6 PAPERS • NO BENCHMARKS YET
The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.
6 PAPERS • 1 BENCHMARK
GiantMIDI-Piano contains 10,854 unique piano solo pieces composed by 2,786 composers. GiantMIDI-Piano contains 34,504,873 transcribed notes, and contains metadata information of each music piece.
Microsoft Research Social Media Conversation Corpus consists of 127M context-message-response triples from the Twitter FireHose, covering the 3-month period June 2012 through August 2012. Only those triples where context and response were generated by the same user were extracted. To minimize noise, only triples that contained at least one frequent bigram that appeared more than 3 times in the corpus was selected. This produced a corpus of 29M Twitter triples.
TutorialBank is a publicly available dataset which aims to facilitate NLP education and research. The dataset consists of links to over 6,300 high-quality resources on NLP and related fields. The corpus’s magnitude, manual collection and focus on annotation for education in addition to research differentiates it from other corpora.
ClueWeb22 is the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier CLUEWeb corpora, the ClUEWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, the dataset includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text.
5 PAPERS • NO BENCHMARKS YET
Coached Conversational Preference Elicitation is a dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.
The Spotify Music Streaming Sessions Dataset (MSSD) consists of 160 million streaming sessions with associated user interactions, audio features and metadata describing the tracks streamed during the sessions, and snapshots of the playlists listened to during the sessions.
5 PAPERS • 1 BENCHMARK
SciRepEval is a comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks, 11 of which are new, across four formats: classification, regression, ranking and search.