The Extended Yale B database contains 2414 frontal-face images with size 192×168 over 38 subjects and about 64 images per subject. The images were captured under different lighting conditions and various facial expressions.
181 PAPERS • 1 BENCHMARK
The "Flying Chairs" are a synthetic dataset with optical flow ground truth. It consists of 22872 image pairs and corresponding flow fields. Images show renderings of 3D chair models moving in front of random backgrounds from Flickr. Motions of both the chairs and the background are purely planar.
181 PAPERS • NO BENCHMARKS YET
SUNCG is a large-scale dataset of synthetic 3D scenes with dense volumetric annotations.
TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split into 30,132 training videos and 511 testing videos, with an average of 470,9 frames.
181 PAPERS • 2 BENCHMARKS
MNIST-M is created by combining MNIST digits with the patches randomly extracted from color photos of BSDS500 as their background. It contains 59,001 training and 90,001 test images.
180 PAPERS • 1 BENCHMARK
StrategyQA is a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. It includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Questions in StrategyQA are short, topic-diverse, and cover a wide range of strategies.
178 PAPERS • 1 BENCHMARK
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
177 PAPERS • 4 BENCHMARKS
The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.
176 PAPERS • 49 BENCHMARKS
MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance.
176 PAPERS • 1 BENCHMARK
BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
175 PAPERS • 6 BENCHMARKS
The MOTChallenge datasets are designed for the task of multiple object tracking. There are several variants of the dataset released each year, such as MOT15, MOT17, MOT20.
175 PAPERS • 8 BENCHMARKS
Colored MNIST is a synthetic binary classification task derived from MNIST.
174 PAPERS • NO BENCHMARKS YET
OTB-2015, also referred as Visual Tracker Benchmark, is a visual tracking dataset. It contains 100 commonly used video sequences for evaluating visual tracking. Image Source: http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html
174 PAPERS • 1 BENCHMARK
Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 for testing. The training and validation videos have pixel-level ground truth annotations for every 5th frame (6 fps). It also contains Instance Segmentation annotations. It has more than 7,800 unique objects, 190k high-quality manual annotations and more than 340 minutes in duration.
174 PAPERS • 10 BENCHMARKS
ENZYMES is a dataset of 600 protein tertiary structures obtained from the BRENDA enzyme database. The ENZYMES dataset contains 6 enzymes.
173 PAPERS • 1 BENCHMARK
Mip-NeRF 360 is an extension to the Mip-NeRF that uses a non-linear parameterization, online distillation, and a novel distortion-based regularize to overcome the challenge of unbounded scenes. The dataset consists of 9 scenes with 5 outdoors and 4 indoors, each containing a complex central object or area with a detailed background.
MUSAN is a corpus of music, speech and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises.
172 PAPERS • NO BENCHMARKS YET
WebVid contains 10 million video clips with captions, sourced from the web. The videos are diverse and rich in their content.
172 PAPERS • 1 BENCHMARK
The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three subsets: training set (64 relations), validation set (16 relations) and test set (20 relations).
170 PAPERS • 4 BENCHMARKS
The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 speakers. Each utterance has 29 frames, whose boundary is centered around the target word. The database is divided into training, validation and test sets. The training set contains at least 800 utterances for each class while the validation and test sets contain 50 utterances.
170 PAPERS • 7 BENCHMARKS
Objaverse is a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category.
170 PAPERS • 1 BENCHMARK
The WebVision dataset is designed to facilitate the research on learning visual representation from noisy web data. It is a large scale web images dataset that contains more than 2.4 million of images crawled from the Flickr website and Google Images search.
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.
MARS (Motion Analysis and Re-identification Set) is a large scale video based person reidentification dataset, an extension of the Market-1501 dataset. It has been collected from six near-synchronized cameras. It consists of 1,261 different pedestrians, who are captured by at least 2 cameras. The variations in poses, colors and illuminations of pedestrians, as well as the poor image quality, make it very difficult to yield high matching accuracy. Moreover, the dataset contains 3,248 distractors in order to make it more realistic. Deformable Part Model and GMMCP tracker were used to automatically generate the tracklets (mostly 25-50 frames long).
169 PAPERS • 2 BENCHMARKS
MORPH is a facial age estimation dataset, which contains 55,134 facial images of 13,617 subjects ranging from 16 to 77 years old.
169 PAPERS • 8 BENCHMARKS
The UCF-QNRF dataset is a crowd counting dataset and it contains large diversity both in scenes, as well as in background types. It consists of 1535 images high-resolution images from Flickr, Web Search and Hajj footage. The number of people (i.e., the count) varies from 50 to 12,000 across images.
169 PAPERS • 1 BENCHMARK
The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.
168 PAPERS • 2 BENCHMARKS
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
168 PAPERS • NO BENCHMARKS YET
The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, user simulation learning, among other tasks in large-scale virtual assistants. Besides these, the dataset has unseen domains and services in the evaluation set to quantify the performance in zero-shot or few shot settings.
168 PAPERS • 3 BENCHMARKS
WMT 2016 is a collection of datasets used in shared tasks of the First Conference on Machine Translation. The conference builds on ten previous Workshops on statistical Machine Translation.
168 PAPERS • 18 BENCHMARKS
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.
LabelMe database is a large collection of images with ground truth labels for object detection and recognition. The annotations come from two different sources, including the LabelMe online annotation tool.
167 PAPERS • 1 BENCHMARK
NELL is a dataset built from the Web via an intelligent agent called Never-Ending Language Learner. This agent attempts to learn over time to read the web. NELL has accumulated over 50 million candidate beliefs by reading the web, and it is considering these at different levels of confidence. NELL has high confidence in 2,810,379 of these beliefs.
166 PAPERS • 4 BENCHMARKS
AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds sourced from the AudioSet dataset. Annotators were provided the audio tracks together with category hints (and with additional video hints if needed).
165 PAPERS • 9 BENCHMARKS
The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in China. To explore the granularity on the Long sequence time-series forecasting (LSTF) problem, different subsets are created, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.
165 PAPERS • 1 BENCHMARK
The Infant Health and Development Program (IHDP) is a randomized controlled study designed to evaluate the effect of home visit from specialist doctors on the cognitive test scores of premature infants. The datasets is first used for benchmarking treatment effect estimation algorithms in Hill [35], where selection bias is induced by removing non-random subsets of the treated individuals to create an observational dataset, and the outcomes are generated using the original covariates and treatments. It contains 747 subjects and 25 variables.
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies were performed at Beth Israel Deaconess Medical Center in Boston, MA.
165 PAPERS • 2 BENCHMARKS
Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer pairs in total. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.
FairFace is a face image dataset which is race balanced. It contains 108,501 images from 7 different race groups: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino. Images were collected from the YFCC-100M Flickr dataset and labeled with race, gender, and age groups.
164 PAPERS • 1 BENCHMARK
Libri-Light is a collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio.
164 PAPERS • 2 BENCHMARKS
The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and over 270, 000 training frames. Moreover, both the frame-level and pixel-level ground truth of abnormal events are annotated in this dataset.
164 PAPERS • 4 BENCHMARKS
AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.
163 PAPERS • 1 BENCHMARK
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).
163 PAPERS • 2 BENCHMARKS
The Face Detection Dataset and Benchmark (FDDB) dataset is a collection of labeled faces from Faces in the Wild dataset. It contains a total of 5171 face annotations, where images are also of various resolution, e.g. 363x450 and 229x410. The dataset incorporates a range of challenges, including difficult pose angles, out-of-focus faces and low resolution. Both greyscale and color images are included.
ATOMIC is an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., "if X pays Y a compliment, then Y will likely return the compliment").
162 PAPERS • NO BENCHMARKS YET
CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:
161 PAPERS • 15 BENCHMARKS
Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not.
161 PAPERS • 1 BENCHMARK
KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular KITTI dataset, providing more comprehensive semantic/instance labels in 2D and 3D, richer 360 degree sensory information (fisheye images and pushbroom laser scans), very accurate and geo-localized vehicle and camera poses, and a series of new challenging benchmarks.
161 PAPERS • 6 BENCHMARKS