PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
160 PAPERS • 2 BENCHMARKS
The Sports-1M dataset consists of over a million videos from YouTube. The videos in the dataset can be obtained through the YouTube URL specified by the authors. Approximately 7% (as of 2016) of the videos have been removed by the YouTube uploaders since the dataset was compiled. However, there are still over a million videos in the dataset with 487 sports-related categories with 1,000 to 3,000 videos per category. The videos are automatically labelled with 487 sports classes using the YouTube Topics API by analyzing the text metadata associated with the videos (e.g. tags, descriptions). Approximately 5% of the videos are annotated with more than one class.
Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API, and consists of 196,591 nodes and 950,327 edges. We have collected a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.
158 PAPERS • 4 BENCHMARKS
HAM10000 is a dataset of 10000 training images for detecting pigmented skin lesions. The authors collected dermatoscopic images from different populations, acquired and stored by different modalities.
158 PAPERS • 3 BENCHMARKS
The UCY dataset consist of real pedestrian trajectories with rich multi-human interaction scenarios captured at 2.5 Hz (Δt=0.4s). It is composed of three sequences (Zara01, Zara02, and UCY), taken in public spaces from top-view.
158 PAPERS • 1 BENCHMARK
Visual Commonsense Reasoning (VCR) is a large-scale dataset for cognition-level visual understanding. Given a challenging question about an image, machines need to present two sub-tasks: answer correctly and provide a rationale justifying its answer. The VCR dataset contains over 212K (training), 26K (validation) and 25K (testing) questions, answers and rationales derived from 110K movie scenes.
158 PAPERS • 13 BENCHMARKS
The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.
158 PAPERS • 2 BENCHMARKS
YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 cooking recipes; on average, each distinct recipe has 22 videos. The procedure steps for each video are annotated with temporal boundaries and described by imperative English sentences (see the example below). The videos were downloaded from YouTube and are all in the third-person viewpoint. All the videos are unconstrained and can be performed by individual persons at their houses with unfixed cameras. YouCook2 contains rich recipe types and various cooking styles from all over the world.
158 PAPERS • 7 BENCHMARKS
CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.
157 PAPERS • 2 BENCHMARKS
This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions.
157 PAPERS • NO BENCHMARKS YET
SentEval is a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary and multi-class classification, natural language inference and sentence similarity. The set of tasks was selected based on what appears to be the community consensus regarding the appropriate evaluations for universal sentence representations. The toolkit comes with scripts to download and preprocess datasets, and an easy interface to evaluate sentence encoders.
HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and 9,658 in test set), 600 HOI categories constructed by 80 object categories and 117 verb classes. HICO-DET provides more than 150k annotated human-object pairs. V-COCO provides 10,346 images (2,533 for training, 2,867 for validating and 4,946 for testing) and 16,199 person instances. Each person has annotations for 29 action categories and there are no interaction labels including objects.
156 PAPERS • 5 BENCHMARKS
ShapeNetCore is a subset of the full ShapeNet dataset with single clean 3D models and manually verified category and alignment annotations. It covers 55 common object categories with about 51,300 unique 3D models. The 12 object categories of PASCAL 3D+, a popular computer vision 3D benchmark dataset, are all covered by ShapeNetCore.
156 PAPERS • 1 BENCHMARK
The IARPA Janus Benchmark A (IJB-A) database is developed with the aim to augment more challenges to the face recognition task by collecting facial images with a wide variations in pose, illumination, expression, resolution and occlusion. IJB-A is constructed by collecting 5,712 images and 2,085 videos from 500 identities, with an average of 11.4 images and 4.2 videos per identity.
155 PAPERS • 2 BENCHMARKS
AFW (Annotated Faces in the Wild) is a face detection dataset that contains 205 images with 468 faces. Each face image is labeled with at most 6 landmarks with visibility labels, as well as a bounding box.
154 PAPERS • 1 BENCHMARK
Builds on top of recent data collection efforts by domain experts in these applications and provides a unified collection of datasets with evaluation metrics and train/test splits that are representative of real-world distribution shifts.
154 PAPERS • NO BENCHMARKS YET
CINIC-10 is a dataset for image classification. It has a total of 270,000 images, 4.5 times that of CIFAR-10. It is constructed from two different sources: ImageNet and CIFAR-10. Specifically, it was compiled as a bridge between CIFAR-10 and ImageNet. It is split into three equal subsets - train, validation, and test - each of which contain 90,000 images.
153 PAPERS • 3 BENCHMARKS
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence level sentiment analysis and emotion recognition in online videos. CMU-MOSEI contains more than 65 hours of annotated video from more than 1000 speakers and 250 topics.
153 PAPERS • 2 BENCHMARKS
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 200K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology — the set of linguistic features that each language expresses — such that the authors expect models performing well on this set to generalize across a large number of the languages in the world.
153 PAPERS • 1 BENCHMARK
The 3DMATCH benchmark evaluates how well descriptors (both 2D and 3D) can establish correspondences between RGB-D frames of different views. The dataset contains 2D RGB-D patches and 3D patches (local TDF voxel grid volumes) of wide-baselined correspondences.
151 PAPERS • 3 BENCHMARKS
The Annotated Facial Landmarks in the Wild (AFLW) is a large-scale collection of annotated face images gathered from Flickr, exhibiting a large variety in appearance (e.g., pose, expression, ethnicity, age, gender) as well as general imaging and environmental conditions. In total about 25K faces are annotated with up to 21 landmarks per image.
151 PAPERS • 11 BENCHMARKS
The Breakfast Actions Dataset comprises of 10 actions related to breakfast preparation, performed by 52 different individuals in 18 different kitchens. The dataset is one of the largest fully annotated datasets available. The actions are recorded “in the wild” as opposed to a single controlled lab environment. It consists of over 77 hours of video recordings.
151 PAPERS • 5 BENCHMARKS
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.
151 PAPERS • 1 BENCHMARK
In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large pose and scale variations. It also has large diversities, large quantities, and rich annotations, including:
150 PAPERS • 2 BENCHMARKS
PASCAL-5i is a dataset used to evaluate few-shot segmentation. The dataset is sub-divided into 4 folds each containing 5 classes. A fold contains labelled samples from 5 classes that are used for evaluating the few-shot learning method. The rest 15 classes are used for training.
150 PAPERS • 1 BENCHMARK
The SALIency in CONtext (SALICON) dataset contains 10,000 training images, 5,000 validation images and 5,000 test images for saliency prediction. This dataset has been created by annotating saliency in images from MS COCO. The ground-truth saliency annotations include fixations generated from mouse trajectories. To improve the data quality, isolated fixations with low local density have been excluded. The training and validation sets, provided with ground truth, contain the following data fields: image, resolution and gaze. The testing data contains only the image and resolution fields.
150 PAPERS • 5 BENCHMARKS
SNAP is a collection of large network datasets. It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more.
150 PAPERS • NO BENCHMARKS YET
Consists of more than 210k videos for 310 audio classes.
150 PAPERS • 3 BENCHMARKS
The PAMAP2 Physical Activity Monitoring dataset contains data of 18 different physical activities (such as walking, cycling, playing soccer, etc.), performed by 9 subjects wearing 3 inertial measurement units and a heart rate monitor. The dataset can be used for activity recognition and intensity estimation, while developing and applying algorithms of data processing, segmentation, feature extraction and classification.
149 PAPERS • 1 BENCHMARK
ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode is probably the so-called Death Match, where several players join in a maze and fight against each other. After a fixed time, the match ends and all the players are ranked by the FRAG scores defined as kills minus suicides. During the game, each player can access various observations, including the first-person view screen pixels, the corresponding depth-map and segmentation-map (pixel-wise object labels), the bird-view maze map, etc. The valid actions include almost all the keyboard-stroke and mouse-control a human player can take, accounting for moving, turning, jumping, shooting, changing weapon, etc. ViZDoom can run a game either synchronously or asynchronously, indicating whether the game core waits until all players’ actions are collected or runs in a constant frame rate without waiting.
149 PAPERS • 3 BENCHMARKS
The Visual Relationship Dataset (VRD) contains 4000 images for training and 1000 for testing annotated with visual relationships. Bounding boxes are annotated with a label containing 100 unary predicates. These labels refer to animals, vehicles, clothes and generic objects. Pairs of bounding boxes are annotated with a label containing 70 binary predicates. These labels refer to actions, prepositions, spatial relations, comparatives or preposition phrases. The dataset has 37993 instances of visual relationships and 6672 types of relationships. 1877 instances of relationships occur only in the test set and they are used to evaluate the zero-shot learning scenario.
148 PAPERS • 7 BENCHMARKS
The NarrativeQA dataset includes a list of documents with Wikipedia summaries, links to full stories, and questions and answers.
147 PAPERS • 1 BENCHMARK
The Cambridge Learner Corpus First Certificate in English (CLC FCE) dataset consists of short texts, written by learners of English as an additional language in response to exam prompts eliciting free-text answers and assessing mastery of the upper-intermediate proficiency level. The texts have been manually error-annotated using a taxonomy of 77 error types. The full dataset consists of 323,192 sentences. The publicly released subset of the dataset, named FCE-public, consists of 33,673 sentences split into test and training sets of 2,720 and 30,953 sentences, respectively.
146 PAPERS • 1 BENCHMARK
The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.
146 PAPERS • 6 BENCHMARKS
YouTubeVIS is a new dataset tailored for tasks like simultaneous detection, segmentation and tracking of object instances in videos and is collected based on the current largest video object segmentation dataset YouTubeVOS.
146 PAPERS • 2 BENCHMARKS
BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a suite of 23 challenging tasks from BIG-Bench that were found to be beyond the capabilities of current language models. These tasks are ones where prior language model evaluations did not outperform the average human-rater.
145 PAPERS • 3 BENCHMARKS
DocRED (Document-Level Relation Extraction Dataset) is a relation extraction dataset constructed from Wikipedia and Wikidata. Each document in the dataset is human-annotated with named entity mentions, coreference information, intra- and inter-sentence relations, and supporting evidence. DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document. Along with the human-annotated data, the dataset provides large-scale distantly supervised data.
144 PAPERS • 4 BENCHMARKS
Paraphrase Adversaries from Word Scrambling (PAWS) is a dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.
144 PAPERS • NO BENCHMARKS YET
Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max.
144 PAPERS • 6 BENCHMARKS
WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus.
144 PAPERS • 2 BENCHMARKS
aPY is a coarse-grained dataset composed of 15339 images from 3 broad categories (animals, objects and vehicles), further divided into a total of 32 subcategories (aeroplane, …, zebra).
CrowdHuman is a large and rich-annotated human detection dataset, which contains 15,000, 4,370 and 5,000 images collected from the Internet for training, validation and testing respectively. The number is more than 10× boosted compared with previous challenging pedestrian detection dataset like CityPersons. The total number of persons is also noticeably larger than the others with ∼340k person and ∼99k ignore region annotations in the CrowdHuman training subset.
143 PAPERS • 2 BENCHMARKS
The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.
The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in form of natural language text. The corpus contains sets with up to 7 triplets each along with one or more reference texts for each set. The test set is split into two parts: seen, containing inputs created for entities and relations belonging to DBpedia categories that were seen in the training data, and unseen, containing inputs extracted for entities and relations belonging to 5 unseen categories.
143 PAPERS • 17 BENCHMARKS
Celeb-DF is a large-scale challenging dataset for deepfake forensics. It includes 590 original videos collected from YouTube with subjects of different ages, ethnic groups and genders, and 5639 corresponding DeepFake videos.
142 PAPERS • NO BENCHMARKS YET
Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
142 PAPERS • 3 BENCHMARKS
The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new multimodal examples created by Facebook AI. Images were licensed from Getty Images so that researchers can use the data set to support their work.
142 PAPERS • 1 BENCHMARK
The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos where a template consists of a varying number of still images and video frames from different sources. These images and videos are collected from the Internet and are totally unconstrained, with large variations in pose, illumination, image quality etc. In addition, the dataset comes with protocols for 1-to-1 template-based face verification, 1-to-N template-based open-set face identification, and 1-to-N open-set video face identification.
142 PAPERS • 5 BENCHMARKS