D4RL is a collection of environments for offline reinforcement learning. These environments include Maze2D, AntMaze, Adroit, Gym, Flow, FrankKitchen and CARLA.
401 PAPERS • 16 BENCHMARKS
Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations.
184 PAPERS • 4 BENCHMARKS
FairFace is a face image dataset which is race balanced. It contains 108,501 images from 7 different race groups: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino. Images were collected from the YFCC-100M Flickr dataset and labeled with race, gender, and age groups.
164 PAPERS • 1 BENCHMARK
ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode is probably the so-called Death Match, where several players join in a maze and fight against each other. After a fixed time, the match ends and all the players are ranked by the FRAG scores defined as kills minus suicides. During the game, each player can access various observations, including the first-person view screen pixels, the corresponding depth-map and segmentation-map (pixel-wise object labels), the bird-view maze map, etc. The valid actions include almost all the keyboard-stroke and mouse-control a human player can take, accounting for moving, turning, jumping, shooting, changing weapon, etc. ViZDoom can run a game either synchronously or asynchronously, indicating whether the game core waits until all players’ actions are collected or runs in a constant frame rate without waiting.
149 PAPERS • 3 BENCHMARKS
TORCS (The Open Racing Car Simulator) is a driving simulator. It is capable of simulating the essential elements of vehicular dynamics such as mass, rotational inertia, collision, mechanics of suspensions, links and differentials, friction and aerodynamics. Physics simulation is simplified and is carried out through Euler integration of differential equations at a temporal discretization level of 0.002 seconds. The rendering pipeline is lightweight and based on OpenGL that can be turned off for faster training. TORCS offers a large variety of tracks and cars as free assets. It also provides a number of programmed robot cars with different levels of performance that can be used to benchmark the performance of human players and software driving agents. TORCS was built with the goal of developing Artificial Intelligence for vehicular control and has been used extensively by the machine learning community ever since its inception.
89 PAPERS • NO BENCHMARKS YET
LogiQA consists of 8,678 QA instances, covering multiple types of deductive reasoning. Results show that state-of-the-art neural models perform by far worse than human ceiling. The dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
71 PAPERS • NO BENCHMARKS YET
ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules.
41 PAPERS • NO BENCHMARKS YET
CoAID include diverse COVID-19 healthcare misinformation, including fake news on websites and social platforms, along with users' social engagement about such news. CoAID includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.
38 PAPERS • NO BENCHMARKS YET
A large dataset of musculoskeletal radiographs containing 40,561 images from 14,863 studies, where each study is manually labeled by radiologists as either normal or abnormal.
35 PAPERS • NO BENCHMARKS YET
GazeFollow is a large-scale dataset annotated with the location of where people in images are looking. It uses several major datasets that contain people as a source of images: 1, 548 images from SUN, 33, 790 images from MS COCO, 9, 135 images from Actions 40, 7, 791 images from PASCAL, 508 images from the ImageNet detection challenge and 198, 097 images from the Places dataset. This concatenation results in a challenging and large image collection of people performing diverse activities in many everyday scenarios.
34 PAPERS • NO BENCHMARKS YET
PearRead is a dataset of scientific peer reviews. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.
32 PAPERS • NO BENCHMARKS YET
The SentiCap dataset contains several thousand images with captions with positive and negative sentiments. These sentimental captions are constructed by the authors by re-writing factual descriptions. In total there are 2000+ sentimental captions.
26 PAPERS • NO BENCHMARKS YET
Evidence Inference is a corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs.
25 PAPERS • NO BENCHMARKS YET
ROPES is a QA dataset which tests a system's ability to apply knowledge from a passage of text to a new situation. A system is presented a background passage containing a causal or qualitative relation(s), a novel situation that uses this background, and questions that require reasoning about effects of the relationships in the back-ground passage in the context of the situation.
23 PAPERS • NO BENCHMARKS YET
An open database for sharing robotic experience, which provides an initial pool of 15 million video frames, from 7 different robot platforms, and study how it can be used to learn generalizable models for vision-based robotic manipulation.
21 PAPERS • NO BENCHMARKS YET
Obstacle Tower is a high fidelity, 3D, 3rd person, procedurally generated environment for reinforcement learning. An agent playing Obstacle Tower must learn to solve both low-level control and high-level planning problems in tandem while learning from pixels and a sparse reward signal. Unlike other benchmarks such as the Arcade Learning Environment, evaluation of agent performance in Obstacle Tower is based on an agent’s ability to perform well on unseen instances of the environment.
19 PAPERS • 6 BENCHMARKS
ROAD is designed to test an autonomous vehicle's ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset, annotated with bounding boxes showing the location in the image plane of each road event.
19 PAPERS • NO BENCHMARKS YET
A benchmark which bridges the gap between freely available, documented, and motivated artificial benchmarks and properties of real industrial problems. The resulting industrial benchmark (IB) has been made publicly available to the RL community by publishing its Java and Python code, including an OpenAI Gym wrapper, on Github.
12 PAPERS • NO BENCHMARKS YET
The TrajNet Challenge represents a large multi-scenario forecasting benchmark. The challenge consists on predicting 3161 human trajectories, observing for each trajectory 8 consecutive ground-truth values (3.2 seconds) i.e., t−7,t−6,…,t, in world plane coordinates (the so-called world plane Human-Human protocol) and forecasting the following 12 (4.8 seconds), i.e., t+1,…,t+12. The 8-12-value protocol is consistent with the most trajectory forecasting approaches, usually focused on the 5-dataset ETH-univ + ETH-hotel + UCY-zara01 + UCY-zara02 + UCY-univ. Trajnet extends substantially the 5-dataset scenario by diversifying the training data, thus stressing the flexibility and generalization one approach has to exhibit when it comes to unseen scenery/situations. In fact, TrajNet is a superset of diverse datasets that requires to train on four families of trajectories, namely 1) BIWI Hotel (orthogonal bird’s eye flight view, moving people), 2) Crowds UCY (3 datasets, tilted bird’s eye view
12 PAPERS • 2 BENCHMARKS
BLVD is a large scale 5D semantics dataset collected by the Visual Cognitive Computing and Intelligent Vehicles Lab. This dataset contains 654 high-resolution video clips owing 120k frames extracted from Changshu, Jiangsu Province, China, where the Intelligent Vehicle Proving Center of China (IVPCC) is located. The frame rate is 10fps/sec for RGB data and 3D point cloud. The dataset contains fully annotated frames which yield 249,129 3D annotations, 4,902 independent individuals for tracking with the length of overall 214,922 points, 6,004 valid fragments for 5D interactive event recognition, and 4,900 individuals for 5D intention prediction. These tasks are contained in four kinds of scenarios depending on the object density (low and high) and light conditions (daytime and nighttime).
9 PAPERS • NO BENCHMARKS YET
CHALET is a 3D house simulator with support for navigation and manipulation. Unlike existing systems, CHALET supports both a wide range of object manipulation, as well as supporting complex environemnt layouts consisting of multiple rooms. The range of object manipulations includes the ability to pick up and place objects, toggle the state of objects like taps or televesions, open or close containers, and insert or remove objects from these containers. In addition, the simulator comes with 58 rooms that can be combined to create houses, including 10 default house layouts. CHALET is therefore suitable for setting up challenging environments for various AI tasks that require complex language understanding and planning, such as navigation, manipulation, instruction following, and interactive question answering.
The UKP Argument Annotated Essays corpus consists of argument annotated persuasive essays including annotations of argument components and argumentative relations.
8 PAPERS • NO BENCHMARKS YET
Atari-HEAD is a dataset of human actions and eye movements recorded while playing Atari videos games. For every game frame, its corresponding image frame, the human keystroke action, the reaction time to make that action, the gaze positions, and immediate reward returned by the environment were recorded. The gaze data was recorded using an EyeLink 1000 eye tracker at 1000Hz. The human subjects are amateur players who are familiar with the games. The human subjects were only allowed to play for 15 minutes and were required to rest for at least 15 minutes before the next trial. Data was collected from 4 subjects, 16 games, 175 15-minute trials, and a total of 2.97 million frames/demonstrations.
7 PAPERS • NO BENCHMARKS YET
Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and fault modes. Records several sensor channels to characterize fault evolution. The data set was provided by the Prognostics CoE at NASA Ames.
6 PAPERS • 3 BENCHMARKS
OMICS is an extensive collection of knowledge for indoor service robots gathered from internet users. Currently, it contains 48 tables capturing different sorts of knowledge. Each tuple of the Help table maps a user desire to a task that may meet the desire (e.g., ⟨ “feel thirsty”, “by offering drink” ⟩). Each tuple of the Tasks/Steps table decomposes a task into several steps (e.g., ⟨ “serve a drink”, 0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”, 3. “give class to person” ⟩). Given this, OMICS offers useful knowledge about hierarchism of naturalistic instructions, where a high-level user request (e.g., “serve a drink”) can be reduced to lower-level tasks (e.g., “get a glass”, ⋯). Another feature of OMICS is that elements of any tuple in an OMICS table are semantically related according to a predefined template. This facilitates the semantic interpretation of the OMICS tuples.
6 PAPERS • NO BENCHMARKS YET
The Flick Cropping Dataset consists of high quality cropping and pairwise ranking annotations used to evaluate the performance of automatic image cropping approaches.
5 PAPERS • NO BENCHMARKS YET
Dem@Care is providing the following datasets, which are collected during lab and home experiments. The data collection took place in the Greek Alzheimer’s Association for Dementia and Related Disorders in Thessaloniki, Greece and in participants’ homes. The datasets include video and audio recordings as well as data from physiological sensors. Moreover, they include data from sleep, motion and plug sensors.
4 PAPERS • NO BENCHMARKS YET
GolfDB is a high-quality video dataset created for general recognition applications in the sport of golf, and specifically for the task of golf swing sequencing.
CUHK Image Cropping is a dataset for image cropping. The photos are of varying aesthetic quality and span a variety of image categories, including animal, architecture, human, landscape, night, plant and man-made objects. Each image is manually cropped by three expert photographers (graduate students in art whose primary medium is photography) to form three training sets. There are 1,000 photos in the dataset.
3 PAPERS • NO BENCHMARKS YET
The ability to jointly understand the geometry of objects and plan actions for manipulating them is crucial for intelligent agents. This ability is referred to as geometric planning. Recently, many interactive environments have been proposed to evaluate intelligent agents on various skills, however, none of them cater to the needs of geometric planning. PackIt is a virtual environment to evaluate and potentially learn the ability to do geometric planning, where an agent needs to take a sequence of actions to pack a set of objects into a box with limited space.
3 PAPERS • 1 BENCHMARK
Covid-HeRA is a dataset for health risk assessment and severity-informed decision making in the presence of COVID19 misinformation. It is a benchmark dataset for risk-aware health misinformation detection, related to the 2019 coronavirus pandemic. Social media posts (Twitter) are annotated based on the perceived likelihood of health behavioural changes and the perceived corresponding risks from following unreliable advice found online.
2 PAPERS • NO BENCHMARKS YET
PICO is a framework to formulate a well-defined focused clinical question. This framework identifies the sentences in a given medical text that belong to the four components: Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O). The PubMed PICO Element Detection dataset is a dataset for evaluating models that automatically detect PICO elements.
A View From Somewhere (AVFS)—a dataset of 638,180 face similarity judgments over 4,921 faces. Each judgment corresponds to the odd-one-out (i.e., least similar) face in a triplet of faces and is accompanied by both the identifier and demographic attributes of the annotator who made the judgment.
1 PAPER • NO BENCHMARKS YET
A large-scale comprehensive collection of dashcam videos collected by vehicles on DiDi's platform. D2-City contains more than 10000 video clips which deeply reflect the diversity and complexity of real-world traffic scenarios in China.
This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.
This dataset consists of 5808 dialogues, based on 2236 unique scenarios. Each dialogue is converted into two training examples in the dataset, showing the complete conversation from the perspective of each agent. The perspectives differ on their input goals, output choice, and in special tokens marking whether a statement was read or written.
The Spaceship dataset is a dataset for evaluating agents’ ability to learn to solve a class of physics-based tasks. The tasks consist on a spaceship that has to reach a the mothership in 11 steps, in an environment where static planets exert gravitational forces on the spaceship, which induce complex non-linear dynamics on the motion over the 11 steps.
Multi-agent pursuit in matrix world (pursuitMW) is a partially observable Markov game (POMG) between a swarm of pursuers and a swarm of evaders. Algorithms can be developed for the pursuers, evaders, or both of them.