MuJoCo (multi-joint dynamics with contact) is a physics engine used to implement environments to benchmark Reinforcement Learning methods.
1,381 PAPERS • 2 BENCHMARKS
OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes environment such as Algorithmic, Atari, Box2D, Classic Control, MuJoCo, Robotics, and Toy Text.
1,205 PAPERS • 3 BENCHMARKS
CARLA (CAR Learning to Act) is an open simulator for urban driving, developed as an open-source layer over Unreal Engine 4. Technically, it operates similarly to, as an open source layer over Unreal Engine 4 that provides sensors in the form of RGB cameras (with customizable positions), ground truth depth maps, ground truth semantic segmentation maps with 12 semantic classes designed for driving (road, lane marking, traffic sign, sidewalk and so on), bounding boxes for dynamic objects in the environment, and measurements of the agent itself (vehicle location and orientation).
1,055 PAPERS • 3 BENCHMARKS
D4RL is a collection of environments for offline reinforcement learning. These environments include Maze2D, AntMaze, Adroit, Gym, Flow, FrankKitchen and CARLA.
401 PAPERS • 16 BENCHMARKS
The Arcade Learning Environment (ALE) is an object-oriented framework that allows researchers to develop AI agents for Atari 2600 games. It is built on top of the Atari 2600 emulator Stella and separates the details of emulation from agent design.
352 PAPERS • 57 BENCHMARKS
The DeepMind Control Suite (DMCS) is a set of simulated continuous control environments with a standardized structure and interpretable rewards. The tasks are written and powered by the MuJoCo physics engine, making them easy to identify. Control Suite tasks include Pendulum, Acrobot, Cart-pole, Cart-k-pole, Ball in cup, Point-mass, Reacher, Finger, Hooper, Fish, Cheetah, Walker, Manipulator, Manipulator extra, Stacker, Swimmer, Humanoid, Humanoid_CMU and LQR.
313 PAPERS • 11 BENCHMARKS
AirSim is a simulator for drones, cars and more, built on Unreal Engine. It is open-source, cross platform, and supports software-in-the-loop simulation with popular flight controllers such as PX4 & ArduPilot and hardware-in-loop with PX4 for physically and visually realistic simulations. It is developed as an Unreal plugin that can simply be dropped into any Unreal environment. Similarly, there exists an experimental version for a Unity plugin.
240 PAPERS • NO BENCHMARKS YET
AI2-Thor is an interactive environment for embodied AI. It contains four types of scenes, including kitchen, living room, bedroom and bathroom, and each scene includes 30 rooms, where each room is unique in terms of furniture placement and item types. There are over 2000 unique objects for AI agents to interact with.
186 PAPERS • 1 BENCHMARK
ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode is probably the so-called Death Match, where several players join in a maze and fight against each other. After a fixed time, the match ends and all the players are ranked by the FRAG scores defined as kills minus suicides. During the game, each player can access various observations, including the first-person view screen pixels, the corresponding depth-map and segmentation-map (pixel-wise object labels), the bird-view maze map, etc. The valid actions include almost all the keyboard-stroke and mouse-control a human player can take, accounting for moving, turning, jumping, shooting, changing weapon, etc. ViZDoom can run a game either synchronously or asynchronously, indicating whether the game core waits until all players’ actions are collected or runs in a constant frame rate without waiting.
149 PAPERS • 3 BENCHMARKS
TORCS (The Open Racing Car Simulator) is a driving simulator. It is capable of simulating the essential elements of vehicular dynamics such as mass, rotational inertia, collision, mechanics of suspensions, links and differentials, friction and aerodynamics. Physics simulation is simplified and is carried out through Euler integration of differential equations at a temporal discretization level of 0.002 seconds. The rendering pipeline is lightweight and based on OpenGL that can be turned off for faster training. TORCS offers a large variety of tracks and cars as free assets. It also provides a number of programmed robot cars with different levels of performance that can be used to benchmark the performance of human players and software driving agents. TORCS was built with the goal of developing Artificial Intelligence for vehicular control and has been used extensively by the machine learning community ever since its inception.
89 PAPERS • NO BENCHMARKS YET
RLBench is an ambitious large-scale benchmark and learning environment designed to facilitate research in a number of vision-guided manipulation research areas, including: reinforcement learning, imitation learning, multi-task learning, geometric computer vision, and in particular, few-shot learning.
87 PAPERS • 3 BENCHMARKS
Brax is a differentiable physics engine that simulates environments made up of rigid bodies, joints, and actuators. Brax is written in JAX and is designed for use on acceleration hardware. It is both efficient for single-device simulation, and scalable to massively parallel simulation on multiple devices, without the need for pesky datacenters.
72 PAPERS • NO BENCHMARKS YET
ALFWorld contains interactive TextWorld environments (Côté et. al) that parallel embodied worlds in the ALFRED dataset (Shridhar et. al). The aligned environments allow agents to reason and learn high-level policies in an abstract space before solving embodied tasks through low-level actuation.
65 PAPERS • NO BENCHMARKS YET
Jericho is a learning environment for man-made Interactive Fiction (IF) games.
46 PAPERS • NO BENCHMARKS YET
Kubric is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
42 PAPERS • 1 BENCHMARK
The General Video Game AI (GVGAI) framework is widely used in research which features a corpus of over 100 single-player games and 60 two-player games. These are fairly small games, each focusing on specific mechanics or skills the players should be able to demonstrate, including clones of classic arcade games such as Space Invaders, puzzle games like Sokoban, adventure games like Zelda or game-theory problems such as the Iterative Prisoners Dilemma. All games are real-time and require players to make decisions in only 40ms at every game tick, although not all games explicitly reward or require fast reactions; in fact, some of the best game-playing approaches add up the time in the beginning of the game to run Breadth-First Search in puzzle games in order to find an accurate solution. However, given the large variety of games (many of which are stochastic and difficult to predict accurately), scoring systems and termination conditions, all unknown to the players, highly-adaptive genera
34 PAPERS • NO BENCHMARKS YET
Benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
31 PAPERS • 2 BENCHMARKS
ManiSkill is a large-scale learning-from-demonstrations benchmark for articulated object manipulation with visual input (point cloud and image). ManiSkill supports object-level variations by utilizing a rich and diverse set of articulated objects, and each task is carefully designed for learning manipulations on a single category of objects. ManiSkill is equipped with high-quality demonstrations to facilitate learning-from-demonstrations approaches and perform evaluations on common baseline algorithms. ManiSkill can encourage the robot learning community to explore more on learning generalizable object manipulation skills.
24 PAPERS • NO BENCHMARKS YET
The StarCraft II Learning Environment (S2LE) is a reinforcement learning environment based on the game StarCraft II. The environment consists of three sub-components: a Linux StarCraft II binary, the StarCraft II API and PySC2. The StarCraft II API allows programmatic control of StarCraft II. It can be used to start a game, get observations, take actions, and review replays. PyC2 is a Python environment that wraps the StarCraft II API to ease the interaction between Python reinforcement learning agents and StarCraft II. It defines an action and observation specification, and includes a random agent and a handful of rule-based agents as examples. It also includes some mini-games as challenges and visualization tools to understand what the agent can see and do.
Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.
ManiSkill2 is the next generation of the SAPIEN ManiSkill benchmark, to address critical pain points often encountered by researchers when using benchmarks for generalizable manipulation skills. It includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames, which cover stationary/mobile-base, single/dual-arm, and rigid/soft-body manipulation tasks with 2D/3D input data simulated by fully dynamic engines.
22 PAPERS • NO BENCHMARKS YET
Gibson is an opensource perceptual and physics simulator to explore active and real-world perception. The Gibson Environment is used for Real-World Perception Learning.
21 PAPERS • NO BENCHMARKS YET
MINOS is a simulator designed to support the development of multisensory models for goal-directed navigation in complex indoor environments. MINOS leverages large datasets of complex 3D environments and supports flexible configuration of multimodal sensor suites.
The NetHack Learning Environment (NLE) is a Reinforcement Learning environment based on NetHack 3.6.6. It is designed to provide a standard reinforcement learning interface to the game, and comes with tasks that function as a first step to evaluate agents on this new environment. NetHack is one of the oldest and arguably most impactful videogames in history, as well as being one of the hardest roguelikes currently being played by humans. It is procedurally generated, rich in entities and dynamics, and overall an extremely challenging environment for current state-of-the-art RL agents, while being much cheaper to run compared to other challenging testbeds. Through NLE, the authors wish to establish NetHack as one of the next challenges for research in decision making and machine learning.
19 PAPERS • 1 BENCHMARK
Obstacle Tower is a high fidelity, 3D, 3rd person, procedurally generated environment for reinforcement learning. An agent playing Obstacle Tower must learn to solve both low-level control and high-level planning problems in tandem while learning from pixels and a sparse reward signal. Unlike other benchmarks such as the Arcade Learning Environment, evaluation of agent performance in Obstacle Tower is based on an agent’s ability to perform well on unseen instances of the environment.
19 PAPERS • 6 BENCHMARKS
Physion is a visual and physical prediction benchmark to measure the performance of machine learning models on making predictions about commonplace real world physical events. In realistically simulating a wide variety of physical phenomena -- rigid and soft-body collisions, stable multi-object configurations, rolling and sliding, projectile motion -- this dataset presents a more comprehensive challenge than existing benchmarks. Moreover, the dataset also contains human responses for the stimuli so that model predictions can be directly compared to human judgments.
15 PAPERS • NO BENCHMARKS YET
PasticineLab is a differentiable physics benchmark, which includes a diverse collection of soft body manipulation tasks. In each task, the agent uses manipulators to deform the plasticine into the desired configuration. The underlying physics engine supports differentiable elastic and plastic deformation using the DiffTaichi system, posing many under-explored challenges to robotic agents.
SMACv2 (StarCraft Multi-Agent Challenge v2) is a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation.
14 PAPERS • NO BENCHMARKS YET
SUMMIT is a high-fidelity simulator that facilitates the development and testing of crowd-driving algorithms. By leveraging the open-source OpenStreetMap map database and a heterogeneous multi-agent motion prediction model developed in our earlier work, SUMMIT simulates dense, unregulated urban traffic for heterogeneous agents at any worldwide locations that OpenStreetMap supports. SUMMIT is built as an extension of CARLA and inherits from it the physical and visual realism for autonomous driving simulation. SUMMIT supports a wide range of applications, including perception, vehicle control, planning, and end-to-end learning.
13 PAPERS • NO BENCHMARKS YET
MineRL BASALT is an RL competition on solving human-judged tasks. The tasks in this competition do not have a pre-defined reward function: the goal is to produce trajectories that are judged by real humans to be effective at solving a given task.
12 PAPERS • NO BENCHMARKS YET
RTMV is a large-scale synthetic dataset for novel view synthesis consisting of ∼300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 × 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large unified benchmark for both training and evaluation. Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset exhibit challenging variations in camera views, lighting, shape, materials, and textures.
12 PAPERS • 1 BENCHMARK
A rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et.al.)
11 PAPERS • NO BENCHMARKS YET
RGB-Stacking is a benchmark for vision-based robotic manipulation. The robot is trained to learn how to grasp objects and balance them on top of one another.
11 PAPERS • 3 BENCHMARKS
The StarCraft Multi-Agent Challenges+ requires agents to learn completion of multi-stage tasks and usage of environmental factors without precise reward functions. The previous challenges (SMAC) recognized as a standard benchmark of Multi-Agent Reinforcement Learning are mainly concerned with ensuring that all agents cooperatively eliminate approaching adversaries only through fine manipulation with obvious reward functions. This challenge, on the other hand, is interested in the exploration capability of MARL algorithms to efficiently learn implicit multi-stage tasks and environmental factors as well as micro-control. This study covers both offensive and defensive scenarios. In the offensive scenarios, agents must learn to first find opponents and then eliminate them. The defensive scenarios require agents to use topographic features. For example, agents need to position themselves behind protective structures to make it harder for enemies to attack.
11 PAPERS • 13 BENCHMARKS
Griddly is an environment for grid-world based research. Griddly provides a highly optimized game state and rendering engine with a flexible high-level interface for configuring environments. Not only does Griddly offer simple interfaces for single, multi-player and RTS games, but also multiple methods of rendering, configurable partial observability and interfaces for procedural content generation.
10 PAPERS • NO BENCHMARKS YET
Mario AI was a benchmark environment for reinforcement learning. The gameplay in Mario AI, as in the original Nintendo’s version, consists in moving the controlled character, namely Mario, through two-dimensional levels, which are viewed sideways. Mario can walk and run to the right and left, jump, and (depending on which state he is in) shoot fireballs. Gravity acts on Mario, making it necessary to jump over cliffs to get past them. Mario can be in one of three states: Small, Big (can kill enemies by jumping onto them), and Fire (can shoot fireballs).
iGibson 2.0 is an open-source simulation environment that supports the simulation of a more diverse set of household tasks through three key innovations. First, iGibson 2.0 supports object states, including temperature, wetness level, cleanliness level, and toggled and sliced states, necessary to cover a wider range of tasks. Second, iGibson 2.0 implements a set of predicate logic functions that map the simulator states to logic states like Cooked or Soaked. Additionally, given a logic state, iGibson 2.0 can sample valid physical states that satisfy it. This functionality can generate potentially infinite instances of tasks with minimal effort from the users. The sampling mechanism allows our scenes to be more densely populated with small objects in semantically meaningful locations. Third, iGibson 2.0 includes a virtual reality (VR) interface to immerse humans in its scenes to collect demonstrations.
Our dataset which consists of multiple indoor and outdoor experiments for up to 30 m gNB-UE link. In each experiment, we fixed the location of the gNB and move the UE with an increment of roughly one degrees. The table above specifies the direction of user movement with respect to gNB-UE link, distance resolution, and the number of user locations for which we conduct channel measurements. Outdoor 30 m data also contains blockage between 3.9 m to 4.8 m. At each location, we scan the transmission beam and collect data for each beam. By doing so, we can get the full OFDM channels for different locations along the moving trajectory with all the beam angles. Moreover, we use 240 kHz subcarrier spacing, which is consistent with the 5G NR numerology at FR2, so the data we collect will be a true reflection of what a 5G UE will see.
9 PAPERS • NO BENCHMARKS YET
CHALET is a 3D house simulator with support for navigation and manipulation. Unlike existing systems, CHALET supports both a wide range of object manipulation, as well as supporting complex environemnt layouts consisting of multiple rooms. The range of object manipulations includes the ability to pick up and place objects, toggle the state of objects like taps or televesions, open or close containers, and insert or remove objects from these containers. In addition, the simulator comes with 58 rooms that can be combined to create houses, including 10 default house layouts. CHALET is therefore suitable for setting up challenging environments for various AI tasks that require complex language understanding and planning, such as navigation, manipulation, instruction following, and interactive question answering.
Bridge Data is a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments. The dataset is collected using a low-cost yet versatile 6-DoF WidowX250 robot arm and contains 7,200 demonstrations of a robot performing 71 kitchen tasks across 10 environments with varying lighting, robot positions, and backgrounds. It can be used to boosting generalization of robotic skills and empirically study how it can improve the learning of new tasks in new environments.
8 PAPERS • NO BENCHMARKS YET
LANI is a 3D navigation environment and corpus, where an agent navigates between landmarks. Lani contains 27,965 crowd-sourced instructions for navigation in an open environment. Each datapoint includes an instruction, a human-annotated ground-truth demonstration trajectory, and an environment with various landmarks and lakes. The dataset train/dev/test split is 19,758/4,135/4,072. Each environment specification defines placement of 6–13 landmarks within a square grass field of size 50m×50m.
EvoGym is a large-scale benchmark for co-optimizing the design and control of soft robots.
7 PAPERS • NO BENCHMARKS YET
MengeROS is an open-source crowd simulation tool for robot navigation that integrates Menge with ROS. It extends Menge to introduce one or more robot agents into a crowd of pedestrians. Each robot agent is controlled by external ROS-compatible controllers. MengeROS has been used to simulate crowds with up to 1000 pedestrians and 20 robots.
NeoRL is a collection of environments and datasets for offline reinforcement learning with a special focus on real-world applications. The design follows real-world properties like the conservative of behavior policies, limited amounts of data, high-dimensional state and action spaces, and the highly stochastic nature of the environments. The datasets include robotics, industrial control, finance trading and city management tasks with real-world properties, containing three-level sizes of dataset, three-level quality of data to mimic the dataset we will meet in offline RL scenarios. Users can use the dataset to evaluate offline RL algorithms with near real-world application nature.
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, the datasets are provided with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. This is a dataset accompanying the paper RL Unplugged: Benchmarks for Offline Reinforcement Learning.
A multivariate spatio-temporal benchmark dataset for meteorological forecasting based on real-time observation data from ground weather stations.
7 PAPERS • 16 BENCHMARKS
The AtariARI (Atari Annotated RAM Interface) is an environment for representation learning. The Atari Arcade Learning Environment (ALE) does not explicitly expose any ground truth state information. However, ALE does expose the RAM state (128 bytes per timestep) which are used by the game programmer to store important state information such as the location of sprites, the state of the clock, or the current room the agent is in. To extract these variables, the dataset creators consulted commented disassemblies (or source code) of Atari 2600 games which were made available by Engelhardt and Jentzsch and CPUWIZ. The dataset creators were able to find and verify important state variables for a total of 22 games. Once this information was acquired, combining it with the ALE interface produced a wrapper that can automatically output a state label for every example frame generated from the game. The dataset creators make this available with an easy-to-use gym wrapper, which returns this infor
6 PAPERS • NO BENCHMARKS YET
CARL (context adaptive RL) provides highly configurable contextual extensions to several well-known RL environments. It's designed to test your agent's generalization capabilities in all scenarios where intra-task generalization is important.