SHAPES is a dataset of synthetic images designed to benchmark systems for understanding of spatial and logical relations among multiple objects. The dataset consists of complex questions about arrangements of colored shapes. The questions are built around compositions of concepts and relations, e.g. Is there a red shape above a circle? or Is a red shape blue?. Questions contain between two and four attributes, object types, or relationships. There are 244 questions and 15,616 images in total, with all questions having a yes and no answer (and corresponding supporting image). This eliminates the risk of learning biases.
111 PAPERS • 1 BENCHMARK
UCI Machine Learning Repository is a collection of over 550 datasets.
52 PAPERS • 9 BENCHMARKS
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a
32 PAPERS • 2 BENCHMARKS
The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units (ICU). Each record consists of roughly 48 hours of multivariate time series data with up to 37 features recorded at various times from the patients during their stay such as respiratory rate, glucose etc.
19 PAPERS • 5 BENCHMARKS
Caenorhabditis elegans is a roundworm commonly used as a model organism in the study of genetics. The movement of these worms is known to be a useful indicator for understanding behavioural genetics. Brown {\em et al.}[1] describe a system for recording the motion of worms on an agar plate and measuring a range of human-defined features[2]. It has been shown that the space of shapes Caenorhabditis elegans adopts on an agar plate can be represented by combinations of six base shapes, or eigenworms. Once the worm outline is extracted, each frame of worm motion can be captured by six scalars representing the amplitudes along each dimension when the shape is projected onto the six eigenworms. Using data collected for the work described in[1], we address the problem of classifying individual worms as wild-type or mutant based on the time series. The data were extracted from the C. elegans behavioural database [3]. We have 259 cases, which we split 131 train and 128 test. We have truncated e
17 PAPERS • 1 BENCHMARK
This dataset is composed of two collections of heartbeat signals derived from two famous PhysioNet datasets in heartbeat classification, the MIT-BIH Arrhythmia Dataset and the PTB Diagnostic ECG Database. The number of samples in both collections is large enough for training a deep neural network.
8 PAPERS • NO BENCHMARKS YET
The original dataset for "ECG5000" is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure Database(chfdb) and it is record "chf07". It was originally published in "Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23)". The data was pre-processed in two steps: (1) extract each heartbeat, (2) make each heartbeat equal length using interpolation. This dataset was originally used in paper "A general framework for never-ending learning from time series streams", DAMI 29(6). After that, 5,000 heartbeats were randomly selected. The patient has severe congestive heart failure and the class values were obtained by automated annotation
5 PAPERS • 3 BENCHMARKS
ECG200
3 PAPERS • 2 BENCHMARKS
State-level data for the US economy. The changes in the number of employees based on one million employees active in the US during the COVID-19 pandemic are gathered from Homebase (Bartik et al. 2020). We further enriched the data with the state-level policies as an indication of extreme events (e.g., the state’s business closure order).
2 PAPERS • 1 BENCHMARK
The collected dataset consists of multivariate time series (MTS) data belonging to several ATMs banking along with the annotations that the operators did when they performed a maintenance task on any of the machines.
1 PAPER • NO BENCHMARKS YET
Recorded with a Husky A200 wheeled UGV, BorealTC contains 116 min of Inertial Measurement Unit (IMU), motor current, and wheel odometry data, focusing on typical boreal forest terrains, notably snow, ice, and silty loam. The dataset also includes experiments on asphalt and flooring. All runs were recorded in Forêt Montmorency and on the main campus of Université Laval, Quebec City, Québec, Canada
1 PAPER • 1 BENCHMARK
The dataset provided is a collection of real-world industrial vibration data collected from a brownfield CNC milling machine. The acceleration has been measured using a tri-axial accelerometer (Bosch CISS Sensor) mounted inside the machine. The X- Y- and Z-axes of the accelerometer have been recorded using a sampling rate equal to 2 kHz. Thereby normal as well as anomalous data have been collected for 4 different timeframes, each lasting 5 months from February 2019 until August 2021 and labelled accordingly. It can be used to investigate the scalability of models and research process variations as the anomaly impact differs. In total there is data from three different CNC milling machines each executing 15 processes. For a detailed description of the data and experimental set-up, please refer to the paper: https://doi.org/10.1016/j.procir.2022.04.022
All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.
The data generated from this study are grouped into 3 main types: (1) participant demographic and clinical data, (2) sensor data from the different devices, as well as clinical scores and metadata related to the tasks performed, and (3) participant diaries collected during the in-clinic and at-home phases of the study. Throughout the data tables, timestamps are provided as UNIX epoch/POSIX time.
Recorded with a Husky A200 wheeled UGV, the Vulpi 2021 dataset contains 13 min of Inertial Measurement Unit (IMU), motor current, and wheel odometry data, focusing on agricultural terrains. The dataset includes experiments on concrete, a dirt road, a ploughed terrain and an unploughed terrain that were all recorded on an experimental farm in San Cassiano, Lecce, Italy.
Data for the paper entitled Quantifying yeast colony morphologies with feature engineering from time-lapse photography by A. Goldschmidt et al. (https://arxiv.org/abs/2201.05259)
The Tufts fNIRS to Mental Workload (fNIRS2MW) open-access dataset is a new dataset for building machine learning classifiers that can consume a short window (30 seconds) of multivariate fNIRS recordings and predict the mental workload intensity of the user during that window.
0 PAPER • NO BENCHMARKS YET