The Human Activity Recognition Dataset has been collected from 30 subjects performing six different activities (Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, Laying). It consists of inertial sensor data that was collected using a smartphone carried by the subjects.
273 PAPERS • 3 BENCHMARKS
The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in China. To explore the granularity on the Long sequence time-series forecasting (LSTF) problem, different subsets are created, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.
165 PAPERS • 1 BENCHMARK
The PAMAP2 Physical Activity Monitoring dataset contains data of 18 different physical activities (such as walking, cycling, playing soccer, etc.), performed by 9 subjects wearing 3 inertial measurement units and a heart rate monitor. The dataset can be used for activity recognition and intensity estimation, while developing and applying algorithms of data processing, segmentation, feature extraction and classification.
149 PAPERS • 1 BENCHMARK
The M4 dataset is a collection of 100,000 time series used for the fourth edition of the Makridakis forecasting Competition. The M4 dataset consists of time series of yearly, quarterly, monthly and other (weekly, daily and hourly) data, which are divided into training and test sets. The minimum numbers of observations in the training test are 13 for yearly, 16 for quarterly, 42 for monthly, 80 for weekly, 93 for daily and 700 for hourly series. The participants were asked to produce the following numbers of forecasts beyond the available data that they had been given: six for yearly, eight for quarterly, 18 for monthly series, 13 for weekly series and 14 and 48 forecasts respectively for the daily and hourly ones.
85 PAPERS • NO BENCHMARKS YET
The First Temporal Benchmark Designed to Evaluate Real-time Anomaly Detectors Benchmark
62 PAPERS • 1 BENCHMARK
The Shifts Dataset is a dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, `in-the-wild' distributional shifts and pose interesting challenges with respect to uncertainty estimation.
43 PAPERS • 1 BENCHMARK
UK-DALE is an open-access dataset from the UK recording Domestic Appliance-Level Electricity to conduct research on disaggregation algorithms, with data describing not just the aggregate demand per building but also the `ground truth' demand of individual appliances. It was built at a sample rate of 16 kHz for the whole-house and at 1/6 Hz for individual appliances. This is the first open access UK dataset at this temporal resolution. It wAS recorded from five houses, one of which was recorded for 655 days.
41 PAPERS • NO BENCHMARKS YET
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a
32 PAPERS • 2 BENCHMARKS
This dataset includes time-series data generated by accelerometer and gyroscope sensors (attitude, gravity, userAcceleration, and rotationRate). It is collected with an iPhone 6s kept in the participant's front pocket using SensingKit which collects information from Core Motion framework on iOS devices. All data is collected in 50Hz sample rate. A total of 24 participants in a range of gender, age, weight, and height performed 6 activities in 15 trials in the same environment and conditions: downstairs, upstairs, walking, jogging, sitting, and standing.
29 PAPERS • NO BENCHMARKS YET
This dataset details the energy consumption of appliances in a low-energy building over 4.5 months. Data was collected at 10-minute intervals.
28 PAPERS • NO BENCHMARKS YET
This dataset is from DeepHawkes: Bridging the Gap between Prediction and Understanding of Information Cascades, CIKM 2017. It includes Weibo tweets and their retweets posted in a day.
16 PAPERS • 1 BENCHMARK
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.
15 PAPERS • 4 BENCHMARKS
Abstract: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
The Argoverse 2 Motion Forecasting Dataset is a curated collection of 250,000 scenarios for training and validation. Each scenario is 11 seconds long and contains the 2D, birds-eye-view centroid and heading of each tracked object sampled at 10 Hz.
14 PAPERS • NO BENCHMARKS YET
The rounD dataset introduces a fresh compilation of natural road user trajectory data from German roundabouts, gathered using drone technology to navigate past usual challenges such as occlusions inherent in traditional traffic data collection methods. It includes traffic data from three unique locations, capturing the movement and categorizing each road user by type. Advanced computer vision algorithms are applied to ensure high positional accuracy. This dataset is highly adaptable for a variety of applications, including predicting road user behavior, driver modeling, scenario-based safety evaluations for automated driving systems, and the data-driven creation of Highly Automated Driving (HAD) system components.
11 PAPERS • NO BENCHMARKS YET
A new dataset of handwritten text with fine-grained annotations at the character level and report results from an initial user evaluation.
9 PAPERS • NO BENCHMARKS YET
4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.
8 PAPERS • 1 BENCHMARK
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enh
8 PAPERS • 2 BENCHMARKS
This dataset is composed of two collections of heartbeat signals derived from two famous PhysioNet datasets in heartbeat classification, the MIT-BIH Arrhythmia Dataset and the PTB Diagnostic ECG Database. The number of samples in both collections is large enough for training a deep neural network.
8 PAPERS • NO BENCHMARKS YET
The time series segmentation benchmark (TSSB) currently contains 75 annotated time series (TS) with 1-9 segments. Each TS is constructed from one of the UEA & UCR time series classification datasets. We group TS by label and concatenate them to create segments with distinctive temporal patterns and statistical properties. We annotate the offsets at which we concatenated the segments as change points (CPs). Addtionally, we apply resampling to control the dataset resolution and add approximate, hand-selected window sizes that are able to capture temporal patterns.
The UCR Anomaly Archive is a collection of 250 uni-variate time series collected in human medicine, biology, meteorology and industry. The collected time series contain a few natural anomalies though the majority of the anomalies are artificial . The dataset was first used in an anomaly detection contest preceding the ACM SIGKDD conference 2021. Each of the time series contains exactly one, occasionally subtle anomaly after a given time stamp. The data before that timestamp can be considered normal. The time series collected in the UCR Anomaly Archive can be categorized into 12 types originating from the four domains human medicine, meteorology, biology and industry. The distribution across the domains is highly imbalanced with around 64% of the times series being collected in human medicine applications, 22% in biology, 9% in industry and 5% being air temperature measurements. The time series within a single type (e.g. ECG) are not completely unique, but differ in terms of injected an
Prediction of Finger Flexion IV Brain-Computer Interface Data Competition The goal of this dataset is to predict the flexion of individual fingers from signals recorded from the surface of the brain (electrocorticography (ECoG)). This data set contains brain signals from three subjects, as well as the time courses of the flexion of each of five fingers. The task in this competition is to use the provided flexion information in order to predict finger flexion for a provided test set. The performance of the classifier will be evaluated by calculating the average correlation coefficient r between actual and predicted finger flexion.
7 PAPERS • 1 BENCHMARK
The Multi-domain Mobile Video Physiology Dataset (MMPD), comprising 11 hours(1152K frames) of recordings from mobile phones of 33 subjects. The dataset was designed to capture videos with greater representation across skin tone, body motion, and lighting conditions. MMPD is comprehensive with eight descriptive labels and can be used in conjunction with the rPPG-toolbox and PhysBench. MMPD is widely used for rPPG tasks and remote heart rate estimation. To access the dataset, you are supposed to download this data release agreement and request downloading by email.
7 PAPERS • NO BENCHMARKS YET
SEN12MS-CR-TS is a multi-modal and multi-temporal data set for cloud removal. It contains time-series of paired and co-registered Sentinel-1 and cloudy as well as cloud-free Sentinel-2 data from European Space Agency's Copernicus mission. Each time series contains 30 cloudy and clear observations regularly sampled throughout the year 2018. Our multi-temporal data set is readily pre-processed and backward-compatible with SEN12MS-CR.
Five datasets used in NeurTraL-AD paper: \textit{RacketSports (RS).} Accelerometer and gyroscope recording of players playing four different racket sports. Each sport is designated as a different class. \textit{Epilepsy (EPSY).} Accelerometer recording of healthy actors simulating four different activity classes, one of them being an epileptic shock. \textit{Naval air training and operating procedures standardization (NAT).} Positions of sensors mounted on different body parts of a person performing activities. There are six different activity classes in the dataset. \textit{Character trajectories (CT).} Velocity trajectories of a pen on a WACOM tablet. There are $20$ different characters in this dataset. \textit{Spoken Arabic Digits (SAD).} MFCC features of ten arabic digits spoken by $88$ different speakers.
UI-PRMD is a data set of movements related to common exercises performed by patients in physical therapy and rehabilitation programs. The data set consists of 10 rehabilitation exercises. A sample of 10 healthy individuals repeated each exercise 10 times in front of two sensory systems for motion capturing: a Vicon optical tracker, and a Kinect camera. The data is presented as positions and angles of the body joints in the skeletal models provided by the Vicon and Kinect mocap systems.
7 PAPERS • 2 BENCHMARKS
VISUELLE is a repository build upon the data of a real fast fashion company, Nunalie, and is composed of 5577 new products and about 45M sales related to fashion seasons from 2016-2019. Each product in VISUELLE is equipped with multimodal information: its image, textual metadata, sales after the first release date, and three related Google Trends describing category, color and fabric popularity.
We introduce a new dataset, Watch and Learn Time-lapse (WALT), consisting of multiple (4K and 1080p) cameras capturing urban environments over a year.
A multivariate spatio-temporal benchmark dataset for meteorological forecasting based on real-time observation data from ground weather stations.
7 PAPERS • 16 BENCHMARKS
Context There's a story behind every dataset and here's your opportunity to share yours.
7 PAPERS • 3 BENCHMARKS
Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and fault modes. Records several sensor channels to characterize fault evolution. The data set was provided by the Prognostics CoE at NASA Ames.
6 PAPERS • 3 BENCHMARKS
Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an eas
6 PAPERS • NO BENCHMARKS YET
WeatherBench 2 is an update to the global, medium-range (1–14 day) weather forecasting benchmark proposed by rasp_weatherbench_2020, designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models.
Nearly 10,000 km² of free high-resolution and paired multi-temporal low-resolution satellite imagery of unique locations which ensure stratified representation of all types of land-use across the world: from agriculture to ice caps, from forests to multiple urbanization densities.
DurLAR is a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery for multi-modal autonomous driving applications. Compared to existing autonomous driving task datasets, DurLAR has the following novel features:
5 PAPERS • NO BENCHMARKS YET
Data Description The training data contains twelve-lead ECGs. The validation and test data contains twelve-lead, six-lead, four-lead, three-lead, and two-lead ECGs:
5 PAPERS • 2 BENCHMARKS
A large-scale multi-modal dataset to facilitate research and studies that concentrate on vision-wireless systems. The Vi-Fi dataset is a large-scale multi-modal dataset that consists of vision, wireless and smartphone motion sensor data of multiple participants and passer-by pedestrians in both indoor and outdoor scenarios. In Vi-Fi, vision modality includes RGB-D video from a mounted camera. Wireless modality comprises smartphone data from participants including WiFi FTM and IMU measurements.
AnoShift is a large-scale anomaly detection benchmark, which focuses on splitting the test data based on its temporal distance to the training set, introducing three testing splits: IID, NEAR, and FAR. This testing scenario proves to capture the in-time performance degradation of anomaly detection methods for classical to masked language models.
4 PAPERS • 1 BENCHMARK
ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation environments with the presence of environmental non-targeted variations, such as air turbidity and light condition changes, as well as targeted object changes in industrial indoor environments. By collecting data in simulations, multi-modal sensor data and precise ground truth labels are obtainable such as the RGB image, depth image, semantic segmentation, change segmentation, camera poses, and 3D reconstructions. While the previous online SCD datasets evaluate models given well-aligned image pairs, ChangeSim also provides raw unpaired sequences that present an opportunity to develop an online SCD model in an end-to-end manner, considering both pairing and detection. Experiments show that even the latest pair-based SCD models suffer from the bottleneck of the pairing process, and it gets worse when the environment contains the non-targeted variations.
4 PAPERS • 2 BENCHMARKS
IowaRain is a dataset of rainfall events for the state of Iowa (2016-2019) acquired from the National Weather Service Next Generation Weather Radar (NEXRAD) system and processed by a quantitative precipitation estimation system. The dataset presented in this study could be used for better disaster monitoring, response and recovery by paving the way for both predictive and prescriptive modeling
4 PAPERS • NO BENCHMARKS YET
The generation of data-driven prognostics models requires the availability of datasets with run-to-failure trajectories. In order to contribute to the development of these methods, the dataset provides a new realistic dataset of run-to-failure trajectories for a small fleet of aircraft engines under realistic flight conditions. The damage propagation modelling used for the generation of this synthetic dataset builds on the modeling strategy from previous work . The dataset was generated with the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dynamical model. The data set is been provided by the Prognostics CoE at NASA Ames in collaboration with ETH Zurich and PARC.
The PRONOSTIA (also called FEMTO) bearing dataset consists of 17 accelerated run-to-failures on a small bearing test rig. Both acceleration and temperature data was collected for each experiment.
The dataset contains a collection of physiological signals (EEG, GSR, PPG) obtained from an experiment of the auditory attention on natural speech. Ethical Approval was acquired for the experiment. Details of the experiment can be found here https://phyaat.github.io/experiment
4 PAPERS • 4 BENCHMARKS
Data The data for this Challenge are from multiple sources: CPSC Database and CPSC-Extra Database INCART Database PTB and PTB-XL Database The Georgia 12-lead ECG Challenge (G12EC) Database Undisclosed Database The first source is the public (CPSC Database) and unused data (CPSC-Extra Database) from the China Physiological Signal Challenge in 2018 (CPSC2018), held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. The unused data from the CPSC2018 is NOT the test data from the CPSC2018. The test data of the CPSC2018 is included in the final private database that has been sequestered. This training set consists of two sets of 6,877 (male: 3,699; female: 3,178) and 3,453 (male: 1,843; female: 1,610) of 12-ECG recordings lasting from 6 seconds to 60 seconds. Each recording was sampled at 500 Hz.
The largest and most realistic dataset available for TCC. It consists of 600 real-world videos recorded with a high-resolution mobile phone camera shooting 1824 x 1368 sized pictures. The length of these videos ranges from 3 to 17 frames (7.3 on average, the median is 7.0 and mode is 8.5). Ground truth information is present only for the last frame in each video (i.e., the shot frame), and was collected using a gray surface calibration target.
This dataset contains simulations of a complex, large-scale chemical plant proposed by Downs and Vogel (1993). As described by Reinartz, Kulahci and Ravn (2021):
The data we use include 366 monthly series, 427 quarterly series and 518 yearly series. They were supplied by both tourism bodies (such as Tourism Australia, the Hong Kong Tourism Board and Tourism New Zealand) and various academics, who had used them in previous tourism forecasting studies (please refer to the acknowledgements and details of the data sources and availability).
This meta-dataset is composed of previously known datasets.