The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
6,980 PAPERS • 52 BENCHMARKS
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST shares the same image size, data format and the structure of training and testing splits with the original MNIST.
2,781 PAPERS • 17 BENCHMARKS
The JAFFE dataset consists of 213 images of different facial expressions from 10 different Japanese female subjects. Each subject was asked to do 7 facial expressions (6 basic facial expressions and neutral) and the images were annotated with average semantic ratings on each facial expression by 60 annotators.
87 PAPERS • 4 BENCHMARKS
The original ionosphere dataset from UCI machine learning repository is a binary classification dataset with dimensionality 34. There is one attribute having values all zeros, which is discarded. So the total number of dimensions are 33. The ‘bad’ class is considered as outliers class and the ‘good’ class as inliers.
62 PAPERS • 2 BENCHMARKS
pathbased is a 3-cluster data set. The data set consists of a circular cluster with an opening near the bottom and two Gaussian distributed clusters inside. Each cluster contains 100 data points.
24 PAPERS • 2 BENCHMARKS
The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.
21 PAPERS • 1 BENCHMARK
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".
15 PAPERS • 7 BENCHMARKS
face image datasets
9 PAPERS • 2 BENCHMARKS
97 synthetic datasets consists of 97 datasets (as illustrated in the figure) and can be used to test graph-based clustering algorithms.
3 PAPERS • 1 BENCHMARK
This failure dataset contains information on the events collected in the OpenStack cloud computing platform during three different campaigns of fault-injection experiments performed with three different workloads.
2 PAPERS • NO BENCHMARKS YET
This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge.
2 PAPERS • 1 BENCHMARK
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. https://archive.ics.uci.edu/ml/datasets/online+retail
The resources for this dataset can be found at https://www.openml.org/d/182