TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research over more complex and realistic tabular and textual data, especially those requiring numerical reasoning.
49 PAPERS • 1 BENCHMARK
GitTables is a corpus of currently 1M relational tables extracted from CSV files in GitHub covering 96 topics. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. The column annotations consist of semantic types, hierarchical relations, range types, table domain and descriptions.
13 PAPERS • NO BENCHMARKS YET
The SIND dataset is based on 4K video captured by drones, providing information including traffic participant trajectories, traffic light status, and high-definition maps
10 PAPERS • NO BENCHMARKS YET
This dataset is a collection of labelled PCAP files, both encrypted and unencrypted, across 10 applications, as well as a pandas dataframe in HDF5 format containing detailed metadata summarizing the connections from those files. It was created to assist the development of machine learning tools that would allow operators to see the traffic categories of both encrypted and unencrypted traffic flows. In particular, features of the network packet traffic timing and size information (both inside of and outside of the VPN) can be leveraged to predict the application category that generated the traffic.
4 PAPERS • NO BENCHMARKS YET
The eICU Collaborative Research Database is a large multi-center critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for Computational Physiology.
The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.
3 PAPERS • NO BENCHMARKS YET
SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ evaluation. Each dataset represents a multivariate time series collected from the sensors installed on the testbed. All instances are labeled for evaluating the results of solving outlier detection and changepoint detection problems.
3 PAPERS • 2 BENCHMARKS
GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.
2 PAPERS • NO BENCHMARKS YET
This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.
The original dataset was provided by Orange telecom in France, which contains anonymized and aggregated human mobility data. The Multivariate-Mobility-Paris dataset comprises information from 2020-08-24 to 2020-11-04 (72 days during the COVID-19 pandemic), with time granularity of 30 minutes and spatial granularity of 6 coarse regions in Paris, France. In other words, it represents a multivariate time series dataset.
The SegmentedTables dataset is a collection of almost 2,000 tables extracted from 352 machine learning papers. Each table consists of rich text content, layout and caption. Tables are annotated with types (leaderboard, ablation, irrelevant) and cells of relevant tables are annotated with semantic roles (such as “paper model”, “competing model”, “dataset”, “metric”).
Yavuz Selim TASPINAR, Murat KOKLU and Mustafa ALTIN
1 PAPER • NO BENCHMARKS YET
The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.
Problem Statement
This dataset includes Direct Borohydride Fuel Cell (DBFC) impedance and polarization test in anode with Pd/C, Pt/C and Pd decorated Ni–Co/rGO catalysts. In fact, different concentration of Sodium Borohydride (SBH), applied voltages and various anode catalysts loading with explanation of experimental details of electrochemical analysis are considered in data. Voltage, power density and resistance of DBFC change as a function of weight percent of SBH (%), applied voltage and amount of anode catalyst loading that are evaluated by polarization and impedance curves with using appropriate equivalent circuit of fuel cell. Can be stated that interpretation of electrochemical behavior changes by the data of related cell is inevitable, which can be useful in simulation, power source investigation and depth analysis in DB fuel cell researches.
The dataset is generated from the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.
This repository contains the dataset for the study of the computational reproducibility of Jupyter notebooks from biomedical publications. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.
The dataset contains two Pareto-fronts: - The Pareto-front for the 2-objective problem - The Pareto-front for the 3-objective problem
This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.
MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts (i.e. images). It contains 3,548 questions paired with 6,620 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.
We present a comprehensive dataset comprising a vast collection of raw mineral samples for the purpose of mineral recognition. The dataset encompasses more than 5,000 distinct mineral species and incorporates subsets for zero-shot and few-shot learning. In addition to the samples themselves, some entries in the dataset are accompanied by supplementary natural language descriptions, size measurements, and segmentation masks. For detailed information on each sample, please refer to the minerals_full.csv file.
US Macroeconomic dataset containing 14 time series of monthly observations. They have various lengths but all end in 1988. The variables: consumer price index, industrial production, nominal GNP, velocity, employment, interest rate, nominal wages, GNP deflator, money stock, real GNP, stock prices (S&P500), GNP per capita, real wages, unemployment.
This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper. We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the Github Repository for this work.
This dataset are about Nafion 112 membrane standard tests and MEA activation tests of PEM fuel cell in various operation condition. Dataset include two general electrochemical analysis method, Polarization and Impedance curves. In this dataset, effect of different pressure of H2/O2 gas, different voltages and various humidity conditions in several steps are considered. Behavior of PEM fuel cell during distinct operation condition tests, activation procedure and different operation condition before and after activation analysis can be concluded from data. In Polarization curves, voltage and power density change as a function of flows of H2/O2 and relative humidity. Resistance of the used equivalent circuit of fuel cell can be calculated from Impedance data. Thus, experimental response of the cell is obvious in the presented data, which is useful in depth analysis, simulation and material performance investigation in PEM fuel cell researches.
This repository contains a dataset and machine learning algorithms to detect poisoned water from clean water via using equivalent Smartphone embedded Wi-Fi CSI data.
The dataset provides information about 450 HYIPs collected between November 2020 and September 2021. This dataset was analyzed and the results are discussed in the paper.
Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.
The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents
1 PAPER • 1 BENCHMARK
It contains data from two different realities: Food.com, a well-known American recipe site, and Planeat, an Italian site that allows you to plan recipes to save food waste. The dataset is divided into two parts: embeddings, which can be used directly to execute the work and receive suggestions, and raw data, which must first be processed into embeddings.
The buildingSMART Data Dictionary (bSDD) is an online service that hosts classifications and their properties, allowed values, units and translations. The bSDD allows linking between all the content inside the database. It provides a standardized workflow to guarantee data quality and information consistency.
This file contains the data and code for the publication "The Federal Reserve's Response to the Global Financial Crisis and Its Long-Term Impact: An Interrupted Time-Series Natural Experimental Analysis" by A. C. Kamkoum, 2023.
It is a competition on kaggle with stroke Prediction, which is heavily imbalanced.
Dataset contains about 48K contracts which are open source on Etherscan.
Here I provided the datasets I used for this analysis. It includes the tweets I streamed using the Tweepy package on Python during the peach of the wildfire season in late summer/early fall of 2020.
ata Set Name: Rice Dataset (Commeo and Osmancik) Abstract: A total of 3810 rice grain's images were taken for the two species (Cammeo and Osmancik), processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.
0 PAPER • NO BENCHMARKS YET
The SheetCopilot dataset contains 28 evaluation workbooks and 221 spreadsheet manipulation tasks that are applied to these workbooks. These tasks involve diverse atomic actions related to six task categories (i.e. Entry and manipulation, Formatting, Management, Charts, Pivot Table, and Formula).
0 PAPER • 1 BENCHMARK