ZINC is a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in ready-to-dock, 3D formats. ZINC also contains over 750 million purchasable compounds that can be searched for analogs.
199 PAPERS • 5 BENCHMARKS
MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance.
176 PAPERS • 1 BENCHMARK
The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs.
41 PAPERS • 5 BENCHMARKS
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
31 PAPERS • 3 BENCHMARKS
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense features" that represent chemical descriptors, such as molecular weight, solubility or surface area, and 272,776 "sparse features" that represent chemical substructures (ECFP10, DFS6, DFS8; stored in Matrix Market Format ). Machine learning methods can either use sparse or dense data or combine them. For each sample there are 12 binary labels that represent the outcome (active/inactive) of 12 different toxicological experiments. Note that the label matrix contains many missing values (NAs). The original data source and Tox21 challenge site is https://tripod.nih.gov/tox21/challenge/.
25 PAPERS • 5 BENCHMARKS
PCQM4Mv2 is a quantum chemistry dataset originally curated under the PubChemQC project. Based on the PubChemQC, we define a meaningful ML task of predicting DFT-calculated HOMO-LUMO energy gap of molecules given their 2D molecular graphs. The HOMO-LUMO gap is one of the most practically-relevant quantum chemical properties of molecules since it is related to reactivity, photoexcitation, and charge transport. Moreover, predicting the quantum chemical property only from 2D molecular graphs without their 3D equilibrium structures is also practically favorable. This is because obtaining 3D equilibrium structures requires DFT-based geometry optimization, which is expensive on its own.
15 PAPERS • 1 BENCHMARK
ESOL is a water solubility prediction dataset consisting of 1128 samples.
9 PAPERS • 3 BENCHMARKS
The $O_2$Perm dataset is created from the Membrane Society of Australasia portal. It uses monomers as polymer graphs to predict the property of oxygen permeability. It has he limited size (595 polymers), which brings great challenges to the property prediction.
3 PAPERS • NO BENCHMARKS YET
Cylinder in Crossflow is a synthetic dataset that involves unsteady laminar flow past a cylinder that generates vortex shedding pattern known as a von Kármán vortex street. The governing equations for this system are the incompressible Navier-Stokes equations. The cylinder has a diameter of 1 and the free stream velocity is 1. The kinematic viscosity $\nu$ is varied such that the Reynolds number is between 100 and 400. Symmetry boundary conditions are applied at the top and bottom edges of the domain and an open pressure boundary condition is applied at the outlet. Solutions are generated on the unstructured mesh of 6384 quad elements.
2 PAPERS • NO BENCHMARKS YET
The GlassTemp dataset is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of glass transition temperature. The glass transition temperature of the material itself denotes the temperature range over which this glass transition takes place.
2 PAPERS • 1 BENCHMARK
The MeltingTemp dataset is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of polymer melting temperature.
DrivAerNet is a large-scale, high-fidelity CFD dataset of 3D industry-standard car shapes designed for data-driven aerodynamic design. It comprises 4000 high-quality 3D car meshes and their corresponding aerodynamic performance coefficients, alongside full 3D flow field information.
1 PAPER • NO BENCHMARKS YET
OCB contains two graph datasets, Ckt-Bench-101 and Ckt-Bench-301, for representation learning over analog circuits. Ckt-Bench-101 and Ckt-Bench-301 contain graphs (DAGs) that represent analog circuits and provide their corresponding graph-level properties: DC gain (Gain), bandwidth (BW), phase margin (PM),Figure of Merit (FoM), which characterize the circuit performance.
This dataset are about Nafion 112 membrane standard tests and MEA activation tests of PEM fuel cell in various operation condition. Dataset include two general electrochemical analysis method, Polarization and Impedance curves. In this dataset, effect of different pressure of H2/O2 gas, different voltages and various humidity conditions in several steps are considered. Behavior of PEM fuel cell during distinct operation condition tests, activation procedure and different operation condition before and after activation analysis can be concluded from data. In Polarization curves, voltage and power density change as a function of flows of H2/O2 and relative humidity. Resistance of the used equivalent circuit of fuel cell can be calculated from Impedance data. Thus, experimental response of the cell is obvious in the presented data, which is useful in depth analysis, simulation and material performance investigation in PEM fuel cell researches.
The PolyDensity is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of polymer density.
Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact
hERG is a large-scale biophysics federated molecular dataset related to cardiac toxicity. It consists of 10,572 compounds, with an average of 29.39 nodes and 94.09 edges in each graph.