MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance.
176 PAPERS • 1 BENCHMARK
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
55 PAPERS • 6 BENCHMARKS
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense features" that represent chemical descriptors, such as molecular weight, solubility or surface area, and 272,776 "sparse features" that represent chemical substructures (ECFP10, DFS6, DFS8; stored in Matrix Market Format ). Machine learning methods can either use sparse or dense data or combine them. For each sample there are 12 binary labels that represent the outcome (active/inactive) of 12 different toxicological experiments. Note that the label matrix contains many missing values (NAs). The original data source and Tox21 challenge site is https://tripod.nih.gov/tox21/challenge/.
25 PAPERS • 5 BENCHMARKS
QM7 dataset is a subset of the GDB-13 database. GDB-13 contains nearly 1 billion stable and synthetically accessible organic molecules. In the QM7 subset, only molecules with up to 23 atoms are included. These atoms consist of carbon ©, nitrogen (N), oxygen (O), and sulfur (S). The total number of molecules in the QM7 dataset is 7165. Each molecule is represented using the Coulomb matrix, which captures the interactions between atoms.
20 PAPERS • 1 BENCHMARK
The BACE dataset focuses on inhibitors of human beta-secretase 1 (BACE-1). It includes both quantitative (IC50 values) and qualitative (binary labels) binding results. The dataset comprises small molecule inhibitors across a wide range of affinities, spanning three orders of magnitude (from nanomolar to micromolar IC50 values). Specifically, it provides: 154 BACE inhibitors for affinity prediction. 20 BACE inhibitors for pose prediction. 34 BACE inhibitors for free energy prediction.
17 PAPERS • 4 BENCHMARKS
The BBBP dataset comes from a study focused on modeling and predicting the permeability of the blood-brain barrier. The BBBP dataset contains binary labels indicating whether a compound can penetrate the blood-brain barrier (BBB) or not. Researchers use this dataset to develop and evaluate machine learning methods for predicting BBB permeability. It’s a critical task because understanding which compounds can cross the BBB is essential for drug discovery and designing therapeutics for neurological conditions.
17 PAPERS • 5 BENCHMARKS
SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts. The available information include side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations.
16 PAPERS • 4 BENCHMARKS
The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI), confirmed active (CA), and confirmed moderately active (CM).
15 PAPERS • 5 BENCHMARKS
The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures: (1) clinical trial toxicity (or absence of toxicity) and (2) FDA approval status.
14 PAPERS • 3 BENCHMARKS
ToxCast is an initiative by the U.S. Environmental Protection Agency (EPA) aimed at predicting the potential toxicity of various chemical compounds. It involves high-throughput screening assays that evaluate thousands of chemicals across multiple biological endpoints. These endpoints cover a wide range of effects, including cell cycle disruptions, interactions with steroid receptors, and cytotoxicity.
12 PAPERS • 4 BENCHMARKS
Molecule3D is a new benchmark that includes a dataset with precise ground-state geometries of approximately 4 million molecules derived from density functional theory (DFT). It also provides a set of software tools for data processing, splitting, training, and evaluation, etc.
11 PAPERS • 2 BENCHMARKS
ESOL is a water solubility prediction dataset consisting of 1128 samples.
9 PAPERS • 3 BENCHMARKS
The FreeSolv database offers a curated collection of experimental and calculated hydration-free energies for small molecules in water. It includes both experimental values obtained from prior literature and calculated values based on simulations. The goal is to provide accurate hydration-free energy data, which is essential for understanding solvation properties and interactions of molecules in aqueous environments.
9 PAPERS • 2 BENCHMARKS
The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a refined nearest-neighbor analysis. The MUV dataset is specifically designed for the validation of virtual screening techniques.
QM8 dataset is a collection of molecular data used for studying quantum mechanical calculations of electronic spectra and excited state energy of small molecules. The QM8 dataset consists of approximately 7,165 molecules. These molecules are a subset of the GDB-13 database, which contains nearly 1 billion stable and synthetically accessible organic molecules. The subset includes all molecules with up to 23 atoms, including 7 heavy atoms (C, N, O, and S).
6 PAPERS • 1 BENCHMARK
PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening (HTS) assays.
3 PAPERS • 2 BENCHMARKS
A benchmark for molecular machine learning where improvements in model performance can be immediately observed in the throughput of promising molecules synthesized in the lab. Photoswitches are a versatile class of molecule for medical and renewable energy applications where a molecule's efficacy is governed by its electronic transition wavelengths.
3 PAPERS • NO BENCHMARKS YET
This dataset is a multi-labelled SMILES odor dataset with 138 odor descriptors. This dataset was created for replicating the paper: A principal odor map unifies diverse tasks in olfactory perception.
1 PAPER • 1 BENCHMARK
A.2.1 AN OPEN, LARGE-SCALE DATASET FOR ZERO-SHOT DRUG DISCOVERY DERIVED FROM PUBCHEM We constructed a large public dataset extracted from PubChem (Kim et al., 2019; Preuer et al., 2018), an open chemistry database, and the largest collection of readily available chemical data. We take assays ranging from 2004 to 2018-05. It initially comprises 224,290,250 records of molecule-bioassay activity, corresponding to 2,120,854 unique molecules and 21,003 unique bioassays. We find that some molecule-bioassay pairs have multiple activity records, which may not all agree. We reduce every molecule-bioassay pair to exactly one activity measurement by applying majority voting. Molecule-bioassay pairs with ties are discarded. This step yields our final bioactivity dataset, which features 223,219,241 records of molecule-bioassay activity, corresponding to 2,120,811 unique molecules and 21,002 unique bioassays ranging from AID 1 to AID 1259411. Molecules range up to CID 132472079. The dataset has 3 di
1 PAPER • NO BENCHMARKS YET