JetClass is a new large-scale dataset to facilitate deep learning research in particle physics. It consists of 100M particle jets for training, 5M for validation and 20M for testing. The dataset contains 10 classes of jets, simulated with MadGraph + Pythia + Delphes. A detailed description of the JetClass dataset is presented in the paper Particle Transformer for Jet Tagging. An interface to use the dataset is provided here.
9 PAPERS • 1 BENCHMARK
JetNet is a particle cloud dataset, containing gluon, top quark, light quark jets saved in .csv format.
9 PAPERS • NO BENCHMARKS YET
Dataset of 50,000 top quark-antiquark (ttbar) events produced in proton-proton collisions at 14 TeV, overlaid with minimum bias events corresponding to a pileup of 200 on average. The dataset consists of detector hits as the input, generator particles as the ground truth and reconstructed particles from DELPHES for additional validation. The DELPHES model corresponds to a CMS-like detector with a multi-layered charged particle tracker, an electromagnetic and hadron calorimeter. Pythia8 and Delphes3 were used for the simulation.
7 PAPERS • NO BENCHMARKS YET
CMD is a publicly available collection of hundreds of thousands 2D maps and 3D grids containing different properties of the gas, dark matter, and stars from more than 2,000 different universes. The data has been generated from thousands of state-of-the-art (magneto-)hydrodynamic and gravity-only N-body simulations from the CAMELS project.
6 PAPERS • NO BENCHMARKS YET
The dataset consists in many runs of the same quantum circuit on different IBM quantum machines. We used 9 different machines and for each one of them, we run 2000 executions of the circuit. The circuit has 9 differents measurement steps along it. To obtain the 9 outcome distributions, for each execution, parts of the circuit are appended 9 times (in the same call to the IBM API, thus, in the shortest possible time) measuring a new step each time. The calls to the IBM API followed two different strategies. One was adopted to maximize the number of calls to the interface, parallelizing the code with as many possible runs and even running 8000 shots per run but considering for 8 times 1000 out of the memory to get the probabilities. The other strategy was slower, without parallelization and with a minimum waiting time between subsequent executions. The latter was adopted to get more uniformly distributed executions in time.
Dataset of high-pT jets from simulations of LHC proton-proton collisions
5 PAPERS • NO BENCHMARKS YET
Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than \emph{10 million samples from present, pre-industrial, and future climate conditions}, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed.
4 PAPERS • NO BENCHMARKS YET
This dataset is the outcome of a data challenge conducted as part of the Dark Machines Initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenge aims at detecting signals of new physics at the LHC using unsupervised machine learning algorithms.
3 PAPERS • NO BENCHMARKS YET
This dataset contains the data for the paper 'Using Multiple Instance Learning for Explainable Solar Flare Prediction'.
2 PAPERS • NO BENCHMARKS YET
Multirotor gym environment for learning control policies for various unmanned aerial vehicles.
Contains data of parametric PDEs
PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of the code used to generate the datasets, to upload and download the datasets from the data repository, as well as to train and evaluate different machine learning models as baseline. PDEBench features a much wider range of PDEs than existing benchmarks and included realistic and difficult problems (both forward and inverse), larger ready-to-use datasets comprising various initial and boundary conditions, and PDE parameters. Moreover, PDEBench was crated to make the source code extensible and we invite active participation to improve and extent the benchmark.
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. This is a dataset accompanying the paper RL Unplugged: Benchmarks for Offline Reinforcement Learning.
SuperCaustics is a simulation tool made in Unreal Engine for generating massive computer vision datasets that include transparent objects.
2 PAPERS • 1 BENCHMARK
Dataset of low fidelity resolutions of the RANS equations over airfoils.
1 PAPER • NO BENCHMARKS YET
This dataset provides neutron and gamma-ray pulse signals for pulse shape discrimination experiments. Serval traditional and recently proposed pulse shape discrimination algorithms are utilized to conduct pulse shape discrimination under raw pulse signals and noise-enhanced datasets. These algorithms include zero-crossing (ZC), charge comparison (CC), falling edge percentage slope (FEPS), frequency gradient analysis (FGA), pulse-coupled neural network (PCNN), ladder gradient (LG), and heterogeneous quasi-continuous spiking cortical model (HQC-SCM). This dataset also provides the source code of all these pulse shape discrimination methods, together with the source code of schematic pulse shape discrimination performance evaluation and anti-noise performance evaluation.
DrivAerNet is a large-scale, high-fidelity CFD dataset of 3D industry-standard car shapes designed for data-driven aerodynamic design. It comprises 4000 high-quality 3D car meshes and their corresponding aerodynamic performance coefficients, alongside full 3D flow field information.
Neural network model files and Madgraph event generator outputs used as inputs to the results presented in the paper "Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences" arXiv:2108.11481; 2022 Mach. Learn.: Sci. Technol. 3 015021 Code and model files can be found at: https://github.com/darrendavidprice/science-discovery/tree/master/expressive_gaussian_mixture_models
This dataset contains simulated synthetic particle decays, simulated using the PhaseSpace library. All simulated decay topologies have a common root particle of mass 100 (arbitrary units). Intermediate particles are selected at random with replacement from the following masses: [90, 80, 70, 50, 25, 20, 10]. Final state particles, which make up the leaf nodes of generated topologies, are drawn with replacement from the following masses: [1, 2, 3, 5, 12]. For each intermediate particle (including the root), we limit the minimum number of children to two, and the maximum five. The dataset contains the resulting simulated particle physics decays, with information about the detected particle (leaves) to be used as input, and Lowest Common Ancestor Generations (LCAGs) to be used as training targets.
This paper details the design of an autonomous alignment and tracking platform to mechanically steer directional horn antennas in a sliding correlator channel sounder setup for 28 GHz V2X propagation modeling. A pan-and-tilt subsystem facilitates uninhibited rotational mobility along the yaw and pitch axes, driven by open-loop servo units and orchestrated via inertial motion controllers. A geo-positioning subsystem augmented in accuracy by real-time kinematics enables navigation events to be shared between a transmitter and receiver over an Apache Kafka messaging middleware framework with fault tolerance. Herein, our system demonstrates a 3D geo-positioning accuracy of 17 cm, an average principal axes positioning accuracy of 1.1 degrees, and an average tracking response time of 27.8 ms. Crucially, fully autonomous antenna alignment and tracking facilitates continuous series of measurements, a unique yet critical necessity for millimeter wave channel modeling in vehicular networks. The
We present a structured benchmark dataset for a representative vibroacoustic problem: Predicting the frequency response for vibrating plates. The vibrating plates benchmark dataset consists of in total 12,000 varied plate designs and accompanying vibration patterns, when the plates are excited by a harmonic force. These vibration platterns give the vibration velocity at every location of the plate orthogonal to its surface. The plate designs incorporate randomly placed beadings, indentations in the plate surface. The beadings stiffen the plates and completely change the resulting vibration patterns. Additionally, the size, thickness and damping loss factor of the plates are varied.
pd4ml is a collection of datasets from fundamental physics research -- including particle physics, astroparticle physics, and hadron- and nuclear physics -- for supervised machine learning studies. These datasets, containing hadronic top quarks, cosmic-ray induced air showers, phase transitions in hadronic matter, and generator-level histories, are made public to simplify future work on cross-disciplinary machine learning and transfer learning in fundamental physics.
Laser powder bed fusion (LBPF) is the additive manufacturing (3D printing) process for metals. RAISE-LPBF is a large dataset on the effect of laser power and laser dot speed in 316L stainless steel bulk material. Both process parameters are independently sampled for each scan line from a continuous distribution, so interactions of different parameter choices can be investigated. Process monitoring comprises on-axis high-speed (20k FPS) video. The data can be used to derive statistical properties of LPBF, as well as to build anomaly detectors.
0 PAPER • NO BENCHMARKS YET
SynD is a synthetic energy dataset with a focus on residential buildings. This dataset is the result of a custom simulation process that relies on power traces of household appliances. The output of simulations is the power consumption of 21 household appliances as well as the household-wide consumption (i.e. mains). Therefore, SynD's can be used for Non-Intrusive Load Monitoring, also referred to as Energy Disaggregation.