AstroBenchmarking

AstroHackWeek2020 Hack proposal #Astro-MNIST

ABOUT THE PROJECT

ASTRO + ML BEST PRACTICES

DATA

CODE

SIMULATIONS

DeepBench

Simulation library for very simple simulations to benchmark machine learning algorithms.

Access the repo here.

SkyPy

This package contains methods for modelling the Universe, galaxies and the Milky Way. Also included are methods for generating observed data.

Access the repo here.

Galaxy2Galaxy

This is a library of models, datasets, and utilities to build generative models for astronomical images. Next to useful models using Variational Auto-Encoders, Self-Attention GANs, PixelCNNs and Normalizing Flows, it also has packages that can be used to generate usefull datasets:

SIMULATED DATASETS

Bolognia Lens Factory

The BLF collects simulated gravitational lenses of different kinds and from different projects and makes them available to the community for any possible usage. Data sets can be in different formats (tables, maps, images) and some projects consist of simulated observations of gravitational lensing systems, mimicking the observing capabilities of existing or future facilities. Available datasets target lenses on a broad range of scales - lensing by galaxies, galaxy clusters and the large scale structure of the universe. You can access these datasets here.

Illustris Simulations

“The IllustrisTNG project is an ongoing series of large, cosmological magnetohydrodynamical simulations of galaxy formation. TNG aims to illuminate the physical processes that drive galaxy formation: to understand when and how galaxies evolve into the structures that are observed in the night sky, and to make predictions for current and future observational programs. The simulations use a state of the art numerical code which includes a comprehensive physical model and runs on some of the largest supercomputers in the world”.

MAGIC Gamma Telescope Dataset

This dataset, available on the UCI Machine Learning Repository, consists of 19020 Monte Carlo-generated instances of simulated registration of high energy gamma particles in a ground-based atmospheric Cherenkov telescope using the imaging technique.

SKA Science Data Challenge #1 Dataset

The data consists of 9 simulated SKA continuum images in FITS format in total intensity of the same field at 3 frequencies and 3 telescope integrations. It can be found here.

Observing Dark Worlds

Training data for Kaggle’s (now closed) challenge “Observing Dark Worlds” can be found here. The challenge involved predicting the center of each dark matter halo in 120 simulated test skies, with the training set being 300 files of simulated skies containing 300-740 galaxies each.

Nyx Cosmological Simulation Data

A sample dataset based on the Lawrence Berkeley National Laboratory’s compressible cosmological hydrodynamics simulation code Nyx is available here.

Microlensing Data Challenge

These data are associated with the Microlensing Data challenge (about which more information can be found here) and consist of light curves simulated as those expected from the WFIRST survey.

Quijote Simulations

“The Quijote simulations are a set of 43100 full N-body simulations. They are designed for two main tasks: (1) Quantify the information content on cosmological observables; (2) Provide enough statistics to train machine learning algorithms. But they can be used for a large variety of problems.”

REAL DATASETS

AstroML

AstroML constains various datasets. Below we are presenting review. AstroML has available routines for downloading and working on the astronomical data sets. For more details, see the documentation therein.

Sloan Digital Sky Survey (SDSS) Data

The survey obtained photometry for hundreds of millions of stars, quasars, and galaxies, and spectra for several million of these objects. In addition, the second phase of the survey performed repeated imaging over a small portion of the sky, called Stripe 82, enabling the study of the time-variation of many objects.

SDSS photometric data are observed through five filters, u, g, r, i, and z. A visualization of the range of these filters is shown below:

There are several other available datasets to choose from: SDSS corected Spectra, SDSS Spectroscopic Sample, SDSS DR7 Quasar Catalog; data from other surveys such as Nasa Sloan Atlas and Stripe 82 Standards + 2MASS, time Domain Data like RRLyrae or LIGO data and WMAP temperature map.

For more please see here.

Example notebooks

AstroML also contains many example notebooks that can help the user extract and use the available datasets:

KiDS Data Release 3 Quasar Catalog

The catalog, prepared by Nakoneczny et al., includes around 190000 quasar candidates. These candidates were identified using a random forest classifier from among a cleaned KiDS inference dataset.

Datasets for GalSim

A dataset of real galaxies extracted from the HST COSMOS survey and compiled by Rachel Mandelbaum et al. for use with GalSim to produce realistic simulations can be found in the associated Zenodo repo.

HTRU2 Dataset

The HTRU2 dataset, available on the UCI Machine Learning Repository, describes a sample of pulsar candidates collected during the High Time Resolution Survey (South). The dataset contains 17898 instances (1639 being pulsars) and 9 attributes (mean of the integrated profile, mean of the DM-SNR curve, etc.).

Solar Flare Dataset

The Solar Flare dataset, also available on the UCI Machine Learning Repository, consists of 1389 instances, each of which capture features for one active region on the Sun. 3 classes of flares (C, M, and X), and their number within a 24-hour period can be predicted using the provided attributes.

PLAsTiCC Challenge Dataset

This data was originally used for the PLAsTiCC challenge on Kaggle in order to identify promising methods to classify variable and transient light curves. It has now been unblinded and is available on a Zenodo repo.

NOAO Survey Program Archives

The NOAO survey archives provide access to data from multiple surveys focusing broadly on the deep “blank” sky, nearby galaxies/clusters of galaxies and stellar populations in local group galaxies.

Open Exoplanet Catalog

A database of all discovered extrasolar planets. The codebook and steps to access the data are available on the associated GitHub repo.

Penn State University Astrostatistics Data

This webpage hosts several prepackaged astronomical datasets (including univariate, multivariate, images, model selection, spectra, etc.) for statistical analysis. Two example datasets on the site are the Shapley galaxy redshift catalog and SDSS quasar catalog.

Galaxy Zoo Data

Galaxy Zoo was a citizen science project with the purpose of classifying the morphologies of ~1 million galaxies imaged by the SDSS. The classification data, along with measurements of bulge size, presence of bars, and the structure of spiral arms, are now available here on Kaggle.

Optical Gravitational Lensing Experiment (OGLE) data

OGLE is an observing project that started in 1992. It has the longest ground based observational data set of the Southern sky, specifically the Magellanic Clouds. OGLE gives png format pictures of the periodic variable stars. If a pulsating star has more than one pulsational period then folded light curves are given for all the present periods. This means that there is also additional data for each star: period of pulsation(s), Fourier parameters (from light curve decomposition). The light curves on the png pictures are in the I filter. There were some differences in the catalogs from one release to the other. Here, we have the OGLE Collection of Variable Stars and OGLE-III.

OGLE Collection of Variable Stars (OCVS)

OGLE-III Data

Minor Planet Center (MPC) Data

The MPC site hosts a large volume of data regarding the orbits of thousands of small bodies in the Solar System. You can access data about the orbits of all the asteroids or comets in the MPC database, along with much more!

Photo-z catalogs

Photometric redshift catalogs presented in Beck et al. 2017 that can be used to probe color coverage and photometric errors can be accessed at this GitHub repo; one of the products of the third edition of the COIN Residence Program.

LEARNING AND USEFUL READS

WANT TO CONTRIBUTE?