AstroHackWeek2020 Hack proposal #Astro-MNIST
Data Acquisition
It would be useful to other researchers looking into your work if you also included SQL code or any other code you used to query the data.
Data Preprocessing
It may also be useful if you provide images of or tables of the head of the data at each step of the preprocessing.
Data Description
Perform exploratory analysis! Include visualizations from this analysis. This can be included in the repo page as additional content for readers.
Datasets in the Context of ML
Mention the programming language, OS type and version, and version of software packages used.
Neural networks
Simulation library for very simple simulations to benchmark machine learning algorithms.
Access the repo here.
This package contains methods for modelling the Universe, galaxies and the Milky Way. Also included are methods for generating observed data.
Access the repo here.
This is a library of models, datasets, and utilities to build generative models for astronomical images. Next to useful models using Variational Auto-Encoders, Self-Attention GANs, PixelCNNs and Normalizing Flows, it also has packages that can be used to generate usefull datasets:
Framework for building image datasets using GalSim, a framework for simulating astronomical objects like stars or galaxies (read more about GalSim here).
Tools for building an image dataset from HSC Public data release.
The BLF collects simulated gravitational lenses of different kinds and from different projects and makes them available to the community for any possible usage. Data sets can be in different formats (tables, maps, images) and some projects consist of simulated observations of gravitational lensing systems, mimicking the observing capabilities of existing or future facilities. Available datasets target lenses on a broad range of scales - lensing by galaxies, galaxy clusters and the large scale structure of the universe. You can access these datasets here.
“The IllustrisTNG project is an ongoing series of large, cosmological magnetohydrodynamical simulations of galaxy formation. TNG aims to illuminate the physical processes that drive galaxy formation: to understand when and how galaxies evolve into the structures that are observed in the night sky, and to make predictions for current and future observational programs. The simulations use a state of the art numerical code which includes a comprehensive physical model and runs on some of the largest supercomputers in the world”.
This dataset, available on the UCI Machine Learning Repository, consists of 19020 Monte Carlo-generated instances of simulated registration of high energy gamma particles in a ground-based atmospheric Cherenkov telescope using the imaging technique.
The data consists of 9 simulated SKA continuum images in FITS format in total intensity of the same field at 3 frequencies and 3 telescope integrations. It can be found here.
Training data for Kaggle’s (now closed) challenge “Observing Dark Worlds” can be found here. The challenge involved predicting the center of each dark matter halo in 120 simulated test skies, with the training set being 300 files of simulated skies containing 300-740 galaxies each.
A sample dataset based on the Lawrence Berkeley National Laboratory’s compressible cosmological hydrodynamics simulation code Nyx is available here.
These data are associated with the Microlensing Data challenge (about which more information can be found here) and consist of light curves simulated as those expected from the WFIRST survey.
“The Quijote simulations are a set of 43100 full N-body simulations. They are designed for two main tasks: (1) Quantify the information content on cosmological observables; (2) Provide enough statistics to train machine learning algorithms. But they can be used for a large variety of problems.”
AstroML constains various datasets. Below we are presenting review. AstroML has available routines for downloading and working on the astronomical data sets. For more details, see the documentation therein.
The survey obtained photometry for hundreds of millions of stars, quasars, and galaxies, and spectra for several million of these objects. In addition, the second phase of the survey performed repeated imaging over a small portion of the sky, called Stripe 82, enabling the study of the time-variation of many objects.
SDSS photometric data are observed through five filters, u, g, r, i, and z. A visualization of the range of these filters is shown below:
There are several other available datasets to choose from: SDSS corected Spectra, SDSS Spectroscopic Sample, SDSS DR7 Quasar Catalog; data from other surveys such as Nasa Sloan Atlas and Stripe 82 Standards + 2MASS, time Domain Data like RRLyrae or LIGO data and WMAP temperature map.
For more please see here.
AstroML also contains many example notebooks that can help the user extract and use the available datasets:
One of many examples available in the astroML package is a notebook for using a Convolutional Neural Network for classifying SDSS galaxy images. It can be accessed here. Note that there are many more examples available.
Example (by Stephen Portillo) for using a dataset from astroML (RR Lyrae) with a decision tree algorithm can be found here.
The catalog, prepared by Nakoneczny et al., includes around 190000 quasar candidates. These candidates were identified using a random forest classifier from among a cleaned KiDS inference dataset.
A dataset of real galaxies extracted from the HST COSMOS survey and compiled by Rachel Mandelbaum et al. for use with GalSim to produce realistic simulations can be found in the associated Zenodo repo.
The HTRU2 dataset, available on the UCI Machine Learning Repository, describes a sample of pulsar candidates collected during the High Time Resolution Survey (South). The dataset contains 17898 instances (1639 being pulsars) and 9 attributes (mean of the integrated profile, mean of the DM-SNR curve, etc.).
The Solar Flare dataset, also available on the UCI Machine Learning Repository, consists of 1389 instances, each of which capture features for one active region on the Sun. 3 classes of flares (C, M, and X), and their number within a 24-hour period can be predicted using the provided attributes.
This data was originally used for the PLAsTiCC challenge on Kaggle in order to identify promising methods to classify variable and transient light curves. It has now been unblinded and is available on a Zenodo repo.
The NOAO survey archives provide access to data from multiple surveys focusing broadly on the deep “blank” sky, nearby galaxies/clusters of galaxies and stellar populations in local group galaxies.
A database of all discovered extrasolar planets. The codebook and steps to access the data are available on the associated GitHub repo.
This webpage hosts several prepackaged astronomical datasets (including univariate, multivariate, images, model selection, spectra, etc.) for statistical analysis. Two example datasets on the site are the Shapley galaxy redshift catalog and SDSS quasar catalog.
Galaxy Zoo was a citizen science project with the purpose of classifying the morphologies of ~1 million galaxies imaged by the SDSS. The classification data, along with measurements of bulge size, presence of bars, and the structure of spiral arms, are now available here on Kaggle.
OGLE is an observing project that started in 1992. It has the longest ground based observational data set of the Southern sky, specifically the Magellanic Clouds. OGLE gives png format pictures of the periodic variable stars. If a pulsating star has more than one pulsational period then folded light curves are given for all the present periods. This means that there is also additional data for each star: period of pulsation(s), Fourier parameters (from light curve decomposition). The light curves on the png pictures are in the I filter. There were some differences in the catalogs from one release to the other. Here, we have the OGLE Collection of Variable Stars and OGLE-III.
The MPC site hosts a large volume of data regarding the orbits of thousands of small bodies in the Solar System. You can access data about the orbits of all the asteroids or comets in the MPC database, along with much more!
Photometric redshift catalogs presented in Beck et al. 2017 that can be used to probe color coverage and photometric errors can be accessed at this GitHub repo; one of the products of the third edition of the COIN Residence Program.