Toolkits

Toolkits page provides pre-trained machine learning models for various chemical applications. In addition to the prediction models, useful data mining methods for clustering and outlier detection was also supported. All toolkits of ChemAI in the following table is publicly available without registration in Toolkits page.


1. Band Gap Prediction

Summary

This toolkit is a pre-trained GBTR to predict experimental band gap. GBTR was trained on 3,895 inorganic materials provided in this paper. For an input chemical formula of the material, it predicts the experimentally measured band gap of the material in eV. For randomly selected 779 materials, it showed a prediction error of 0.213 eV in mean absolute error (MAE) and achieved R2 score of 0.909. All resources including the training materials dataset and the trained prediction model are publicly available in the Toolkits page of ChemAI.

2. Formation Energy Prediction

Summary

This is a pre-trained GBTR to predict experimental formation energy of inorganic materials. GBTR was trained on 1,196 inorganic materials to predict their standard enthalpies of formation measured by high-temperature calorimetry. It predicts the experimental formation energy from chemical formula of the material. For randomly selected 239 materials, the prediction error and R2 score of this toolkit were 0.059 eV/atom and 0.907, respectively.

3. Thermoelectricity Prediction

Summary

  • Machine learning model: DopNet

  • Training dataset: MRL Dataset (Not available)

  • Required input: Chemical formula

  • Target property: ZT (Figure of merit)

  • Test MAE: 0.063

  • Test R2 Score: 0.867

This toolkit predicts ZTs (figure of merit) of the materials from their chemical formulas. For an input chemical formula, it predicts ZTs for each temperature from 300 K to 800 K. We trained DopNet to improve the prediction accuracy by explicitly identifying dopant effects in thermoelectric materials. To train DopNet, 480 thermoelectric materials covering various materials systems were manually collected from MRL Database

4. Band Gap Prediction of Perovskites

Summary

For an input crystal structure formatted by crystal information file (CIF) of a hybrid organic-inorganic perovskite, this toolkit predicts HSE band gap by integrating elemental attributes and structural information of the perovskite. A graph neural network with attention mechanism was trained on a calculation dataset containing 1,345 hybrid perovskites and their HSE band gaps. For a test dataset, this toolkit showed a prediction error of 0.138 eV and achieved R2 score of 0.901.

5. Band Gap Correction

Summary

  • Machine learning model: Graph attention network (GAT)

  • Training dataset: PRB-270 dataset

  • Required inputs: Crystal structure (.cif) and GGA band gap (eV)

  • Target property: G0W0 band gap (eV)

  • Test MAE: 0.168 eV

  • Test R2 Score: 0.951

The purpose of this toolkit is to provide more accurate band gaps from the GGA band gaps. For this purpose, GAT was trained to predict the G0W0 band gap from the crystal structure and the GGA band gap. We trained GAT on 270 crystal structures provided in this paper. For a test dataset, this toolkit showed a prediction error of 0.172 eV and R2 score of 0.951, respectively.

6. Prediction of Absorption Max

Summary

This toolkit was trained on 17,295 chromophore and solvant pairs to predict their absorption max. Two graph neural networks were trained to embed molecular structures of chromophore and solvant. For the chromophore and solvant embeddings, a fully-connected neural network was trained to predict absorption max from chromophore and solvant interactions. For randomly selected 3,459 chromophore and solvant pairs, this toolkit showed a prediction error of 21.253 nm and achieved R2 score of 0.902, respectively.

6. Data Clustering

Summary

  • Method: DBSCAN

  • Required inputs: Data points (.xlsx), distance threshold, and quantity threshold

The purpose of this toolkit is to make groups of data points without label information, which is called clustering. Clustering is useful to get insight from similar data or different data. For given data points and hyperparameters, it makes groups of the data points based on their data distributions.

7. Outlier Detection

Summary

Outlier means an abnormal data, and prediction accuracies of machine learning methods can be improved by removing the outliers in a training dataset. The purpose of this toolkit is to provide labels for each data point where it is an inlier or an outlier. After performing this toolkit, a new dataset containing inliner and outlier labels will be provided through ChemAI.