Toolkits¶

Toolkits page provides pre-trained machine learning models for various chemical applications. In addition to the prediction models, useful data mining methods for clustering and outlier detection was also supported. All toolkits of ChemAI in the following table is publicly available without registration in Toolkits page.

Machine learning prediction
Data pre-processing
- Data Clustering
- Outlier Detection

1. Band Gap Prediction¶

Summary

Machine learning model: Gradient boosting tree regression (GBTR)
Training dataset: EBG Dataset
Required input: Chemical formula
Target property: Experimental band gap (eV)
Test MAE: 0.213 eV
Test R2 Score: 0.909

This toolkit is a pre-trained GBTR to predict experimental band gap. GBTR was trained on 3,895 inorganic materials provided in this paper. For an input chemical formula of the material, it predicts the experimentally measured band gap of the material in eV. For randomly selected 779 materials, it showed a prediction error of 0.213 eV in mean absolute error (MAE) and achieved R2 score of 0.909. All resources including the training materials dataset and the trained prediction model are publicly available in the Toolkits page of ChemAI.

2. Formation Energy Prediction¶

Summary

Machine learning model: Gradient boosting tree regression (GBTR)
Training dataset: EFE Dataset
Required input: Chemical formula
Target property: Experimental formation energy (ev/atom)
Test MAE: 0.059 eV/atom
Test R2 Score: 0.907

This is a pre-trained GBTR to predict experimental formation energy of inorganic materials. GBTR was trained on 1,196 inorganic materials to predict their standard enthalpies of formation measured by high-temperature calorimetry. It predicts the experimental formation energy from chemical formula of the material. For randomly selected 239 materials, the prediction error and R2 score of this toolkit were 0.059 eV/atom and 0.907, respectively.

3. Thermoelectricity Prediction¶

Summary

Machine learning model: DopNet
Training dataset: MRL Dataset (Not available)
Required input: Chemical formula
Target property: ZT (Figure of merit)
Test MAE: 0.063
Test R2 Score: 0.867

This toolkit predicts ZTs (figure of merit) of the materials from their chemical formulas. For an input chemical formula, it predicts ZTs for each temperature from 300 K to 800 K. We trained DopNet to improve the prediction accuracy by explicitly identifying dopant effects in thermoelectric materials. To train DopNet, 480 thermoelectric materials covering various materials systems were manually collected from MRL Database

4. Band Gap Prediction of Perovskites¶

Summary

Machine learning model: Graph attention network (GAT)
Training dataset: HOIP dataset
Required input: Crystal structure (.cif)
Target property: HSE band gap (eV)
Test MAE: 0.138 eV
Test R2 Score: 0.901

For an input crystal structure formatted by crystal information file (CIF) of a hybrid organic-inorganic perovskite, this toolkit predicts HSE band gap by integrating elemental attributes and structural information of the perovskite. A graph neural network with attention mechanism was trained on a calculation dataset containing 1,345 hybrid perovskites and their HSE band gaps. For a test dataset, this toolkit showed a prediction error of 0.138 eV and achieved R2 score of 0.901.

5. Band Gap Correction¶

Summary

Machine learning model: Graph attention network (GAT)
Training dataset: PRB-270 dataset
Required inputs: Crystal structure (.cif) and GGA band gap (eV)
Target property: G0W0 band gap (eV)
Test MAE: 0.168 eV
Test R2 Score: 0.951

The purpose of this toolkit is to provide more accurate band gaps from the GGA band gaps. For this purpose, GAT was trained to predict the G0W0 band gap from the crystal structure and the GGA band gap. We trained GAT on 270 crystal structures provided in this paper. For a test dataset, this toolkit showed a prediction error of 0.172 eV and R2 score of 0.951, respectively.

6. Prediction of Absorption Max¶

Summary

Machine learning model: Graph Interaction Network of GCN
Training dataset: CHR-SOLV dataset
Required inputs: SMILES of chromophore and solvant
Target property: Absorption max (nm)
Test MAE: 21.253 nm
Test R2 Score: 0.902

This toolkit was trained on 17,295 chromophore and solvant pairs to predict their absorption max. Two graph neural networks were trained to embed molecular structures of chromophore and solvant. For the chromophore and solvant embeddings, a fully-connected neural network was trained to predict absorption max from chromophore and solvant interactions. For randomly selected 3,459 chromophore and solvant pairs, this toolkit showed a prediction error of 21.253 nm and achieved R2 score of 0.902, respectively.

6. Data Clustering¶

Summary

Method: DBSCAN
Required inputs: Data points (.xlsx), distance threshold, and quantity threshold

The purpose of this toolkit is to make groups of data points without label information, which is called clustering. Clustering is useful to get insight from similar data or different data. For given data points and hyperparameters, it makes groups of the data points based on their data distributions.

7. Outlier Detection¶

Summary

Method: Local outlier factor (LOF)
Required inputs: Data points (.xlsx) and the number of neighbors

Outlier means an abnormal data, and prediction accuracies of machine learning methods can be improved by removing the outliers in a training dataset. The purpose of this toolkit is to provide labels for each data point where it is an inlier or an outlier. After performing this toolkit, a new dataset containing inliner and outlier labels will be provided through ChemAI.