maica.ml

Base Classes

The maica.ml.base module includes basic classes of machine learning algorithms. It provides a wrapper class of Scikit-learn models and an abstract class of PyTorch models.

class Model(alg_type: str, alg_name: str)

An abstract class of machine learning algorithm in MAICA. All machine learning algorithms in MAICA should inherit this class.

class SKLearnModel(alg_name: str, **kwargs)

A wrapper class of the machine learning algorithms in the Scikit-learn library. This class provides a generic interface of fit() and predict() functions in the Scikit-learn algorithms.

fit(inputs: numpy.ndarray, targets: numpy.ndarray)

Fit model parameters for the given input and target data.

Parameters
  • inputs – (numpy.ndarray) The input data of the training dataset.

  • targets – (numpy.ndarray) The target data of the training dataset.

Returns

Trained model.

predict(inputs: numpy.ndarray)

Predict target values of the given input data.

Parameters

inputs – (numpy.ndarray) The input data of the dataset.

Returns

Predicted values for the given inputs.

save(path_model_file: str)

Save model parameters into a file in path_model_file.

Parameters

path_model_file – (str) The path of the model file.

load(path_model_file: str)

Load model parameters in a file of path_model_file.

Parameters

path_model_file – (str) The path of the model file.

class PyTorchModel(alg_name: str)

An abstract class of the machine learning algorithms in the PyTorch library. In MAICA, all PyTorch algorithms should inherit this class.

gpu()

Move model parameters from CPU to GPU.

Returns

Model object (self) in GPU.

save(path_model_file: str)

Save model parameters into a file in path_model_file.

Parameters

path_model_file – (str) The path of the model file.

load(path_model_file: str)

Load model parameters in a file of path_model_file.

Parameters

path_model_file – (str) The path of the model file.

training: bool

Machine Learning Utilities

The maica.ml.util module provides essential functions for training configuration and model reuse. Most deep learning algorithms in MAICA are based on this module.

get_batch_sizes(n_data: int, train_setting: int)

Return a batch size for stochastic gradient descent method according to the number of data and the train setting.

Parameters
  • n_data – (int) The number of data will be used to train.

  • train_setting – (int) A hyperparameter determining the train settings.

Returns

A list of candidate batch sizes.

get_init_lrs(hparam_setting: int)

Return initial learning rates according to the given hyperparameter setting level.

Parameters

hparam_setting – (int) A hyperparameter determining the hyperparameter optimization of the model.

Returns

A list of candidate learning rates.

get_data_loader(*data: object, batch_size: int = 8, shuffle: bool = False)

Generate data loader object for a given dataset. If the given data is numpy.ndarray, it returns torch.DataLoader object. If the data is maica.data.GraphDataset, it returns torch_geometric.DataLoader object to iterate the graph-structured data.

Parameters
  • data – (object) The dataset to be iterated by the data loader.

  • batch_size – (int, optional) The batch size of the data loader (default = 8).

  • shuffle – (int, optional) An option to randomly sample the data when the iterations of the data loader (default = False).

Returns

Data loader object of torch.DataLoader or torch_geometric.DataLoader.

get_optimizer(model_params: torch._C.Generator, gd_name: str, init_lr: float = 0.001, l2_reg: float = 1e-06)

Return a gradient descent optimizer to fit model parameters.

Parameters
  • model_params – (torch.Generator) Model parameters to be trained by the generated optimizer.

  • gd_name – (str) A name of the gradient descent method to fit model parameters (defined in maica.core.env).

  • init_lr – (float, optional) An initial learning rate of the gradient descent optimizer.

  • l2_reg – (float, optional) A coefficient of the L2 regularization in model parameters.

Returns

A gradient descent optimizer to fit model parameters.

get_loss_func(loss_func: str)

Return a loss function to evaluate the model performance.

Parameters

loss_func – (str) A name of the loss function to evaluate model performance (defined in maica.core.env`).

Returns

A loss function object to evaluate the model performance.

get_model(alg_name: str, **kwargs)

Get a machine learning model for given algorithm name and model hyperparameters. The names of the algorithms are defined in maica.core.env.

Parameters
  • alg_name – (str) A name of the machine learning algorithm (defined in maica.core.env).

  • kwargs – (optional) A dictionary containing model hyperparameters.

Returns

A machine learning model.

save_eval_results(task_name: str, model: maica.ml.base.Model, dataset_test: maica.data.base.Dataset, preds: numpy.ndarray)

Save the model parameters and the prediction results as a model file and an excel file.

Parameters
  • task_name – (str) A name of your task.

  • model – (ml.base.Model) A model used will be evaluated.

  • dataset_test – (data.base.Dataset) A dataset used to the evaluation.

  • preds – (numpy.ndarray) Prediction results of the model for the dataset.

save_interpretation(model: maica.ml.base.Model, file_name: str)

Save interpretable information of the machine learning algorithms. For non-interpretable algorithms, it raises AssertionError. The following algorithms support this function.

  • Decision Tree Regression (ALG_DCTR).

  • Symbolic Regression (ALG_SYMR).

  • Gradient Boosting Tree Regression (ALG_GBTR).

Parameters
  • model – (ml.base.Model) A machine learning algorithm to generate interpretation about the prediction.

  • file_name – (str) The path of file to store the generated interpretation.

Neural Networks

The maica.ml.nn module provides an implementation of the most essential feedforward neural network. The algorithms in this module are used to predict target values from the feature vectors and the chemical formulas.

class FCNN(dim_in: int, dim_out: int)

Fully-connected neural network with the three hidden layers and the one output layer. For each hidden layer, the batch normalization technique is applied to accelerate the model training.

forward(x: None._VariableFunctionsClass.tensor)

Predict target values for the given data x.

Parameters

x – (torch.tensor) A tensor containing input data of the model.

Returns

A tensor containing predicted values.

fit(data_loader: torch.utils.data.dataloader.DataLoader, optimizer: torch.optim.optimizer.Optimizer, criterion: object)

Fit the model parameters for the given dataset using the data loader, the optimizer, and the loss function. It iterates the parameter optimization once for the entire dataset.

Parameters
  • data_loader – (torch.utils.data.DataLoader) A data loader to sample the data from the training dataset.

  • optimizer – (torch.optim.Optimizer) An optimizer to fit the model parameters.

  • criterion – (object) A loss function to evaluate the prediction performance of the model.

Returns

Training loss.

predict(data_loader: object)

Predict target values for the given dataset in the data loader.

Parameters

data_loader – (torch.utils.data.DataLoader) A data loader to sample the data from the dataset.

Returns

A NumPy array containing the predicted values.

training: bool
class Autoencoder(dim_in: int, dim_latent: int)

A neural network to generate latent embeddings of input data. It is trained to minimize the difference between the input and its output rather than to be trained based on the target values. The training problem of the autoencoder can be defined by:

\[heta^* = rgmin_{ heta} ||\mathbf{x} - \mathbf{x}^{'}||_2^2,\]

where \(mathbf{x}\) is the input data, and \(\mathbf{x}^{'}\) is the predicted value of the autoencoder.

forward(x: None._VariableFunctionsClass.tensor)

Perform encoding and decoding for the given data x.

Parameters

x – (torch.tensor) The input data of the model.

Returns

The decoded input.

enc(x: None._VariableFunctionsClass.tensor)

Generate the latent embedding of the input data x. This is called encoding in the autoencoders.

Parameters

x – (torch.tensor) The input data of the model.

Returns

Latent embedding of the input data x.

dec(z: None._VariableFunctionsClass.tensor)

Restore the input data from the latent embedding of the input data. This is called decoding in the autoencoders.

Parameters

z – (torch.tensor) The latent embedding.

Returns

Restored input data from the given latent embedding z.

fit(data_loader: torch.utils.data.dataloader.DataLoader, optimizer: torch.optim.optimizer.Optimizer)

Fit the model parameters for the given dataset using the data loader and the optimizer. It iterates the parameter optimization once for the entire dataset.

Parameters
  • data_loader – (torch.utils.data.DataLoader) A data loader to sample the data from the training dataset.

  • optimizer – (torch.optim.Optimizer) An optimizer to fit the model parameters.

Returns

Training loss.

predict(data_loader: object)

Predict target values for the given dataset in the data loader.

Parameters

data_loader – (torch.utils.data.DataLoader) A data loader to sample the data from the dataset.

Returns

A NumPy array containing the predicted values.

training: bool

Graph Neural Networks

The maica.ml.gnn module includes various implementation of graph neural networks from the torch_geometric library. It provides pre-defined graph neural networks for the structure-based predictions.

class GNN(dim_out: int, alg_name: str)

Abstract class of the graph neural networks. It defines two functions for parameter optimization and prediction.

fit(data_loader: torch_geometric.data.dataloader.DataLoader, optimizer: torch.optim.optimizer.Optimizer, criterion: object)

Fit the model parameters for the given dataset in the data loader, optimizer, and loss function. It iterates the parameter optimization once for the entire dataset.

Parameters
  • data_loader – (torch_geometric.data.DataLoader) A data loader to sample the data from the training dataset.

  • optimizer – (torch.optim.Optimizer) An optimizer to fit the model parameters.

  • criterion – (object) A loss function to evaluate the prediction performance of the model.

Returns

Training loss.

predict(data_loader: torch_geometric.data.dataloader.DataLoader)

Predict target values for the given dataset in the data loader.

Parameters

data_loader – (torch_geometric.data.DataLoader) A data loader to sample the data from the dataset.

Returns

A NumPy array containing the predicted values.

training: bool
class GCN(n_node_feats: int, dim_out: int, n_graphs: int = 1, readout: str = 'mean')

Graph convolutional network (GCN) form the “Semi-supervised Classification with Graph Convolutional Networks” paper.

forward(g: torch_geometric.data.batch.Batch)

Predict target values for the given Batch object.

Parameters

g – (torch_geometric.data.Batch) An input Batch object of the torch_geometric.data.Data objects.

Returns

Target values.

training: bool
class GAT(n_node_feats: int, dim_out: int, n_graphs: int = 1, readout: str = 'mean')

Graph attention network (GAT) form the “Graph Attention Networks” paper.

forward(g: torch_geometric.data.batch.Batch)

Predict target values for the given Batch object.

Parameters

g – (torch_geometric.data.Batch) An input Batch object of the torch_geometric.data.Data objects.

Returns

Target values.

training: bool
class GIN(n_node_feats: int, dim_out: int, n_graphs: int = 1, readout: str = 'mean')

Graph isomorphism network (GIN) form the “How Powerful are Graph Neural Networks?” paper.

forward(g: torch_geometric.data.batch.Batch)

Predict target values for the given Batch object.

Parameters

g – (torch_geometric.data.Batch) An input Batch object of the torch_geometric.data.Data objects.

Returns

Target values.

training: bool
class CGCNN(n_node_feats: int, n_edge_feats: int, dim_out: int, n_graphs: int = 1, readout: str = 'mean')

Crystal graph convolutional neural network (CGCNN) form the “Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties” paper.

forward(g: torch_geometric.data.batch.Batch)

Predict target values for the given Batch object.

Parameters

g – (torch_geometric.data.Batch) An input Batch object of the torch_geometric.data.Data objects.

Returns

Target values.

training: bool

Transfer Learning

This module includes several functions to perform transfer learning using neural networks. Popular transfer learning methods called retrain head and fine tuning are provided as built-in functions.

tl_retrain_head(path_source_model, model_target: maica.ml.base.Model, dataset_target: maica.data.base.Dataset, gd_name: str = 'adam', init_lr: float = 0.001, l2_reg: float = 1e-06, loss_func: str = 'mae', max_epoch: int = 300)

Perform transfer learning based on feature extractor of the source model. The feature extraction layers of the source model are frozen in the training on the target dataset. Only the prediction layer called head is trained during the training on the target dataset. This transfer learning method is called ‘retrain head’.

Parameters
  • path_source_model – (str) The path of the model file of the source model.

  • model_target – (ml.base.Model) Target model that will be used for transfer learning.

  • dataset_target – (data.base.Dataset) Target dataset for transfer learning.

  • gd_name – (str, optional) A name of the optimizer to train the target model (default = AdamOptimizer)

  • init_lr – (float, optional) An initial learning rate of the gradient descent optimizer (default = 1e-3).

  • l2_reg – (float, optional) A coefficient of the L2 regularization in model parameters (Default = 1e-6).

  • loss_func – (str, optional) A name of loss function to evaluate the model performance (default = Mean Absolute Error).

  • max_epoch – (int, optional) The maximum iteration of the model parameter optimization in the training.

Returns

Trained model.

tl_fine_tuning(path_source_model, model_target: maica.ml.base.Model, dataset_target: maica.data.base.Dataset, gd_name: str = 'adam', init_lr: float = 1e-06, l2_reg: float = 1e-06, loss_func: str = 'mae', max_epoch: int = 300)

Perform transfer learning based on a pre-trained source model. The target model is initialized by the model parameters of the source model. After the initialization, the target model is trained on the target dataset with a small learning rate. This transfer learning method is called ‘fine tuning’.

Parameters
  • path_source_model – (str) The path of the model file of the source model.

  • model_target – (ml.base.Model) Target model that will be used for transfer learning.

  • dataset_target – (data.base.Dataset) Target dataset for transfer learning.

  • gd_name – (str, optional) A name of the optimizer to train the target model (default = AdamOptimizer)

  • init_lr – (float, optional) An initial learning rate of the gradient descent optimizer (default = 1e-3).

  • l2_reg – (float, optional) A coefficient of the L2 regularization in model parameters (Default = 1e-6).

  • loss_func – (str, optional) A name of loss function to evaluate the model performance (default = Mean Absolute Error).

  • max_epoch – (int, optional) The maximum iteration of the model parameter optimization in the training.

Returns

Trained model.