maica.data

The maica.data module includes essential utilities and classes to load various chemical data from data files. All chemical data is abstracted to numerical vector or mathematical graph by this module. This module supports the following four essential data types.

  • Numerical vector.

  • Chemical formula.

  • Molecular structure.

  • Crystal structure.

In addition to the above essential data types, a type of composite data of them can be handled by this module. It is useful to process complex chemical data, such as molecule-to-molecule interactions and crystal structure in specific conditions.

Base Module

This module contains an abstract class of the dataset objects. All dataset objects in MAICA inherits the abstract class of this module.

class Dataset

An abstract class for the Dataset classes in MAICA.

read_data_file(path_data_file: str)

Read data file form the given path_data_file. Only the xlsx and csv extensions are acceptable.

Parameters

path_data_file – (str) A path of the data file.

Returns

Pandas and NumPy array objects of the data file.

Feature Vectors

This module provides a dataset class and data load function for the vector-shaped data. It is useful in machine learning with the datasets such as feature vectors of compounds and engineering conditions. If you want to handle the vector-shaped datasets, call load_dataset function in this module to load these datasets.

class VectorDataset(dataset: numpy.ndarray, idx_feat: object, idx_target: int, var_names: numpy.ndarray, idx_data: numpy.ndarray)

A dataset object for vector-shaped data. It can contain all vector-shaped data regardless of the domain of the data. This class is useful to handle the numeric vectors, such as feature vectors of compounds and engineering conditions.

normalize()

Normalize the input data based on the z-score. For a vector-shaped data \(\mathbf{x}\), the z-score is defined by:

\[\mathbf{z} = \frac{\mathbf{x} - \mathbf{\mu}}{\mathbf{\sigma}},\]

where \(\mathbf{\mu}\) is the mean of data, and \(\mathbf{\sigma}\) is the standard deviation of data. Note that all mathematical operations are applied dimensional-wise.

denormalize(z: numpy.ndarray)

Restore the normalize input data. For a vector-shaped normalized data \(\mathbf{z}\), the denormalized (oroginal) data is calculated by:

\[\mathbf{x} = \mathbf{\sigma} \mathbf{z} + \mathbf{\mu},\]

where \(\mathbf{\mu}\) is the mean of data, and \(\mathbf{\sigma}\) is the standard deviation of data. Note that all mathematical operations are applied dimensional-wise.

Parameters

z – (numpy.ndarray) The normalized data.

Returns

The original data of the given normalize data.

to_tensor()

Convert the data objects of numpy.ndarray into the data objects of torch.Tensor.

remove_outliers()

Remove outliers in the dataset using Local Outlier Factor (LOF).

split(ratio: float)

Split a dataset into two sub-datasets based on the given ratio. Two sub-datasets and the original indices of the data in them are returned.

Parameters

ratio – (float) Ratio between two sub-datasets. The sub-datasets are dived by a ratio of ratio to 1 - ratio.

Returns

Two sub-datasets.

load_dataset(path_data_file: str, idx_feat: object, idx_target: int, impute_method: str = 'knn')

Load a vector-shaped dataset to the maica.data.vector.VectorDataset object.

Parameters
  • path_data_file – (str) The path of the data file.

  • idx_feat – (object) Indices of the input features in the data file.

  • idx_target – (int) An index of the target variable in the data file.

  • impute_method – (str, optional) A imputation method to fill empty data in the data file (default = maica.core.env.IMPUTE_KNN).

Returns

A VectorDataset object containing the dataset.

Graph Module

This module provides a base class of the mathematical graphs to store the molecular and crystal structures. The base class GraphDataset can be used for machine learning based on the molecular and crystal structures.

class GraphDataset(dataset: list, idx_struct: numpy.ndarray, idx_feat: numpy.ndarray, idx_target: int, var_names: numpy.ndarray)

A base class to store the graph-structured data, such as the molecular and crystal structures.

n_data()

Return the number of data in the dataset.

Returns

The number of data.

n_node_feats()

Return the number of node features of the graphs in the dataset.

Returns

The number of node features.

n_edge_feats()

Return the number of edge features of the graphs in the dataset.

Returns

The number of edge features.

n_graphs()

Return the number of graphs in the input data. It is useful to configure the machine learning algorithm for the data containing multiple graphs, such as molecular-to-molecular interaction data.

Returns

The number of graphs in the input data.

split(ratio: float)

Split a dataset into two sub-datasets based on the given ratio. Two sub-datasets and the original indices of the data in them are returned.

Parameters

ratio – (float) Ratio between two sub-datasets. The sub-datasets are dived by a ratio of ratio to 1 - ratio.

Returns

Two sub-datasets and the original indices of the data in them.

Chemical Formula

This module includes a basic class FormDataset to load a dataset containing the chemical formulas.

class FormDataset(dataset: numpy.ndarray, idx_form: numpy.ndarray, idx_feat: numpy.ndarray, idx_target: int, var_names: numpy.ndarray, forms: numpy.ndarray, idx_data: Optional[numpy.ndarray] = None)

It is a dataset class to load datasets based on the chemical formulas. To generate this class from your dataset, call load_dataset with a path of your dataset.

split(ratio: float)

Split a dataset into two sub-datasets based on the given ratio. Two sub-datasets and the original indices of the data in them are returned.

Parameters

ratio – (float) Ratio between two sub-datasets. The sub-datasets are dived by a ratio of ratio to 1 - ratio.

Returns

Two sub-datasets and the original indices of the data in them.

load_dataset(path_data_file: str, idx_form: object, idx_feat: object, idx_target: int, impute_method: str = 'knn', path_elem_embs: Optional[str] = None)

Load a dataset containing the chemical formulas to the maica.data.vector.VectorDataset object.

Parameters
  • path_data_file – (str) The path of the data file.

  • idx_form – (object) Indices of the chemical formulas in the data file.

  • idx_feat – (object) Indices of the input numerical features in the data file.

  • idx_target – (int) An index of the target variable in the data file.

  • impute_method – (str, optional) A imputation method to fill empty data in the data file (default = maica.core.env.IMPUTE_KNN).

  • path_elem_embs – (str, optional) A path of JSON file storing user-defined elemental embeddings (default = None).

Returns

A FormDataset object.

Molecular Structure

This module provides several utilities to handle the datasets of the molecular structures. Users can load the dataset of the molecular structures to GraphDataset object by calling load_dataset function.

load_dataset(path_metadata_file: str, idx_smiles: object, idx_feat: object, idx_target: int, impute_method: str = 'knn', path_elem_embs: Optional[str] = None)

Generate the GraphDataset object for a dataset of the molecular structures. It can be used for machine learning to predict molecular properties from the molecular structures.

Parameters
  • path_metadata_file – (str) The path of the metadata file of the dataset.

  • idx_smiles – (numpy.ndarray) Indices of the variables indicating the molecular structures in the dataset file.

  • idx_feat – (object) Indices of the input numerical features in the data file.

  • idx_target – (object) An index of the target variable in the data file.

  • impute_method – (str, optional) A imputation method to fill empty data in the data file (default = maica.core.env.IMPUTE_KNN).

  • path_elem_embs – (str, optional) A path of JSON file storing user-defined elemental embeddings (default = None).

Returns

A GraphDataset object.

Crystal Structure

This module provides several utilities to handle the datasets of the crystsal structures. Users can load the dataset of the crystal structures to GraphDataset object by calling load_dataset function.

load_dataset(path_dataset: str, path_metadata_file: str, idx_mid: numpy.ndarray, idx_feat: numpy.ndarray, idx_target: int, impute_method: str = 'mean', path_elem_embs: Optional[str] = None)

Generate the GraphDataset object for a dataset of the crystal structures. It can be used for machine learning to predict materials properties from the crystal structures.

Parameters
  • path_dataset – (str) A path of the directory storing the CIF files.

  • path_metadata_file – (str) A path of the metadata file of the dataset.

  • idx_mid – (numpy.ndarray) Indices of the variables indicating the crystal structures in the dataset file.

  • idx_feat – (object) Indices of the input numerical features in the data file.

  • idx_target – (object) An index of the target variable in the data file.

  • impute_method – (str, optional) A imputation method to fill empty data in the data file (default = maica.core.env.IMPUTE_KNN).

  • path_elem_embs – (str, optional) A path of JSON file storing user-defined elemental embeddings (default = None).

Returns

A GraphDataset object.