maica.data¶
The maica.data
module includes essential utilities and classes to load various chemical data from data files.
All chemical data is abstracted to numerical vector or mathematical graph by this module.
This module supports the following four essential data types.
Numerical vector.
Chemical formula.
Molecular structure.
Crystal structure.
In addition to the above essential data types, a type of composite data of them can be handled by this module. It is useful to process complex chemical data, such as molecule-to-molecule interactions and crystal structure in specific conditions.
Base Module¶
This module contains an abstract class of the dataset objects. All dataset objects in MAICA inherits the abstract class of this module.
- class Dataset¶
An abstract class for the Dataset classes in MAICA.
- read_data_file(path_data_file: str)¶
Read data file form the given
path_data_file
. Only thexlsx
andcsv
extensions are acceptable.- Parameters
path_data_file – (str) A path of the data file.
- Returns
Pandas and NumPy array objects of the data file.
Feature Vectors¶
This module provides a dataset class and data load function for the vector-shaped data.
It is useful in machine learning with the datasets such as feature vectors of compounds and engineering conditions.
If you want to handle the vector-shaped datasets, call load_dataset
function in this module to load these datasets.
- class VectorDataset(dataset: numpy.ndarray, idx_feat: object, idx_target: int, var_names: numpy.ndarray, idx_data: numpy.ndarray)¶
A dataset object for vector-shaped data. It can contain all vector-shaped data regardless of the domain of the data. This class is useful to handle the numeric vectors, such as feature vectors of compounds and engineering conditions.
- normalize()¶
Normalize the input data based on the z-score. For a vector-shaped data \(\mathbf{x}\), the z-score is defined by:
\[\mathbf{z} = \frac{\mathbf{x} - \mathbf{\mu}}{\mathbf{\sigma}},\]where \(\mathbf{\mu}\) is the mean of data, and \(\mathbf{\sigma}\) is the standard deviation of data. Note that all mathematical operations are applied dimensional-wise.
- denormalize(z: numpy.ndarray)¶
Restore the normalize input data. For a vector-shaped normalized data \(\mathbf{z}\), the denormalized (oroginal) data is calculated by:
\[\mathbf{x} = \mathbf{\sigma} \mathbf{z} + \mathbf{\mu},\]where \(\mathbf{\mu}\) is the mean of data, and \(\mathbf{\sigma}\) is the standard deviation of data. Note that all mathematical operations are applied dimensional-wise.
- Parameters
z – (numpy.ndarray) The normalized data.
- Returns
The original data of the given normalize data.
- to_tensor()¶
Convert the data objects of numpy.ndarray into the data objects of torch.Tensor.
- remove_outliers()¶
Remove outliers in the dataset using Local Outlier Factor (LOF).
- split(ratio: float)¶
Split a dataset into two sub-datasets based on the given ratio. Two sub-datasets and the original indices of the data in them are returned.
- Parameters
ratio – (float) Ratio between two sub-datasets. The sub-datasets are dived by a ratio of
ratio
to1 - ratio
.- Returns
Two sub-datasets.
- load_dataset(path_data_file: str, idx_feat: object, idx_target: int, impute_method: str = 'knn')¶
Load a vector-shaped dataset to the
maica.data.vector.VectorDataset
object.- Parameters
path_data_file – (str) The path of the data file.
idx_feat – (object) Indices of the input features in the data file.
idx_target – (int) An index of the target variable in the data file.
impute_method – (str, optional) A imputation method to fill empty data in the data file (default =
maica.core.env.IMPUTE_KNN
).
- Returns
A
VectorDataset
object containing the dataset.
Graph Module¶
This module provides a base class of the mathematical graphs to store the molecular and crystal structures.
The base class GraphDataset
can be used for machine learning based on the molecular and crystal structures.
- class GraphDataset(dataset: list, idx_struct: numpy.ndarray, idx_feat: numpy.ndarray, idx_target: int, var_names: numpy.ndarray)¶
A base class to store the graph-structured data, such as the molecular and crystal structures.
- n_data()¶
Return the number of data in the dataset.
- Returns
The number of data.
- n_node_feats()¶
Return the number of node features of the graphs in the dataset.
- Returns
The number of node features.
- n_edge_feats()¶
Return the number of edge features of the graphs in the dataset.
- Returns
The number of edge features.
- n_graphs()¶
Return the number of graphs in the input data. It is useful to configure the machine learning algorithm for the data containing multiple graphs, such as molecular-to-molecular interaction data.
- Returns
The number of graphs in the input data.
- split(ratio: float)¶
Split a dataset into two sub-datasets based on the given ratio. Two sub-datasets and the original indices of the data in them are returned.
- Parameters
ratio – (float) Ratio between two sub-datasets. The sub-datasets are dived by a ratio of
ratio
to1 - ratio
.- Returns
Two sub-datasets and the original indices of the data in them.
Chemical Formula¶
This module includes a basic class FormDataset
to load a dataset containing the chemical formulas.
- class FormDataset(dataset: numpy.ndarray, idx_form: numpy.ndarray, idx_feat: numpy.ndarray, idx_target: int, var_names: numpy.ndarray, forms: numpy.ndarray, idx_data: Optional[numpy.ndarray] = None)¶
It is a dataset class to load datasets based on the chemical formulas. To generate this class from your dataset, call
load_dataset
with a path of your dataset.- split(ratio: float)¶
Split a dataset into two sub-datasets based on the given ratio. Two sub-datasets and the original indices of the data in them are returned.
- Parameters
ratio – (float) Ratio between two sub-datasets. The sub-datasets are dived by a ratio of
ratio
to1 - ratio
.- Returns
Two sub-datasets and the original indices of the data in them.
- load_dataset(path_data_file: str, idx_form: object, idx_feat: object, idx_target: int, impute_method: str = 'knn', path_elem_embs: Optional[str] = None)¶
Load a dataset containing the chemical formulas to the
maica.data.vector.VectorDataset
object.- Parameters
path_data_file – (str) The path of the data file.
idx_form – (object) Indices of the chemical formulas in the data file.
idx_feat – (object) Indices of the input numerical features in the data file.
idx_target – (int) An index of the target variable in the data file.
impute_method – (str, optional) A imputation method to fill empty data in the data file (default =
maica.core.env.IMPUTE_KNN
).path_elem_embs – (str, optional) A path of JSON file storing user-defined elemental embeddings (default =
None
).
- Returns
A
FormDataset
object.
Molecular Structure¶
This module provides several utilities to handle the datasets of the molecular structures.
Users can load the dataset of the molecular structures to GraphDataset
object by calling load_dataset
function.
- load_dataset(path_metadata_file: str, idx_smiles: object, idx_feat: object, idx_target: int, impute_method: str = 'knn', path_elem_embs: Optional[str] = None)¶
Generate the
GraphDataset
object for a dataset of the molecular structures. It can be used for machine learning to predict molecular properties from the molecular structures.- Parameters
path_metadata_file – (str) The path of the metadata file of the dataset.
idx_smiles – (numpy.ndarray) Indices of the variables indicating the molecular structures in the dataset file.
idx_feat – (object) Indices of the input numerical features in the data file.
idx_target – (object) An index of the target variable in the data file.
impute_method – (str, optional) A imputation method to fill empty data in the data file (default =
maica.core.env.IMPUTE_KNN
).path_elem_embs – (str, optional) A path of JSON file storing user-defined elemental embeddings (default =
None
).
- Returns
A
GraphDataset
object.
Crystal Structure¶
This module provides several utilities to handle the datasets of the crystsal structures.
Users can load the dataset of the crystal structures to GraphDataset
object by calling load_dataset
function.
- load_dataset(path_dataset: str, path_metadata_file: str, idx_mid: numpy.ndarray, idx_feat: numpy.ndarray, idx_target: int, impute_method: str = 'mean', path_elem_embs: Optional[str] = None)¶
Generate the
GraphDataset
object for a dataset of the crystal structures. It can be used for machine learning to predict materials properties from the crystal structures.- Parameters
path_dataset – (str) A path of the directory storing the CIF files.
path_metadata_file – (str) A path of the metadata file of the dataset.
idx_mid – (numpy.ndarray) Indices of the variables indicating the crystal structures in the dataset file.
idx_feat – (object) Indices of the input numerical features in the data file.
idx_target – (object) An index of the target variable in the data file.
impute_method – (str, optional) A imputation method to fill empty data in the data file (default =
maica.core.env.IMPUTE_KNN
).path_elem_embs – (str, optional) A path of JSON file storing user-defined elemental embeddings (default =
None
).
- Returns
A
GraphDataset
object.