maica.chem¶
The maica.chem
module includes essential utilities and algorithms for chemical machine learning.
This module provides global variables containing chemical information.
Also, it gives several functions to generate numerical representation of chemical data.
Global Variables
Variable Name |
Type |
Description |
---|---|---|
atom_nums |
dict |
A dictionary of the atomic number for each atomic symbol. |
atom_sysm |
dict |
A dictionary of the atomic symbol for each atomic number. |
elem_feat_names |
list |
Basic elemental features from the Python Mendeleev package. |
cat_hbd |
list |
Categories of hybridization of the atoms. |
cat_fc |
list |
Categories of formal charge of the atoms. |
cat_bond_types |
list |
Categories of bond types. |
Functions
Function Name |
Description |
---|---|
load_mendeleev_feats |
Load basic elemental attributes from the PyThon Mendeleev package. |
load_elem_feats |
Load elemental attributes from user-defined elemental embeddings.
If the elemental embedding is |
form_to_vec |
Convert the chemical formula into the feature vector based on a given elemental features called |
parse_form |
Convert the string of the chemical formula into the dictionary object of |
get_mol_graph |
Convert the SMILES representation into the molecular graph. |
rbf |
Apply radial basis function to the given data |
even_samples |
Sample |
get_crystal_graph |
Convert the CIF file of the crystal structure into the crystal graph. |
get_atom_info |
Extract atom-feature and atom-coordination matrices from the given |
get_bond_info |
Extract bond information of the given atom coordination |
Base Utilities¶
The maica.chem.base
module includes basic information and utilities to process chemical data.
This module was designed to provide elemental attributes or embeddings
in converting unstructured chemical data into the formal data formats, such as numpy.ndarray and torch.Tensor.
- load_mendeleev_feats()¶
Load elemental features from the Python Mendeleev package. Total eight elemental features containing the attributes in
elem_feat_names
and the first ionization energy are loaded.- Returns
An array of the basic elemental features in row-wise.
- load_elem_feats(path_elem_embs: Optional[str] = None)¶
Load elemental features from the elemental embeddings in the given
path_elem_embs
. Users can use the customized elemental features in machine learning by exploiting this function. The user-defined elemental features should be provided as a JSON file with a format of {element: array of features}. If the argumentpath_elem_embs
isNone
, basic elemental features from the Python Mendeleev package is returned.- Parameters
path_elem_embs – (str, optional) Path of the JSON file including custom elemental features (default =
None
).- Returns
An array of the elemental features in row-wise.
Chemical Formula¶
A collection of functions to convert the string of the chemical formulas into the machine-readable numerical representations.
- form_to_vec(form: str, elem_feats: numpy.ndarray)¶
Convert the given chemical formula into a feature vector. For a set of computed elemental features \(S = \{\mathbf{h} = f(e) | e \in c \}\) where \(c\) is the given chemical formula, the feature vector is calculated a concatenated vector based on weighted sum with a weight \(w_\mathbf{h}\), standard deviation \(\sigma\), and max operation as:
\[\mathbf{x} = \sum_{\mathbf{h} \in S} w_\mathbf{h} h \oplus \sigma(S) \oplus \max(S).\]Note that the standard deviation \(\sigma\) and the max operations are applied feature-wise (not element-wise). This formula-to-vector conversion method is common in chemical machine learning [1, 2, 3].
- Parameters
form – (str) Chemical formula (e.g., ZnIn2S4).
elem_feats – (numpy.ndarray) The NumPy array of elemental features.
- Returns
A feature vector of the chemical formula.
- parse_form(form: str)¶
Parse the chemical formula to an element-wise dictionary. For example, ZnIn2S4 is converted into a dictionary of {Zn: 1.0, In: 2.0, S: 4.0}.
- Parameters
form – (str) Chemical formula (e.g., ZnIn2S4).
- Returns
An element-wise dictionary of the chemical formula.
Molecular Structure¶
This module provides abstracted functions to convert the molecular structures into the mathematical graphs.
The converted molecular structure is stored as a torch_geometric.Data
object.
- get_mol_graph(elem_feats: numpy.ndarray, smiles: str, numeric_feats: numpy.ndarray, target: Optional[float] = None, gid: int = - 1)¶
Convert the SMILES representation of a molecular structure into the mathematical graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\), where \(\mathcal{V}\) is a set of atoms, \(\mathcal{E}\) is a set of bonds, \(X\) is a atom-feature matrix, and \(R\) is a bond-feature matrix.
- Parameters
elem_feats – (numpy.ndarray) The NumPy array of elemental features.
smiles – (str) A SMILES representation of the input molecular structure.
numeric_feats – (numpy.ndarray) Numerical features of the molecule (e.g., molecular weight).
target – (float, optional) The target property that you want to predict from the molecular structure (default =
None
).gid – (int, optional) An integer identifier of the molecular structure (default = -1).
- Returns
A molecular graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\).
Crystal Structure¶
This module provides abstracted functions to convert the crystal structures into the mathematical graphs.
Like the molecular structure, the converted crystal structure are stored as a torch_geometric.Data
object.
- rbf(x: numpy.ndarray, mu: numpy.ndarray, beta: float)¶
Apply radial basis function to the given data
x
with the mean vectormu
and a constantbeta
. The Gaussian kernel was used to implement this radial basis function.- Parameters
x – (numpy.ndarray) An input data of radial basis function.
mu – (numpy.ndarray) Feature-wise mean vector of the Gaussian kernel.
beta – (float) A shape parameter of radial basis function.
- Returns
Converted data of the input
x
.
- even_samples(min_val: float, max_val: float, n_samples: int)¶
Sample
n_samples
values evenly within[min_val, max_val]
.- Parameters
min_val – (float) The minimum value of the sample range.
max_val – (float) The maximum value of the sample range.
n_samples – (int) The number of samples.
- Returns
An array of the samples.
- get_crystal_graph(elem_feats: numpy.ndarray, path_cif: str, numeric_feats: numpy.ndarray, rbf_means: numpy.ndarray, target: Optional[float] = None, gid: int = - 1, radius: float = 3.0)¶
Convert the crystallographic information framework (CIF) file into the mathematical graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\), where \(\mathcal{V}\) is a set of atoms, \(\mathcal{E}\) is a set of bonds, \(X\) is a atom-feature matrix, and \(R\) is a bond-feature matrix.
- Parameters
elem_feats – (numpy.ndarray) The NumPy array of elemental features.
path_cif – (str) The path of the CIF file of the crystal structure.
numeric_feats – (numpy.ndarray) Numerical features of the crystal (e.g., density).
rbf_means – (numpy.ndarray) A feature-wise mean vector of the Gaussian kernel for radial basis function to generate bond features.
target – (float, optional) The target property that you want to predict from the crystal structure (default =
None
).gid – (int, optional) An integer identifier of the crystal structure (default = -1).
radius – (float, optional) The maximum value of the radius to define neighbor atoms (default = 3.0).
- Returns
A crystal graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\).
- get_atom_info(crystal: pymatgen.core.structure.Structure, elem_feats: numpy.ndarray, numeric_feats: numpy.ndarray, radius: float)¶
Calculate an atom-feature matrix and an atom-coordination matrix from the given
pymatgen.core.Structure
object.- Parameters
crystal – (pymatgen.core.Structure) The Pymatgen object of the crystal structure.
elem_feats – (numpy.ndarray) The NumPy array of elemental features.
numeric_feats – (numpy.ndarray) Numerical features of the crystal (e.g., density).
radius – (float) The maximum value of the radius to define neighbor atoms.
- Returns
The atom-feature and atom-coordination matrices.
- get_bond_info(atom_coord: numpy.ndarray, rbf_means: numpy.ndarray, radius: float)¶
Calculate bond information of the crystal structure. If the crystal structure was converted into an isolated graph, it returns (None, None).
- Parameters
atom_coord – (numpy.ndarray) The XYZ coordinates of the atoms in the crystal structure.
rbf_means – (numpy.ndarray) A feature-wise mean vector of the Gaussian kernel for radial basis function to generate bond features.
radius – (float) The maximum value of the radius to define neighbor atoms.
- Returns
Indices of the bonds in the crystal structure and the bond-feature matrix.