maica.chem

The maica.chem module includes essential utilities and algorithms for chemical machine learning. This module provides global variables containing chemical information. Also, it gives several functions to generate numerical representation of chemical data.

Global Variables

Variable Name

Type

Description

atom_nums

dict

A dictionary of the atomic number for each atomic symbol.

atom_sysm

dict

A dictionary of the atomic symbol for each atomic number.

elem_feat_names

list

Basic elemental features from the Python Mendeleev package.

cat_hbd

list

Categories of hybridization of the atoms.

cat_fc

list

Categories of formal charge of the atoms.

cat_bond_types

list

Categories of bond types.

Functions

Function Name

Description

load_mendeleev_feats

Load basic elemental attributes from the PyThon Mendeleev package.

load_elem_feats

Load elemental attributes from user-defined elemental embeddings. If the elemental embedding is None, the basic elemental attributes are loaded.

form_to_vec

Convert the chemical formula into the feature vector based on a given elemental features called elem_feats.

parse_form

Convert the string of the chemical formula into the dictionary object of {element: ratio}.

get_mol_graph

Convert the SMILES representation into the molecular graph.

rbf

Apply radial basis function to the given data x.

even_samples

Sample n_samples values evenly within [min_val, max_val].

get_crystal_graph

Convert the CIF file of the crystal structure into the crystal graph.

get_atom_info

Extract atom-feature and atom-coordination matrices from the given pymatgen.core.Structure object.

get_bond_info

Extract bond information of the given atom coordination atom_coords.

Base Utilities

The maica.chem.base module includes basic information and utilities to process chemical data. This module was designed to provide elemental attributes or embeddings in converting unstructured chemical data into the formal data formats, such as numpy.ndarray and torch.Tensor.

load_mendeleev_feats()

Load elemental features from the Python Mendeleev package. Total eight elemental features containing the attributes in elem_feat_names and the first ionization energy are loaded.

Returns

An array of the basic elemental features in row-wise.

load_elem_feats(path_elem_embs: Optional[str] = None)

Load elemental features from the elemental embeddings in the given path_elem_embs. Users can use the customized elemental features in machine learning by exploiting this function. The user-defined elemental features should be provided as a JSON file with a format of {element: array of features}. If the argument path_elem_embs is None, basic elemental features from the Python Mendeleev package is returned.

Parameters

path_elem_embs – (str, optional) Path of the JSON file including custom elemental features (default = None).

Returns

An array of the elemental features in row-wise.

Chemical Formula

A collection of functions to convert the string of the chemical formulas into the machine-readable numerical representations.

form_to_vec(form: str, elem_feats: numpy.ndarray)

Convert the given chemical formula into a feature vector. For a set of computed elemental features \(S = \{\mathbf{h} = f(e) | e \in c \}\) where \(c\) is the given chemical formula, the feature vector is calculated a concatenated vector based on weighted sum with a weight \(w_\mathbf{h}\), standard deviation \(\sigma\), and max operation as:

\[\mathbf{x} = \sum_{\mathbf{h} \in S} w_\mathbf{h} h \oplus \sigma(S) \oplus \max(S).\]

Note that the standard deviation \(\sigma\) and the max operations are applied feature-wise (not element-wise). This formula-to-vector conversion method is common in chemical machine learning [1, 2, 3].

Parameters
  • form – (str) Chemical formula (e.g., ZnIn2S4).

  • elem_feats – (numpy.ndarray) The NumPy array of elemental features.

Returns

A feature vector of the chemical formula.

parse_form(form: str)

Parse the chemical formula to an element-wise dictionary. For example, ZnIn2S4 is converted into a dictionary of {Zn: 1.0, In: 2.0, S: 4.0}.

Parameters

form – (str) Chemical formula (e.g., ZnIn2S4).

Returns

An element-wise dictionary of the chemical formula.

Molecular Structure

This module provides abstracted functions to convert the molecular structures into the mathematical graphs. The converted molecular structure is stored as a torch_geometric.Data object.

get_mol_graph(elem_feats: numpy.ndarray, smiles: str, numeric_feats: numpy.ndarray, target: Optional[float] = None, gid: int = - 1)

Convert the SMILES representation of a molecular structure into the mathematical graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\), where \(\mathcal{V}\) is a set of atoms, \(\mathcal{E}\) is a set of bonds, \(X\) is a atom-feature matrix, and \(R\) is a bond-feature matrix.

Parameters
  • elem_feats – (numpy.ndarray) The NumPy array of elemental features.

  • smiles – (str) A SMILES representation of the input molecular structure.

  • numeric_feats – (numpy.ndarray) Numerical features of the molecule (e.g., molecular weight).

  • target – (float, optional) The target property that you want to predict from the molecular structure (default = None).

  • gid – (int, optional) An integer identifier of the molecular structure (default = -1).

Returns

A molecular graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\).

Crystal Structure

This module provides abstracted functions to convert the crystal structures into the mathematical graphs. Like the molecular structure, the converted crystal structure are stored as a torch_geometric.Data object.

rbf(x: numpy.ndarray, mu: numpy.ndarray, beta: float)

Apply radial basis function to the given data x with the mean vector mu and a constant beta. The Gaussian kernel was used to implement this radial basis function.

Parameters
  • x – (numpy.ndarray) An input data of radial basis function.

  • mu – (numpy.ndarray) Feature-wise mean vector of the Gaussian kernel.

  • beta – (float) A shape parameter of radial basis function.

Returns

Converted data of the input x.

even_samples(min_val: float, max_val: float, n_samples: int)

Sample n_samples values evenly within [min_val, max_val].

Parameters
  • min_val – (float) The minimum value of the sample range.

  • max_val – (float) The maximum value of the sample range.

  • n_samples – (int) The number of samples.

Returns

An array of the samples.

get_crystal_graph(elem_feats: numpy.ndarray, path_cif: str, numeric_feats: numpy.ndarray, rbf_means: numpy.ndarray, target: Optional[float] = None, gid: int = - 1, radius: float = 3.0)

Convert the crystallographic information framework (CIF) file into the mathematical graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\), where \(\mathcal{V}\) is a set of atoms, \(\mathcal{E}\) is a set of bonds, \(X\) is a atom-feature matrix, and \(R\) is a bond-feature matrix.

Parameters
  • elem_feats – (numpy.ndarray) The NumPy array of elemental features.

  • path_cif – (str) The path of the CIF file of the crystal structure.

  • numeric_feats – (numpy.ndarray) Numerical features of the crystal (e.g., density).

  • rbf_means – (numpy.ndarray) A feature-wise mean vector of the Gaussian kernel for radial basis function to generate bond features.

  • target – (float, optional) The target property that you want to predict from the crystal structure (default = None).

  • gid – (int, optional) An integer identifier of the crystal structure (default = -1).

  • radius – (float, optional) The maximum value of the radius to define neighbor atoms (default = 3.0).

Returns

A crystal graph \(G = (\mathcal{V}, \mathcal{E}, X, R)\).

get_atom_info(crystal: pymatgen.core.structure.Structure, elem_feats: numpy.ndarray, numeric_feats: numpy.ndarray, radius: float)

Calculate an atom-feature matrix and an atom-coordination matrix from the given pymatgen.core.Structure object.

Parameters
  • crystal – (pymatgen.core.Structure) The Pymatgen object of the crystal structure.

  • elem_feats – (numpy.ndarray) The NumPy array of elemental features.

  • numeric_feats – (numpy.ndarray) Numerical features of the crystal (e.g., density).

  • radius – (float) The maximum value of the radius to define neighbor atoms.

Returns

The atom-feature and atom-coordination matrices.

get_bond_info(atom_coord: numpy.ndarray, rbf_means: numpy.ndarray, radius: float)

Calculate bond information of the crystal structure. If the crystal structure was converted into an isolated graph, it returns (None, None).

Parameters
  • atom_coord – (numpy.ndarray) The XYZ coordinates of the atoms in the crystal structure.

  • rbf_means – (numpy.ndarray) A feature-wise mean vector of the Gaussian kernel for radial basis function to generate bond features.

  • radius – (float) The maximum value of the radius to define neighbor atoms.

Returns

Indices of the bonds in the crystal structure and the bond-feature matrix.