maica.util¶

The maica.util module includes useful functions for preprocessing and visualization. This can be used to convert a user dataset into a new dataset suitable for machine learning. Also, it can be applied to analyze prediction results of machine learning. In addition to the supplementary functions, this module provides meta-heuristic algorithms for optimizing the input data of the mathematical functions and machine learning models.

Functions

Function Name	Description
impute	Fill empty values in the given dataset.
get_split_idx	Randomly generate two subsets of the original indices. This function is useful to split a dataset into training and test datsaets.
get_one_hot_feat	Generate one-hot encoding of the categorical features.
get_target_dist	Get a histogram of the target values. Lists of the bins and the labels of the histogram are returned for the given target values.
get_error_dist	Compute mean of the prediction errors for each range of target values.
plot_target_dist	Plot a histogram of the target values for a given dataset.
plot_error_dist	Plot a distribution of prediction errors for each range of target values.
plot_pred_result	Draw a scatter plot of target and predicted values.
plot_embeddings	Draw a 2-dimensional scatter plot of data embeddings.
optimize	Find the optimal input of a given objective function.
run_ml_model	A wrapper function to run machine learning algorithms for meta-heuristic based input optimization.

Preprocessing¶

The maica.util.preprocessing module includes useful preprocessing methods for the datasets. This module allows you to convert a user dataset into a new dataset suitable for machine learning.

impute(data: numpy.ndarray, method: str)¶

Fill empty values in the given data. It is useful for machine learning with the experimental data containing missing values. Three imputation methods are provided:

Mean-based imputation method.
Zero-based imputation method.
KNN-based imputation method.

Possible values of method selecting the imputation method are given in maica.core.env.

Parameters

data – (numpy.ndarray) Data object.
method – (str) A key of the imputation method.

Returns

Data object.

get_split_idx(n_data: int, ratio: float)¶

Get indices that randomly split a set of {0, 1, …, n_data}. The set is divided into two subsets at a ratio of ratio to 1 - ratio.

Parameters

n_data – (int) The number of data in your data object.
ratio – (float) The ratio for division.

Returns

Data indices of the two subsets.

get_one_hot_feat(categories: list, hot_category: object)¶

Generate one-hot encoding for the argument hot_category in the categories of categories.

Parameters

categories – (list) A list of categories for the one-hot encoding.
hot_category – (object) An emerged class in the one-hot encoding.

Returns

The one-hot encoding feature.

Analysis¶

The maica.util.analysis module provides useful functions to analyze prediction results of the machine learning algorithms.

get_target_dist(targets: numpy.ndarray, n_bins: int = 10)¶

Get a histogram of the target values. It returns bins and labels of the histogram. This function is useful to analyze the prediction results based on the number of targets for each range.

Parameters

targets – (numpy.ndarray) Target values to convert the histogram.
n_bins – (int) The number of bins (default = 10).

Returns

Lists of the bins and the labels of the histogram.

get_error_dist(targets: numpy.ndarray, preds: numpy.ndarray, n_ranges: int = 10)¶

Compute mean of the prediction errors for each range of the target values. The prediction error is calculated by mean absolute error (MAE). This function is useful to analyze the prediction accuracy for each rnage of the target values. The error of the \(k^{th}\) range is calculated by:

\[error_k = \frac{1}{|S_k|} \sum_{i \in S_K} |y_i - y_i^{'}|,\]

where \(S_k\) is a set of indices of the data in the \(k^{th}\) range, and \(y_i^{'}\) is a predicted value of the input data \(\mathbf{x}_i\).

Parameters

targets – (numpy.ndarray) Target values of the prediction.
preds – (numpy.ndarray) Prediction results.
n_ranges – (int) The number of sections to split the range of the target values (default = 10).

Returns

Mean of errors and labels for each target range.

Visualization¶

The maica.util.visualization module provides various visualization tools for dataset analysis and evaluations. This module was implemented based on matplotlib to serve a unified interface.

plot_target_dist(fig_name: str, dataset: maica.data.base.Dataset, n_bins: int = 10, font_size: int = 16)¶

Plot a histogram of the target values of dataset. The histogram is saved as an image file in fig_name.

Parameters

fig_name – (str) A name of the image file storing the histogram of the target values.
dataset – (maica.data.base.Dataset) The dataset object containing the target values.
n_bins – (int, optional) The number of bins in the histogram (default = 10).
font_size – (int, optional) The size of text in the generated figure (default = 16).

plot_error_dist(fig_name: str, dataset: maica.data.base.Dataset, preds: numpy.ndarray, font_size: int = 16)¶

Plot a distribution of prediction errors. The prediction errors for each target range are calculated based on mean absolute error as:

\[error = \frac{1}{N_i}\sum_{j=1}^{N_i}|y_j - y_j^{'}|,\]

where \(N_i\) is the number of data points in the \(i^{th}\) target range, \(y_j\) is the target value of the \(j^{th}\) data in the \(i^{th}\) target range, and \(y_{j}^{'}\) is the predicted value of the \(j^{th}\) data.

Parameters

fig_name – (str) A name of the image file storing the error distribution.
dataset – (maica.data.base.Dataset) The dataset object containing the target values.
preds – (numpy.ndarray) Prediction results.
font_size – (int, optional) The size of text in the generated figure (default = 16).

plot_pred_result(fig_name: str, dataset: object, preds: numpy.ndarray, font_size: int = 16, min_val: Optional[float] = None, max_val: Optional[float] = None)¶

Draw a scatter plot of the prediction results in preds. The X and Y axes are drawn by the true target values in dataset and the predicted values in preds, respectively.

Parameters

fig_name – (str) A name of the image file storing the scatter plot.
dataset – (str) The dataset object containing the target values.
preds – (numpy.ndarray) Prediction results.
font_size – (int, optional) The size of text in the generated figure (default = 16).
min_val – (float, optional) The minimum value of the X and Y axes (default = None).
max_val – (float, optional) The maximum value of the X and Y axes (default = None).

plot_embeddings(fig_name: str, data: numpy.ndarray, labels: numpy.ndarray, font_size: int = 16, emb_method: str = 'tsne')¶

Draw a scatter plot of the data embeddings from data and emb_method. This function generates 2-dimensional scatter plot based on first and second components of the embedding method. Available embedding method are given in maica.core.env.

Parameters

fig_name – (str) A name of the image file storing the scatter plot.
data – (numpy.ndarray) The original data to be embedded.
labels – (labels) Numerical labels for the data.
font_size – (int, optional) The size of text in the generated figure (default = 16).
emb_method – (str, optional) Embedding method (default = env.EMB_TSNE).

Optimization¶

The maica.util.optimization module supports the function optimization. It includes various meta-heuristic algorithms and a wrapper function to execute them. This module can be used to optimize the input data of user-defined functions or machine learning models.

optimize(obj_func: object, opt_alg: str, opt_type: str, **kwargs)¶

Optimize the given user-defined function obj_func using the meta-heuristic algorithm opt_alg in ARTCAI. The objective function can be all input-output mappings regardless of their differentiability.

Parameters

obj_func – (object) The objective function to optimize.
opt_alg – (str) A name of the meta-heuristic algorithm to optimize the objective function.
opt_type – (str) An argument for determining the objective function to be minimized or maximized.
kwargs – (optional) Hyperparameters of the meta-heuristic algorithm.

Returns

Optimal input of the objective function and its score in the meta-heuristic algorithm.

run_ml_model(model: maica.ml.base.Model, input_data: numpy.ndarray)¶

Run the given machine learning model of model to generate model outputs. It is a wrapper function to use the machine learning models as an objective function of meta-heuristics.

Parameters

model – (maica.ml.base.Model) A machine learning model that makes up the objective function.
input_data – (numpy.ndarray) Input data of the given machine learning model.

Returns

Output data of the machine learning model.