maica.util

The maica.util module includes useful functions for preprocessing and visualization. This can be used to convert a user dataset into a new dataset suitable for machine learning. Also, it can be applied to analyze prediction results of machine learning. In addition to the supplementary functions, this module provides meta-heuristic algorithms for optimizing the input data of the mathematical functions and machine learning models.

Functions

Function Name

Description

impute

Fill empty values in the given dataset.

get_split_idx

Randomly generate two subsets of the original indices. This function is useful to split a dataset into training and test datsaets.

get_one_hot_feat

Generate one-hot encoding of the categorical features.

get_target_dist

Get a histogram of the target values. Lists of the bins and the labels of the histogram are returned for the given target values.

get_error_dist

Compute mean of the prediction errors for each range of target values.

plot_target_dist

Plot a histogram of the target values for a given dataset.

plot_error_dist

Plot a distribution of prediction errors for each range of target values.

plot_pred_result

Draw a scatter plot of target and predicted values.

plot_embeddings

Draw a 2-dimensional scatter plot of data embeddings.

optimize

Find the optimal input of a given objective function.

run_ml_model

A wrapper function to run machine learning algorithms for meta-heuristic based input optimization.

Preprocessing

The maica.util.preprocessing module includes useful preprocessing methods for the datasets. This module allows you to convert a user dataset into a new dataset suitable for machine learning.

impute(data: numpy.ndarray, method: str)

Fill empty values in the given data. It is useful for machine learning with the experimental data containing missing values. Three imputation methods are provided:

  • Mean-based imputation method.

  • Zero-based imputation method.

  • KNN-based imputation method.

Possible values of method selecting the imputation method are given in maica.core.env.

Parameters
  • data – (numpy.ndarray) Data object.

  • method – (str) A key of the imputation method.

Returns

Data object.

get_split_idx(n_data: int, ratio: float)

Get indices that randomly split a set of {0, 1, …, n_data}. The set is divided into two subsets at a ratio of ratio to 1 - ratio.

Parameters
  • n_data – (int) The number of data in your data object.

  • ratio – (float) The ratio for division.

Returns

Data indices of the two subsets.

get_one_hot_feat(categories: list, hot_category: object)

Generate one-hot encoding for the argument hot_category in the categories of categories.

Parameters
  • categories – (list) A list of categories for the one-hot encoding.

  • hot_category – (object) An emerged class in the one-hot encoding.

Returns

The one-hot encoding feature.

Analysis

The maica.util.analysis module provides useful functions to analyze prediction results of the machine learning algorithms.

get_target_dist(targets: numpy.ndarray, n_bins: int = 10)

Get a histogram of the target values. It returns bins and labels of the histogram. This function is useful to analyze the prediction results based on the number of targets for each range.

Parameters
  • targets – (numpy.ndarray) Target values to convert the histogram.

  • n_bins – (int) The number of bins (default = 10).

Returns

Lists of the bins and the labels of the histogram.

get_error_dist(targets: numpy.ndarray, preds: numpy.ndarray, n_ranges: int = 10)

Compute mean of the prediction errors for each range of the target values. The prediction error is calculated by mean absolute error (MAE). This function is useful to analyze the prediction accuracy for each rnage of the target values. The error of the \(k^{th}\) range is calculated by:

\[error_k = \frac{1}{|S_k|} \sum_{i \in S_K} |y_i - y_i^{'}|,\]

where \(S_k\) is a set of indices of the data in the \(k^{th}\) range, and \(y_i^{'}\) is a predicted value of the input data \(\mathbf{x}_i\).

Parameters
  • targets – (numpy.ndarray) Target values of the prediction.

  • preds – (numpy.ndarray) Prediction results.

  • n_ranges – (int) The number of sections to split the range of the target values (default = 10).

Returns

Mean of errors and labels for each target range.

Visualization

The maica.util.visualization module provides various visualization tools for dataset analysis and evaluations. This module was implemented based on matplotlib to serve a unified interface.

plot_target_dist(fig_name: str, dataset: maica.data.base.Dataset, n_bins: int = 10, font_size: int = 16)

Plot a histogram of the target values of dataset. The histogram is saved as an image file in fig_name.

Parameters
  • fig_name – (str) A name of the image file storing the histogram of the target values.

  • dataset – (maica.data.base.Dataset) The dataset object containing the target values.

  • n_bins – (int, optional) The number of bins in the histogram (default = 10).

  • font_size – (int, optional) The size of text in the generated figure (default = 16).

plot_error_dist(fig_name: str, dataset: maica.data.base.Dataset, preds: numpy.ndarray, font_size: int = 16)

Plot a distribution of prediction errors. The prediction errors for each target range are calculated based on mean absolute error as:

\[error = \frac{1}{N_i}\sum_{j=1}^{N_i}|y_j - y_j^{'}|,\]

where \(N_i\) is the number of data points in the \(i^{th}\) target range, \(y_j\) is the target value of the \(j^{th}\) data in the \(i^{th}\) target range, and \(y_{j}^{'}\) is the predicted value of the \(j^{th}\) data.

Parameters
  • fig_name – (str) A name of the image file storing the error distribution.

  • dataset – (maica.data.base.Dataset) The dataset object containing the target values.

  • preds – (numpy.ndarray) Prediction results.

  • font_size – (int, optional) The size of text in the generated figure (default = 16).

plot_pred_result(fig_name: str, dataset: object, preds: numpy.ndarray, font_size: int = 16, min_val: Optional[float] = None, max_val: Optional[float] = None)

Draw a scatter plot of the prediction results in preds. The X and Y axes are drawn by the true target values in dataset and the predicted values in preds, respectively.

Parameters
  • fig_name – (str) A name of the image file storing the scatter plot.

  • dataset – (str) The dataset object containing the target values.

  • preds – (numpy.ndarray) Prediction results.

  • font_size – (int, optional) The size of text in the generated figure (default = 16).

  • min_val – (float, optional) The minimum value of the X and Y axes (default = None).

  • max_val – (float, optional) The maximum value of the X and Y axes (default = None).

plot_embeddings(fig_name: str, data: numpy.ndarray, labels: numpy.ndarray, font_size: int = 16, emb_method: str = 'tsne')

Draw a scatter plot of the data embeddings from data and emb_method. This function generates 2-dimensional scatter plot based on first and second components of the embedding method. Available embedding method are given in maica.core.env.

Parameters
  • fig_name – (str) A name of the image file storing the scatter plot.

  • data – (numpy.ndarray) The original data to be embedded.

  • labels – (labels) Numerical labels for the data.

  • font_size – (int, optional) The size of text in the generated figure (default = 16).

  • emb_method – (str, optional) Embedding method (default = env.EMB_TSNE).

Optimization

The maica.util.optimization module supports the function optimization. It includes various meta-heuristic algorithms and a wrapper function to execute them. This module can be used to optimize the input data of user-defined functions or machine learning models.

optimize(obj_func: object, opt_alg: str, opt_type: str, **kwargs)

Optimize the given user-defined function obj_func using the meta-heuristic algorithm opt_alg in ARTCAI. The objective function can be all input-output mappings regardless of their differentiability.

Parameters
  • obj_func – (object) The objective function to optimize.

  • opt_alg – (str) A name of the meta-heuristic algorithm to optimize the objective function.

  • opt_type – (str) An argument for determining the objective function to be minimized or maximized.

  • kwargs – (optional) Hyperparameters of the meta-heuristic algorithm.

Returns

Optimal input of the objective function and its score in the meta-heuristic algorithm.

run_ml_model(model: maica.ml.base.Model, input_data: numpy.ndarray)

Run the given machine learning model of model to generate model outputs. It is a wrapper function to use the machine learning models as an objective function of meta-heuristics.

Parameters
  • model – (maica.ml.base.Model) A machine learning model that makes up the objective function.

  • input_data – (numpy.ndarray) Input data of the given machine learning model.

Returns

Output data of the machine learning model.