maica.util¶
The maica.util
module includes useful functions for preprocessing and visualization.
This can be used to convert a user dataset into a new dataset suitable for machine learning.
Also, it can be applied to analyze prediction results of machine learning.
In addition to the supplementary functions, this module provides meta-heuristic algorithms for optimizing
the input data of the mathematical functions and machine learning models.
Functions
Function Name |
Description |
---|---|
impute |
Fill empty values in the given dataset. |
get_split_idx |
Randomly generate two subsets of the original indices. This function is useful to split a dataset into training and test datsaets. |
get_one_hot_feat |
Generate one-hot encoding of the categorical features. |
get_target_dist |
Get a histogram of the target values. Lists of the bins and the labels of the histogram are returned for the given target values. |
get_error_dist |
Compute mean of the prediction errors for each range of target values. |
plot_target_dist |
Plot a histogram of the target values for a given dataset. |
plot_error_dist |
Plot a distribution of prediction errors for each range of target values. |
plot_pred_result |
Draw a scatter plot of target and predicted values. |
plot_embeddings |
Draw a 2-dimensional scatter plot of data embeddings. |
optimize |
Find the optimal input of a given objective function. |
run_ml_model |
A wrapper function to run machine learning algorithms for meta-heuristic based input optimization. |
Preprocessing¶
The maica.util.preprocessing
module includes useful preprocessing methods for the datasets.
This module allows you to convert a user dataset into a new dataset suitable for machine learning.
- impute(data: numpy.ndarray, method: str)¶
Fill empty values in the given
data
. It is useful for machine learning with the experimental data containing missing values. Three imputation methods are provided:Mean-based imputation method.
Zero-based imputation method.
KNN-based imputation method.
Possible values of
method
selecting the imputation method are given inmaica.core.env
.- Parameters
data – (numpy.ndarray) Data object.
method – (str) A key of the imputation method.
- Returns
Data object.
- get_split_idx(n_data: int, ratio: float)¶
Get indices that randomly split a set of {0, 1, …,
n_data
}. The set is divided into two subsets at a ratio ofratio
to1 - ratio
.- Parameters
n_data – (int) The number of data in your data object.
ratio – (float) The ratio for division.
- Returns
Data indices of the two subsets.
- get_one_hot_feat(categories: list, hot_category: object)¶
Generate one-hot encoding for the argument
hot_category
in the categories ofcategories
.- Parameters
categories – (list) A list of categories for the one-hot encoding.
hot_category – (object) An emerged class in the one-hot encoding.
- Returns
The one-hot encoding feature.
Analysis¶
The maica.util.analysis
module provides useful functions to analyze prediction results
of the machine learning algorithms.
- get_target_dist(targets: numpy.ndarray, n_bins: int = 10)¶
Get a histogram of the target values. It returns bins and labels of the histogram. This function is useful to analyze the prediction results based on the number of targets for each range.
- Parameters
targets – (numpy.ndarray) Target values to convert the histogram.
n_bins – (int) The number of bins (default = 10).
- Returns
Lists of the bins and the labels of the histogram.
- get_error_dist(targets: numpy.ndarray, preds: numpy.ndarray, n_ranges: int = 10)¶
Compute mean of the prediction errors for each range of the target values. The prediction error is calculated by mean absolute error (MAE). This function is useful to analyze the prediction accuracy for each rnage of the target values. The error of the \(k^{th}\) range is calculated by:
\[error_k = \frac{1}{|S_k|} \sum_{i \in S_K} |y_i - y_i^{'}|,\]where \(S_k\) is a set of indices of the data in the \(k^{th}\) range, and \(y_i^{'}\) is a predicted value of the input data \(\mathbf{x}_i\).
- Parameters
targets – (numpy.ndarray) Target values of the prediction.
preds – (numpy.ndarray) Prediction results.
n_ranges – (int) The number of sections to split the range of the target values (default = 10).
- Returns
Mean of errors and labels for each target range.
Visualization¶
The maica.util.visualization
module provides various visualization tools for dataset analysis and evaluations.
This module was implemented based on matplotlib
to serve a unified interface.
- plot_target_dist(fig_name: str, dataset: maica.data.base.Dataset, n_bins: int = 10, font_size: int = 16)¶
Plot a histogram of the target values of
dataset
. The histogram is saved as an image file infig_name
.- Parameters
fig_name – (str) A name of the image file storing the histogram of the target values.
dataset – (maica.data.base.Dataset) The dataset object containing the target values.
n_bins – (int, optional) The number of bins in the histogram (default = 10).
font_size – (int, optional) The size of text in the generated figure (default = 16).
- plot_error_dist(fig_name: str, dataset: maica.data.base.Dataset, preds: numpy.ndarray, font_size: int = 16)¶
Plot a distribution of prediction errors. The prediction errors for each target range are calculated based on mean absolute error as:
\[error = \frac{1}{N_i}\sum_{j=1}^{N_i}|y_j - y_j^{'}|,\]where \(N_i\) is the number of data points in the \(i^{th}\) target range, \(y_j\) is the target value of the \(j^{th}\) data in the \(i^{th}\) target range, and \(y_{j}^{'}\) is the predicted value of the \(j^{th}\) data.
- Parameters
fig_name – (str) A name of the image file storing the error distribution.
dataset – (maica.data.base.Dataset) The dataset object containing the target values.
preds – (numpy.ndarray) Prediction results.
font_size – (int, optional) The size of text in the generated figure (default = 16).
- plot_pred_result(fig_name: str, dataset: object, preds: numpy.ndarray, font_size: int = 16, min_val: Optional[float] = None, max_val: Optional[float] = None)¶
Draw a scatter plot of the prediction results in
preds
. The X and Y axes are drawn by the true target values indataset
and the predicted values inpreds
, respectively.- Parameters
fig_name – (str) A name of the image file storing the scatter plot.
dataset – (str) The dataset object containing the target values.
preds – (numpy.ndarray) Prediction results.
font_size – (int, optional) The size of text in the generated figure (default = 16).
min_val – (float, optional) The minimum value of the X and Y axes (default =
None
).max_val – (float, optional) The maximum value of the X and Y axes (default =
None
).
- plot_embeddings(fig_name: str, data: numpy.ndarray, labels: numpy.ndarray, font_size: int = 16, emb_method: str = 'tsne')¶
Draw a scatter plot of the data embeddings from
data
andemb_method
. This function generates 2-dimensional scatter plot based on first and second components of the embedding method. Available embedding method are given inmaica.core.env
.- Parameters
fig_name – (str) A name of the image file storing the scatter plot.
data – (numpy.ndarray) The original data to be embedded.
labels – (labels) Numerical labels for the data.
font_size – (int, optional) The size of text in the generated figure (default = 16).
emb_method – (str, optional) Embedding method (default =
env.EMB_TSNE
).
Optimization¶
The maica.util.optimization
module supports the function optimization.
It includes various meta-heuristic algorithms and a wrapper function to execute them.
This module can be used to optimize the input data of user-defined functions or machine learning models.
- optimize(obj_func: object, opt_alg: str, opt_type: str, **kwargs)¶
Optimize the given user-defined function
obj_func
using the meta-heuristic algorithmopt_alg
in ARTCAI. The objective function can be all input-output mappings regardless of their differentiability.- Parameters
obj_func – (object) The objective function to optimize.
opt_alg – (str) A name of the meta-heuristic algorithm to optimize the objective function.
opt_type – (str) An argument for determining the objective function to be minimized or maximized.
kwargs – (optional) Hyperparameters of the meta-heuristic algorithm.
- Returns
Optimal input of the objective function and its score in the meta-heuristic algorithm.
- run_ml_model(model: maica.ml.base.Model, input_data: numpy.ndarray)¶
Run the given machine learning model of
model
to generate model outputs. It is a wrapper function to use the machine learning models as an objective function of meta-heuristics.- Parameters
model – (maica.ml.base.Model) A machine learning model that makes up the objective function.
input_data – (numpy.ndarray) Input data of the given machine learning model.
- Returns
Output data of the machine learning model.