Module: reconstruction#

The reconstruction module contains functions for reconstructing quantities of interest (QoIs) from the low-dimensional data representations and functionalities to asses the quality of that reconstruction.

Note

The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).

\[\begin{split}\mathbf{X} = \begin{bmatrix} \vdots & \vdots & & \vdots \\ X_1 & X_2 & \dots & X_{Q} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix}\end{split}\]

The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.

The representation of the user-supplied data matrix in PCAfold is the input parameter X, which should be of type numpy.ndarray and of size (n_observations,n_variables).


Tools for reconstructing quantities of interest (QoIs)#

Class ANN#

class PCAfold.reconstruction.ANN(input_data, output_data, optimizer, interior_architecture=(), activation_functions='tanh', kernel_initializer=None, bias_initializer=None, loss='MSE', batch_size=200, n_epochs=1000, validation_perc=10, random_seed=None, verbose=False)#

Enables reconstruction of quantities of interest (QoIs) using artificial neural network (ANN).

Example:

from PCAfold import ANN
import numpy as np
from tensorflow import optimizers
from tensorflow.keras import initializers

# Generate dummy dataset:
input_data = np.random.rand(100,8)
output_data = np.random.rand(100,3)

# Instantiate ANN class object:
ann_model = ANN(input_data,
                output_data,
                optimizer=optimizers.legacy.Adam(learning_rate=0.001),
                interior_architecture=(5,4),
                activation_functions=('tanh', 'tanh', 'linear'),
                kernel_initializer=initializers.GlorotUniform(seed=100),
                bias_initializer=initializers.Ones(),
                loss='MSE',
                batch_size=100,
                n_epochs=1000,
                validation_perc=10,
                random_seed=100,
                verbose=True)

# Begin model training:
ann_model.train()

A summary of the current ANN model and its hyperparameter settings can be printed using the summary() function:

# Print the ANN model summary
qoi_aware.summary()
ANN model summary...
Parameters:
  • input_datanumpy.ndarray specifying the data set used as the input (regressors) to the ANN. It should be of size (n_observations,n_input_variables).

  • output_datanumpy.ndarray specifying the data set used as the output (predictors) to the ANN. It should be of size (n_observations,n_output_variables).

  • optimizertf.optimizers or tf.optimizers.legacy or str specifying the optimizer to use along with its parameters.

  • interior_architecture – (optional) tuple of int specifying the number of neurons in the interior network architecture. For example, if interior_architecture=(4,5), two interior layers will be created and the overal network architecture will be (Input)-(4)-(5)-(Output). If set to an empty tuple, interior_architecture=(), the overal network architecture will be (Input)-(Output). Keep in mind that if you’d like to create just one interior layer, you should use a comma after the integer: interior_architecture=(4,).

  • activation_functions – (optional) str or tuple specifying activation functions in all layers. If set to str, the same activation function is used in all layers. If set to a tuple of str, a different activation function can be set at different layers. The number of elements in the tuple should match the number of layers! str and str elements of the tuple can only be 'linear', 'sigmoid', or 'tanh'.

  • kernel_initializer – (optional) tf.keras.initializers specifying the initialization of weights in the network. If set to None, weights will be initialized using the Glorot uniform distribution.

  • bias_initializer – (optional) tf.keras.initializers specifying the initialization of biases in the network. If set to None, biases will be initialized as zeros.

  • loss – (optional) str specifying the loss function. It can be 'MAE' or 'MSE'.

  • batch_size – (optional) int specifying the batch size.

  • n_epochs – (optional) int specifying the number of epochs.

  • validation_perc – (optional) int specifying the percentage of the input data to be used as validation data during training. It should be a number larger than or equal to 0 and smaller than 100. Note, that if it is set above 0, not all of the input data will be used as training data. Note, that validation data does not impact model training!

  • random_seed – (optional) int specifying the random seed to be used for any random operations. It is highly recommended to set a fixed random seed, as this allows for complete reproducibility of the results.

  • verbose – (optional) bool for printing verbose details.

Attributes:

  • input_data - (read only) numpy.ndarray specifying the data set used as the input to the ANN.

  • output_data - (read only) numpy.ndarray specifying the data set used as the output to the ANN.

  • architecture - (read only) str specifying the ANN architecture.

  • ann_model - (read only) object of Keras.models.Sequential class that stores the artificial neural network model.

  • weights_and_biases_init - (read only) list of numpy.ndarray specifying weights and biases with which the ANN was intialized.

  • weights_and_biases_trained - (read only) list of numpy.ndarray specifying weights and biases after training the ANN. Only available after calling ANN.train().

  • training_loss - (read only) list of losses computed on the training data. Only available after calling ANN.train().

  • validation_loss - (read only) list of losses computed on the validation data. Only available after calling ANN.train() and only when validation_perc is not equal to 0.

PCAfold.reconstruction.ANN.summary(self)#

Prints the ANN model summary.

PCAfold.reconstruction.ANN.train(self)#

Trains the artificial neural network (ANN) model.

PCAfold.reconstruction.ANN.predict(self, input_regressors)#

Predicts the quantities of interest (QoIs) from the trained artificial neural network (ANN) model.

Parameters:

input_regressorsnumpy.ndarray specifying the input data (regressors) to be used for predicting the quantities of interest (QoIs) from the trained ANN model. It should be of size (n_observations,n_input_variables), where n_observations can be different from the number of observations in the training dataset.

Returns:

  • output_predictors - predicted quantities of interest (QoIs).

PCAfold.reconstruction.ANN.print_weights_and_biases_init(self)#

Prints initial weights and biases from all layers of the QoI-aware encoder-decoder.

PCAfold.reconstruction.ANN.print_weights_and_biases_trained(self)#

Prints trained weights and biases from all layers of the QoI-aware encoder-decoder.

PCAfold.reconstruction.ANN.plot_losses(self, markevery=100, figure_size=(15, 5), save_filename=None)#

Plots training and validation losses.

Parameters:
  • markevery – (optional) int specifying how frequently the epoch number on the x-axis should be labelled.

  • figure_size – (optional) tuple specifying figure size.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns:

  • plt - matplotlib.pyplot plot handle.


Class PartitionOfUnityNetwork#

class PCAfold.reconstruction.PartitionOfUnityNetwork(partition_centers, partition_shapes, basis_type, ivar_center=None, ivar_scale=None, basis_coeffs=None, transform_power=1.0, transform_shift=0.0, transform_sign_shift=0.0, dtype='float64')#

A class for reconstruction (regression) of QoIs using POUnets.

The POUnets are constructed with a single-layer network of normalized radial basis functions (RBFs) whose neurons each own and weight a polynomial basis. For independent variable inputs \(\vec{x}\) of dimensionality \(d\), the \(i^{\text{th}}\) partition or neuron is computed as

\[\Phi_i(\vec{x};\vec{h}_i,K_i) = \phi^{{\rm RBF}}_i(\vec{x};\vec{h}_i,K_i)/\sum_j \phi^{{\rm RBF}}_j(\vec{x};\vec{h}_i,K_i)\]

where

\[\phi_i^{{\rm RBF}}(\vec{x};\vec{h}_i,K_i) = \exp\left(-(\vec{x}-\vec{h}_i)^\mathsf{T}K_i(\vec{x}-\vec{h}_i)\right) \nonumber\]

with vector \(\vec{h}_i\) and diagonal matrix \(K_i\) defining the \(d\) center and \(d\) shape parameters, respectively, for training.

The final output of a POUnet is then obtained through

\[g(\vec{x};\vec{h},K,c) = \sum_{i=1}^{p}\left(\Phi_i(\vec{x};\vec{h}_i,K_i)\sum_{k=1}^{b}c_{i,k}m_k(\vec{x})\right)\]

where the polynomial basis is represented as a sum of \(b\) Taylor monomials, with the \(k^{\text{th}}\) monomial written as \(m_k(\vec{x})\), that are multiplied by trainable basis coefficients \(c\). The number of basis monomials is determined by the basis_type for the polynomial. For example, in two-dimensional space, a quadratic polynomial basis contains \(b=6\) monomial functions \(\{1, x_1, x_2, x_1^2, x_2^2, x_1x_2\}\). The combination of the partitions and polynomial basis functions creates localized polynomial fits for a QoI.

More information can be found in [AHK+22].

The PartitionOfUnityNetwork class also provides a nonlinear transformation for the dependent variable(s) during training, which can be beneficial if the variable changes over orders of magnitude, for example. The equation for the transformation of variable \(f\) is

\[(|f + s_1|)^\alpha \text{sign}(f + s_1) + s_2 \text{sign}(f + s_1)\]

where \(\alpha\) is the transform_power, \(s_1\) is the transform_shift, and \(s_2\) is the transform_sign_shift.

Example:

from PCAfold import init_uniform_partitions, PartitionOfUnityNetwork
import numpy as np

# Generate dummy data set:
ivars = np.random.rand(100,2)
dvars = 2.*ivars[:,0] + 3.*ivars[:,1]

# Initialize the POUnet parameters
net = PartitionOfUnityNetwork(**init_uniform_partitions([5,7], ivars), basis_type='linear')

# Build the training graph with provided training data
net.build_training_graph(ivars, dvars)

# (optional) update the learning rate (default is 1.e-3)
net.update_lr(1.e-4)

# (optional) update the least-squares regularization (default is 1.e-10)
net.update_l2reg(1.e-10)

# Train the POUnet
net.train(1000)

# Evaluate the POUnet
pred = net(ivars)

# Evaluate the POUnet derivatives
der = net.derivatives(ivars)

# Save the POUnet to a file
net.write_data_to_file('filename.pkl')

# Load a POUnet from file
net2 = PartitionOfUnityNetwork.load_from_file('filename.pkl')

# Evaluate the loaded POUnet (without needing to call build_training_graph)
pred2 = net2(ivars)
Parameters:
  • partition_centers – array size (number of partitions) x (number of ivar inputs) for partition locations

  • partition_shapes – array size (number of partitions) x (number of ivar inputs) for partition shapes influencing the RBF widths

  • basis_type – string ('constant', 'linear', or 'quadratic') for the degree of polynomial basis

  • ivar_center – (optional, default None) array for centering the ivar inputs before evaluating the POUnet, if None centers with zeros

  • ivar_scale – (optional, default None) array for scaling the ivar inputs before evaluating the POUnet, if None scales with ones

  • basis_coeffs – (optional, default None) if the array of polynomial basis coefficients is known, it may be provided here, otherwise it will be initialized with build_training_graph and trained with train

  • transform_power – (optional, default 1.) the power parameter used in the transformation equation during training

  • transform_shift – (optional, default 0.) the shift parameter used in the transformation equation during training

  • transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation during training

  • dtype – (optional, default 'float64') string specifying either float type 'float64' or 'float32'

Attributes:

  • partition_centers - (read only) array of the current partition centers

  • partition_shapes - (read only) array of the current partition shape parameters

  • basis_type - (read only) string relaying the basis degree

  • basis_coeffs - (read only) array of the current basis coefficients

  • ivar_center - (read only) array of the centering parameters for the ivar inputs

  • ivar_scale - (read only) array of the scaling parameters for the ivar inputs

  • dtype - (read only) string relaying the data type ('float64' or 'float32')

  • training_archive - (read only) dictionary of the errors and POUnet states archived during training

  • iterations - (read only) array of the iterations archived during training

PCAfold.reconstruction.PartitionOfUnityNetwork.load_data_from_file(filename)#

Load data from a specified filename with pickle (following write_data_to_file)

Parameters:

filename – string

Returns:

dictionary of the POUnet data

PCAfold.reconstruction.PartitionOfUnityNetwork.load_from_file(filename)#

Load class from a specified filename with pickle (following write_data_to_file)

Parameters:

filename – string

Returns:

PartitionOfUnityNetwork

PCAfold.reconstruction.PartitionOfUnityNetwork.load_data_from_txt(filename, verbose=False)#

Load data from a specified txt filename (following write_data_to_txt)

Parameters:
  • filename – string

  • verbose – (optional, default False) print out the data as it is read

Returns:

dictionary of the POUnet data

PCAfold.reconstruction.PartitionOfUnityNetwork.write_data_to_file(self, filename)#

Save class data to a specified file using pickle. This does not include the archived data from training, which can be separately accessed with training_archive and saved outside of PartitionOfUnityNetwork.

Parameters:

filename – string

PCAfold.reconstruction.PartitionOfUnityNetwork.write_data_to_txt(self, filename, nformat='%.14e')#

Save data to a specified txt file. This may be used to read POUnet parameters into other languages such as C++

Parameters:

filename – string

PCAfold.reconstruction.PartitionOfUnityNetwork.build_training_graph(self, ivars, dvars, error_type='abs', constrain_positivity=False, istensor=False, verbose=False)#

Construct the graph used during training (including defining the training errors) with the provided training data

Parameters:
  • ivars – array of independent variables for training

  • dvars – array of dependent variable(s) for training

  • error_type – (optional, default 'abs') the type of training error: relative 'rel' or absolute 'abs'

  • constrain_positivity – (optional, default False) when True, it penalizes the training error with \(f - |f|\) for dependent variables \(f\). This can be useful when used in QoIAwareProjectionPOUnet

  • istensor – (optional, default False) whether to evaluate ivars and dvars as tensorflow Tensors or numpy arrays

  • verbose – (options, default False) print out the number of the partition and basis parameters when True

PCAfold.reconstruction.PartitionOfUnityNetwork.update_lr(self, lr)#

update the learning rate for training

Parameters:

lr – float for the learning rate

PCAfold.reconstruction.PartitionOfUnityNetwork.update_l2reg(self, l2reg)#

update the least-squares regularization for training

Parameters:

l2reg – float for the least-squares regularization

PCAfold.reconstruction.PartitionOfUnityNetwork.lstsq(self, verbose=True)#

update the basis coefficients with least-squares regression

Parameters:

verbose – (optional, default True) prints when least-squares solve is performed when True

PCAfold.reconstruction.PartitionOfUnityNetwork.train(self, iterations, archive_rate=100, use_best_archive_sse=True, verbose=False)#

Performs training using a least-squares gradient descent block coordinate descent strategy. This alternates between updating the partition parameters with gradient descent and updating the basis coefficients with least-squares.

Parameters:
  • iterations – integer for number of training iterations to perform

  • archive_rate – (optional, default 100) the rate at which the errors and parameters are archived during training. These can be accessed with the training_archive attribute

  • use_best_archive_sse – (optional, default True) when True will set the POUnet parameters to those with the lowest error observed during training, otherwise the parameters from the last iteration are used

  • verbose – (optional, default False) when True will print progress

PCAfold.reconstruction.PartitionOfUnityNetwork.__call__(self, xeval)#

evaluate the POUnet

Parameters:

xeval – array of independent variable query points

Returns:

array of POUnet predictions

PCAfold.reconstruction.PartitionOfUnityNetwork.derivatives(self, xeval, dvar_idx=0)#

evaluate the POUnet derivatives

Parameters:
  • xeval – array of independent variable query points

  • dvar_idx – (optional, default 0) index for the dependent variable whose derivatives are being evaluated

Returns:

array of POUnet derivative evaluations

PCAfold.reconstruction.PartitionOfUnityNetwork.partition_prenorm(self, xeval)#

evaluate the POUnet partitions prior to normalization

Parameters:

xeval – array of independent variable query points

Returns:

array of POUnet RBF partition evaluations before normalization

PCAfold.reconstruction.init_uniform_partitions(list_npartitions, ivars, width_factor=0.5, verbose=False)#

Computes parameters for initializing partition locations near training data with uniform spacing in each dimension.

Example:

from PCAfold import init_uniform_partitions
import numpy as np

# Generate dummy data set:
ivars = np.random.rand(100,2)

# compute partition parameters for an initial 5x7 grid:
init_data = init_uniform_partitions([5, 7], ivars)
Parameters:
  • list_npartitions – list of integers specifying the number of partitions to try initializing in each dimension. Only partitions near the provided ivars are kept.

  • ivars – array of independent variables used for determining which partitions to keep

  • width_factor – (optional, default 0.5) the factor multiplying the spacing between partitions for initializing the partitions’ RBF widths

  • verbose – (optional, default False) when True, prints the number of partitions retained compared to the initial grid

Returns:

a dictionary of partition parameters to be used in initializing a PartitionOfUnityNetwork


Regression assessment#

Class RegressionAssessment#

class PCAfold.reconstruction.RegressionAssessment(observed, predicted, idx=None, variable_names=None, use_global_mean=False, norm='std', use_global_norm=False, tolerance=0.05)#

Wrapper class for storing all regression assessment metrics for a given regression solution given by the observed dependent variables, \(\pmb{\phi}_o\), and the predicted dependent variables, \(\pmb{\phi}_p\).

Example:

from PCAfold import PCA, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Instantiate RegressionAssessment class object:
regression_metrics = RegressionAssessment(X, X_rec)

# Access mean absolute error values:
MAE = regression_metrics.mean_absolute_error

In addition, all stratified regression metrics can be computed on a single variable:

from PCAfold import variable_bins

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=5, verbose=False)

# Instantiate RegressionAssessment class object:
stratified_regression_metrics = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx)

# Access stratified mean absolute error values:
stratified_MAE = stratified_regression_metrics.stratified_mean_absolute_error
Parameters:
  • observednumpy.ndarray specifying the observed values of dependent variables, \(\pmb{\phi}_o\). It should be of size (n_observations,) or (n_observations,n_variables).

  • predictednumpy.ndarray specifying the predicted values of dependent variables, \(\pmb{\phi}_p\). It should be of size (n_observations,) or (n_observations,n_variables).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • variable_names – (optional) list of str specifying variable names.

  • use_global_mean – (optional) bool specifying if global mean of the observed variable should be used as a reference in \(R^2\) calculation.

  • normstr specifying the normalization, \(d_{norm}\), for NRMSE computation. It can be one of the following: std, range, root_square_mean, root_square_range, root_square_std, abs_mean.

  • use_global_norm – (optional) bool specifying if global norm of the observed variable should be used in NRMSE calculation.

  • tolerancefloat specifying the tolerance for GDE computation.

Attributes:

  • coefficient_of_determination - (read only) numpy.ndarray specifying the coefficient of determination, \(R^2\), values. It has size (1,n_variables).

  • mean_absolute_error - (read only) numpy.ndarray specifying the mean absolute error (MAE) values. It has size (1,n_variables).

  • max_absolute_error - (read only) numpy.ndarray specifying the max absolute error (MaxAE) values. It has size (1,n_variables).

  • mean_squared_error - (read only) numpy.ndarray specifying the mean squared error (MSE) values. It has size (1,n_variables).

  • root_mean_squared_error - (read only) numpy.ndarray specifying the root mean squared error (RMSE) values. It has size (1,n_variables).

  • normalized_root_mean_squared_error - (read only) numpy.ndarray specifying the normalized root mean squared error (NRMSE) values. It has size (1,n_variables).

  • good_direction_estimate - (read only) float specifying the good direction estimate (GDE) value, treating the entire \(\pmb{\phi}_o\) and \(\pmb{\phi}_p\) as vectors. Note that if a single dependent variable is passed, GDE cannot be computed and is set to NaN.

If idx has been specified:

  • stratified_coefficient_of_determination - (read only) numpy.ndarray specifying the coefficient of determination, \(R^2\), values. It has size (1,n_variables).

  • stratified_mean_absolute_error - (read only) numpy.ndarray specifying the mean absolute error (MAE) values. It has size (1,n_variables).

  • stratified_max_absolute_error - (read only) numpy.ndarray specifying the max absolute error (MaxAE) values. It has size (1,n_variables).

  • stratified_mean_squared_error - (read only) numpy.ndarray specifying the mean squared error (MSE) values. It has size (1,n_variables).

  • stratified_root_mean_squared_error - (read only) numpy.ndarray specifying the root mean squared error (RMSE) values. It has size (1,n_variables).

  • stratified_normalized_root_mean_squared_error - (read only) numpy.ndarray specifying the normalized root mean squared error (NRMSE) values. It has size (1,n_variables).

PCAfold.reconstruction.RegressionAssessment.print_metrics(self, table_format=['raw'], float_format='.4f', metrics=None, comparison=None)#

Prints regression assessment metrics as raw text, in tex format and/or as pandas.DataFrame.

Example:

from PCAfold import PCA, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Instantiate RegressionAssessment class object:
regression_metrics = RegressionAssessment(X, X_rec)

# Print regression metrics:
regression_metrics.print_metrics(table_format=['raw', 'tex', 'pandas'],
                                 float_format='.4f',
                                 metrics=['R2', 'NRMSE', 'GDE'])

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
X1
R2:     0.9900
NRMSE:  0.0999
GDE:    70.0000
-------------------------
X2
R2:     0.6126
NRMSE:  0.6224
GDE:    70.0000
-------------------------
X3
R2:     0.6368
NRMSE:  0.6026
GDE:    70.0000

Adding 'tex' to the table_format list will result in printing:

\begin{table}[h!]
\begin{center}
\begin{tabular}{llll} \toprule
 & \textit{X1} & \textit{X2} & \textit{X3} \\ \midrule
R2 & 0.9900 & 0.6126 & 0.6368 \\
NRMSE & 0.0999 & 0.6224 & 0.6026 \\
GDE & 70.0000 & 70.0000 & 70.0000 \\
\end{tabular}
\caption{}\label{}
\end{center}
\end{table}

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

../_images/generate-pandas-table.png

Additionally, the current object of RegressionAssessment class can be compared with another object:

from PCAfold import PCA, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)
Y = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)
pca_Y = PCA(Y, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))
Y_rec = pca_Y.reconstruct(pca_Y.transform(Y))

# Instantiate RegressionAssessment class object:
regression_metrics_X = RegressionAssessment(X, X_rec)
regression_metrics_Y = RegressionAssessment(Y, Y_rec)

# Print regression metrics:
regression_metrics_X.print_metrics(table_format=['raw', 'pandas'],
                                   float_format='.4f',
                                   metrics=['R2', 'NRMSE', 'GDE'],
                                   comparison=regression_metrics_Y)

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
X1
R2:     0.9133  BETTER
NRMSE:  0.2944  BETTER
GDE:    67.0000 WORSE
-------------------------
X2
R2:     0.5969  WORSE
NRMSE:  0.6349  WORSE
GDE:    67.0000 WORSE
-------------------------
X3
R2:     0.6175  WORSE
NRMSE:  0.6185  WORSE
GDE:    67.0000 WORSE

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

../_images/generate-pandas-table-comparison.png
Parameters:
  • table_format – (optional) list of str specifying the format(s) in which the table should be printed. Strings can only be 'raw', 'tex' and/or 'pandas'.

  • float_format – (optional) str specifying the display format for the numerical entries inside the table. By default it is set to '.4f'.

  • metrics – (optional) list of str specifying which metrics should be printed. Strings can only be 'R2', 'MAE', 'MSE', 'MSLE', 'RMSE', 'NRMSE', 'GDE'. If metrics is set to None, all available metrics will be printed.

  • comparison – (optional) object of RegressionAssessment class specifying the metrics that should be compared with the current regression metrics.

PCAfold.reconstruction.RegressionAssessment.print_stratified_metrics(self, table_format=['raw'], float_format='.4f', metrics=None, comparison=None)#

Prints stratified regression assessment metrics as raw text, in tex format and/or as pandas.DataFrame. In each cluster, in addition to the regression metrics, number of observations is printed, along with the minimum and maximum values of the observed variable in that cluster.

Example:

from PCAfold import PCA, variable_bins, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=3, verbose=False)

# Instantiate RegressionAssessment class object:
stratified_regression_metrics = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx)

# Print regression metrics:
stratified_regression_metrics.print_stratified_metrics(table_format=['raw', 'tex', 'pandas'],
                                                       float_format='.4f',
                                                       metrics=['R2', 'MAE', 'NRMSE'])

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
k1
Observations:   31
Min:    0.0120
Max:    0.3311
R2:     -3.3271
MAE:    0.1774
NRMSE:  2.0802
-------------------------
k2
Observations:   38
Min:    0.3425
Max:    0.6665
R2:     -1.4608
MAE:    0.1367
NRMSE:  1.5687
-------------------------
k3
Observations:   31
Min:    0.6853
Max:    0.9959
R2:     -3.7319
MAE:    0.1743
NRMSE:  2.1753

Adding 'tex' to the table_format list will result in printing:

\begin{table}[h!]
\begin{center}
\begin{tabular}{llll} \toprule
 & \textit{k1} & \textit{k2} & \textit{k3} \\ \midrule
Observations & 31.0000 & 38.0000 & 31.0000 \\
Min & 0.0120 & 0.3425 & 0.6853 \\
Max & 0.3311 & 0.6665 & 0.9959 \\
R2 & -3.3271 & -1.4608 & -3.7319 \\
MAE & 0.1774 & 0.1367 & 0.1743 \\
NRMSE & 2.0802 & 1.5687 & 2.1753 \\
\end{tabular}
\caption{}\label{}
\end{center}
\end{table}

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

../_images/generate-pandas-table-stratified.png

Additionally, the current object of RegressionAssessment class can be compared with another object:

from PCAfold import PCA, variable_bins, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=3, verbose=False)

# Instantiate RegressionAssessment class object:
stratified_regression_metrics_0 = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx)
stratified_regression_metrics_1 = RegressionAssessment(X[:,1], X_rec[:,1], idx=idx)

# Print regression metrics:
stratified_regression_metrics_0.print_stratified_metrics(table_format=['raw', 'pandas'],
                                                         float_format='.4f',
                                                         metrics=['R2', 'MAE', 'NRMSE'],
                                                         comparison=stratified_regression_metrics_1)

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
k1
Observations:   39
Min:    0.0013
Max:    0.3097
R2:     0.9236  BETTER
MAE:    0.0185  BETTER
NRMSE:  0.2764  BETTER
-------------------------
k2
Observations:   29
Min:    0.3519
Max:    0.6630
R2:     0.9380  BETTER
MAE:    0.0179  BETTER
NRMSE:  0.2491  BETTER
-------------------------
k3
Observations:   32
Min:    0.6663
Max:    0.9943
R2:     0.9343  BETTER
MAE:    0.0194  BETTER
NRMSE:  0.2563  BETTER

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

../_images/generate-pandas-table-comparison-stratified.png
Parameters:
  • table_format – (optional) list of str specifying the format(s) in which the table should be printed. Strings can only be 'raw', 'tex' and/or 'pandas'.

  • float_format – (optional) str specifying the display format for the numerical entries inside the table. By default it is set to '.4f'.

  • metrics – (optional) list of str specifying which metrics should be printed. Strings can only be 'R2', 'MAE', 'MSE', 'MSLE', 'RMSE', 'NRMSE'. If metrics is set to None, all available metrics will be printed.

  • comparison – (optional) object of RegressionAssessment class specifying the metrics that should be compared with the current regression metrics.

Regression quality metrics#

PCAfold.reconstruction.coefficient_of_determination(observed, predicted)#

Computes the coefficient of determination, \(R^2\), value:

\[R^2 = 1 - \frac{\sum_{i=1}^N (\phi_{o,i} - \phi_{p,i})^2}{\sum_{i=1}^N (\phi_{o,i} - \mathrm{mean}(\phi_{o,i}))^2}\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, coefficient_of_determination
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the coefficient of determination for the first variable:
r2 = coefficient_of_determination(X[:,0], X_rec[:,0])
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns:

  • r2 - coefficient of determination, \(R^2\).

PCAfold.reconstruction.stratified_coefficient_of_determination(observed, predicted, idx, use_global_mean=True, verbose=False)#

Computes the stratified coefficient of determination, \(R^2\), values. Stratified \(R^2\) is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

\(R_j^2\) in the \(j^{th}\) bin can be computed in two ways:

  • If use_global_mean=True, the mean of the entire observed variable is used as a reference:

\[R_j^2 = 1 - \frac{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \phi_{p,i}^{j})^2}{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \mathrm{mean}(\phi_o))^2}\]
  • If use_global_mean=False, the mean of the considered \(j^{th}\) bin is used as a reference:

\[R_j^2 = 1 - \frac{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \phi_{p,i}^{j})^2}{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \mathrm{mean}(\phi_o^{j}))^2}\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin and \(\phi_p\) is the predicted dependent variable.

Note

After running this function you can call analysis.plot_stratified_coefficient_of_determination(r2_in_bins, bins_borders) on the function outputs and it will visualize how stratified \(R^2\) changes across bins.

Warning

The stratified \(R^2\) metric can be misleading if there are large variations in point density in an observed variable. For instance, below is a data set composed of lines of points that have uniform spacing on the \(x\) axis but become more and more sparse in the direction of increasing \(\phi\) due to an increasing gradient of \(\phi\). If bins are narrow enough (number of bins is high enough), a single bin (like the bin bounded by the red dashed lines) can contain only one of those lines of points for high value of \(\phi\). \(R^2\) will then be computed for constant, or almost constant observations, even though globally those observations lie in a location of a large gradient of the observed variable!

../_images/stratified-r2.png

Example:

from PCAfold import PCA, variable_bins, stratified_coefficient_of_determination, plot_stratified_coefficient_of_determination
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified R2 in 10 bins of the first variable in a data set:
r2_in_bins = stratified_coefficient_of_determination(X[:,0],
                                                     X_rec[:,0],
                                                     idx=idx,
                                                     use_global_mean=True,
                                                     verbose=True)

# Plot the stratified R2 values:
plot_stratified_coefficient_of_determination(r2_in_bins, bins_borders)
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • use_global_mean – (optional) bool specifying if global mean of the observed variable should be used as a reference in \(R^2\) calculation.

  • verbose – (optional) bool for printing sizes (number of observations) and \(R^2\) values in each bin.

Returns:

  • r2_in_bins - list specifying the coefficients of determination \(R^2\) in each bin. It has length k.

PCAfold.reconstruction.mean_absolute_error(observed, predicted)#

Computes the mean absolute error (MAE):

\[\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^N | \phi_{o,i} - \phi_{p,i} |\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, mean_absolute_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the mean absolute error for the first variable:
mae = mean_absolute_error(X[:,0], X_rec[:,0])
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns:

  • mae - mean absolute error (MAE).

PCAfold.reconstruction.stratified_mean_absolute_error(observed, predicted, idx, verbose=False)#

Computes the stratified mean absolute error (MAE) values. Stratified MAE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

MAE in the \(j^{th}\) bin can be computed as:

\[\mathrm{MAE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} | \phi_{o,i}^j - \phi_{p,i}^j |\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_mean_absolute_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified MAE in 10 bins of the first variable in a data set:
mae_in_bins = stratified_mean_absolute_error(X[:,0],
                                             X_rec[:,0],
                                             idx=idx,
                                             verbose=True)
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and MAE values in each bin.

Returns:

  • mae_in_bins - list specifying the mean absolute error (MAE) in each bin. It has length k.

PCAfold.reconstruction.max_absolute_error(observed, predicted)#

Computes the maximum absolute error (MaxAE):

\[\mathrm{MaxAE} = \mathrm{max}( | \phi_{o,i} - \phi_{p,i} | )\]

where \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, max_absolute_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the maximum absolute error for the first variable:
maxae = max_absolute_error(X[:,0], X_rec[:,0])
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns:

  • maxae - max absolute error (MaxAE).

PCAfold.reconstruction.stratified_max_absolute_error(observed, predicted, idx, verbose=False)#

Computes the stratified max absolute error (MaxAE) values. Stratified MaxAE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

MaxAE in the \(j^{th}\) bin can be computed as:

\[\mathrm{MaxAE}_j = \mathrm{max}( | \phi_{o,i}^j - \phi_{p,i}^j | )\]

where \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_max_absolute_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified MaxAE in 10 bins of the first variable in a data set:
maxae_in_bins = stratified_max_absolute_error(X[:,0],
                                              X_rec[:,0],
                                              idx=idx,
                                              verbose=True)
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and MSE values in each bin.

Returns:

  • maxae_in_bins - list specifying the max absolute error (MaxAE) in each bin. It has length k.

PCAfold.reconstruction.mean_squared_error(observed, predicted)#

Computes the mean squared error (MSE):

\[\mathrm{MSE} = \frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the mean squared error for the first variable:
mse = mean_squared_error(X[:,0], X_rec[:,0])
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns:

  • mse - mean squared error (MSE).

PCAfold.reconstruction.stratified_mean_squared_error(observed, predicted, idx, verbose=False)#

Computes the stratified mean squared error (MSE) values. Stratified MSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

MSE in the \(j^{th}\) bin can be computed as:

\[\mathrm{MSE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified MSE in 10 bins of the first variable in a data set:
mse_in_bins = stratified_mean_squared_error(X[:,0],
                                            X_rec[:,0],
                                            idx=idx,
                                            verbose=True)
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and MSE values in each bin.

Returns:

  • mse_in_bins - list specifying the mean squared error (MSE) in each bin. It has length k.

PCAfold.reconstruction.mean_squared_logarithmic_error(observed, predicted)#

Computes the mean squared logarithmic error (MSLE):

\[\mathrm{MSLE} = \frac{1}{N} \sum_{i=1}^N (\log(\phi_{o,i} + 1) - \log(\phi_{p,i} + 1)) ^2\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Warning

The MSLE metric can only be used on non-negative samples.

Example:

from PCAfold import PCA, mean_squared_logarithmic_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the mean squared error for the first variable:
msle = mean_squared_logarithmic_error(X[:,0], X_rec[:,0])
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns:

  • msle - mean squared logarithmic error (MSLE).

PCAfold.reconstruction.stratified_mean_squared_logarithmic_error(observed, predicted, idx, verbose=False)#

Computes the stratified mean squared logarithmic error (MSLE) values. Stratified MSLE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

MSLE in the \(j^{th}\) bin can be computed as:

\[\mathrm{MSLE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} (\log(\phi_{o,i}^j + 1) - \log(\phi_{p,i}^j + 1)) ^2\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Warning

The MSLE metric can only be used on non-negative samples.

Example:

from PCAfold import PCA, variable_bins, stratified_mean_squared_logarithmic_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified MSLE in 10 bins of the first variable in a data set:
msle_in_bins = stratified_mean_squared_logarithmic_error(X[:,0],
                                                         X_rec[:,0],
                                                         idx=idx,
                                                         verbose=True)
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and MSLE values in each bin.

Returns:

  • msle_in_bins - list specifying the mean squared logarithmic error (MSLE) in each bin. It has length k.

PCAfold.reconstruction.root_mean_squared_error(observed, predicted)#

Computes the root mean squared error (RMSE):

\[\mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2}\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the root mean squared error for the first variable:
rmse = root_mean_squared_error(X[:,0], X_rec[:,0])
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns:

  • rmse - root mean squared error (RMSE).

PCAfold.reconstruction.stratified_root_mean_squared_error(observed, predicted, idx, verbose=False)#

Computes the stratified root mean squared error (RMSE) values. Stratified RMSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

RMSE in the \(j^{th}\) bin can be computed as:

\[\mathrm{RMSE}_j = \sqrt{\frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2}\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified RMSE in 10 bins of the first variable in a data set:
rmse_in_bins = stratified_root_mean_squared_error(X[:,0],
                                                  X_rec[:,0],
                                                  idx=idx,
                                                  verbose=True)
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and RMSE values in each bin.

Returns:

  • rmse_in_bins - list specifying the mean squared error (RMSE) in each bin. It has length k.

PCAfold.reconstruction.normalized_root_mean_squared_error(observed, predicted, norm='std')#

Computes the normalized root mean squared error (NRMSE):

\[\mathrm{NRMSE} = \frac{1}{d_{norm}} \sqrt{\frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2}\]

where \(d_{norm}\) is the normalization factor, \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Various normalizations are available:

Normalization

norm

Normalization factor \(d_{norm}\)

Root square mean

'root_square_mean'

\(d_{norm} = \sqrt{\mathrm{mean}(\phi_o^2)}\)

Std

'std'

\(d_{norm} = \mathrm{std}(\phi_o)\)

Range

'range'

\(d_{norm} = \mathrm{max}(\phi_o) - \mathrm{min}(\phi_o)\)

Root square range

'root_square_range'

\(d_{norm} = \sqrt{\mathrm{max}(\phi_o^2) - \mathrm{min}(\phi_o^2)}`\)

Root square std

'root_square_std'

\(d_{norm} = \sqrt{\mathrm{std}(\phi_o^2)}\)

Absolute mean

'abs_mean'

\(d_{norm} = | \mathrm{mean}(\phi_o) |\)

Example:

from PCAfold import PCA, normalized_root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the root mean squared error for the first variable:
nrmse = normalized_root_mean_squared_error(X[:,0], X_rec[:,0], norm='std')
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • normstr specifying the normalization, \(d_{norm}\). It can be one of the following: std, range, root_square_mean, root_square_range, root_square_std, abs_mean.

Returns:

  • nrmse - normalized root mean squared error (NRMSE).

PCAfold.reconstruction.stratified_normalized_root_mean_squared_error(observed, predicted, idx, norm='std', use_global_norm=False, verbose=False)#

Computes the stratified normalized root mean squared error (NRMSE) values. Stratified NRMSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

NRMSE in the \(j^{th}\) bin can be computed as:

\[\mathrm{NRMSE}_j = \frac{1}{d_{norm}} \sqrt{\frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2}\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_normalized_root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified NRMSE in 10 bins of the first variable in a data set:
nrmse_in_bins = stratified_normalized_root_mean_squared_error(X[:,0],
                                                              X_rec[:,0],
                                                              idx=idx,
                                                              norm='std',
                                                              use_global_norm=True,
                                                              verbose=True)
Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • normstr specifying the normalization, \(d_{norm}\). It can be one of the following: std, range, root_square_mean, root_square_range, root_square_std, abs_mean.

  • use_global_norm – (optional) bool specifying if global norm of the observed variable should be used in NRMSE calculation. If set to False, norms are computed on samples from the the corresponding bin.

  • verbose – (optional) bool for printing sizes (number of observations) and NRMSE values in each bin.

Returns:

  • nrmse_in_bins - list specifying the mean squared error (NRMSE) in each bin. It has length k.

PCAfold.reconstruction.turning_points(observed, predicted)#

Computes the turning points percentage - the percentage of predicted outputs that have the opposite growth tendency to the corresponding observed growth tendency.

Warning

This function is under construction.

Returns:

  • turning_points - turning points percentage in %.

PCAfold.reconstruction.good_estimate(observed, predicted, tolerance=0.05)#

Computes the good estimate (GE) - the percentage of predicted values that are within the specified tolerance from the corresponding observed values.

Warning

This function is under construction.

Parameters:
  • observednumpy.ndarray specifying the observed values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

Parm tolerance:

float specifying the tolerance.

Returns:

  • good_estimate - good estimate (GE) in %.

PCAfold.reconstruction.good_direction_estimate(observed, predicted, tolerance=0.05)#

Computes the good direction (GD) and the good direction estimate (GDE).

GD for observation \(i\), is computed as:

\[GD_i = \frac{\vec{\phi}_{o,i}}{|| \vec{\phi}_{o,i} ||} \cdot \frac{\vec{\phi}_{p,i}}{|| \vec{\phi}_{p,i} ||}\]

where \(\vec{\phi}_o\) is the observed vector quantity and \(\vec{\phi}_p\) is the predicted vector quantity.

GDE is computed as the percentage of predicted vector observations whose direction is within the specified tolerance from the direction of the corresponding observed vector.

Example:

from PCAfold import PCA, good_direction_estimate
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the vector of good direction and good direction estimate:
(good_direction, good_direction_estimate) = good_direction_estimate(X,
                                                                    X_rec,
                                                                    tolerance=0.01)
Parameters:
  • observednumpy.ndarray specifying the observed vector quantity, \(\vec{\phi}_o\). It should be of size (n_observations,n_dimensions).

  • predictednumpy.ndarray specifying the predicted vector quantity, \(\vec{\phi}_p\). It should be of size (n_observations,n_dimensions).

  • tolerancefloat specifying the tolerance.

Returns:

  • good_direction - numpy.ndarray specifying the vector of good direction (GD). It has size (n_observations,).

  • good_direction_estimate - good direction estimate (GDE) in %.

PCAfold.reconstruction.generate_tex_table(data_frame_table, float_format='.2f', caption='', label='')#

Generates tex code for a table stored in a pandas.DataFrame. This function can be useful e.g. for printing regression results.

Example:

from PCAfold import PCA, generate_tex_table
import numpy as np
import pandas as pd

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy variables names:
variable_names = ['A1', 'A2', 'A3', 'A4', 'A5']

# Instantiate PCA class object:
pca_q2 = PCA(X, scaling='auto', n_components=2, use_eigendec=True, nocenter=False)
pca_q3 = PCA(X, scaling='auto', n_components=3, use_eigendec=True, nocenter=False)

# Calculate the R2 values:
r2_q2 = pca_q2.calculate_r2(X)[None,:]
r2_q3 = pca_q3.calculate_r2(X)[None,:]

# Generate pandas.DataFrame from the R2 values:
r2_table = pd.DataFrame(np.vstack((r2_q2, r2_q3)),
                        columns=variable_names,
                        index=['PCA, $q=2$', 'PCA, $q=3$'])

# Generate tex code for the table:
generate_tex_table(r2_table,
                   float_format=".3f",
                   caption='$R^2$ values.',
                   label='r2-values')

Note

The code above will produce tex code:

\begin{table}[h!]
\begin{center}
\begin{tabular}{llllll} \toprule
 & \textit{A1} & \textit{A2} & \textit{A3} & \textit{A4} & \textit{A5} \\ \midrule
PCA, $q=2$ & 0.507 & 0.461 & 0.485 & 0.437 & 0.611 \\
PCA, $q=3$ & 0.618 & 0.658 & 0.916 & 0.439 & 0.778 \\
\end{tabular}
\caption{$R^2$ values.}\label{r2-values}
\end{center}
\end{table}

Which, when compiled, will result in a table:

../_images/generate-tex-table.png
Parameters:
  • data_frame_tablepandas.DataFrame specifying the table to convert to tex code. It can include column names and index names.

  • float_formatstr specifying the display format for the numerical entries inside the table. By default it is set to '.2f'.

  • captionstr specifying caption for the table.

  • labelstr specifying label for the table.


Plotting functions#

PCAfold.reconstruction.plot_2d_regression(x, observed, predicted, x_label=None, y_label=None, color_observed=None, color_predicted=None, figure_size=(7, 7), title=None, save_filename=None)#

Plots the result of regression of a dependent variable on top of a one-dimensional manifold defined by a single independent variable x.

Example:

from PCAfold import PCA, plot_2d_regression
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Obtain two-dimensional manifold from PCA:
pca_X = PCA(X)
PCs = pca_X.transform(X)
X_rec = pca_X.reconstruct(PCs)

# Plot the manifold:
plt = plot_2d_regression(X[:,0],
                         X[:,0],
                         X_rec[:,0],
                         x_label='$x$',
                         y_label='$y$',
                         color_observed='k',
                         color_predicted='r',
                         figure_size=(10,10),
                         title='2D regression',
                         save_filename='2d-regression.pdf')
plt.close()
Parameters:
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • observednumpy.ndarray specifying the observed values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • color_observed – (optional) str specifying the color of the plotted observed variable.

  • color_predicted – (optional) str specifying the color of the plotted predicted variable.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns:

  • plt - matplotlib.pyplot plot handle.

PCAfold.reconstruction.plot_2d_regression_scalar_field(grid_bounds, regression_model, x=None, y=None, resolution=(10, 10), extension=(0, 0), x_label=None, y_label=None, s_field=None, s_manifold=None, manifold_color=None, colorbar_label=None, color_map='viridis', colorbar_range=None, manifold_alpha=1, grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a 2D field of a regressed scalar dependent variable. A two-dimensional manifold can be additionally plotted on top of the field.

Example:

from PCAfold import PCA, KReg, plot_2d_regression_scalar_field
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,2)
Z = np.random.rand(100,1)

# Train the kernel regression model:
model = KReg(X, Z)

# Define the regression model:
def regression_model(query):

    predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)[0,0]

    return predicted

# Define the bounds for the scalar field:
grid_bounds = ([np.min(X[:,0]),np.max(X[:,0])],[np.min(X[:,1]),np.max(X[:,1])])

# Plot the regressed scalar field:
plt = plot_2d_regression_scalar_field(grid_bounds,
                                      regression_model,
                                      x=X[:,0],
                                      y=X[:,1],
                                      resolution=(100,100),
                                      extension=(10,10),
                                      x_label='$X_1$',
                                      y_label='$X_2$',
                                      s_field=4,
                                      s_manifold=60,
                                      manifold_color=Z,
                                      colorbar_label='$Z_1$',
                                      color_map='inferno',
                                      colorbar_range=(0,1),
                                      manifold_alpha=1,
                                      grid_on=False,
                                      figure_size=(10,6),
                                      title='2D regressed scalar field',
                                      save_filename='2D-regressed-scalar-field.pdf')
plt.close()
Parameters:
  • grid_boundstuple of list specifying the bounds of the dependent variable on the \(x\) and \(y\) axis.

  • regression_modelfunction that outputs the predicted vector using the regression model. It should take as input a numpy.ndarray of size (1,2), where the two elements specify the first and second independent variable values. It should output a float specifying the regressed scalar value at that input.

  • x – (optional) numpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • y – (optional) numpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • resolution – (optional) tuple of int specifying the resolution of the streamplot grid on the \(x\) and \(y\) axis.

  • extension – (optional) tuple of float or int specifying a percentage by which the dependent variable should be extended beyond on the \(x\) and \(y\) axis, beyond what has been specified by the grid_bounds parameter.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • s_field – (optional) int or float specifying the scatter point size for the scalar field.

  • s_manifold – (optional) int or float specifying the scatter point size for the manifold.

  • manifold_color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, manifold will be plotted in black.

  • colorbar_label – (optional) str specifying colorbar label annotation.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • colorbar_range – (optional) tuple specifying the lower and the upper bound for the colorbar range.

  • manifold_alpha – (optional) float or int specifying the opacity of the plotted manifold.

  • grid_onbool specifying whether grid should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns:

  • plt - matplotlib.pyplot plot handle.

PCAfold.reconstruction.plot_2d_regression_streamplot(grid_bounds, regression_model, x=None, y=None, resolution=(10, 10), extension=(0, 0), color='k', x_label=None, y_label=None, s_manifold=None, manifold_color=None, colorbar_label=None, color_map='viridis', colorbar_range=None, manifold_alpha=1, grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a streamplot of a regressed vector field of a dependent variable. A two-dimensional manifold can be additionally plotted on top of the streamplot.

Example:

from PCAfold import PCA, KReg, plot_2d_regression_streamplot
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)
S_X = np.random.rand(100,5)

# Obtain two-dimensional manifold from PCA:
pca_X = PCA(X, n_components=2)
PCs = pca_X.transform(X)
S_Z = pca_X.transform(S_X, nocenter=True)

# Train the kernel regression model:
model = KReg(PCs, S_Z)

# Define the regression model:
def regression_model(query):

    predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)

    return predicted

# Define the bounds for the streamplot:
grid_bounds = ([np.min(PCs[:,0]),np.max(PCs[:,0])],[np.min(PCs[:,1]),np.max(PCs[:,1])])

# Plot the regression streamplot:
plt = plot_2d_regression_streamplot(grid_bounds,
                                    regression_model,
                                    x=PCs[:,0],
                                    y=PCs[:,1],
                                    resolution=(15,15),
                                    extension=(20,20),
                                    color='r',
                                    x_label='$Z_1$',
                                    y_label='$Z_2$',
                                    manifold_color=X[:,0],
                                    colorbar_label='$X_1$',
                                    color_map='plasma',
                                    colorbar_range=(0,1),
                                    manifold_alpha=1,
                                    grid_on=False,
                                    figure_size=(10,6),
                                    title='Streamplot',
                                    save_filename='streamplot.pdf')
plt.close()
Parameters:
  • grid_boundstuple of list specifying the bounds of the dependent variable on the \(x\) and \(y\) axis.

  • regression_modelfunction that outputs the predicted vector using the regression model. It should take as input a numpy.ndarray of size (1,2), where the two elements specify the first and second independent variable values. It should output a numpy.ndarray of size (1,2), where the two elements specify the first and second regressed vector elements.

  • x – (optional) numpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • y – (optional) numpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • resolution – (optional) tuple of int specifying the resolution of the streamplot grid on the \(x\) and \(y\) axis.

  • extension – (optional) tuple of float or int specifying a percentage by which the dependent variable should be extended beyond on the \(x\) and \(y\) axis, beyond what has been specified by the grid_bounds parameter.

  • color – (optional) str specifying the streamlines color.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • s_manifold – (optional) int or float specifying the scatter point size for the manifold.

  • manifold_color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, manifold will be plotted in black.

  • colorbar_label – (optional) str specifying colorbar label annotation.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • colorbar_range – (optional) tuple specifying the lower and the upper bound for the colorbar range.

  • manifold_alpha – (optional) float or int specifying the opacity of the plotted manifold.

  • grid_onbool specifying whether grid should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns:

  • plt - matplotlib.pyplot plot handle.

PCAfold.reconstruction.plot_3d_regression(x, y, observed, predicted, elev=45, azim=-45, clean=False, x_label=None, y_label=None, z_label=None, color_observed=None, color_predicted=None, s_observed=None, s_predicted=None, alpha_observed=None, alpha_predicted=None, figure_size=(7, 7), title=None, save_filename=None)#

Plots the result of regression of a dependent variable on top of a two-dimensional manifold defined by two independent variables x and y.

Example:

from PCAfold import PCA, plot_3d_regression
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Obtain three-dimensional manifold from PCA:
pca_X = PCA(X)
PCs = pca_X.transform(X)
X_rec = pca_X.reconstruct(PCs)

# Plot the manifold:
plt = plot_3d_regression(X[:,0],
                         X[:,1],
                         X[:,0],
                         X_rec[:,0],
                         elev=45,
                         azim=-45,
                         x_label='$x$',
                         y_label='$y$',
                         z_label='$z$',
                         color_observed='k',
                         color_predicted='r',
                         figure_size=(10,10),
                         title='3D regression',
                         save_filename='3d-regression.pdf')
plt.close()
Parameters:
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • ynumpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • observednumpy.ndarray specifying the observed values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • elev – (optional) float or int specifying the elevation angle.

  • azim – (optional) float or int specifying the azimuth angle.

  • clean – (optional) bool specifying if a clean plot should be made. If set to True, nothing else but the data points and the 3D axes is plotted.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • z_label – (optional) str specifying \(z\)-axis label annotation. If set to None label will not be plotted.

  • color_observed – (optional) str specifying the color of the plotted observed variable.

  • color_predicted – (optional) str specifying the color of the plotted predicted variable.

  • s_observed – (optional) int or float specifying the scatter point size for the observed variable.

  • s_predicted – (optional) int or float specifying the scatter point size for the predicted variable.

  • alpha_observed – (optional) int or float specifying the point opacity for the observed variable.

  • alpha_predicted – (optional) int or float specifying the point opacity for the predicted variable.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns:

  • plt - matplotlib.pyplot plot handle.

PCAfold.reconstruction.plot_stratified_metric(metric_in_bins, bins_borders, variable_name=None, metric_name=None, yscale='linear', ylim=None, figure_size=(10, 5), title=None, save_filename=None)#

This function plots a stratified metric across bins of a dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_coefficient_of_determination, plot_stratified_metric
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified R2 in 10 bins of the first variable in a data set:
r2_in_bins = stratified_coefficient_of_determination(X[:,0],
                                                     X_rec[:,0],
                                                     idx=idx,
                                                     use_global_mean=True,
                                                     verbose=True)

# Visualize how R2 changes across bins:
plt = plot_stratified_metric(r2_in_bins,
                             bins_borders,
                             variable_name='$X_1$',
                             metric_name='$R^2$',
                             yscale='log',
                             figure_size=(10,5),
                             title='Stratified $R^2$',
                             save_filename='r2.pdf')
plt.close()
Parameters:
  • metric_in_binslist of metric values in each bin.

  • bins_borderslist of bins borders that were created to stratify the dependent variable.

  • variable_name – (optional) str specifying the name of the variable for which the metric was computed. If set to None label on the x-axis will not be plotted.

  • metric_name – (optional) str specifying the name of the metric to be plotted on the y-axis. If set to None label on the x-axis will not be plotted.

  • yscale – (optional) str specifying the scale for the y-axis.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns:

  • plt - matplotlib.pyplot plot handle.