_images/PCAfold-logo-rectangle.svg

Low-dimensional PCA-derived manifolds and everything in between!


https://img.shields.io/badge/GitLab-PCAfold-blue.svg?style=flat http://img.shields.io/badge/license-MIT-blue.svg?style=flat https://readthedocs.org/projects/pcafold/badge/?version=latest https://mybinder.org/badge_logo.svg

Intro#

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds. It incorporates a variety of data preprocessing tools (including data clustering and sampling), implements several dimensionality reduction strategies and utilizes novel approaches to assess the quality of the obtained low-dimensional manifolds. The latest software version introduces algorithms to optimize projection topologies based on quantities of interest (QoIs) and novel tools to reconstruct QoIs from the low-dimensional data representations using partition of unity networks (POUnets).

A general overview for using PCAfold modules is presented in the diagram below:

_images/PCAfold-diagram.svg

Each module’s functionalities can also be used as a standalone tool for performing a specific task and can easily combine with techniques outside of this software.

Reach out to the Getting started section for more information on installing the software and for possible workflows that can be achieved with PCAfold. You can also download the poster below for a condensed overview of the available functionalities.

_images/PCAfold-poster.png

Citing PCAfold#

PCAfold is published in the SoftwareX journal. If you use PCAfold in a scientific publication, you can cite the software as:

Zdybał, K., Armstrong, E., Parente, A. and Sutherland, J.C., 2020. PCAfold: Python software to generate, analyze and improve PCA-derived low-dimensional manifolds. SoftwareX, 12, p.100630.

or using BibTeX:

@article{pcafold2020,
title = "PCAfold: Python software to generate, analyze and improve PCA-derived low-dimensional manifolds",
journal = "SoftwareX",
volume = "12",
pages = "100630",
year = "2020",
issn = "2352-7110",
doi = "https://doi.org/10.1016/j.softx.2020.100630",
url = "http://www.sciencedirect.com/science/article/pii/S2352711020303435",
author = "Kamila Zdybał and Elizabeth Armstrong and Alessandro Parente and James C. Sutherland"
}

Getting started#

Installation#

Dependencies#

PCAfold requires python3.7 and the following packages:

pip install Cython
pip install matplotlib
pip install numpy
pip install scipy
pip install termcolor
pip install tqdm
pip install scikit-learn
pip install tensorflow
Build from source#

Clone the PCAfold repository and move into the PCAfold directory created:

git clone http://gitlab.multiscale.utah.edu/common/PCAfold.git
cd PCAfold

Run the setup.py script as below to complete the installation:

python3.7 setup.py build_ext --inplace
python3.7 setup.py install

If the installation was successful, you are ready to import PCAfold!

Testing#

To run regression tests from the base repo directory run:

python3.7 -m unittest discover

To switch verbose on, use the -v flag.

All tests should be passing. If any of the tests is failing and you can’t sort out why, please open an issue on GitLab.

Local documentation build#

To build the documentation locally, you need sphinx installed on your machine, along with few extensions:

pip install Sphinx
pip install sphinxcontrib-bibtex
pip install furo

Then, navigate to docs/ directory and build the documentation:

sphinx-build -b html . builddir

make html

Documentation main page _build/html/index.html can be opened in a web browser.

On MacOS, you can open it directly from the terminal:

open _build/html/index.html

Plotting#

Some functions within PCAfold result in plot outputs. Global styles for the plots, such as font types and sizes, are set using the PCAfold/styles.py file. This file can be updated with new settings that will be seen globally by all PCAfold modules. Re-build the project after changing styles.py file:

python3.7 setup.py install

Note, that all plotting functions return handles to generated plots.

Workflows#

In this section, we present several popular workflows that can be achieved using functionalities of PCAfold. An overview for combining PCAfold modules into a complete workflow is presented in the diagram below:

_images/PCAfold-software-architecture.svg

Each module’s functionalities can also be used as a standalone tool for performing a specific task and can easily combine with techniques from outside of this software.

The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).

\[\begin{split}\mathbf{X} = \begin{bmatrix} \vdots & \vdots & & \vdots \\ X_1 & X_2 & \dots & X_{Q} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix}\end{split}\]

Below are brief descriptions of several workflows that utilize functionalities of PCAfold:

Data manipulation#

Basic data manipulation such as centering, scaling, outlier detection and removal or kernel density weighting of data sets can be achieved using the preprocess module.

Data clustering#

Data clustering can be achieved using the preprocess module. This functionality can be useful for data analysis or feature detection and can also be the first step for applying data reduction techniques locally (on local portions of the data). It is also worth pointing out that clustering algorithms from outside of PCAfold software can be brought into the workflow.

Data sampling#

Data sampling can be achieved using the preprocess module. Possible use-case for sampling data sets could be to split data sets into train and test samples for other Machine Learning algorithms. Another use-case can be sampling imbalanced data sets.

Global PCA#

Global PCA can be performed using PCA class available in the reduction module.

Local PCA#

Local PCA can be performed using LPCA class available in the reduction module.

PCA on sampled data sets#

PCA on sampled data sets can be performed by combining sampling techniques from the preprocess module, with PCA class available in the reduction module. The reduction module additionally contains a few more functions specifically designed to help analyze the results of performing PCA on sampled data sets.

Assessing manifold quality#

Once a low-dimensional manifold is obtained, the quality of the manifold can be assessed using functionalities available in the analysis module. It is worth noting that the manifold assessment metrics available can be equally applied to manifolds derived by means of techniques other than PCA.

Reconstructing quantities of interest (QoIs)#

Using the reconstruction module, quantities of interest (QoIs) can be reconstructed from the reduced data representations using kernel regression, artificial neural networks (ANN) and a novel approach called partition of unity networks (POUnets).

Improving projection topologies#

Two novel algorithms based on the quantitative cost function are introduced in the utilities module that can help improve topologies of PCA projections through appropriate variable selection. We also introduce an autoencoder-like strategy that optimizes the projection topology directly based on the custom projection-independent and projection-dependent quantities of interest (QoIs).

Data preprocessing#

The preprocess module can be used for performing data preprocessing including centering and scaling, outlier detection and removal, kernel density weighting of data sets, data clustering and data sampling. It also includes functionalities that allow the user to perform initial data inspection such as computing conditional statistics, calculating statistically representative sample sizes, or ordering variables in a data set according to a criterion.

Note

The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).

\[\begin{split}\mathbf{X} = \begin{bmatrix} \vdots & \vdots & & \vdots \\ X_1 & X_2 & \dots & X_{Q} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix}\end{split}\]

The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.

The representation of the user-supplied data matrix in PCAfold is the input parameter X, which should be of type numpy.ndarray and of size (n_observations,n_variables).


Data manipulation#

This section includes functions for performing basic data manipulation such as centering and scaling and outlier detection and removal.

center_scale#
PCAfold.preprocess.center_scale(X, scaling, nocenter=False)#

Centers and scales the original data set, \(\mathbf{X}\). In the discussion below, we understand that \(X_j\) is the \(j^{th}\) column of \(\mathbf{X}\).

  • Centering is performed by subtracting the center, \(c_j\), from \(X_j\), where centers for all columns are stored in the matrix \(\mathbf{C}\):

\[\mathbf{X_c} = \mathbf{X} - \mathbf{C}\]

Centers for each column are computed as:

\[c_j = mean(X_j)\]

with the only exceptions of '0to1' and '-1to1' scalings, which introduce a different quantity to center each column.

  • Scaling is performed by dividing \(X_j\) by the scaling factor, \(d_j\), where scaling factors for all columns are stored in the diagonal matrix \(\mathbf{D}\):

\[\mathbf{X_s} = \mathbf{X} \cdot \mathbf{D}^{-1}\]

If both centering and scaling is applied:

\[\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\]

Several scaling options are implemented here:

Scaling method

scaling

Scaling factor \(d_j\)

None

'none' or ''

1

Auto [PvdBHW+06]

'auto' or 'std'

\(\sigma\)

Pareto [PNod08]

'pareto'

\(\sqrt{\sigma}\)

VAST [PKEA+03]

'vast'

\(\sigma^2 / mean(X_j)\)

Range [PvdBHW+06]

'range'

\(max(X_j) - min(X_j)\)

0 to 1

'0to1'

\(d_j = max(X_j) - min(X_j)\)
\(c_j = min(X_j)\)
-1 to 1

'-1to1'

\(d_j = 0.5 \cdot (max(X_j) - min(X_j))\)
\(c_j = 0.5 \cdot (max(X_j) + min(X_j))\)

Level [PvdBHW+06]

'level'

\(mean(X_j)\)

Max

'max'

\(max(X_j)\)

Variance

'variance'

\(var(X_j)\)

Median

'median'

\(median(X_j)\)

Poisson [PKK04]

'poisson'

\(\sqrt{mean(X_j)}\)

S1

'vast_2'

\(\sigma^2 k^2 / mean(X_j)\)

S2

'vast_3'

\(\sigma^2 k^2 / max(X_j)\)

S3

'vast_4'

\(\sigma^2 k^2 / (max(X_j) - min(X_j))\)

L2-norm

'l2-norm'

\(\|X_j\|_2\)

where \(\sigma\) is the standard deviation of \(X_j\) and \(k\) is the kurtosis of \(X_j\).

The effect of data preprocessing (including scaling) on low-dimensional manifolds was studied in [PPS13].

Example:

from PCAfold import center_scale
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Center and scale:
(X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4', 'l2-norm'.

  • nocenter – (optional) bool specifying whether data should be centered by mean. If set to True data will not be centered.

Returns

  • X_cs - numpy.ndarray specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It has size (n_observations,n_variables).

  • X_center - numpy.ndarray specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It has size (n_variables,).

  • X_scale - numpy.ndarray specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It has size (n_variables,).

invert_center_scale#
PCAfold.preprocess.invert_center_scale(X_cs, X_center, X_scale)#

Inverts whatever centering and scaling was done by the center_scale function:

\[\mathbf{X} = \mathbf{X_{cs}} \cdot \mathbf{D} + \mathbf{C}\]

Example:

from PCAfold import center_scale, invert_center_scale
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Center and scale:
(X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)

# Uncenter and unscale:
X = invert_center_scale(X_cs, X_center, X_scale)
Parameters
  • X_csnumpy.ndarray specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size (n_observations,n_variables).

  • X_centernumpy.ndarray specifying the centers, \(c_j\), applied on the original data set, \(\mathbf{X}\). It should be of size (n_variables,).

  • X_scalenumpy.ndarray specifying the scales, \(d_j\), applied on the original data set, \(\mathbf{X}\). It should be of size (n_variables,).

Returns

  • X - numpy.ndarray specifying the original data set, \(\mathbf{X}\). It has size (n_observations,n_variables).

power_transform#
PCAfold.preprocess.power_transform(X, transform_power, transform_shift=0.0, transform_sign_shift=0.0, invert=False)#

Performs a power transformation of the provided data. The equation for the transformation of variable \(X\) is

\[(|X + s_1|)^\alpha \text{sign}(X + s_1) + s_2 \text{sign}(X + s_1)\]

where \(\alpha\) is the transform_power, \(s_1\) is the transform_shift, and \(s_2\) is the transform_sign_shift.

Example:

from PCAfold import power_transform
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20) + 1

# Perform power transformation:
X_pow = power_transform(X, 0.5)

# undo the transformation:
X_orig = power_transform(X_pow, 0.5, invert=True)
Parameters
  • X – array of the variable(s) to be transformed

  • transform_power – the power parameter used in the transformation equation

  • transform_shift – (optional, default 0.) the shift parameter used in the transformation equation

  • transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation

  • invert – (optional, default False) when True, will undo the transformation

Returns

array of the transformed variables

log_transform#
PCAfold.preprocess.log_transform(X, method='log', threshold=1e-06)#

Performs log transformation of the original data set, \(\mathbf{X}\).

For an example original function:

_images/log_transform-original-function.svg

The symlog transformation can be obtained with method='symlog':

_images/log_transform-symlog.svg

The continuous symlog transformation can be obtained with method='continuous-symlog':

_images/log_transform-continuous-symlog.svg

Example:

from PCAfold import log_transform
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20) + 1

# Perform log transformation:
X_log = log_transform(X)

# Perform symlog transformation:
X_symlog = log_transform(X, method='symlog', threshold=1.e-4)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • method – (optional) str specifying the log-transformation method. It can be one of the following: log, ln, symlog, continuous-symlog.

  • threshold – (optional) float or int specifying the threshold for symlog transformation.

Returns

  • X_transformed - numpy.ndarray specifying the log-transformed data set. It has size (n_observations,n_variables).

remove_constant_vars#
PCAfold.preprocess.remove_constant_vars(X, maxtol=1e-12, rangetol=0.0001)#

Removes any constant columns from the original data set, \(\mathbf{X}\). The \(j^{th}\) column, \(X_j\), is considered constant if either of the following is true:

  • The maximum of an absolute value of a column \(X_j\) is less than maxtol:

\[max(|X_j|) < \verb|maxtol|\]
  • The ratio of the range of values in a column \(X_j\) to \(max(|X_j|)\) is less than rangetol:

\[\frac{max(X_j) - min(X_j)}{max(|X_j|)} < \verb|rangetol|\]

Specifically, it can be used as preprocessing for PCA so the eigenvalue calculation doesn’t break.

Example:

from PCAfold import remove_constant_vars
import numpy as np

# Generate dummy data set with a constant variable:
X = np.random.rand(100,20)
X[:,5] = np.ones((100,))

# Remove the constant column:
(X_removed, idx_removed, idx_retained) = remove_constant_vars(X)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • maxtol – (optional) float specifying the tolerance for \(max(|X_j|)\).

  • rangetol – (optional) float specifying the tolerance for \(max(X_j) - min(X_j)\) over \(max(|X_j|)\).

Returns

  • X_removed - numpy.ndarray specifying the original data set, \(\mathbf{X}\) with any constant columns removed. It has size (n_observations,n_variables).

  • idx_removed - list specifying the indices of columns removed from \(\mathbf{X}\).

  • idx_retained - list specifying the indices of columns retained in \(\mathbf{X}\).

order_variables#
PCAfold.preprocess.order_variables(X, method='mean', descending=True)#

Orders variables in the original data set, \(\mathbf{X}\), using a selected method.

Example:

from PCAfold import order_variables
import numpy as np

# Generate a dummy data set:
X = np.array([[100, 1, 10],
              [200, 2, 20],
              [300, 3, 30]])

# Order variables by the mean value in the descending order:
(X_ordered, idx) = order_variables(X, method='mean', descending=True)

The code above should return an ordered data set:

array([[100,  10,   1],
       [200,  20,   2],
       [300,  30,   3]])

and the list of ordered variable indices:

[1, 2, 0]
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • method – (optional) str or list of int specifying the ordering method. If str, it can be one of the following: 'mean', 'min', 'max', 'std' or 'var'. If list, it is a custom user-provided list of indices for how the variables should be ordered.

  • descending – (optional) bool specifying whether variables should be ordered in the descending order. If set to False, variables will be ordered in the ascending order.

Returns

  • X_ordered - numpy.ndarray specifying the original data set with ordered variables. It has size (n_observations,n_variables).

  • idx - list specifying the indices of the ordered variables. It has length n_variables.

Class PreProcessing#
class PCAfold.preprocess.PreProcessing(X, scaling='none', nocenter=False)#

Performs a composition of data manipulation done by remove_constant_vars and center_scale functions on the original data set, \(\mathbf{X}\). It can be used to store the result of that manipulation. Specifically, it:

  • checks for constant columns in a data set and removes them,

  • centers and scales the data.

Example:

from PCAfold import PreProcessing
import numpy as np

# Generate dummy data set with a constant variable:
X = np.random.rand(100,20)
X[:,5] = np.ones((100,))

# Instantiate PreProcessing class object:
preprocessed = PreProcessing(X, 'range', nocenter=False)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • nocenter – (optional) bool specifying whether data should be centered by mean. If set to True data will not be centered.

Attributes:

  • X_removed - (read only) numpy.ndarray specifying the original data set with any constant columns removed. It has size (n_observations,n_variables).

  • idx_removed - (read only) list specifying the indices of columns removed from \(\mathbf{X}\).

  • idx_retained - (read only) list specifying the indices of columns retained in \(\mathbf{X}\).

  • X_cs - (read only) numpy.ndarray specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size (n_observations,n_variables).

  • X_center - (read only) numpy.ndarray specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It should be of size (n_variables,).

  • X_scale - (read only) numpy.ndarray specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It should be of size (n_variables,).

outlier_detection#
PCAfold.preprocess.outlier_detection(X, scaling, method='MULTIVARIATE TRIMMING', trimming_threshold=0.5, quantile_threshold=0.9899, verbose=False)#

Finds outliers in the original data set, \(\mathbf{X}\), and returns indices of observations without outliers as well as indices of the outliers themselves. Two options are implemented here:

  • 'MULTIVARIATE TRIMMING'

Outliers are detected based on multivariate Mahalanobis distance, \(D_M\):

\[D_M = \sqrt{(\mathbf{X} - \mathbf{\bar{X}})^T \mathbf{S}^{-1} (\mathbf{X} - \mathbf{\bar{X}})}\]

where \(\mathbf{\bar{X}}\) is a matrix of the same size as \(\mathbf{X}\) storing in each column a copy of the average value of the same column in \(\mathbf{X}\). \(\mathbf{S}\) is the covariance matrix computed as per PCA class. Note that the scaling option selected will affect the covariance matrix \(\mathbf{S}\). Since Mahalanobis distance takes into account covariance between variables, observations with sufficiently large \(D_M\) can be considered as outliers. For more detailed information on Mahalanobis distance the user is referred to [PBis06] or [PDMJRM00].

The threshold above which observations will be classified as outliers can be specified using trimming_threshold parameter. Specifically, the \(i^{th}\) observation is classified as an outlier if:

\[D_{M, i} > \verb|trimming_threshold| \cdot max(D_M)\]
  • 'PC CLASSIFIER'

Outliers are detected based on major and minor principal components (PCs). The method of principal component classifier (PCC) was first proposed in [PSCSC03]. The application of this technique to combustion data sets was studied in [PPS13]. Specifically, the \(i^{th}\) observation is classified as an outlier if the first PC classifier based on \(q\)-first (major) PCs:

\[\sum_{j=1}^{q} \frac{z_{ij}^2}{L_j} > c_1\]

or if the second PC classifier based on \((Q-k+1)\)-last (minor) PCs:

\[\sum_{j=k}^{Q} \frac{z_{ij}^2}{L_j} > c_2\]

where \(z_{ij}\) is the \(i^{th}, j^{th}\) element from the principal components matrix \(\mathbf{Z}\) and \(L_j\) is the \(j^{th}\) eigenvalue from \(\mathbf{L}\) (as per PCA class). Major PCs are selected such that the total variance explained is 50%. Minor PCs are selected such that the remaining variance they explain is 20%.

Coefficients \(c_1\) and \(c_2\) are found such that they represent the quantile_threshold (by default 98.99%) quantile of the empirical distributions of the first and second PC classifier respectively.

Example:

from PCAfold import outlier_detection
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Find outliers:
(idx_outliers_removed, idx_outliers) = outlier_detection(X, scaling='auto', method='MULTIVARIATE TRIMMING', trimming_threshold=0.8, verbose=True)

# New data set without outliers can be obtained as:
X_outliers_removed = X[idx_outliers_removed,:]

# Observations that were classified as outliers can be obtained as:
X_outliers = X[idx_outliers,:]
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • method – (optional) str specifying the outlier detection method to use. It should be 'MULTIVARIATE TRIMMING' or 'PC CLASSIFIER'.

  • trimming_threshold – (optional) float specifying the trimming threshold to use in combination with 'MULTIVARIATE TRIMMING' method.

  • quantile_threshold – (optional) float specifying the quantile threshold to use in combination with 'PC CLASSIFIER' method.

  • verbose – (optional) bool for printing verbose details.

Returns

  • idx_outliers_removed - list specifying the indices of observations without outliers.

  • idx_outliers - list specifying the indices of observations that were classified as outliers.

representative_sample_size#
PCAfold.preprocess.representative_sample_size(depvars, percentages, thresholds, variable_names=None, method='kl-divergence', statistics='median', n_resamples=10, random_seed=None, verbose=False)#

Computes a representative sample size given dependent variables that serve as ground truth (100% of data). It is assumed that the full dataset is representative of some physical phenomena.

Two general approaches are available:

  • If method='kl-divergence', the representative sample size is computed based on Kullback-Leibler divergence.

  • If method='mean', method='median', method='variance', or method='std', the representative sample size is computed based on convergence of a first order (mean or median) or of second order (variance, standard deviation) statistics.

Example:

from PCAfold import center_scale, representative_sample_size
import numpy as np

# Generate dummy data set and two dependent variables:
x, y = np.meshgrid(np.linspace(-1,1,100), np.linspace(-1,1,100))
xy = np.hstack((x.ravel()[:,None],y.ravel()[:,None]))

phi_1 = np.exp(-((x*x+y*y) / (1 * 1**2)))
phi_1 = phi_1.ravel()[:,None]

phi_2 = np.exp(-((x*x+y*y) / (0.01 * 1**2)))
phi_2 = phi_2.ravel()[:,None]

depvars = np.column_stack((phi_1, phi_2))
depvars, _, _ = center_scale(depvars, scaling='0to1')

# Specify the list of percentages to explore:
percentages = list(np.linspace(1,99.9,200))

# Specify the list of thresholds for each dependent variable:
thresholds = [10**-4, 10**-4]

# Specify the names of the dependent variables:
variable_names = ['Phi-1', 'Phi-2']

# Compute representative sample size for each dependent variable:
(idx, sample_sizes, statistics) = representative_sample_size(depvars,
                                                             percentages,
                                                             thresholds=thresholds,
                                                             variable_names=variable_names,
                                                             method='kl-divergence',
                                                             statistics='median',
                                                             n_resamples=20,
                                                             random_seed=100,
                                                             verbose=True)

With verbose=True we will see some detailed information:

Dependent variable Phi-1 ...
KL divergence threshold used: 0.0001
Representative sample size for dependent variable Phi-1: 2833 samples (28.3% of data).


Dependent variable Phi-2 ...
KL divergence threshold used: 0.0001
Representative sample size for dependent variable Phi-2: 9890 samples (98.9% of data).
Parameters
  • depvarsnumpy.ndarray specifying the dependent variables that should be well represented in a sampled dataset. . It should be of size (n_observations,n_dependent_variables).

  • percentageslist of percentages to explore. It should be ordered in ascending order. Elements should be larger than 0 and not larger than 100.

  • thresholds – (optional) list of float specifying the target thresholds for each dependent variable. The thresholds should be appropriate to the method based on which a representative sample size is computed.

  • variable_names – (optional) list of str specifying names for all dependent variables. If set to None, dependent variables are called with consecutive integers.

  • method – (optional) str specifying the method used to compute the sample size statistics. It can be mean, median, variance, std, or 'kl-divergence'.

  • statistics – (optional) str specifying the overall statistics that should be computed from a given method. It can be min, max, mean, or median.

  • n_resamples – (optional) int specifying the number of resamples to perform for each percentage in the percentages vector. It is recommended to set this parameters to above 1, since it might accidentally happen that a random sample is statistically representative of the full dataset. Re-sampling helps to average-out the effect of such one-off “lucky” random samples.

  • random_seed – (optional) int specifying the random seed.

  • verbose – (optional) bool for printing verbose details.

Returns

  • threshold_idx - list of int specifying the highest indices from the percentages list where the representative number of samples condition was still met. It has length n_depvars. If the condition for a representative sample size was not met for a dependent variable, a value of -1 is returned in the list for that dependent variable.

  • representatitive_sample_sizes - numpy.ndarray of int specifying the representative number of samples. It has size (1,n_depvars). If the condition for a representative sample size was not met for a dependent variable, a value of -1 is returned in the array for that dependent variable.

  • sample_size_statistics - numpy.ndarray specifying the full vector of computed statistics correponding to each entry in percentages and each dependent variable. It has size (n_percentages,n_depvars).

Class ConditionalStatistics#
class PCAfold.preprocess.ConditionalStatistics(X, conditioning_variable, k=20, split_values=None, verbose=False)#

Enables computing conditional statistics on the original data set, \(\mathbf{X}\). This includes:

  • conditional mean

  • conditional minimum

  • conditional maximum

  • conditional standard deviation

Other quantities can be added in the future at the user’s request.

Example:

from PCAfold import ConditionalStatistics
import numpy as np

# Generate dummy variables:
conditioning_variable = np.linspace(-1,1,100)
y = -conditioning_variable**2 + 1

# Instantiate an object of the ConditionalStatistics class
# and compute conditional statistics in 10 bins of the conditioning variable:
cond = ConditionalStatistics(y[:,None], conditioning_variable, k=10)

# Access conditional statistics:
conditional_mean = cond.conditional_mean
conditional_min = cond.conditional_minimum
conditional_max = cond.conditional_maximum
conditional_std = cond.conditional_standard_deviation

# Access the centroids of the created bins:
centroids = cond.centroids
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • conditioning_variablenumpy.ndarray specifying a single variable to be used as a conditioning variable. It should be of size (n_observations,1) or (n_observations,).

  • kint specifying the number of bins to create in the conditioning variable. It has to be a positive number.

  • split_valueslist specifying values at which splits should be performed. If set to None, splits will be performed using \(k\) equal variable bins.

  • verbose – (optional) bool for printing verbose details.

Attributes:

  • idx - (read only) numpy.ndarray of cluster (bins) classifications. It has size (n_observations,).

  • borders - (read only) list of values that define borders for the clusters (bins). It has length k+1.

  • centroids - (read only) list of values that specify bins centers. It has length k.

  • conditional_mean - (read only) numpy.ndarray specifying the conditional means of all original variables in the \(k\) bins created. It has size (k,n_variables).

  • conditional_minimum - (read only) numpy.ndarray specifying the conditional minimums of all original variables in the \(k\) bins created. It has size (k,n_variables).

  • conditional_maximum - (read only) numpy.ndarray specifying the conditional maximums of all original variables in the \(k\) bins created. It has size (k,n_variables).

  • conditional_standard_deviation - (read only) numpy.ndarray specifying the conditional standard deviations of all original variables in the \(k\) bins created. It has size (k,n_variables).

Class KernelDensity#
class PCAfold.preprocess.KernelDensity(X, conditioning_variable, verbose=False)#

Enables kernel density weighting of the original data set, \(\mathbf{X}\), based on single-variable or multi-variable case as proposed in [PCGP12].

The goal of both cases is to obtain a vector of weights, \(\mathbf{W_c}\), that has the same number of elements as there are observations in the original data set, \(\mathbf{X}\). Each observation will then get multiplied by the corresponding weight from \(\mathbf{W_c}\).

Note

Kernel density weighting technique is usually very expensive, even on data sets with relatively small number of observations. Since the single-variable case is a cheaper option than the multi-variable case, it is recommended that this technique is tried first for larger data sets.

Gaussian kernel is used in both approaches:

\[K_{c, c'} = \sqrt{\frac{1}{2 \pi h^2}} exp(- \frac{d^2}{2 h^2})\]

\(h\) is the kernel bandwidth:

\[h = \Big( \frac{4 \hat{\sigma}}{3 n} \Big)^{1/5}\]

where \(\hat{\sigma}\) is the standard deviation of the considered variable and \(n\) is the number of observations in the data set.

\(d\) is the distance between two observations \(c\) and \(c'\):

\[d = |x_c - x_{c'}|\]

Single-variable

If the conditioning_variable argument is a single vector, weighting will be performed according to the single-variable case. It begins by summing Gaussian kernels:

\[\mathbf{K_c} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c'}\]

and weights are then computed as:

\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]

Multi-variable

If the conditioning_variable argument is a matrix of multiple variables, weighting will be performed according to the multi-variable case. It begins by summing Gaussian kernels for a \(k^{th}\) variable:

\[\mathbf{K_c}_{, k} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c', k}\]

Global density taking into account all variables is then obtained as:

\[\mathbf{K_{c}} = \prod_{k=1}^{k=Q} \mathbf{K_c}_{, k}\]

where \(Q\) is the total number of conditioning variables, and weights are computed as:

\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]

Example:

from PCAfold import KernelDensity
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Perform kernel density weighting based on the first variable:
kerneld = KernelDensity(X, X[:,0])

# Access the weighted data set:
X_weighted = kerneld.X_weighted

# Access the weights used to scale the data set:
weights = kerneld.weights
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • conditioning_variablenumpy.ndarray specifying either a single variable or multiple variables to be used as a conditioning variable for kernel weighting procedure. Note that it can also be passed as the data set \(\mathbf{X}\).

Attributes:

  • weights - numpy.ndarray specifying the computed weights, \(\mathbf{W_c}\). It has size (n_observations,1).

  • X_weighted - numpy.ndarray specifying the weighted data set (each observation in \(\mathbf{X}\) is multiplied by the corresponding weight in \(\mathbf{W_c}\)). It has size (n_observations,n_variables).

Class DensityEstimation#
class PCAfold.preprocess.DensityEstimation(X, n_neighbors)#

Enables density estimation on point-cloud data.

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • n_neighborsint specifying the number of nearest neighbors, or the \(k\) th nearest neighbor when applicable.

DensityEstimation.average_knn_distance#
PCAfold.preprocess.DensityEstimation.average_knn_distance(self, verbose=False)#

Computes an average Euclidean distances to \(k\) nearest neighbors on a manifold defined by the independent variables.

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)

# Compute average distances on a manifold defined by the PCs:
average_distances = density_estimation.average_knn_distance(verbose=True)

With verbose=True, minimum, maximum and average distance will be printed:

Minimum distance:   0.1388300829487847
Maximum distance:   0.4689587542132183
Average distance:   0.20824964953425693
Median distance:    0.18333873029179215

Note

This function requires the scikit-learn module. You can install it through:

pip install scikit-learn

Parameters

verbose – (optional) bool for printing verbose details.

Returns

  • average_distances - numpy.ndarray specifying the vector of average distances for every observation in a data set to its \(k\) nearest neighbors. It has size (n_observations,).

DensityEstimation.kth_nearest_neighbor_codensity#
PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_codensity(self)#

Computes the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data codensity defined as:

\[\delta_k(x) = d(x, v_k(x))\]

where \(v_k(x)\) is the \(k\) th nearest neighbor of \(x\).

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)

# Compute the distance to the kth nearest neighbor:
data_codensity = density_estimation.kth_nearest_neighbor_codensity()

Note

This function requires the scikit-learn module. You can install it through:

pip install scikit-learn

Returns

  • data_codensity - numpy.ndarray specifying the vector of distances to the \(k\) th nearest neighbor of every data observation. It has size (n_observations,).

DensityEstimation.kth_nearest_neighbor_density#
PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_density(self)#

Computes an inverse of the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data density defined as:

\[\rho_k(x) = \frac{1}{\delta_k(x)}\]

where \(\delta_k(x)\) is the codensity.

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)

# Compute the distance to the kth nearest neighbor:
data_density = density_estimation.kth_nearest_neighbor_density()

Note

This function requires the scikit-learn module. You can install it through:

pip install scikit-learn

Returns

  • data_density - numpy.ndarray specifying the vector of inverse distances to the \(k\) th nearest neighbor of every data observation. It has size (n_observations,).


Data clustering#

This section includes functions for classifying data sets into local clusters and performing some basic operations on clusters [PELL09], [PKR09].

Clustering functions#

Each function that clusters the data set returns a vector of integers idx of type numpy.ndarray of size (n_observations,) that specifies classification of each observation from the original data set, \(\mathbf{X}\), to a local cluster.

_images/clustering-idx.svg

Note

The first cluster has index 0 within all idx vectors returned.

variable_bins#
PCAfold.preprocess.variable_bins(var, k, verbose=False)#

Clusters the data by dividing a variable vector var into bins of equal lengths.

An example of how a vector can be partitioned with this function is presented below:

_images/clustering-variable-bins.svg

Example:

from PCAfold import variable_bins
import numpy as np

# Generate dummy variable:
x = np.linspace(-1,1,100)

# Create partitioning according to bins of x:
(idx, borders) = variable_bins(x, 4, verbose=True)
Parameters
  • varnumpy.ndarray specifying the variable values. It should be of size (n_observations,) or (n_observations,1).

  • kint specifying the number of clusters to create. It has to be a positive number.

  • verbose – (optional) bool for printing verbose details.

Returns

  • idx - numpy.ndarray of cluster classifications. It has size (n_observations,).

  • borders - list of values that define borders for the clusters. It has length k+1.

predefined_variable_bins#
PCAfold.preprocess.predefined_variable_bins(var, split_values, verbose=False)#

Clusters the data by dividing a variable vector var into bins such that splits are done at user-specified values. Split values can be specified in the split_values list. In general: split_values = [value_1, value_2, ..., value_n].

Note: When a split is performed at a given value_i, the observation in var that takes exactly that value is assigned to the newly created bin.

An example of how a vector can be partitioned with this function is presented below:

_images/clustering-predefined-variable-bins.svg

Example:

from PCAfold import predefined_variable_bins
import numpy as np

# Generate dummy variable:
x = np.linspace(-1,1,100)

# Create partitioning according to pre-defined bins of x:
(idx, borders) = predefined_variable_bins(x, [-0.6, 0.4, 0.8], verbose=True)
Parameters
  • varnumpy.ndarray specifying the variable values. It should be of size (n_observations,) or (n_observations,1).

  • split_valueslist specifying values at which splits should be performed.

  • verbose – (optional) bool for printing verbose details.

Returns

  • idx - numpy.ndarray of cluster classifications. It has size (n_observations,).

  • borders - list of values that define borders for the clusters. It has length k+1.

mixture_fraction_bins#
PCAfold.preprocess.mixture_fraction_bins(Z, k, Z_stoich, verbose=False)#

Clusters the data by dividing a mixture fraction vector Z into bins of equal lengths. This technique can be used to partition combustion data sets as proposed in [PPSTS09]. The vector is first split to lean and rich side (according to the stoichiometric mixture fraction Z_stoich) and then the sides get divided further into clusters. When k is odd, there will always be one more cluster on the side with larger range in mixture fraction space compared to the other side.

An example of how a vector can be partitioned with this function is presented below:

_images/clustering-mixture-fraction-bins.svg

Example:

from PCAfold import mixture_fraction_bins
import numpy as np

# Generate dummy mixture fraction variable:
Z = np.linspace(0,1,100)

# Create partitioning according to bins of mixture fraction:
(idx, borders) = mixture_fraction_bins(Z, 4, 0.4, verbose=True)
Parameters
  • Znumpy.ndarray specifying the mixture fraction values. It should be of size (n_observations,) or (n_observations,1).

  • kint specifying the number of clusters to create. It has to be a positive number.

  • Z_stoichfloat specifying the stoichiometric mixture fraction. It has to be between 0 and 1.

  • verbose – (optional) bool for printing verbose details.

Returns

  • idx - numpy.ndarray of cluster classifications. It has size (n_observations,).

  • borders - list of values that define borders for the clusters. It has length k+1.

zero_neighborhood_bins#
PCAfold.preprocess.zero_neighborhood_bins(var, k, zero_offset_percentage=0.1, split_at_zero=False, verbose=False)#

Clusters the data by separating close-to-zero observations in a vector into one cluster (split_at_zero=False) or two clusters (split_at_zero=True). The offset from zero at which splits are performed is computed based on the input parameter zero_offset_percentage:

\[\verb|offset| = \frac{(max(\verb|var|) - min(\verb|var|)) \cdot \verb|zero_offset_percentage|}{100}\]

Further clusters are found by splitting positive and negative values in a vector alternatingly into bins of equal lengths.

This clustering technique can be useful for partitioning any variable that has many observations clustered around zero value and relatively few observations far away from zero on either side.

Two examples of how a vector can be partitioned with this function are presented below:

  • With split_at_zero=False:

_images/clustering-zero-neighborhood-bins.svg

If split_at_zero=False the smallest allowed number of clusters is 3. This is to assure that there are at least three clusters: with negative values, with close to zero values, with positive values.

When k is even, there will always be one more cluster on the side with larger range compared to the other side.

  • With split_at_zero=True:

_images/clustering-zero-neighborhood-bins-zero-split.svg

If split_at_zero=True the smallest allowed number of clusters is 4. This is to assure that there are at least four clusters: with negative values, with negative values close to zero, with positive values close to zero and with positive values.

When k is odd, there will always be one more cluster on the side with larger range compared to the other side.

Note

This clustering technique is well suited for partitioning chemical source terms, \(\mathbf{S_X}\), or sources of principal components, \(\mathbf{S_Z}\), (as per [TSP09]) since it relies on unbalanced vectors that have many observations numerically close to zero. Using split_at_zero=True it can further differentiate between negative and positive sources.

Example:

from PCAfold import zero_neighborhood_bins
import numpy as np

# Generate dummy variable:
x = np.linspace(-100,100,1000)

# Create partitioning according to bins of x:
(idx, borders) = zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True)
Parameters
  • varnumpy.ndarray specifying the variable values. It should be of size (n_observations,) or (n_observations,1).

  • kint specifying the number of clusters to create. It has to be a positive number. It cannot be smaller than 3 if split_at_zero=False or smaller than 4 if split_at_zero=True.

  • zero_offset_percentage – (optional) percentage of \(max(\verb|var|) - min(\verb|var|)\) range to take as the offset from zero value. For instance, set zero_offset_percentage=10 if you want 10% as offset.

  • split_at_zero – (optional) bool specifying whether partitioning should be done at var=0.

  • verbose – (optional) bool for printing verbose details.

Returns

  • idx - numpy.ndarray of cluster classifications. It has size (n_observations,).

  • borders - list of values that define borders for the clusters. It has length k+1.

Auxiliary functions#
degrade_clusters#
PCAfold.preprocess.degrade_clusters(idx, verbose=False)#

Re-numerates clusters if either of these two cases is true:

  • idx is composed of non-consecutive integers, or

  • the smallest cluster index in idx is not equal to 0.

Example:

from PCAfold import degrade_clusters
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 2, 0, 5, 10])

# Degrade clusters:
(idx_degraded, k_update) = degrade_clusters(idx)

The code above will produce:

>>> idx_degraded
array([0, 0, 1, 0, 2, 3])

Alternatively:

from PCAfold import degrade_clusters
import numpy as np

# Generate dummy idx vector:
idx = np.array([1, 1, 2, 2, 3, 3])

# Degrade clusters:
(idx_degraded, k_update) = degrade_clusters(idx)

will produce:

>>> idx_degraded
array([0, 0, 1, 1, 2, 2])
Parameters
  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing verbose details.

Returns

  • idx_degraded - numpy.ndarray of degraded cluster classifications. It has size (n_observations,).

  • k_update - int specifying the updated number of clusters.

flip_clusters#
PCAfold.preprocess.flip_clusters(idx, dictionary)#

Flips cluster labelling according to instructions provided in a dictionary. For a dictionary = {key : value}, a cluster with a number key will get a number value.

Example:

from PCAfold import flip_clusters
import numpy as np

# Generate dummy idx vector:
idx = np.array([0,0,0,1,1,1,1,2,2])

# Swap cluster number 1 with cluster number 2:
flipped_idx = flip_clusters(idx, {1:2, 2:1})

The code above will produce:

>>> flipped_idx
array([0, 0, 0, 2, 2, 2, 2, 1, 1])

Note

This function can also be used to merge clusters. Using the idx from the example above, if we call:

flipped_idx = flip_clusters(idx, {2:1})

the result will be:

>>> flipped_idx
array([0,0,0,1,1,1,1,1,1])

where clusters 1 and 2 have been merged into one cluster numbered 1.

Parameters
  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • dictionarydict specifying instructions for cluster label flipping.

Returns

  • flipped_idx - numpy.ndarray specifying the re-labelled cluster classifications. It has size (n_observations,).

get_centroids#
PCAfold.preprocess.get_centroids(X, idx)#

Computes the centroids for all variables in the original data set, \(\mathbf{X}\), and for each cluster specified in the idx vector. The centroid \(c_{n, j}\) for variable \(X_j\) in the \(n^{th}\) cluster, is computed as:

\[c_{n, j} = mean(X_j), \,\,\,\, \text{for} \,\, X_j \in \text{cluster} \,\, n\]

Centroids for all variables from all clusters are stored in the matrix \(\mathbf{c} \in \mathbb{R}^{k \times Q}\) returned:

\[\begin{split}\mathbf{c} = \begin{bmatrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, Q} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, Q} \\ \vdots & \vdots & \vdots & \vdots \\ c_{k, 1} & c_{k, 2} & \dots & c_{k, Q} \\ \end{bmatrix}\end{split}\]

Example:

from PCAfold import get_centroids
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy clustering of the data set:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Compute the centroids of each cluster:
centroids = get_centroids(X, idx)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

Returns

  • centroids - numpy.ndarray specifying the centroids matrix, \(\mathbf{c}\), for all clusters and for all variables. It has size (k,n_variables).

get_partition#
PCAfold.preprocess.get_partition(X, idx)#

Partitions the observations from the original data set, \(\mathbf{X}\), into \(k\) clusters according to idx provided.

Example:

from PCAfold import get_partition
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy clustering of the data set:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Generate partitioning of the data set according to idx:
(X_in_clusters, idx_in_clusters) = get_partition(X, idx)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

Returns

  • X_in_clusters - list of \(k\) numpy.ndarray that contains original data set observations partitioned to \(k\) clusters. It has length k.

  • idx_in_clusters - list of \(k\) numpy.ndarray that contains indices of the original data set observations partitioned to \(k\) clusters. It has length k.

get_populations#
PCAfold.preprocess.get_populations(idx)#

Computes populations (number of observations) in clusters specified in the idx vector. As an example, if there are 100 observations in the first cluster and 500 observations in the second cluster this function will return a list: [100, 500].

Example:

from PCAfold import variable_bins, get_populations
import numpy as np

# Generate dummy partitioning:
x = np.linspace(-1,1,100)
(idx, borders) = variable_bins(x, 4, verbose=True)

# Compute cluster populations:
populations = get_populations(idx)

The code above will produce:

>>> populations
[25, 25, 25, 25]
Parameters

idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

Returns

  • populations - list of cluster populations. Each entry referes to one cluster ordered according to idx. It has length k.

get_average_centroid_distance#
PCAfold.preprocess.get_average_centroid_distance(X, idx, weighted=False)#

Computes the average Euclidean distance between observations and the centroids of clusters to which each observation belongs.

The average can be computed as an arithmetic average from all clusters (weighted=False) or as a weighted average (weighted=True). In the latter, the distances are weighted by the number of observations in a cluster so that the average centroid distance will approach the average distance in the largest cluster.

Example:

from PCAfold import get_average_centroid_distance
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy clustering of the data set:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Compute average distance from cluster centroids:
average_centroid_distance = get_average_centroid_distance(X, idx, weighted=False)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • weighted – (optional) bool specifying whether distances from centroid should be weighted by the number of observations in a cluster. If set to False, arithmetic average will be computed.

Returns

  • average_centroid_distance - float specifying the average distance from centroids, averaged over all observations and all clusters.


Data sampling#

This section includes functions for splitting data sets into train and test data for use in machine learning algorithms. Apart from random splitting that can be achieved with the commonly used sklearn.model_selection.train_test_split, extended methods are implemented here that allow for purposive sampling [PNey92], such as drawing samples at certain amount from local clusters [PMMD10], [PGSB04]. These functionalities can be specifically used to tackle imbalanced data sets [PHG09], [PRLM+16].

The general idea is to divide the entire data set X (or its portion) into train and test samples as presented below:

_images/tts-train-test-select.svg

Train data is always sampled in the same way for a given sampling function. Depending on the option selected, test data will be sampled differently, either as all remaining samples that were not included in train data or as a subset of those. You can select the option by setting the test_selection_option parameter for each sampling function. Reach out to the documentation for a specific sampling function to see what options are available.

All splitting functions in this module return a tuple of two variables: (idx_train, idx_test). Both idx_train and idx_test are vectors of integers of type numpy.ndarray and of size (_,). These variables contain indices of observations that went into train data and test data respectively.

In your model learning algorithm you can then get the train and test observations, for instance in the following way:

X_train = X[idx_train,:]
X_test = X[idx_test,:]

All functions are equipped with verbose parameter. If it is set to True some additional information on train and test selection is printed.

Note

It is assumed that the first cluster has index 0 within all input idx vectors.

Class DataSampler#
class PCAfold.preprocess.DataSampler(idx, idx_test=None, random_seed=None, verbose=False)#

Enables selecting train and test data samples.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, idx_test=np.array([5,9]), random_seed=100, verbose=True)
Parameters
  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • idx_test – (optional) numpy.ndarray specifying the user-provided indices for test data. If specified, train data will be selected ignoring the indices in idx_test and the test data will be returned the same as the user-provided idx_test. If not specified, test samples will be selected according to the test_selection_option parameter (see documentation for each sampling function). Setting fixed idx_test parameter may be useful if training a machine learning model on specific test samples is desired. It should be of size (n_test_samples,) or (n_test_samples,1).

  • random_seed – (optional) int specifying random seed for random sample selection.

  • verbose – (optional) bool for printing verbose details.

DataSampler.number#
PCAfold.preprocess.DataSampler.number(self, perc, test_selection_option=1)#

Uses classifications into \(k\) clusters and samples fixed number of observations from every cluster as training data. In general, this results in a balanced representation of features identified by a clustering algorithm.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.number(20, test_selection_option=1)

Train data:

The number of train samples is estimated based on the percentage perc provided. First, the total number of samples for training is estimated as a percentage perc from the total number of observations n_observations in a data set. Next, this number is divided equally into \(k\) clusters. The result n_of_samples is the number of samples that will be selected from each cluster:

\[\verb|n_of_samples| = \verb|int| \Big( \frac{\verb|perc| \cdot \verb|n_observations|}{k \cdot 100} \Big)\]

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2, the smallest cluster is found and the remaining number of observations \(m\) are taken as test data in that cluster. Next, the same number of samples \(m\) is taken from all remaining larger clusters.

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

_images/sampling-test-selection-option-number.svg

Here \(n\) and \(m\) are fixed numbers for each cluster. In general, \(n \neq m\).

Parameters
  • perc – percentage of data to be selected as training data from the entire data set. For instance, set perc=20 if you want to select 20%.

  • test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

  • idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).

  • idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).

DataSampler.percentage#
PCAfold.preprocess.DataSampler.percentage(self, perc, test_selection_option=1)#

Uses classifications into \(k\) clusters and samples a certain percentage perc from every cluster as the training data.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.percentage(20, test_selection_option=1)

Note: If the cluster sizes are comparable, this function will give a similar train sample distribution as random sampling (DataSampler.random). This sampling can be useful in cases where one cluster is significantly smaller than others and there is a chance that this cluster will not get covered in the train data if random sampling was used.

Train data:

The number of train samples is estimated based on the percentage perc provided. First, the size of the \(i^{th}\) cluster is estimated cluster_size_i and then a percentage perc of that number is selected.

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2 the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

_images/sampling-test-selection-option-percentage.svg

Here \(p\) is the percentage perc provided.

Parameters
  • perc – percentage of data to be selected as training data from each cluster. For instance, set perc=20 if you want to select 20%.

  • test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

  • idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).

  • idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).

DataSampler.manual#
PCAfold.preprocess.DataSampler.manual(self, sampling_dictionary, sampling_type='percentage', test_selection_option=1)#

Uses classifications into \(k\) clusters and a dictionary sampling_dictionary in which you manually specify what 'percentage' (or what 'number') of samples will be selected as the train data from each cluster. The dictionary keys are cluster classifications as per idx and the dictionary values are either percentage or number of train samples to be selected. The default dictionary values are percentage but you can select sampling_type='number' in order to interpret the values as a number of samples.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.manual({0:1, 1:1, 2:1}, sampling_type='number', test_selection_option=1)

Train data:

The number of train samples selected from each cluster is estimated based on the sampling_dictionary. For key : value, percentage value (or number value) of samples will be selected from cluster key.

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2 the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

_images/sampling-test-selection-option-manual.svg

Here it is understood that \(n_1\) train samples were requested from the first cluster, \(n_2\) from the second cluster and \(n_3\) from the third cluster, where \(n_i\) can be interpreted as number or as percentage. This can be achieved by setting:

sampling_dictionary = {0:n_1, 1:n_2, 2:n_3}
Parameters
  • sampling_dictionarydict specifying manual sampling. Keys are cluster classifications and values are either percentage or number of samples to be taken from that cluster. Keys should match the cluster classifications as per idx.

  • sampling_type – (optional) str specifying whether percentage or number is given in the sampling_dictionary. Available options: percentage or number. The default is percentage.

  • test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

  • idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).

  • idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).

DataSampler.random#
PCAfold.preprocess.DataSampler.random(self, perc, test_selection_option=1)#

Samples train data at random from the entire data set.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.random(20, test_selection_option=1)

Due to the nature of this sampling technique, it is not necessary to have idx classifications since random samples can also be selected from unclassified data sets. You can achieve that by generating a dummy idx vector that has the same number of observations n_observations as your data set. For instance:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
n_observations = 100
idx = np.zeros(n_observations)

# Instantiate DataSampler class object:
selection = DataSampler(idx)

# Generate sampling:
(idx_train, idx_test) = selection.random(20, test_selection_option=1)

Train data:

The total number of train samples is computed as a percentage perc from the total number of observations in a data set. These samples are then drawn at random from the entire data set, independent of cluster classifications.

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2 the same procedure is used to select test data as was used to select train data (only allowed if perc is less than 50%).

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

_images/sampling-test-selection-option-random.svg

Here \(p\) is the percentage perc provided.

Parameters
  • perc – percentage of data to be selected as training data from each cluster. Set perc=20 if you want 20%.

  • test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

  • idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).

  • idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).


Plotting functions#

This section includes functions for data preprocessing related plotting such as visualizing the formed clusters, visualizing the selected train and test samples or plotting the conditional statistics.

plot_2d_clustering#
PCAfold.preprocess.plot_2d_clustering(x, y, idx, clean=False, x_label=None, y_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, grid_on=False, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a two-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.

Example:

from PCAfold import variable_bins, plot_2d_clustering
import numpy as np

# Generate dummy data set:
x = np.linspace(-1,1,100)
y = -x**2 + 1

# Generate dummy clustering of the data set:
(idx, _) = variable_bins(x, 4, verbose=False)

# Plot the clustering result:
plt = plot_2d_clustering(x,
                     y,
                     idx,
                     x_label='$x$',
                     y_label='$y$',
                     color_map='viridis',
                     first_cluster_index_zero=False,
                     grid_on=True,
                     figure_size=(10,6),
                     title='x-y data set',
                     save_filename='clustering.pdf')
plt.close()
Parameters
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • ynumpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • clean – (optional) bool specifying if a clean plot should be made. If set to True, nothing else but the data points is plotted.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • alphas – (optional) list specifying the opacity of each cluster.

  • first_cluster_index_zero – (optional) bool specifying if the first cluster should be indexed 0 on the plot. If set to False the first cluster will be indexed 1.

  • grid_onbool specifying whether grid should be plotted.

  • s – (optional) int or float specifying the scatter point size.

  • markerscale – (optional) int or float specifying the scale for the legend marker.

  • legend – (optional) bool specifying the whether legend should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_3d_clustering#
PCAfold.preprocess.plot_3d_clustering(x, y, z, idx, elev=45, azim=-45, x_label=None, y_label=None, z_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a three-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.

Example:

from PCAfold import variable_bins, plot_3d_clustering
import numpy as np

# Generate dummy data set:
x = np.linspace(-1,1,100)
y = -x**2 + 1
z = x + 10

# Generate dummy clustering of the data set:
(idx, _) = variable_bins(x, 4, verbose=False)

# Plot the clustering result:
plt = plot_3d_clustering(x,
                         y,
                         z,
                         idx,
                         x_label='$x$',
                         y_label='$y$',
                         z_label='$z$',
                         color_map='viridis',
                         first_cluster_index_zero=False,
                         figure_size=(10,6),
                         title='x-y-z data set',
                         save_filename='clustering.pdf')
plt.close()
Parameters
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • ynumpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • znumpy.ndarray specifying the variable on the \(z\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • elev – (optional) elevation angle.

  • azim – (optional) azimuth angle.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • z_label – (optional) str specifying \(z\)-axis label annotation. If set to None label will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • alphas – (optional) list specifying the opacity of each cluster.

  • first_cluster_index_zero – (optional) bool specifying if the first cluster should be indexed 0 on the plot. If set to False the first cluster will be indexed 1.

  • s – (optional) int or float specifying the scatter point size.

  • markerscale – (optional) int or float specifying the scale for the legend marker.

  • legend – (optional) bool specifying the whether legend should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_2d_train_test_samples#
PCAfold.preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label=None, y_label=None, color_map='viridis', first_cluster_index_zero=True, grid_on=False, figure_size=(14, 7), title=None, save_filename=None)#

Plots a two-dimensional manifold divided into train and test samples. Number of observations in train and test data respectively will be plotted in the legend.

Example:

from PCAfold import variable_bins, DataSampler, plot_2d_train_test_samples
import numpy as np

# Generate dummy data set:
x = np.linspace(-1,1,100)
y = -x**2 + 1

# Generate dummy clustering of the data set:
(idx, borders) = variable_bins(x, 4, verbose=False)

# Generate dummy sampling of the data set:
sample = DataSampler(idx, random_seed=None, verbose=True)
(idx_train, idx_test) = sample.number(40, test_selection_option=1)

# Plot the sampling result:
plt = plot_2d_train_test_samples(x,
                                 y,
                                 idx,
                                 idx_train,
                                 idx_test,
                                 x_label='$x$',
                                 y_label='$y$',
                                 color_map='viridis',
                                 first_cluster_index_zero=False,
                                 grid_on=True,
                                 figure_size=(12,6),
                                 title='x-y data set',
                                 save_filename='sampling.pdf')
plt.close()
Parameters
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • ynumpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • idx_trainnumpy.ndarray specifying the indices of the train data. It should be of size (n_train,) or (n_train,1).

  • idx_testnumpy.ndarray specifying the indices of the test data. It should be of size (n_test,) or (n_test,1).

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • first_cluster_index_zero – (optional) bool specifying if the first cluster should be indexed 0 on the plot. If set to False the first cluster will be indexed 1.

  • grid_onbool specifying whether grid should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_conditional_statistics#
PCAfold.preprocess.plot_conditional_statistics(variable, conditioning_variable, k=20, split_values=None, statistics_to_plot=['mean'], color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', figure_size=(7, 7), title=None, save_filename=None)#

Plots a two-dimensional manifold given by variable and conditioning_variable and the selected conditional statistics (as per preprocess.ConditionalStatistics).

Example:

from PCAfold import PCA, plot_conditional_statistics
import numpy as np

# Generate dummy variables:
conditioning_variable = np.linspace(-1,1,100)
y = -conditioning_variable**2 + 1

# Plot the conditional statistics:
plt = plot_conditional_statistics(y,
                                  conditioning_variable,
                                  k=10,
                                  x_label='$x$',
                                  y_label='$y$',
                                  figure_size=(10,3),
                                  title='Conditional mean',
                                  save_filename='conditional-mean.pdf')
plt.close()
Parameters
  • variablenumpy.ndarray specifying a single dependent variable to condition. This will be plotted on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • conditioning_variablenumpy.ndarray specifying a single variable to be used as a conditioning variable. This will be plotted on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • kint specifying the number of bins to create in the conditioning variable. It has to be a positive number.

  • split_valueslist specifying values at which splits should be performed. If set to None, splits will be performed using \(k\) equal variable bins.

  • statistics_to_plotlist of str specifying conditional statistics to plot. The strings can be mean, min, max or std.

  • color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, data will be plotted in black.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • colorbar_label – (optional) string specifying colorbar label annotation. If set to None, colorbar label will not be plotted.

  • color_map – (optional) colormap to use as per matplotlib.cm. Default is viridis.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.


Bibliography#

PBis06

Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

PCVJ21(1,2)

Gunnar Carlsson and Mikael Vejdemo-Johansson. Topological Data Analysis with Applications. Cambridge University Press, 2021.

PCGP12

Axel Coussement, Olivier Gicquel, and Alessandro Parente. Kernel density weighted principal component analysis of combustion processes. Combustion and flame, 159(9):2844–2855, 2012.

PDMJRM00

Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The mahalanobis distance. Chemometrics and intelligent laboratory systems, 50(1):1–18, 2000.

PELL09

Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Wiley Publishing, 4th edition, 2009. ISBN 0340761199.

PGSB04

Abdul A. Gill, George D. Smith, and Anthony J. Bagnall. Improving decision tree performance through induction-and cluster-based stratified sampling. In International Conference on Intelligent Data Engineering and Automated Learning, 339–344. Springer, 2004.

PHG09

Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.

PKR09

Leonard Kaufman and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. Volume 344. John Wiley & Sons, 2009.

PKK04

Michael R Keenan and Paul G Kotula. Accounting for poisson noise in the multivariate analysis of tof-sims spectrum images. Surface and Interface Analysis: An International Journal devoted to the development and application of techniques for the analysis of surfaces, interfaces and thin films, 36(3):203–212, 2004.

PKEA+03

Hector C Keun, Timothy MD Ebbels, Henrik Antti, Mary E Bollard, Olaf Beckonert, Elaine Holmes, John C Lindon, and Jeremy K Nicholson. Improved analysis of multivariate data by variable stability scaling: application to nmr-based metabolic profiling. Analytica chimica acta, 490(1-2):265–276, 2003.

PMMD10

Robert J. May, Holger R. Maier, and Graeme C. Dandy. Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks, 23(2):283–294, 2010.

PNey92

Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in Statistics, pages 123–150. Springer, 1992.

PNod08

Isao Noda. Scaling techniques to enhance two-dimensional correlation spectra. Journal of Molecular Structure, 883:216–227, 2008.

PPS13(1,2)

Alessandro Parente and James C. Sutherland. Principal component analysis of turbulent combustion data: data pre-processing and manifold sensitivity. Combustion and flame, 160(2):340–350, 2013.

PPSTS09

Alessandro Parente, James C. Sutherland, Leonardo Tognotti, and Philip J. Smith. Identification of low-dimensional manifolds in turbulent flames. Proceedings of the Combustion Institute, 32(1):1579–1586, 2009.

PRLM+16

Mojdeh Rastgoo, Guillaume Lemaitre, Joan Massich, Olivier Morel, Franck Marzani, Rafael Garcia, and Fabrice Meriaudeau. Tackling the problem of data imbalancing for melanoma classification. BIOSTEC - 3rd International Conference on BIOIMAGING, 2016.

PSCSC03

Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003.

PvdBHW+06(1,2,3)

Robert A van den Berg, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde, and Mariët J van der Werf. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics, 7(1):1–15, 2006.

Data reduction#

The reduction module contains functions for performing Principal Component Analysis (PCA).

Note

The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).

\[\begin{split}\mathbf{X} = \begin{bmatrix} \vdots & \vdots & & \vdots \\ X_1 & X_2 & \dots & X_{Q} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix}\end{split}\]

The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.

The representation of the user-supplied data matrix in PCAfold is the input parameter X, which should be of type numpy.ndarray and of size (n_observations,n_variables).


Principal Component Analysis#

Class PCA#
class PCAfold.reduction.PCA(X, scaling='std', n_components=0, use_eigendec=True, nocenter=False)#

Enables performing Principal Component Analysis (PCA) of the original data set, \(\mathbf{X}\). For more detailed information on the theory of PCA the user is referred to [RJolliffe02].

Two options for performing PCA are implemented:

Eigendecomposition of the covariance matrix
Set use_eigendec=True (default)
Singular Value Decomposition (SVD)
Set use_eigendec=False
Centering and scaling (as per preprocess.center_scale function):
If nocenter=False: \(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\)
If nocenter=True: \(\mathbf{X_{cs}} = \mathbf{X} \cdot \mathbf{D}^{-1}\)

Eigendecomposition of the covariance matrix \(\mathbf{S}\)

SVD: \(\mathbf{X_{cs}} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^{\mathbf{T}}\)

Modes:
Eigenvectors \(\mathbf{A}\)
Modes:
\(\mathbf{A} = \mathbf{V}\)
Amplitudes:
Eigenvalues \(\mathbf{L}\)
Amplitudes:
\(\mathbf{L} = diag(\mathbf{\Sigma})\)

Note: For simplicity, we will from now on refer to \(\mathbf{A}\) as the matrix of eigenvectors and to \(\mathbf{L}\) as the vector of eigenvalues, irrespective of the method used to perform PCA.

Covariance matrix is computed at the class initialization as:

\[\mathbf{S} = \frac{1}{N-1} \mathbf{X_{cs}}^{\mathbf{T}} \mathbf{X_{cs}}\]

where \(N\) is the number of observations in the original data set, \(\mathbf{X}\).

Loadings matrix, \(\mathbf{l}\), is computed at the class initialization such that the element \(\mathbf{l}_{ij}\) is the corresponding scaled element of the eigenvectors matrix, \(\mathbf{A}_{ij}\):

\[\mathbf{l}_{ij} = \frac{\mathbf{A}_{ij} \sqrt{\mathbf{L}_j}}{\sqrt{\mathbf{S}_{ii}}}\]

where \(\mathbf{L}_j\) is the \(j^{th}\) eigenvalue and \(\mathbf{S}_{ii}\) is the \(i^{th}\) element on the diagonal of the covariance matrix, \(\mathbf{S}\).

The variance accounted for in each individual variable by the first \(q\) PCs, \(\mathbf{t_q}\), is computed at the class initialization:

\[\mathbf{t}_{\mathbf{q}i} = \sum_{j=1}^{q} \Bigg( \frac{\mathbf{A}_{ij} \sqrt{\mathbf{L}_j}}{ s_i } \Bigg)^2\]

where \(q\) is the number of retained principal components and \(s_i\) is the standard deviation of the \(i^{th}\) variable in the data set.

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Access the eigenvectors:
A = pca_X.A

# Access the eigenvalues:
L = pca_X.L

# Access the loadings:
l = pca_X.loadings

# Access the variance accounted for each variable:
tq = pca_X.tq
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • n_components – (optional) int specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.

  • use_eigendec

    (optional) bool specifying the method for obtaining eigenvalues and eigenvectors:

    • use_eigendec=True uses eigendecomposition of the covariance matrix (from numpy.linalg.eigh)

    • use_eigendec=False uses Singular Value Decomposition (SVD) (from scipy.linalg.svd)

  • nocenter – (optional) bool specifying whether the data original data set should be centered by mean.

Attributes:

  • n_components - (can be re-set) number of retained principal components, \(q\).

  • n_components_init - (read only) number of retained principal components, \(q\), with which PCA class object was initialized.

  • scaling - (read only) scaling criteria with which PCA class object was initialized.

  • n_variables - (read only) number of variables of the original data set on which PCA class object was initialized.

  • X_cs - (read only) centered and scaled data set \(\mathbf{X_{cs}}\).

  • X_center - (read only) vector of centers, \(\mathbf{C}\), applied on the original data set \(\mathbf{X}\).

  • X_scale - (read only) vector of scales, \(\mathbf{D}\), applied on the original data set \(\mathbf{X}\).

  • S - (read only) covariance matrix, \(\mathbf{S}\).

  • L - (read only) vector of eigenvalues, \(\mathbf{L}\).

  • A - (read only) matrix of eigenvectors, \(\mathbf{A}\), (vectors are stored in columns, rows correspond to weights).

  • loadings - (read only) loadings, \(\mathbf{l}\), (vectors are stored in columns, rows correspond to weights).

  • tq - (read only) variance accounted for in each individual variable, \(\mathbf{t_q}\).

  • tq - (read only) variance accounted for in each individual variable in each PC, \(\mathbf{t_{q,j}}\).

PCA.transform#
PCAfold.reduction.PCA.transform(self, X, nocenter=False)#

Transforms any original data set, \(\mathbf{X}\), to a new truncated basis, \(\mathbf{A}_q\), identified by PCA. It computes the \(q\) first principal components, \(\mathbf{Z}_q\), given the original data.

If nocenter=False:

\[\mathbf{Z}_q = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1} \cdot \mathbf{A}_q\]

If nocenter=True:

\[\mathbf{Z}_q = \mathbf{X} \cdot \mathbf{D}^{-1} \cdot \mathbf{A}_q\]

Here \(\mathbf{C}\) and \(\mathbf{D}\) are centers and scales computed during PCA class initialization and \(\mathbf{A}_q\) is the matrix of \(q\) first eigenvectors extracted from \(\mathbf{A}\).

Warning

Set nocenter=True only if you know what you are doing.

One example when nocenter should be set to True is when transforming chemical source terms, \(\mathbf{S_X}\), to principal components space (as per [TSP09]) to obtain sources of principal components, \(\mathbf{S_Z}\). In that case \(\mathbf{X} = \mathbf{S_X}\) and the transformation should be performed without centering:

\[\mathbf{S}_{\mathbf{Z}, q} = \mathbf{S_X} \cdot \mathbf{D}^{-1} \cdot \mathbf{A}_q\]

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)
Parameters
  • Xnumpy.ndarray specifying the data set \(\mathbf{X}\) to transform. It should be of size (n_observations,n_variables). Note that it does not need to be the same data set that was used to construct the PCA class object. It could for instance be a function of that data set. By default, this data set will be pre-processed with the centers and scales computed on the data set used when constructing the PCA object.

  • nocenter – (optional) bool specifying whether PCA.X_center centers should be applied to center the data set before transformation. If nocenter=True centers will not be applied on the data set.

Returns

  • principal_components - numpy.ndarray specifying the \(q\) first principal components \(\mathbf{Z}_q\). It has size (n_observations,n_components).

PCA.reconstruct#
PCAfold.reduction.PCA.reconstruct(self, principal_components, nocenter=False)#

Calculates rank-\(q\) reconstruction of the data set from the \(q\) first principal components, \(\mathbf{Z}_q\).

If nocenter=False:

\[\mathbf{X_{rec}} = \mathbf{Z}_q \mathbf{A}_q^{\mathbf{T}} \cdot \mathbf{D} + \mathbf{C}\]

If nocenter=True:

\[\mathbf{X_{rec}} = \mathbf{Z}_q \mathbf{A}_q^{\mathbf{T}} \cdot \mathbf{D}\]

Here \(\mathbf{C}\) and \(\mathbf{D}\) are centers and scales computed during PCA class initialization and \(\mathbf{A}_q\) is the matrix of \(q\) first eigenvectors extracted from \(\mathbf{A}\).

Warning

Set nocenter=True only if you know what you are doing.

One example when nocenter should be set to True is when reconstructing chemical source terms, \(\mathbf{S_X}\), (as per [TSP09]) from the \(q\) first sources of principal components, \(\mathbf{S}_{\mathbf{Z}, q}\). In that case \(\mathbf{Z}_q = \mathbf{S}_{\mathbf{Z}, q}\) and the reconstruction should be performed without uncentering:

\[\mathbf{S_{X, rec}} = \mathbf{S}_{\mathbf{Z}, q} \mathbf{A}_q^{\mathbf{T}} \cdot \mathbf{D}\]

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Calculate the reconstructed variables:
X_rec = pca_X.reconstruct(principal_components)
Parameters
  • principal_componentsnumpy.ndarray of \(q\) first principal components, \(\mathbf{Z}_q\). It should be of size (n_observations,n_variables).

  • nocenter – (optional) bool specifying whether PCA.X_center centers should be applied to un-center the reconstructed data set. If nocenter=True centers will not be applied on the reconstructed data set.

Returns

  • X_rec - rank-\(q\) reconstruction of the original data set.

PCA.get_weights_dictionary#
PCAfold.reduction.PCA.get_weights_dictionary(self, variable_names, pc_index, n_digits=10)#

Creates a dictionary where keys are the names of the variables in the original data set \(\mathbf{X}\) and values are the eigenvector weights corresponding to the principal component selected by pc_index. This function helps in accessing weight value for a specific variable and for a specific PC.

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy variables names:
variable_names = ['A1', 'A2', 'A3', 'A4', 'A5']

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=0, use_eigendec=True, nocenter=False)

# Create a dictionary for PC-1 weights:
PC1_weights_dictionary = pca_X.get_weights_dictionary(variable_names, 0, n_digits=8)

The code above will create a dictionary:

{'A1': 0.63544443,
 'A2': -0.39500424,
 'A3': -0.28819465,
 'A4': 0.57000796,
 'A5': 0.17949037}

Eigenvector weight for a specific variable can then be accessed by:

PC1_weights_dictionary['A3']
Parameters
  • variable_nameslist of str specifying names for all variables in the original data set, \(\mathbf{X}\).

  • pc_index – non-negative int specifying the index of the PC to create the dictionary for. Set pc_index=0 if you want to look at the first PC.

  • n_digits – (optional) non-negative int specifying how many digits should be kept in rounding the eigenvector weights.

Returns

  • weights_dictionary - dict of variable names as keys and selected eigenvector weights as values.

PCA.u_scores#
PCAfold.reduction.PCA.u_scores(self, X)#

Calculates the U-scores (principal components):

\[\mathbf{U_{scores}} = \mathbf{X_{cs}} \mathbf{A}_q\]

This function is equivalent to PCA.transform.

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=10, use_eigendec=True, nocenter=False)

# Calculate the U-scores:
u_scores = pca_X.u_scores(X)
Parameters

X – data set to transform. Note that it does not need to be the same data set that was used to construct the PCA object. It could for instance be a function of that data set. By default, this data set will be pre-processed with the centers and scales computed on the data set used when constructing the PCA object.

Returns

  • u_scores - U-scores (principal components).

PCA.w_scores#
PCAfold.reduction.PCA.w_scores(self, X)#

Calculates the W-scores which are the principal components scaled by the inverse square root of the corresponding eigenvalue:

\[\mathbf{W_{scores}} = \frac{\mathbf{Z}_q}{\sqrt{\mathbf{L_q}}}\]

where \(\mathbf{L_q}\) are the \(q\) first eigenvalues extracted from \(\mathbf{L}\). The W-scores are still uncorrelated and have variances equal unity.

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=10, use_eigendec=True, nocenter=False)

# Calculate the W-scores:
w_scores = pca_X.w_scores(X)
Parameters

X – data set to transform. Note that it does not need to be the same data set that was used to construct the PCA object. It could for instance be a function of that data set. By default, this data set will be pre-processed with the centers and scales computed on the data set used when constructing the PCA object.

Returns

  • w_scores - W-scores (scaled principal components).

PCA.calculate_r2#
PCAfold.reduction.PCA.calculate_r2(self, X)#

Calculates coefficient of determination, \(R^2\), values for the rank-\(q\) reconstruction, \(\mathbf{X_{rec}}\), of the original data set, \(\mathbf{X}\):

\[R^2 = 1 - \frac{\sum_{i=1}^N (\mathbf{X}_i - \mathbf{X_{rec}}_i)^2}{\sum_{i=1}^N (\mathbf{X}_i - mean(\mathbf{X}_i))^2}\]

where \(\mathbf{X}_i\) is the \(i^{th}\) column of \(\mathbf{X}\), \(\mathbf{X_{rec}}_i\) is the \(i^{th}\) column of \(\mathbf{X_{rec}}\) and \(N\) is the number of observations in \(\mathbf{X}\).

If all of the eigenvalues are retained, \(R^2\) will be equal to unity.

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=10, use_eigendec=True, nocenter=False)

# Calculate the R2 values:
r2 = pca_X.calculate_r2(X)
Parameters

Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

Returns

  • r2 - numpy.ndarray specifying the coefficient of determination values \(R^2\) for the rank-\(q\) reconstruction of the original data set. It has size (n_variables,).

PCA.r2_convergence#
PCAfold.reduction.PCA.r2_convergence(self, X, n_pcs, variable_names=[], print_width=10, verbose=False, save_filename=None)#

Returns and optionally prints and/or saves to .txt file \(R^2\) values (as per PCA.calculate_r2 function) for reconstruction of the original data set \(\mathbf{X}\) as a function of number of retained principal components (PCs).

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=3)

# Compute and print convergence of R2 values:
r2 = pca_X.r2_convergence(X, n_pcs=3, variable_names=['X1', 'X2', 'X3'], print_width=10, verbose=True)

The code above will print \(R^2\) values retaining 1-3 PCs:

| n PCs      | X1         | X2         | X3         | Mean       |
| 1          | 0.17857365 | 0.53258736 | 0.49905763 | 0.40340621 |
| 2          | 0.99220888 | 0.57167479 | 0.61150487 | 0.72512951 |
| 3          | 1.0        | 1.0        | 1.0        | 1.0        |
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • n_pcs – the maximum number of PCs to consider.

  • variable_names – (optional) list of ‘str’ specifying variable names. If not specified variables will be numbered.

  • print_width – (optional) width of columns printed out.

  • verbose – (optional) bool for printing out the table with \(R^2\) values.

  • save_filename – (optional) str specifying .txt save location/filename.

Returns

  • r2 - matrix of size (n_pcs, n_variables) containing the \(R^2\) values for each variable as a function of the number of retained PCs.

PCA.set_retained_eigenvalues#
PCAfold.reduction.PCA.set_retained_eigenvalues(self, method='SCREE GRAPH', option=None)#

Helps determine how many principal components (PCs) should be retained. The following methods are available:

  • 'TOTAL VARIANCE' - retain the PCs whose eigenvalues account for a specific percentage of the total variance. The required number of PCs is then the smallest value of \(q\) for which this chosen percentage is exceeded. Fraction of variance can be supplied using the option parameter. For instance, set option=0.6 if you want to account for 60% variance. If variance is not supplied in the option paramter, the user will be asked for input during function execution.

  • 'INDIVIDUAL VARIANCE' - retain the PCs whose eigenvalues are greater than the average of the eigenvalues [RKai60] or than 0.7 times the average of the eigenvalues [RJol72]. For a correlation matrix this average equals 1. Fraction of variance can be supplied using the option parameter. For instance, set option=0.6 if you want to account for 60% variance. If variance is not supplied in the option paramter, the user will be asked for input during function execution.

  • 'BROKEN STICK' - retain the PCs according to the Broken Stick Model [RFro76].

  • 'SCREE GRAPH' - retain the PCs using the scree graph, a plot of the eigenvalues agaist their indexes, and look for a natural break between the large and small eigenvalues.

For more detailed information on the options implemented here the user is referred to [RJolliffe02].

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto')

# Compute a new ``PCA`` class object with the new number of retained components:
pca_X_new = pca_X.set_retained_eigenvalues(method='TOTAL VARIANCE', option=0.6)

# The new number of principal components that has been set:
print(pca_X_new.n_components)

This function provides a few methods to select the number of eigenvalues to be retained in the PCA reduction.

Parameters
  • method – (optional) str specifying the method to use in selecting retained eigenvalues.

  • option – (optional) additional parameter used for the 'TOTAL VARIANCE' and 'INDIVIDUAL VARIANCE' methods. If not supplied, information will be obtained interactively.

Returns

  • pca - the PCA object with the number of retained eigenvalues set on it.

PCA.principal_variables#
PCAfold.reduction.PCA.principal_variables(self, method='B2', x=[])#

Extracts Principal Variables (PVs) from a PCA.

The following methods are currently supported:

  • 'B4' - selects Principal Variables based on the variables contained in the eigenvectors corresponding to the largest eigenvalues [RJol72].

  • 'B2' - selects Principal Variables based on the variables contained in the smallest eigenvalues. These are discarded and the remaining variables are used as the PVs [RJol72].

  • 'M2' - at each iteration, each remaining variable is analyzed via PCA [RKrz87]. Note: this is a very expensive method.

For more detailed information on the options implemented here the user is referred to [RJolliffe02].

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto')

# Select Principal Variables (PVs) using M2 method:
principal_variables_indices = pca_X.principal_variables(method='M2', X)
Parameters
  • method – (optional) str specifying the method for determining the Principal Variables (PVs).

  • x – (optional) data set to accompany 'M2' method. Note that this is only required for the 'M2' method.

Returns

  • principal_variables_indices - a vector of indices of retained Principal Variables (PVs).

PCA.data_consistency_check#
PCAfold.reduction.PCA.data_consistency_check(self, X, errors_are_fatal=False)#

Checks if the supplied data matrix X is consistent with the current PCA class object.

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=10)

# This data set will be consistent:
X_1 = np.random.rand(50,20)
is_consistent = pca_X.data_consistency_check(X_1)

# This data set will not be consistent but will not throw ValueError:
X_2 = np.random.rand(100,10)
is_consistent = pca_X.data_consistency_check(X_2)

# This data set will not be consistent and will throw ValueError:
X_3 = np.random.rand(100,10)
is_consistent = pca_X.data_consistency_check(X_3, errors_are_fatal=True)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • errors_are_fatal – (optional) bool indicating if ValueError should be raised if an incompatibility is detected.

Returns

  • is_consistent - bool specifying whether or not the supplied data matrix \(\mathbf{X}\) is consistent with the PCA class object.

PCA.save_to_txt#
PCAfold.reduction.PCA.save_to_txt(self, save_filename)#

Writes the eigenvector matrix, \(\mathbf{A}\), loadings, \(\mathbf{l}\), centering, \(\mathbf{C}\), and scaling ,:math:mathbf{D}, to .txt file.

Example:

from PCAfold import PCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=5)

# Save the PCA results to .txt:
pca_X.save_to_txt('pca_X_Data.txt')
Parameters

save_filenamestr specifying .txt save location/filename.


Local Principal Component Analysis#

Class LPCA#
class PCAfold.reduction.LPCA(X, idx, scaling='std', n_components=0, use_eigendec=True, nocenter=False, verbose=False)#

Enables performing local Principal Component Analysis (LPCA) of the original data set, \(\mathbf{X}\), partitioned into clusters.

Example:

from PCAfold import LPCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Generate dummy vector of cluster classifications:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Instantiate LPCA class object:
lpca_X = LPCA(X, idx, scaling='none', n_components=2)

# Access the local covariance matrix in the first cluster:
S_k1 = lpca_X.S[0]

# Access the local eigenvectors in the first cluster:
A_k1 = lpca_X.A[0]

# Access the local eigenvalues in the first cluster:
L_k1 = lpca_X.L[0]

# Access the local principal components in the first cluster:
Z_k1 = lpca_X.principal_components[0]

# Access the local loadings in the first cluster:
l_k1 = lpca_X.loadings[0]

# Access the local variance accounted for in each individual variable in the first cluster:
tq_k1 = lpca_X.tq[0]
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • scaling – (optional) str specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • n_components – (optional) int specifying the number of returned eigenvectors, eigenvalues and principal components, \(q\). If set to 0 all are returned.

  • use_eigendec

    (optional) bool specifying the method for obtaining eigenvalues and eigenvectors:

    • use_eigendec=True uses eigendecomposition of the covariance matrix (from numpy.linalg.eigh)

    • use_eigendec=False uses Singular Value Decomposition (SVD) (from scipy.linalg.svd)

  • nocenter – (optional) bool specifying whether data should be centered by mean.

  • verbose – (optional) bool for printing verbose details.

Attributes:

  • S - (read only) list of numpy.ndarray specifying the local covariance matrix, \(\mathbf{S}\). Each list element corresponds to the covariance matrix in a single cluster.

  • A - (read only) list of numpy.ndarray specifying the local eigenvectors, \(\mathbf{A}\). Each list element corresponds to eigenvectors in a single cluster.

  • L - (read only) list of numpy.ndarray specifying the local eigenvalues, \(\mathbf{L}\). Each list element corresponds to eigenvalues in a single cluster.

  • principal_components - (read only) list of numpy.ndarray specifying the local principal components, \(\mathbf{Z}\). Each list element corresponds to principal components in a single cluster.

  • loadings - (read only) list of numpy.ndarray specifying the local loadings, \(\mathbf{l}\). Each list element corresponds to loadings in a single cluster.

  • tq - (read only) list of numpy.ndarray specifying the local variance accounted for in each individual variable by the first \(q\) PCs, \(\mathbf{t_q}\). Each list element corresponds to variance metric in a single cluster.

  • X_reconstructed - (read only) numpy.ndarray specifying the dataset reconstructed from local PCA using the first \(q\) PCs. It has size (n_observations,n_variables).

  • R2 - (read only) list specifying the average coefficient of determination for each cluster reconstructed using the first \(q\) PCs. Each list element corresponds to each reconstructed cluster and is averaged over all non-constant state variables in that cluster.

  • idx_retained_in_clusters - (read only) list of list specifying the variables retained in each cluster. If a variable within a particular cluster becomes constant, it will be removed from this list.

LPCA.local_correlation#
PCAfold.reduction.LPCA.local_correlation(self, variable, index=0, metric='pearson', display=None, verbose=False)#

Computes a correlation in each cluster and a globally-averaged correlation between the local principal component, PC, and some specified variable, \(\phi\). The average is taken from each of the \(k\) clusters. Correlation in the \(n^{th}\) cluster is referred to as \(r_n(\mathrm{PC}, \phi)\).

Available correlation functions are:

  • Pearson correlation coefficient (PCC), set metric='pearson':

\[r_n(\mathrm{PC}, \phi) = \mathrm{abs} \Bigg( \frac{\sum_{i=1}^{N_n} (\mathrm{PC}_i - \overline{\mathrm{PC}}) (\phi_i - \bar{\phi})}{\sqrt{\sum_{i=1}^{N_n} (\mathrm{PC}_i - \overline{\mathrm{PC}})^2} \sqrt{\sum_{i=1}^{N_n} (\phi_i - \bar{\phi})^2}} \Bigg)\]

where \(N_n\) is the number of observations in the \(n^{th}\) cluster.

  • Spearman correlation coefficient (SCC), set metric='spearman'.

  • Distance correlation (dCor), set metric='dcor':

\[r_n(\mathrm{PC}, \phi) = \sqrt{ \frac{\mathrm{dCov}(\mathrm{PC}_n, \phi_n)}{\mathrm{dCov}(\mathrm{PC}_n, \mathrm{PC}_n) \mathrm{dCov}(\phi_n, \phi_n)} }\]

where \(\mathrm{dCov}\) is the distance covariance computed for any two variables, \(X\) and \(Y\), as:

\[\mathrm{dCov}(X,Y) = \sqrt{ \frac{1}{N^2} \sum_{i,j=1}^N x_{i,j} y_{i,j}}\]

where \(x_{i,j}\) and \(y_{i,j}\) are the elements of the double-centered Euclidean distances matrices for \(X\) and \(Y\) observations respectively. \(N\) is the total number of observations. Note, that the distance correlation computation allows \(X\) and \(Y\) to have different dimensions.

Note

The distance correlation computation requires the dcor module. You can install it through:

pip install dcor

Globally-averaged correlation metric is computed in two variants:

  • Weighted, where each local correlation is weighted by the size of each cluster:

\[\bar{r} = \frac{1}{N} \sum_{n=1}^k N_n r_n(\mathrm{PC}, \phi)\]
  • Unweighted, which computes an arithmetic average of local correlations from all clusters:

\[r = \frac{1}{k} \sum_{n=1}^k r_n(\mathrm{PC}, \phi)\]

Example:

from PCAfold import predefined_variable_bins, LPCA
import numpy as np

# Generate dummy data set:
x = np.linspace(-1,1,1000)
y = -x**2 + 1
X = np.hstack((x[:,None], y[:,None]))

# Generate dummy vector of cluster classifications:
(idx, _) = predefined_variable_bins(x, [-0.9, 0, 0.6])

# Instantiate LPCA class object:
lpca = LPCA(X, idx, scaling='none')

# Compute local Pearson correlation coefficient between PC-1 and y:
(local_correlations, weighted, unweighted) = lpca.local_correlation(y, index=0, metric='pearson', verbose=True)

With verbose=True we will see some detailed information:

PCC in cluster 1:   0.999996
PCC in cluster 2:   -0.990817
PCC in cluster 3:   0.983221
PCC in cluster 4:   0.999838

Globally-averaged weighted correlation: 0.990801
Globally-averaged unweighted correlation: 0.993468
Parameters
  • variablenumpy.ndarray specifying the variable, \(\phi\), for correlation computation. It should be of size (n_observations,) or (n_observations,1) or (n_observations,n_variables) when metric='dcor'.

  • index – (optional) int specifying the index of the local principal component for correlation computation. Set index=0 if you want to look at the first PC.

  • metric – (optional) str specifying the correlation metric to use. It can be 'pearson', 'spearman' or 'dcor'.

  • display – (optional) str specifying the display format for the correlations. It can be 'abs', 'percent', 'abs-percent'.

  • verbose – (optional) bool for printing verbose details.

Returns

  • local_correlations - numpy.ndarray specifying the computed correlation in each cluster. It has size (k,).

  • weighted - float specifying the globally-averaged weighted correlation.

  • unweighted - float specifying the globally-averaged unweighted correlation.


Vector Quantization Principal Component Analysis#

Class VQPCA#
class PCAfold.reduction.VQPCA(X, n_clusters, n_components, scaling='std', idx_init='random', max_iter=300, tolerance=None, random_state=None, verbose=False)#

Enables performing Vector Quantization Principal Component Analysis (VQPCA).

The VQPCA algorithm was first proposed in [RKL97] and its modified version that we present here was developed by [PPSTS09]. VQPCA assigns observations to local clusters based on the minimum reconstruction error from PCA approximation with n_components number of Principal Components. This is an iterative procedure in which the reconstruction errors are evaluated for every observation as if that observation belonged to cluster j and next, the observation is assigned to the cluster for which the reconstruction error is smallest.

Note

The VQPCA algorithm centers the global data set \(\mathbf{X}\) by the mean value and scales it by the scaling specified with the scaling parameter. Data in local clusters is centered by the mean value but is not scaled.

Example:

from PCAfold import VQPCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(400,10)

# Instantiate VQPCA class object:
vqpca = VQPCA(X,
              n_clusters=3,
              n_components=2,
              scaling='std',
              idx_init='random',
              max_iter=100,
              tolerance=1.0e-08,
              random_state=42,
              verbose=True)

# Access the VQPCA clustering solution:
idx = vqpca.idx

With verbose=True, the code above will print detailed information on each iteration:

| It.   | Rec. error      | Error conv.? | Cent. conv.? | Cluster 1  | Cluster 2  | Cluster 3  | Time [min]   |
| 1     | 10.20073505     | False        | False        | 165        | 58         | 177        | 0.00042      |
| 2     | 6.02108074      | False        | False        | 155        | 84         | 161        | 0.00073      |
| 3     | 5.79390739      | False        | False        | 148        | 97         | 155        | 0.0011       |
| 4     | 5.69141601      | False        | False        | 148        | 110        | 142        | 0.00134      |
| 5     | 5.63347972      | False        | False        | 148        | 117        | 135        | 0.00156      |
| 6     | 5.61523762      | False        | False        | 148        | 117        | 135        | 0.00175      |
| 7     | 5.61010989      | False        | False        | 147        | 117        | 136        | 0.00199      |
| 8     | 5.60402719      | False        | False        | 144        | 119        | 137        | 0.00224      |
| 9     | 5.59803052      | False        | False        | 144        | 121        | 135        | 0.00246      |
| 10    | 5.59072799      | False        | False        | 142        | 123        | 135        | 0.00268      |
| 11    | 5.5783608       | False        | False        | 139        | 123        | 138        | 0.00291      |
| 12    | 5.57368963      | False        | False        | 138        | 123        | 139        | 0.00316      |
| 13    | 5.56762599      | False        | False        | 140        | 122        | 138        | 0.0034       |
| 14    | 5.55839038      | False        | False        | 138        | 120        | 142        | 0.00368      |
| 15    | 5.55167405      | False        | False        | 137        | 120        | 143        | 0.00394      |
| 16    | 5.54661554      | False        | False        | 136        | 120        | 144        | 0.0042       |
| 17    | 5.5453694       | False        | True         | 136        | 120        | 144        | 0.00444      |
| 18    | 5.5453694       | True         | True         | 136        | 120        | 144        | 0.00444      |
Convergence reached in iteration: 18

Total time: 0.004471 minutes.
--------------------------------------------------------------------------------------------------------------
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • n_clustersint specifying the number of clusters to partition the data.

  • n_components – (optional) int specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • idx_init – (optional) str or numpy.ndarray specifying the method for centroids initialization. If str, it can be 'uniform' or 'random'. By default random intialization is performed. An arbitrary user-supplied initial idx for initializing the centroids can be passed using a numpy.ndarray. It should be of size (n_observations,) or (n_observations,1).

  • max_iter – (optional) the maximum number of iterations that the algorithm will loop through.

  • tolerance – (optional) float specifying the tolerance for the global mean squared reconstruction error and for the cluster centroids. This parameter is important for judging the convergence of the VQPCA algorithm. If set to None, the default value 1.0e-08 is used.

  • random_state – (optional) int specifying the random seed.

  • verbose – (optional) boolean for printing clustering details.

Attributes:

  • idx - vector of cluster classifications.

  • collected_idx - vector of cluster classifications from all iterations.

  • converged - boolean specifying whether the algorithm has converged.

  • A - local eigenvectors from the last iteration.

  • principal_components - local Principal Components from the last iteration.

  • reconstruction_errors_in_clusters - mean reconstruction errors in each cluster from the last iteration.


Subset Principal Component Analysis#

Class SubsetPCA#
class PCAfold.reduction.SubsetPCA(X, X_source=None, full_sequence=True, subset_indices=None, variable_names=None, scaling='std', n_components=2, use_eigendec=True, nocenter=False, verbose=False)#

Enables performing Principal Component Analysis (PCA) of a subset of the original data set, \(\mathbf{X}\).

Example:

from PCAfold import SubsetPCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate SubsetPCA class object:
subset_pca_X = SubsetPCA(X, full_sequence=True, scaling='std', n_components=2)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • X_source – (optional) numpy.ndarray specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09].

  • full_sequence – (optional) bool specifying if a full sequence of subset PCAs should be performed. If set to True, it is assumed that variables in \(\mathbf{X}\) have been ordered according to some criterion. A sequence of subset PCAs will then be performed starting from the first n_components+1 variables and gradually adding the next variable in \(\mathbf{X}\). When full_sequence=True, parameter subset_indices will be ignored and the class attributes will be of type list of numpy.ndarray. Each element in those lists corresponds to one subset PCA in a sequence.

  • subset_indices – (optional) list specifying the indices of columns to be taken from the original data set to form a subset of a data set.

  • variable_names – (optional) list of str specifying the names of variables in \(\mathbf{X}\). It should have length n_variables and each element should correspond to a column in \(\mathbf{X}\).

  • scaling – (optional) str specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • n_components – (optional) int specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.

  • use_eigendec

    (optional) bool specifying the method for obtaining eigenvalues and eigenvectors:

    • use_eigendec=True uses eigendecomposition of the covariance matrix (from numpy.linalg.eigh)

    • use_eigendec=False uses Singular Value Decomposition (SVD) (from scipy.linalg.svd)

  • nocenter – (optional) bool specifying whether the data original data set should be centered by mean.

Attributes:

  • S - (read only) numpy.ndarray or list of numpy.ndarray specifying the covariance matrix, \(\mathbf{S}\).

  • L - (read only) numpy.ndarray or list of numpy.ndarray specifying the vector of eigenvalues, \(\mathbf{L}\).

  • A - (read only) numpy.ndarray or list of numpy.ndarray specifying the matrix of eigenvectors, \(\mathbf{A}\).

  • principal_components - (read only) numpy.ndarray or list of numpy.ndarray specifying the principal components, \(\mathbf{Z}\).

  • PC_source_terms - (read only) numpy.ndarray or list of numpy.ndarray specifying the PC source terms, \(\mathbf{S_Z}\).

  • variable_sequence - (read only) list or list of list specifying the names of variables that were used in each subset PCA.


Sample Principal Component Analysis#

Class SamplePCA#
PCAfold.reduction.SamplePCA(X, idx_X_r, scaling, n_components, biasing_option, X_source=None)#

Enables performing Principal Component Analysis (PCA) on a sample, \(\mathbf{X_r}\), of the original data set, \(\mathbf{X}\), with one of the four implemented options. Reach out to the Biasing options section of the documentation for more information on the available options.

Example:

from PCAfold import DataSampler, SamplePCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Generate dummy sampling indices:
idx = np.zeros((100,)).astype(int)
idx[50:80] = 1
selection = DataSampler(idx)
(idx_X_r, _) = selection.number(20, test_selection_option=1)

# Instantiate SamplePCA class object:
sample_pca = SamplePCA(X,
                       idx_X_r,
                       scaling='auto',
                       n_components=2,
                       biasing_option=1,
                       X_source=None)

# Access the re-sampled PCs:
PCs_resampled = sample_pca.pc_scores
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idx_X_rnumpy.ndarray specifying the vector of indices that should be extracted from \(\mathbf{X}\) to form \(\mathbf{X_r}\). It should be of size (n_samples,) or (n_samples,1).

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • n_componentsint specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.

  • biasing_optionint specifying the biasing option. It can only attain values 1, 2, 3 or 4.

  • X_source – (optional) numpy.ndarray specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09].

Attributes:

  • eigenvalues - (read only) numpy.ndarray specifying the biased eigenvalues, \(\mathbf{L_r}\).

  • eigenvectors - (read only) numpy.ndarray specifying the biased eigenvectors, \(\mathbf{A_r}\).

  • pc_scores - (read only) numpy.ndarray specifying the \(q\) first biased principal components, \(\mathbf{Z_r}\).

  • pc_sources - (read only) numpy.ndarray specifying the \(q\) first biased sources of principal components, \(\mathbf{S_{Z_r}}\). More information can be found in [TSP09]. This parameter is only computed if X_source input parameter is specified.

  • C - (read only) numpy.ndarray specifying a vector of centers, \(\mathbf{C}\), that were used to preprocess the original full data set \(\mathbf{X}\).

  • D - (read only) numpy.ndarray specifying a vector of scales, \(\mathbf{D}\), that were used to preprocess the original full data set \(\mathbf{X}\).

  • C_r - (read only) numpy.ndarray specifying a vector of centers, \(\mathbf{C_r}\), that were used to preprocess the sampled data set \(\mathbf{X_r}\).

  • D_r - (read only) numpy.ndarray specifying a vector of scales, \(\mathbf{D_r}\), that were used to preprocess the sampled data set \(\mathbf{X_r}\).

Class EquilibratedSamplePCA#
PCAfold.reduction.EquilibratedSamplePCA(X, idx, scaling, n_components, biasing_option, X_source=None, n_iterations=10, stop_iter=0, random_seed=None, verbose=False)#

Enables performing Principal Component Analysis (PCA) on a sample, \(\mathbf{X_r}\), of the original data set, \(\mathbf{X}\), with one of the four implemented options. Reach out to the Biasing options section of the documentation for more information on the available options.

This implementation gradually (in n_iterations) equilibrates cluster populations heading towards population of the smallest cluster, in each cluster.

At each iteration it generates a reduced data set \(\mathbf{X_r}^{(i)}\) made up from new populations, performs PCA on that data set to find the \(i^{th}\) version of the eigenvectors. Depending on the option selected, it then does the projection of a data set (and optionally also its sources) onto the found eigenvectors.

Equilibration:

For the moment, there is only one way implemented for the equilibration. The smallest cluster is found and any larger \(j^{th}\) cluster’s observations are diminished at each iteration by:

\[\frac{N_j - N_s}{\verb|n_iterations|}\]

\(N_j\) is the number of observations in that \(j^{th}\) cluster and \(N_s\) is the number of observations in the smallest cluster. This is further illustrated on synthetic data set below:

_images/cluster-equilibration-scheme.svg

Future implementation will include equilibration that slows down close to equilibrium.

Interpretation for the outputs:

This function returns 3D arrays eigenvectors, pc_scores and pc_sources that have the following structure:

_images/cbpca-equilibrate-outputs.svg

Example:

from PCAfold import EquilibratedSamplePCA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Generate dummy sampling indices:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Instantiate EquilibratedSamplePCA class object:
equilibrated_pca = EquilibratedSamplePCA(X,
                                         idx,
                                         'auto',
                                         n_components=2,
                                         biasing_option=1,
                                         n_iterations=1,
                                         random_seed=100,
                                         verbose=True)

# Access the re-sampled PCs from the last (equilibrated) iteration:
PCs_resampled = equilibrated_pca.pc_scores[:,:,-1]
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • X_sourcenumpy.ndarray specifying the source terms \(\mathbf{S_X}\) corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09].

  • n_componentsint specifying number of \(q\) first principal components that will be saved.

  • biasing_optionint specifying the biasing option. It can only attain values 1, 2, 3 or 4.

  • n_iterations – (optional) int specifying the number of iterations to loop over.

  • stop_iter – (optional) int specifying the index of iteration to stop.

  • random_seed – (optional) int specifying random seed for random sample selection.

  • verbose – (optional) bool for printing verbose details.

Returns

  • eigenvalues - numpy.ndarray specifying the collected eigenvalues from each iteration.

  • eigenvectors - numpy.ndarray specifying the collected eigenvectors from each iteration. This is a 3D array of size (n_variables, n_components, n_iterations+1).

  • pc_scores - numpy.ndarray specifying the collected principal components from each iteration. This is a 3D array of size (n_observations, n_components, n_iterations+1).

  • pc_sources - numpy.ndarray specifying the collected sources of principal components from each iteration. This is a 3D array of size (n_observations, n_components, n_iterations+1).

  • idx_train - numpy.ndarray specifying the final training indices from the equilibrated iteration.

  • C_r - numpy.ndarray specifying a vector of final centers that were used to center the data set at the last (equilibration) iteration.

  • D_r - numpy.ndarray specifying a vector of final scales that were used to scale the data set at the last (equilibration) iteration.

analyze_centers_change#
PCAfold.reduction.analyze_centers_change(X, idx_X_r, variable_names=[], plot_variables=[], legend_label=[], figure_size=None, title=None, save_filename=None)#

Analyzes the change in normalized centers computed on the sampled subset of the original data set \(\mathbf{X_r}\) with respect to the full original data set \(\mathbf{X}\).

The original data set \(\mathbf{X}\) is first normalized so that each variable ranges from 0 to 1:

\[||\mathbf{X}|| = \frac{\mathbf{X} - min(\mathbf{X})}{max(\mathbf{X} - min(\mathbf{X}))}\]

This normalization is done so that centers can be compared across variables on one plot. Samples are then extracted from \(||\mathbf{X}||\), according to idx_X_r, to form \(||\mathbf{X_r}||\).

Normalized centers are computed as:

\[||\mathbf{C}|| = mean(||\mathbf{X}||)\]
\[||\mathbf{C_r}|| = mean(||\mathbf{X_r}||)\]

Percentage measuring the relative change in normalized centers is computed as:

\[p = \frac{||\mathbf{C_r}|| - ||\mathbf{C}||}{||\mathbf{C}||} \cdot 100\%\]

Example:

from PCAfold import analyze_centers_change, DataSampler
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Generate dummy sampling indices:
idx = np.zeros((100,)).astype(int)
idx[50:80] = 1
selection = DataSampler(idx)
(idx_X_r, _) = selection.number(20, test_selection_option=1)

# Analyze the change in normalized centers:
(normalized_C, normalized_C_r, center_movement_percentage, plt) = analyze_centers_change(X, idx_X_r)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idx_X_r – vector of indices that should be extracted from \(\mathbf{X}\) to form \(\mathbf{X_r}\).

  • variable_names – (optional) list of str specifying variable names.

  • plot_variables – (optional) list of int specifying indices of variables to be plotted. By default, all variables are plotted.

  • legend_label – (optional) list of str specifying labels for the legend. First entry will refer to \(||\mathbf{C}||\) and second entry to \(||\mathbf{C_r}||\). If the list is empty, legend will not be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • normalized_C - normalized centers \(||\mathbf{C}||\).

  • normalized_C_r - normalized centers \(||\mathbf{C_r}||\).

  • center_movement_percentage - percentage \(p\) measuring the relative change in normalized centers.

  • plt - matplotlib.pyplot plot handle.

analyze_eigenvector_weights_change#
PCAfold.reduction.analyze_eigenvector_weights_change(eigenvectors, variable_names=[], plot_variables=[], normalize=False, zero_norm=False, legend_label=[], color_map='viridis', figure_size=None, title=None, save_filename=None)#

Analyzes the change of weights on an eigenvector obtained from a reduced data set as specified by the eigenvectors matrix. This matrix can contain many versions of eigenvectors, for instance coming from each iteration from the equilibrate_cluster_populations function.

If the number of versions is larger than two, the weights are plot on a color scale that marks each version. If there is a consistent trend, the coloring should form a clear trajectory.

In a special case, when there are only two versions within eigenvectors matrix, it is understood that the first version corresponds to the original data set and the last version to the equilibrated data set.

Note: This function plots absolute, (and optionally normalized) values of weights on each variable. Columns are normalized dividing by the maximum value. This is done in order to compare the movement of weights equally, with the highest, normalized one being equal to 1. You can additionally set the zero_norm=True in order to normalize weights such that they are between 0 and 1 (this is not done by default).

Example:

from PCAfold import equilibrate_cluster_populations, analyze_eigenvector_weights_change
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Generate dummy sampling indices:
idx = np.zeros((100,))
idx[50:80] = 1

# Run cluster equlibration:
(eigenvalues, eigenvectors_matrix, pc_scores_matrix, pc_sources_matrix, idx_train, C_r, D_r) = equilibrate_cluster_populations(X, idx, 'auto', n_components=2, biasing_option=1, n_iterations=1, random_seed=100, verbose=True)

# Analyze weights change on the first eigenvector:
plt = analyze_eigenvector_weights_change(eigenvectors_matrix[:,0,:])

# Analyze weights change on the second eigenvector:
plt = analyze_eigenvector_weights_change(eigenvectors_matrix[:,1,:])
Parameters
  • eigenvectors

    matrix of concatenated eigenvectors coming from different data sets or from different iterations. It should be size (n_variables, n_versions). This parameter can be directly extracted from eigenvectors_matrix output from function equilibrate_cluster_populations. For instance if the first and second eigenvector should be plotted:

    eigenvectors_1 = eigenvectors_matrix[:,0,:]
    eigenvectors_2 = eigenvectors_matrix[:,1,:]
    

  • variable_names – (optional) list of str specifying variable names.

  • plot_variables – (optional) list of integers specifying indices of variables to be plotted. By default, all variables are plotted.

  • normalize – (optional) bool specifying whether weights should be normlized at all. If set to false, the absolute values are plotted.

  • zero_norm – (optional) bool specifying whether weights should be normalized between 0 and 1. By default they are not normalized to start at 0. Only has effect if normalize=True.

  • legend_label – (optional) list of str specifying labels for the legend. If the list is empty, legend will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

analyze_eigenvalue_distribution#
PCAfold.reduction.analyze_eigenvalue_distribution(X, idx_X_r, scaling, biasing_option, legend_label=[], figure_size=None, title=None, save_filename=None)#

Analyzes the normalized eigenvalue distribution when PCA is performed on the original data set \(\mathbf{X}\) and on the sampled data set \(\mathbf{X_r}\).

Reach out to the Biasing options section of the documentation for more information on the available options.

Example:

from PCAfold import analyze_eigenvalue_distribution, DataSampler
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Generate dummy sampling indices:
idx = np.zeros((100,)).astype(int)
idx[50:80] = 1
selection = DataSampler(idx)
(idx_X_r, _) = selection.number(20, test_selection_option=1)

# Analyze the change in eigenvalue distribution:
plt = analyze_eigenvalue_distribution(X, idx_X_r, 'auto', biasing_option=1)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • idx_X_r – vector of indices that should be extracted from \(\mathbf{X}\) to form \(\mathbf{X_r}\).

  • scalingstr specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • biasing_optionint specifying biasing option. Can only attain values 1, 2, 3 or 4.

  • legend_label – (optional) list of str specifying labels for the legend. First entry will refer to \(\mathbf{X}\) and second entry to \(\mathbf{X_r}\). If the list is empty, legend will not be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.


Biasing options#

This section explains the choice for biasing_option input parameter in some of the functions in this module. The general goal for PCA on sampled data sets is to bias PCA with some information about the sampled data set \(\mathbf{X_r}\). Biasing option parameter will control how PCA is performed on or informed by \(\mathbf{X_r}\) data set sampled from \(\mathbf{X}\).

It is assumed that centers and scales computed on \(\mathbf{X_r}\) are denoted \(\mathbf{C_r}\) and \(\mathbf{D_r}\) and centers and scales computed on \(\mathbf{X}\) are denoted \(\mathbf{C}\) and \(\mathbf{D}\). \(N\) is the number of observations in \(\mathbf{X}\).

Biasing option 1#

The steps of PCA in this option:

Step

Option 1

S1: Sampling

\(\mathbf{X} \xrightarrow{\text{sampling}} \mathbf{X_r}\)

S2: Centering and scaling

\(\mathbf{X_{cs, r}} = (\mathbf{X_r} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\)
\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\)

S3: PCA: Eigenvectors

\(\frac{1}{N-1} \mathbf{X_{cs, r}}^{\mathbf{T}} \mathbf{X_{cs, r}} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\)

S4: PCA: Transformation

\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\)

These steps are presented graphically below:

_images/biasing-option-1.svg
Biasing option 2#

The steps of PCA in this option:

Step

Option 2

S1: Sampling

\(\mathbf{X_{cs}} \xrightarrow{\text{sampling}} \mathbf{X_r}\)

S2: Centering and scaling

\(\mathbf{X_r}\) is not further pre-processed
\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\)

S3: PCA: Eigenvectors

\(\frac{1}{N-1} \mathbf{X_r}^{\mathbf{T}} \mathbf{X_r} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\)

S4: PCA: Transformation

\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\)

These steps are presented graphically below:

_images/biasing-option-2.svg
Biasing option 3#

The steps of PCA in this option:

Step

Option 3

S1: Sampling

\(\mathbf{X} \xrightarrow{\text{sampling}} \mathbf{X_r}\)

S2: Centering and scaling

\(\mathbf{X_{cs, r}} = (\mathbf{X_r} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\)
\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\)

S3: PCA: Eigenvectors

\(\frac{1}{N-1} \mathbf{X_{cs, r}}^{\mathbf{T}} \mathbf{X_{cs, r}} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\)

S4: PCA: Transformation

\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\)

These steps are presented graphically below:

_images/biasing-option-3.svg
Biasing option 4#

The steps of PCA in this option:

Step

Option 4

S1: Sampling

\(\mathbf{X} \xrightarrow{\text{sampling}} \mathbf{X_r}\)

S2: Centering and scaling

\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\)

S3: PCA: Eigenvectors

\(\frac{1}{N-1} \mathbf{X_{cs}}^{\mathbf{T}} \mathbf{X_{cs}} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\)

S4: PCA: Transformation

\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\)

These steps are presented graphically below:

_images/biasing-option-4.svg

Plotting functions#

plot_2d_manifold#
PCAfold.reduction.plot_2d_manifold(x, y, color=None, clean=False, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', colorbar_range=None, norm=None, grid_on=True, s=None, figure_size=(7, 7), title=None, save_filename=None)#

Plots a two-dimensional manifold given two vectors defining the manifold.

Example:

from PCAfold import PCA, plot_2d_manifold
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Obtain 2-dimensional manifold from PCA:
pca_X = PCA(X)
principal_components = pca_X.transform(X)

# Plot the manifold:
plt = plot_2d_manifold(principal_components[:,0],
                       principal_components[:,1],
                       color=X[:,0],
                       clean=False,
                       x_label='PC-1',
                       y_label='PC-2',
                       colorbar_label='$X_1$',
                       colorbar_range=(0,1),
                       figure_size=(5,5),
                       title='2D manifold',
                       save_filename='2d-manifold.pdf')
plt.close()
Parameters
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • ynumpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, manifold will be plotted in black.

  • clean – (optional) bool specifying if a clean plot should be made. If set to True, nothing else but the data points is plotted.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • colorbar_label – (optional) str specifying colorbar label annotation. If set to None, colorbar label will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • norm – (optional) matplotlib.colors specifying the colormap normalization to use. Example can be matplotlib.colors.LogNorm().

  • colorbar_range – (optional) tuple specifying the lower and the upper bound for the colorbar range.

  • grid_onbool specifying whether grid should be plotted.

  • s – (optional) int or float specifying the scatter point size.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_3d_manifold#
PCAfold.reduction.plot_3d_manifold(x, y, z, color=None, elev=45, azim=-45, clean=False, x_label=None, y_label=None, z_label=None, colorbar_label=None, color_map='viridis', colorbar_range=None, s=None, figure_size=(7, 7), title=None, save_filename=None)#

Plots a three-dimensional manifold given three vectors defining the manifold.

Example:

from PCAfold import PCA, plot_3d_manifold
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Obtain 3-dimensional manifold from PCA:
pca_X = PCA(X)
PCs = pca_X.transform(X)

# Plot the manifold:
plt = plot_3d_manifold(PCs[:,0],
                       PCs[:,1],
                       PCs[:,2],
                       color=X[:,0],
                       elev=30,
                       azim=-60,
                       clean=False,
                       x_label='PC-1',
                       y_label='PC-2',
                       z_label='PC-3',
                       colorbar_label='$X_1$',
                       colorbar_range=(0,1),
                       figure_size=(15,7),
                       title='3D manifold',
                       save_filename='3d-manifold.png')
plt.close()
Parameters
  • x – variable on the \(x\)-axis. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1).

  • y – variable on the \(y\)-axis. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1).

  • z – variable on the \(z\)-axis. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1).

  • color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x, y and z vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, manifold will be plotted in black.

  • elev – (optional) elevation angle.

  • azim – (optional) azimuth angle.

  • clean – (optional) bool specifying if a clean plot should be made. If set to True, nothing else but the data points and the 3D axes is plotted.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • z_label – (optional) str specifying \(z\)-axis label annotation. If set to None label will not be plotted.

  • colorbar_label – (optional) str specifying colorbar label annotation. If set to None, colorbar label will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • colorbar_range – (optional) tuple specifying the lower and the upper bound for the colorbar range.

  • s – (optional) int or float specifying the scatter point size.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_2d_manifold_sequence#
PCAfold.reduction.plot_2d_manifold_sequence(xy, color=None, x_label=None, y_label=None, cbar=False, nrows=1, colorbar_label=None, color_map='viridis', grid_on=True, figure_size=(7, 3), title=None, save_filename=None)#

Plots a sequence of two-dimensional manifolds given a list of two vectors defining the manifold.

Example:

from PCAfold import SubsetPCA, plot_2d_manifold_sequence
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Obtain two-dimensional manifolds from subset PCA:
subset_PCA = SubsetPCA(X)
principal_components = subset_PCA.principal_components

# Plot the manifold:
plt = plot_2d_manifold_sequence(principal_components,
                                color=X[:,0],
                                x_label='PC-1',
                                y_label='PC-2',
                                nrows=2,
                                colorbar_label='$X_1$',
                                figure_size=(7,3),
                                title=['First', 'Second', 'Third'],
                                save_filename='2d-manifold-sequence.pdf')
plt.close()
Parameters
  • xylist of numpy.ndarray specifying the manifolds (variables on the \(x\) and \(y\) -axis). Each element of the list should be of size (n_observations,2).

  • color – (optional) numpy.ndarray or str, or list of numpy.ndarray or str specifying colors for the manifolds. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. Each numpy.ndarray should be of size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, manifolds will be plotted in black.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • cbar – (optional) bool specifying if the colorbar should be plotted.

  • nrows – (optional) int specifying in how many rows the manifold sequence should be plotted.

  • colorbar_label – (optional) str specifying colorbar label annotation. If set to None, colorbar label will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • grid_onbool specifying whether grid should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) list of str specifying title for each subplot. If set to None titles will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_parity#
PCAfold.reduction.plot_parity(variable, variable_rec, color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a parity plot between a variable and its reconstruction.

Example:

from PCAfold import PCA, plot_parity
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Obtain PCA reconstruction of the data set:
pca_X = PCA(X, n_components=8)
principal_components = pca_X.transform(X)
X_rec = pca_X.reconstruct(principal_components)

# Parity plot for the reconstruction of the first variable:
plt = plot_parity(X[:,0],
                  X_rec[:,0],
                  color=X[:,0],
                  x_label='Observed $X_1$',
                  y_label='Reconstructed $X_1$',
                  colorbar_label='X_1',
                  color_map='inferno',
                  figure_size=(5,5),
                  title='Parity plot',
                  save_filename='parity-plot.pdf')
plt.close()
Parameters
  • variable – vector specifying the original variable. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1).

  • variable_rec – vector specifying the reconstruction of the original variable. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1).

  • color – (optional) vector or string specifying color for the parity plot. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, parity plot will be plotted in black.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • colorbar_label – (optional) str specifying colorbar label annotation. If set to None, colorbar label will not be plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • grid_onbool specifying whether grid should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_mode#
PCAfold.reduction.plot_mode(mode, mode_name=None, variable_names=None, plot_absolute=False, rotate_label=False, bar_color=None, ylim=None, figure_size=None, title=None, save_filename=None)#

Plots weights on a generic mode.

Example:

from PCAfold import PCA, plot_mode
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Perform PCA and obtain eigenvectors:
pca_X = PCA(X, n_components=2)
eigenvectors = pca_X.A

# Plot second and third eigenvector:
plt = plot_mode(eigenvectors[:,0],
                 variable_names=['$a_1$', '$a_2$', '$a_3$'],
                 plot_absolute=False,
                 rotate_label=True,
                 bar_color=None,
                 figure_size=(5,3),
                 title='PCA on X',
                 save_filename='PCA-X.pdf')
plt.close()
Parameters
  • modenumpy.ndarray specifying the mode to plot. It should be of size (n_variables,) or (n_variables,1).

  • mode_namestr specifying the mode name.

  • variable_names – (optional) list of str specifying variable names.

  • plot_absolute – (optional) bool specifying whether absolute values of eigenvectors should be plotted.

  • rotate_label – (optional) bool specifying whether the labels on the x-axis should be rotated by 90 degrees. It is recommended to set it to True for data sets with many variables for viewing clarity.

  • bar_color – (optional) str specifying color of bars.

  • ylim – (optional) list specifying limits on the y-axis.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_eigenvectors#
PCAfold.reduction.plot_eigenvectors(eigenvectors, eigenvectors_indices=[], variable_names=None, plot_absolute=False, rotate_label=False, bar_color=None, figure_size=None, title=None, save_filename=None)#

Plots weights on eigenvectors. It will generate as many plots as there are eigenvectors present in the eigenvectors matrix.

Example:

from PCAfold import PCA, plot_eigenvectors
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Perform PCA and obtain eigenvectors:
pca_X = PCA(X, n_components=2)
eigenvectors = pca_X.A

# Plot second and third eigenvector:
plts = plot_eigenvectors(eigenvectors[:,[1,2]],
                         eigenvectors_indices=[1,2],
                         variable_names=['$a_1$', '$a_2$', '$a_3$'],
                         plot_absolute=False,
                         rotate_label=True,
                         bar_color=None,
                         title='PCA on X',
                         save_filename='PCA-X.pdf')
plts[0].close()
plts[1].close()
Parameters
  • eigenvectors – matrix of eigenvectors to plot. It can be supplied as an attribute of the PCA class: PCA.A.

  • eigenvectors_indiceslist of int specifying indexing of eigenvectors inside eigenvectors supplied. If it is not supplied, it is assumed that eigenvectors are numbered \([0, 1, 2, \dots, n]\), where \(n\) is the number of eigenvectors provided.

  • variable_names – (optional) list of str specifying variable names.

  • plot_absolute – (optional) bool specifying whether absolute values of eigenvectors should be plotted.

  • rotate_label – (optional) bool specifying whether the labels on the x-axis should be rotated by 90 degrees. It is recommended to set it to True for data sets with many variables for viewing clarity.

  • bar_color – (optional) str specifying color of bars.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png. Note that a prefix eigenvector-# will be added out front the filename, where # is the number of the currently plotted eigenvector.

Returns

  • plot_handles - list of plot handles.

plot_eigenvectors_comparison#
PCAfold.reduction.plot_eigenvectors_comparison(eigenvectors_tuple, legend_labels=[], variable_names=[], plot_absolute=False, rotate_label=False, ylim=None, color_map='coolwarm', figure_size=None, title=None, save_filename=None)#

Plots a comparison of weights on eigenvectors.

Example:

from PCAfold import PCA, plot_eigenvectors_comparison
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Perform PCA and obtain eigenvectors:
pca_X = PCA(X, n_components=2)
eigenvectors = pca_X.A

# Plot comparaison of first and second eigenvector:
plt = plot_eigenvectors_comparison((eigenvectors[:,0],
                                    eigenvectors[:,1]),
                                   legend_labels=['PC-1', 'PC-2'],
                                   variable_names=['$a_1$', '$a_2$', '$a_3$'],
                                   plot_absolute=False,
                                   color_map='coolwarm',
                                   title='PCA on X',
                                   save_filename='PCA-X.pdf')
plt.close()
Parameters
  • eigenvectors_tupletuple specifying the eigenvectors to plot. Each eigenvector inside a tuple should be a 0D array. It can be supplied as an attribute of the PCA class, for instance: (PCA.A[:,0], PCA.A[:,1]).

  • legend_labelslist of str specifying labels for each element in the eigenvectors_tuple.

  • variable_names – (optional) list of str specifying variable names.

  • plot_absolutebool specifying whether absolute values of eigenvectors should be plotted.

  • rotate_label – (optional) bool specifying whether the labels on the x-axis should be rotated by 90 degrees. It is recommended to set it to True for data sets with many variables for viewing clarity.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'coolwarm'.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_eigenvalue_distribution#
PCAfold.reduction.plot_eigenvalue_distribution(eigenvalues, normalized=False, figure_size=None, title=None, save_filename=None)#

Plots eigenvalue distribution.

Example:

from PCAfold import PCA, plot_eigenvalue_distribution
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA and obtain eigenvalues:
pca_X = PCA(X)
eigenvalues = pca_X.L

# Plot eigenvalue distribution:
plt = plot_eigenvalue_distribution(eigenvalues,
                                   normalized=True,
                                   title='PCA on X',
                                   save_filename='PCA-X.pdf')
plt.close()
Parameters
  • eigenvalues – a 0D vector of eigenvalues to plot. It can be supplied as an attribute of the PCA class: PCA.L.

  • normalized – (optional) bool specifying whether eigenvalues should be normalized to 1.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_eigenvalue_distribution_comparison#
PCAfold.reduction.plot_eigenvalue_distribution_comparison(eigenvalues_tuple, legend_labels=[], normalized=False, color_map='coolwarm', figure_size=None, title=None, save_filename=None)#

Plots a comparison of eigenvalue distributions.

Example:

from PCAfold import PCA, plot_eigenvalue_distribution_comparison
import numpy as np

# Generate dummy data sets:
X = np.random.rand(100,10)
Y = np.random.rand(100,10)

# Perform PCA and obtain eigenvalues:
pca_X = PCA(X)
eigenvalues_X = pca_X.L
pca_Y = PCA(Y)
eigenvalues_Y = pca_Y.L

# Plot eigenvalue distribution comparison:
plt = plot_eigenvalue_distribution_comparison((eigenvalues_X, eigenvalues_Y),
                                              legend_labels=['PCA on X', 'PCA on Y'],
                                              normalized=True,
                                              title='PCA on X and Y',
                                              save_filename='PCA-X-Y.pdf')
plt.close()
Parameters
  • eigenvalues_tupletuple specifying the eigenvalues to plot. Each vector of eigenvalues inside a tuple should be a 0D array. It can be supplied as an attribute of the PCA class, for instance: (PCA_1.L, PCA_2.L).

  • legend_labelslist of str specifying the labels for each element in the eigenvalues_tuple.

  • normalized – (optional) bool specifying whether eigenvalues should be normalized to 1.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'coolwarm'.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_cumulative_variance#
PCAfold.reduction.plot_cumulative_variance(eigenvalues, n_components=0, figure_size=None, title=None, save_filename=None)#

Plots the eigenvalues as bars and their cumulative sum to visualize the percent variance in the data explained by each principal component individually and by each principal component cumulatively.

Example:

from PCAfold import PCA, plot_cumulative_variance
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA and obtain eigenvalues:
pca_X = PCA(X)
eigenvalues = pca_X.L

# Plot the cumulative variance from eigenvalues:
plt = plot_cumulative_variance(eigenvalues,
                               n_components=0,
                               title='PCA on X',
                               save_filename='PCA-X.pdf')
plt.close()
Parameters
  • eigenvalues – a 0D vector of eigenvalues to analyze. It can be supplied as an attribute of the PCA class: PCA.L.

  • n_components – (optional) how many principal components you want to visualize (default is all).

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_heatmap#
PCAfold.reduction.plot_heatmap(M, annotate=False, text_color='w', format_displayed='%.2f', x_ticks=None, y_ticks=None, color_map='viridis', cbar=False, colorbar_label=None, figure_size=(5, 5), title=None, save_filename=None)#

Plots a heatmap for any matrix \(\mathbf{M}\).

Example:

from PCAfold import PCA, plot_heatmap
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA and obtain the covariance matrix:
pca_X = PCA(X)
covariance_matrix = pca_X.S

# Define ticks:
ticks = ['A', 'B', 'C', 'D', 'E']

# Plot a heatmap of the covariance matrix:
plt = plot_heatmap(covariance_matrix,
                   annotate=True,
                   text_color='w',
                   format_displayed='%.1f',
                   x_ticks=ticks,
                   y_ticks=ticks,
                   title='Covariance',
                   save_filename='covariance.pdf')
plt.close()
Parameters
  • Mnumpy.ndarray specifying the matrix \(\mathbf{M}\).

  • annotate – (optional) bool specifying whether numerical values of matrix elements should be plot on top of the heatmap.

  • text_color – (optional) str specifying the color of the annotation text.

  • format_displayed – (optional) str specifying the display format for the numerical entries inside the table. By default it is set to '%.2f'.

  • x_ticks – (optional) bool specifying whether ticks on the \(x\) -axis should be plotted or list specifying the ticks on the \(x\) -axis.

  • y_ticks – (optional) bool specifying whether ticks on the \(y\) -axis should be plotted or list specifying the ticks on the \(y\) -axis.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • cbar – (optional) bool specifying whether colorbar should be plotted.

  • colorbar_label – (optional) str specifying colorbar label annotation. If set to None, colorbar label will not be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_heatmap_sequence#
PCAfold.reduction.plot_heatmap_sequence(M, annotate=False, text_color='w', format_displayed='%.2f', x_ticks=None, y_ticks=None, color_map='viridis', cbar=False, colorbar_label=None, figure_size=(5, 5), title=None, save_filename=None)#

Plots a sequence of heatmaps for matrices \(\mathbf{M}\) stored in a list.

Example:

from PCAfold import PCA, plot_heatmap_sequence
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA and obtain the covariance matrices:
pca_X_auto = PCA(X, scaling='auto')
pca_X_range = PCA(X, scaling='range')
pca_X_vast = PCA(X, scaling='vast')
covariance_matrices = [pca_X_auto.S, pca_X_range.S, pca_X_vast.S]
titles = ['Auto', 'Range', 'VAST']

# Plot a sequence of heatmaps of the covariance matrices:
plt = plot_heatmap_sequence(covariance_matrices,
                            annotate=True,
                            text_color='w',
                            format_displayed='%.1f',
                            color_map='viridis',
                            cbar=True,
                            title=titles,
                            figure_size=(12,3),
                            save_filename='covariance-matrices.pdf')
plt.close()
Parameters
  • Mlist of numpy.ndarray specifying the matrices \(\mathbf{M}\).

  • annotate – (optional) bool specifying whether numerical values of matrix elements should be plot on top of the heatmap.

  • text_color – (optional) str specifying the color of the annotation text.

  • format_displayed – (optional) str specifying the display format for the numerical entries inside the table. By default it is set to '%.2f'.

  • x_ticks – (optional) bool specifying whether ticks on the \(x\) -axis should be plotted or list of list specifying the ticks on the \(x\) -axis.

  • y_ticks – (optional) bool specifying whether ticks on the \(y\) -axis should be plotted or list of list specifying the ticks on the \(y\) -axis.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • cbar – (optional) bool specifying whether colorbar should be plotted.

  • colorbar_label – (optional) str specifying colorbar label annotation. If set to None, colorbar label will not be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.


Bibliography#

RFro76

Serge Frontier. Étude de la décroissance des valeurs propres dans une analyse en composantes principales: comparaison avec le modd́le du bâton brisé. Journal of experimental marine Biology and Ecology, 25(1):67–75, 1976.

RJol72(1,2,3)

Ian T Jolliffe. Discarding variables in a principal component analysis. i: artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2):160–173, 1972.

RKai60

Henry F. Kaiser. The application of electronic computers to factor analysis. Educational and psychological measurement, 20(1):141–151, 1960.

RKL97

Nandakishore Kambhatla and Todd K. Leen. Dimension reduction by local principal component analysis. Neural computation, 9(7):1493–1516, 1997.

RKrz87

Wojtek J Krzanowski. Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1):22–33, 1987.

TSP09(1,2,3,4,5,6)

James C. Sutherland and Alessandro Parente. Combustion modeling using principal component analysis. Proceedings of the Combustion Institute, 32(1):1563–1570, 2009.

RJolliffe02(1,2,3)

Ian Jolliffe. Principal component analysis. Springer Verlag, New York, 2002.

Manifold analysis#

The analysis module contains functions for assessing the intrinsic dimensionality and quality of manifolds.

Note

The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).

\[\begin{split}\mathbf{X} = \begin{bmatrix} \vdots & \vdots & & \vdots \\ X_1 & X_2 & \dots & X_{Q} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix}\end{split}\]

The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.

The representation of the user-supplied data matrix in PCAfold is the input parameter X, which should be of type numpy.ndarray and of size (n_observations,n_variables).


Manifold assessment#

This section includes functions for quantitative assessments of manifold dimensionality and for comparing manifold parameterizations according to scales of variation and uniqueness of dependent variable values as introduced in [AAS21] and [AZASP22].

compute_normalized_variance#
PCAfold.analysis.compute_normalized_variance(indepvars, depvars, depvar_names, npts_bandwidth=25, min_bandwidth=None, max_bandwidth=None, bandwidth_values=None, scale_unit_box=True, n_threads=None, compute_sample_norm_var=False, compute_sample_norm_range=False)#

Compute a normalized variance (and related quantities) for analyzing manifold dimensionality. The normalized variance is computed as

\[\mathcal{N}(\sigma) = \frac{\sum_{i=1}^n (y_i - \mathcal{K}(\hat{x}_i; \sigma))^2}{\sum_{i=1}^n (y_i - \bar{y} )^2}\]

where \(\bar{y}\) is the average quantity over the whole manifold and \(\mathcal{K}(\hat{x}_i; \sigma)\) is the weighted average quantity calculated using kernel regression with a Gaussian kernel of bandwidth \(\sigma\) centered around the \(i^{th}\) observation. \(n\) is the number of observations. \(\mathcal{N}(\sigma)\) is computed for each bandwidth in an array of bandwidth values. By default, the indepvars (\(x\)) are centered and scaled to reside inside a unit box (resulting in \(\hat{x}\)) so that the bandwidths have the same meaning in each dimension. Therefore, the bandwidth and its involved calculations are applied in the normalized independent variable space. This may be turned off by setting scale_unit_box to False. The bandwidth values may be specified directly through bandwidth_values or default values will be calculated as a logspace from min_bandwidth to max_bandwidth with npts_bandwidth number of values. If left unspecified, min_bandwidth and max_bandwidth will be calculated as the minimum and maximum nonzero distance between points, respectively.

More information can be found in [AAS21].

Example:

from PCAfold import PCA, compute_normalized_variance
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA to obtain the low-dimensional manifold:
pca_X = PCA(X, n_components=2)
principal_components = pca_X.transform(X)

# Compute normalized variance quantities:
variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True)

# Access bandwidth values:
variance_data.bandwidth_values

# Access normalized variance values:
variance_data.normalized_variance

# Access normalized variance values for a specific variable:
variance_data.normalized_variance['B']
Parameters
  • indepvarsnumpy.ndarray specifying the independent variable values. It should be of size (n_observations,n_independent_variables).

  • depvarsnumpy.ndarray specifying the dependent variable values. It should be of size (n_observations,n_dependent_variables).

  • depvar_nameslist of str corresponding to the names of the dependent variables (for saving values in a dictionary)

  • npts_bandwidth – (optional, default 25) number of points to build a logspace of bandwidth values

  • min_bandwidth – (optional, default to minimum nonzero interpoint distance) minimum bandwidth

  • max_bandwidth – (optional, default to estimated maximum interpoint distance) maximum bandwidth

  • bandwidth_values – (optional) array of bandwidth values, i.e. filter widths for a Gaussian filter, to loop over

  • scale_unit_box – (optional, default True) center/scale the independent variables between [0,1] for computing a normalized variance so the bandwidth values have the same meaning in each dimension

  • n_threads – (optional, default None) number of threads to run this computation. If None, default behavior of multiprocessing.Pool is used, which is to use all available cores on the current system.

  • compute_sample_norm_var – (optional, default False) bool specifying if sample normalized variance should be computed.

  • compute_sample_norm_range – (optional, default False) bool specifying if sample normalized range should be computed.

Returns

  • variance_data - an object of the VarianceData class.

Class VarianceData#
class PCAfold.analysis.VarianceData(bandwidth_values, norm_var, global_var, bandwidth_10pct_rise, keys, norm_var_limit, sample_norm_var, sample_norm_range)#

A class for storing helpful quantities in analyzing dimensionality of manifolds through normalized variance measures. This class will be returned by compute_normalized_variance.

Parameters
  • bandwidth_values – the array of bandwidth values (Gaussian filter widths) used in computing the normalized variance for each variable

  • normalized_variance – dictionary of the normalized variance computed at each of the bandwidth values for each variable

  • global_variance – dictionary of the global variance for each variable

  • bandwidth_10pct_rise – dictionary of the bandwidth value corresponding to a 10% rise in the normalized variance for each variable

  • variable_names – list of the variable names

  • normalized_variance_limit – dictionary of the normalized variance computed as the bandwidth approaches zero (numerically at \(10^{-16}\)) for each variable

  • sample_normalized_variance – dictionary of the sample normalized variance for every observation, for each bandwidth and for each variable

normalized_variance_derivative#
PCAfold.analysis.normalized_variance_derivative(variance_data)#

Compute a scaled normalized variance derivative on a logarithmic scale, \(\hat{\mathcal{D}}(\sigma)\), from

\[\mathcal{D}(\sigma) = \frac{\mathrm{d}\mathcal{N}(\sigma)}{\mathrm{d}\log_{10}(\sigma)} + \lim_{\sigma \to 0} \mathcal{N}(\sigma)\]

and

\[\hat{\mathcal{D}}(\sigma) = \frac{\mathcal{D}(\sigma)}{\max(\mathcal{D}(\sigma))}\]

This value relays how fast the variance is changing as the bandwidth changes and captures non-uniqueness from nonzero values of \(\lim_{\sigma \to 0} \mathcal{N}(\sigma)\). The derivative is approximated with central finite differencing and the limit is approximated by \(\mathcal{N}(\sigma=10^{-16})\) using the normalized_variance_limit attribute of the VarianceData object.

More information can be found in [AAS21].

Example:

from PCAfold import PCA, compute_normalized_variance, normalized_variance_derivative
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA to obtain the low-dimensional manifold:
pca_X = PCA(X, n_components=2)
principal_components = pca_X.transform(X)

# Compute normalized variance quantities:
variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True)

# Compute normalized variance derivative:
(derivative, bandwidth_values, max_derivative) = normalized_variance_derivative(variance_data)

# Access normalized variance derivative values for a specific variable:
derivative['B']
Parameters

variance_data – a VarianceData class returned from compute_normalized_variance

Returns

  • derivative_dict - a dictionary of \(\hat{\mathcal{D}}(\sigma)\) for each variable in the provided VarianceData object

  • x - the \(\sigma\) values where \(\hat{\mathcal{D}}(\sigma)\) was computed

  • max_derivatives_dicts - a dictionary of \(\max(\mathcal{D}(\sigma))\) values for each variable in the provided VarianceData object.

find_local_maxima#
PCAfold.analysis.find_local_maxima(dependent_values, independent_values, logscaling=True, threshold=0.01, show_plot=False)#

Finds and returns locations and values of local maxima in a dependent variable given a set of observations. The functional form of the dependent variable is approximated with a cubic spline for smoother approximations to local maxima.

Parameters
  • dependent_values – observations of a single dependent variable such as \(\hat{\mathcal{D}}\) from normalized_variance_derivative (for a single variable).

  • independent_values – observations of a single independent variable such as \(\sigma\) returned by normalized_variance_derivative

  • logscaling – (optional, default True) this logarithmically scales independent_values before finding local maxima. This is needed for scaling \(\sigma\) appropriately before finding peaks in \(\hat{\mathcal{D}}\).

  • threshold – (optional, default \(10^{-2}\)) local maxima found below this threshold will be ignored.

  • show_plot – (optional, default False) when True, a plot of the dependent_values over independent_values (logarithmically scaled if logscaling is True) with the local maxima highlighted will be shown.

Returns

  • the locations of local maxima in dependent_values

  • the local maxima values

random_sampling_normalized_variance#
PCAfold.analysis.random_sampling_normalized_variance(sampling_percentages, indepvars, depvars, depvar_names, n_sample_iterations=1, verbose=True, npts_bandwidth=25, min_bandwidth=None, max_bandwidth=None, bandwidth_values=None, scale_unit_box=True, n_threads=None)#

Compute the normalized variance derivatives \(\hat{\mathcal{D}}(\sigma)\) for random samples of the provided data specified using sampling_percentages. These will be averaged over n_sample_iterations iterations. Analyzing the shift in peaks of \(\hat{\mathcal{D}}(\sigma)\) due to sampling can distinguish between characteristic features and non-uniqueness due to a transformation/reduction of manifold coordinates. True features should not show significant sensitivity to sampling while non-uniqueness/folds in the manifold will.

More information can be found in [AAS21].

Parameters
  • sampling_percentages – list or 1D array of fractions (between 0 and 1) of the provided data to sample for computing the normalized variance

  • indepvars – independent variable values (size: n_observations x n_independent variables)

  • depvars – dependent variable values (size: n_observations x n_dependent variables)

  • depvar_names – list of strings corresponding to the names of the dependent variables (for saving values in a dictionary)

  • n_sample_iterations – (optional, default 1) how many iterations for each sampling_percentages to average the normalized variance derivative over

  • verbose – (optional, default True) when True, progress statements are printed

  • npts_bandwidth – (optional, default 25) number of points to build a logspace of bandwidth values

  • min_bandwidth – (optional, default to minimum nonzero interpoint distance) minimum bandwidth

  • max_bandwidth – (optional, default to estimated maximum interpoint distance) maximum bandwidth

  • bandwidth_values – (optional) array of bandwidth values, i.e. filter widths for a Gaussian filter, to loop over

  • scale_unit_box – (optional, default True) center/scale the independent variables between [0,1] for computing a normalized variance so the bandwidth values have the same meaning in each dimension

  • n_threads – (optional, default None) number of threads to run this computation. If None, default behavior of multiprocessing.Pool is used, which is to use all available cores on the current system.

Returns

  • a dictionary of the normalized variance derivative (\(\hat{\mathcal{D}}(\sigma)\)) for each sampling percentage in sampling_percentages averaged over n_sample_iterations iterations

  • the \(\sigma\) values used for computing \(\hat{\mathcal{D}}(\sigma)\)

  • a dictionary of the VarianceData objects for each sampling percentage and iteration in sampling_percentages and n_sample_iterations

feature_size_map#
PCAfold.analysis.feature_size_map(variance_data, variable_name, cutoff=1, starting_bandwidth_idx='peak', use_variance=False, verbose=False)#

Computes a map of local feature sizes on a manifold.

Example:

from PCAfold import PCA, compute_normalized_variance, feature_size_map
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Specify variables names
variable_names = ['X_' + str(i) for i in range(0,10)]

# Perform PCA to obtain the low-dimensional manifold:
pca_X = PCA(X, n_components=2)
principal_components = pca_X.transform(X)

# Specify the bandwidth values:
bandwidth_values = np.logspace(-4, 2, 50)

# Compute normalized variance quantities:
variance_data = compute_normalized_variance(principal_components,
                                            X,
                                            depvar_names=variable_names,
                                            bandwidth_values=bandwidth_values)

# Compute the feature size map:
feature_size_map = feature_size_map(variance_data,
                                    variable_name='X_1',
                                    cutoff=1,
                                    starting_bandwidth_idx='peak',
                                    verbose=True)
Parameters
  • variance_data – an object of VarianceData class.

  • variable_namestr specifying the name of the dependent variable for which the feature size map should be computed. It should be as per name specified when computing variance_data.

  • cutoff – (optional) float or int specifying the cutoff percentage, \(p\). It should be a number between 0 and 100.

  • starting_bandwidth_idx – (optional) int or str specifying the index of the starting bandwidth to compute the local feature sizes from. Local feature sizes computed will never be smaller than the starting bandwidth. If set to 'peak', the starting bandwidth will be automatically calculated as the rightmost peak, \(\sigma_{peak}\).

  • verbose – (optional) bool for printing verbose details.

Returns

  • feature_size_map - numpy.ndarray specifying the local feature sizes on a manifold, \(\mathbf{B}\). It has size (n_observations,).

feature_size_map_smooth#
PCAfold.analysis.feature_size_map_smooth(indepvars, feature_size_map, method='median', n_neighbors=10)#

Smooths out a map of local feature sizes on a manifold.

Note

This function requires the scikit-learn module. You can install it through:

pip install scikit-learn

Example:

from PCAfold import PCA, compute_normalized_variance, feature_size_map, smooth_feature_size_map
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Specify variables names
variable_names = ['X_' + str(i) for i in range(0,10)]

# Perform PCA to obtain the low-dimensional manifold:
pca_X = PCA(X, n_components=2)
principal_components = pca_X.transform(X)

# Specify the bandwidth values:
bandwidth_values = np.logspace(-4, 2, 50)

# Compute normalized variance quantities:
variance_data = compute_normalized_variance(principal_components,
                                            X,
                                            depvar_names=variable_names,
                                            bandwidth_values=bandwidth_values)

# Compute the feature size map:
feature_size_map = feature_size_map(variance_data,
                                    variable_name='X_1',
                                    cutoff=1,
                                    starting_bandwidth_idx='peak',
                                    verbose=True)

# Smooth out the feature size map:
updated_feature_size_map = feature_size_map_smooth(principal_components,
                                                   feature_size_map,
                                                   method='median',
                                                   n_neighbors=4)
Parameters
  • indepvarsnumpy.ndarray specifying the independent variable values. It should be of size (n_observations,n_independent_variables).

  • feature_size_mapnumpy.ndarray specifying the local feature sizes on a manifold, \(\mathbf{B}\). It should be of size (n_observations,) or (n_observations,1).

  • method – (optional) str specifying the smoothing method. It can be 'median', 'mean', 'max' or 'min'.

  • n_neighbors – (optional) int specifying the number of nearest neighbors to smooth over.

Returns

  • updated_feature_size_map - numpy.ndarray specifying the smoothed local feature sizes on a manifold, \(\mathbf{B}\). It has size (n_observations,).

cost_function_normalized_variance_derivative#
PCAfold.analysis.cost_function_normalized_variance_derivative(variance_data, penalty_function=None, power=1, vertical_shift=1, norm=None, integrate_to_peak=False, rightmost_peak_shift=None)#

Defines a cost function for manifold topology assessment based on the areas, or weighted (penalized) areas, under the normalized variance derivatives curves, \(\hat{\mathcal{D}}(\sigma)\), for the selected \(n_{dep}\) dependent variables.

More information on the theory and application of the cost function can be found in [AZASP22].

An individual area, \(A_i\), for the \(i^{th}\) dependent variable, is computed by directly integrating the function \(\hat{\mathcal{D}}_i(\sigma)\) in the \(\log_{10}\) space of bandwidths \(\sigma\). Integration is performed using the composite trapezoid rule.

When integrate_to_peak=False, the bounds of integration go from the minimum bandwidth, \(\sigma_{min, i}\), to the maximum bandwidth, \(\sigma_{max, i}\):

\[A_i = \int_{\sigma_{min, i}}^{\sigma_{max, i}} \hat{\mathcal{D}}_i(\sigma) d \log_{10} \sigma\]
_images/cost-function-D-hat.svg

When integrate_to_peak=True, the bounds of integration go from the minimum bandwidth, \(\sigma_{min, i}\), to the bandwidth for which the rightmost peak happens in \(\hat{\mathcal{D}}_i(\sigma)\), \(\sigma_{peak, i}\):

\[A_i = \int_{\sigma_{min, i}}^{\sigma_{peak, i}} \hat{\mathcal{D}}_i(\sigma) d \log_{10} \sigma\]
_images/cost-function-D-hat-to-peak.svg

In addition, each individual area, \(A_i\), can be weighted. The following weighting options are available:

  • If penalty_function='peak', \(A_i\) is weighted by the inverse of the rightmost peak location:

\[A_i = \frac{1}{\sigma_{peak, i}} \cdot \int \hat{\mathcal{D}}_i(\sigma) d(\log_{10} \sigma)\]

This creates a constant penalty:

_images/cost-function-peak.svg
  • If penalty_function='sigma', \(A_i\) is weighted continuously by the bandwidth:

\[A_i = \int \frac{1}{\sigma^r} \cdot \hat{\mathcal{D}}_i(\sigma) d(\log_{10} \sigma)\]

where \(r\) is a hyper-parameter that can be controlled by the user. This type of weighting strongly penalizes the area happening at lower bandwidth values.

For instance, when \(r=0.2\):

_images/cost-function-sigma-penalty-r02.svg

When \(r=1\) (with the penalty corresponding to \(r=0.2\) plotted in gray in the background):

_images/cost-function-sigma-penalty-r1.svg
  • If penalty_function='log-sigma-over-peak', \(A_i\) is weighted continuously by the \(\log_{10}\) -transformed bandwidth and takes into account information about the rightmost peak location.

\[A_i = \int \Big( \big| \log_{10} \Big( \frac{\sigma}{\sigma_{peak, i}} \Big) \big|^r + b \cdot \frac{\log_{10} \sigma_{max, i} - \log_{10} \sigma_{min, i}}{\log_{10} \sigma_{peak, i} - \log_{10} \sigma_{min, i}} \Big) \cdot \hat{\mathcal{D}}_i(\sigma) d(\log_{10} \sigma)\]

This type of weighting creates a more gentle penalty for the area happening further from the rightmost peak location. By increasing \(b\), the user can increase the amount of penalty applied to smaller feature sizes over larger ones. By increasing \(r\), the user can penalize non-uniqueness more strongly.

For instance, when \(r=1\):

_images/cost-function-log-sigma-over-peak-penalty-r1.svg

When \(r=2\) (with the penalty corresponding to \(r=1\) plotted in gray in the background):

_images/cost-function-log-sigma-over-peak-penalty-r2.svg

If norm=None, a list of costs for all dependent variables is returned. Otherwise, the final cost, \(\mathcal{L}\), can be computed from all \(A_i\) in a few ways, where \(n_{dep}\) is the number of dependent variables stored in the variance_data object:

  • If norm='average', \(\mathcal{L} = \frac{1}{n_{dep}} \sum_{i = 1}^{n_{dep}} A_i\).

  • If norm='cumulative', \(\mathcal{L} = \sum_{i = 1}^{n_{dep}} A_i\).

  • If norm='max', \(\mathcal{L} = \text{max} (A_i)\).

  • If norm='median', \(\mathcal{L} = \text{median} (A_i)\).

  • If norm='min', \(\mathcal{L} = \text{min} (A_i)\).

Example:

from PCAfold import PCA, compute_normalized_variance, cost_function_normalized_variance_derivative
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Specify variables names
variable_names = ['X_' + str(i) for i in range(0,10)]

# Perform PCA to obtain the low-dimensional manifold:
pca_X = PCA(X, n_components=2)
principal_components = pca_X.transform(X)

# Specify the bandwidth values:
bandwidth_values = np.logspace(-4, 2, 50)

# Compute normalized variance quantities:
variance_data = compute_normalized_variance(principal_components,
                                            X,
                                            depvar_names=variable_names,
                                            bandwidth_values=bandwidth_values)

# Compute the cost for the current manifold:
cost = cost_function_normalized_variance_derivative(variance_data,
                                                    penalty_function='sigma',
                                                    power=0.5,
                                                    vertical_shift=1,
                                                    norm='max',
                                                    integrate_to_peak=True)
Parameters
  • variance_data – an object of VarianceData class.

  • penalty_function – (optional) str specifying the weighting (penalty) applied to each area. Set penalty_function='peak' to weight each area by the rightmost peak location, \(\sigma_{peak, i}\), for the \(i^{th}\) dependent variable. Set penalty_function='sigma' to weight each area continuously by the bandwidth. Set penalty_function='log-sigma' to weight each area continuously by the \(\log_{10}\) -transformed bandwidth. Set penalty_function='log-sigma-over-peak' to weight each area continuously by the \(\log_{10}\) -transformed bandwidth, normalized by the right most peak location, \(\sigma_{peak, i}\). If penalty_function=None, the area is not weighted.

  • power – (optional) float or int specifying the power, \(r\). It can be used to control how much penalty should be applied to variance happening at the smallest length scales.

  • vertical_shift – (optional) float or int specifying the vertical shift multiplier, \(b\). It can be used to control how much penalty should be applied to feature sizes.

  • norm – (optional) str specifying the norm to apply for all areas \(A_i\). norm='average' uses an arithmetic average, norm='max' uses the \(L_{\infty}\) norm, norm='median' uses a median area, norm='cumulative' uses a cumulative area and norm='min' uses a minimum area. If norm=None, a list of costs for all depedent variables is returned.

  • integrate_to_peak – (optional) bool specifying whether an individual area for the \(i^{th}\) dependent variable should be computed only up the the rightmost peak location.

  • rightmost_peak_shift – (optional) float or int specifying the percentage, \(p\), of shift in the rightmost peak location. If set to a number between 0 and 100, a quantity \(p/100 (\sigma_{max} - \sigma_{peak, i})\) is added to the rightmost peak location. It can be used to move the rightmost peak location further right, for instance if there is a blending of scales in the \(\hat{\mathcal{D}}(\sigma)\) profile.

Returns

  • cost - float specifying the normalized cost, \(\mathcal{L}\), or, if norm=None, a list of costs, \(A_i\), for each dependent variable.


Kernel Regression#

This section includes details on the Nadaraya-Watson kernel regression [AHardle90] used in assessing manifolds. The KReg class may be used for non-parametric regression in general.

Class KReg#
class PCAfold.kernel_regression.KReg(indepvars, depvars, internal_dtype=<class 'float'>, supress_warning=False)#

A class for building and evaluating Nadaraya-Watson kernel regression models using a Gaussian kernel. The regression estimator \(\mathcal{K}(u; \sigma)\) evaluated at independent variables \(u\) can be expressed using a set of \(n\) observations of independent variables (\(x\)) and dependent variables (\(y\)) as follows

\[\mathcal{K}(u; \sigma) = \frac{\sum_{i=1}^{n} \mathcal{W}_i(u; \sigma) y_i}{\sum_{i=1}^{n} \mathcal{W}_i(u; \sigma)}\]

where a Gaussian kernel of bandwidth \(\sigma\) is used as

\[\mathcal{W}_i(u; \sigma) = \exp \left( \frac{-|| x_i - u ||_2^2}{\sigma^2} \right)\]

Both constant and variable bandwidths are supported. Kernels with anisotropic bandwidths are calculated as

\[\mathcal{W}_i(u; \sigma) = \exp \left( -|| \text{diag}(\sigma)^{-1} (x_i - u) ||_2^2 \right)\]

where \(\sigma\) is a vector of bandwidths per independent variable.

Example:

from PCAfold import KReg
import numpy as np

indepvars = np.expand_dims(np.linspace(0,np.pi,11),axis=1)
depvars = np.cos(indepvars)
query = np.expand_dims(np.linspace(0,np.pi,21),axis=1)

model = KReg(indepvars, depvars)
predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)
Parameters
  • indepvarsnumpy.ndarray specifying the independent variable training data, \(x\) in equations above. It should be of size (n_observations,n_independent_variables).

  • depvarsnumpy.ndarray specifying the dependent variable training data, \(y\) in equations above. It should be of size (n_observations,n_dependent_variables).

  • internal_dtype – (optional, default float) data type to enforce in training and evaluating

  • supress_warning – (optional, default False) if True, turns off printed warnings

KReg.predict#
PCAfold.kernel_regression.KReg.predict(self, query_points, bandwidth, n_neighbors=None)#

Calculate dependent variable predictions at query_points.

Parameters
  • query_pointsnumpy.ndarray specifying the independent variable points to query the model. It should be of size (n_points,n_independent_variables).

  • bandwidth

    value(s) to use for the bandwidth in the Gaussian kernel. Supported formats include:

    • single value: constant bandwidth applied to each query point and independent variable dimension.

    • 2D array shape (n_points x n_independent_variables): an array of bandwidths for each independent variable dimension of each query point.

    • string “nearest_neighbors_isotropic”: This option requires the argument n_neighbors to be specified for which a bandwidth will be calculated for each query point based on the Euclidean distance to the n_neighbors nearest indepvars point.

    • string “nearest_neighbors_anisotropic”: This option requires the argument n_neighbors to be specified for which a bandwidth will be calculated for each query point based on the distance in each (separate) independent variable dimension to the n_neighbors nearest indepvars point.

Returns

dependent variable predictions for the query_points

KReg.compute_constant_bandwidth#
PCAfold.kernel_regression.KReg.compute_constant_bandwidth(self, query_points, bandwidth)#

Format a single bandwidth value into a 2D array matching the shape of query_points

Parameters
  • query_points – array of independent variable points to query the model (n_points x n_independent_variables)

  • bandwidth – single value for the bandwidth used in a Gaussian kernel

Returns

an array of bandwidth values matching the shape of query_points

KReg.compute_bandwidth_isotropic#
PCAfold.kernel_regression.KReg.compute_bandwidth_isotropic(self, query_points, bandwidth)#

Format a 1D array of bandwidth values for each point in query_points into a 2D array matching the shape of query_points

Parameters
  • query_points – array of independent variable points to query the model (n_points x n_independent_variables)

  • bandwidth – 1D array of bandwidth values length n_points

Returns

an array of bandwidth values matching the shape of query_points (repeats the bandwidth array for each independent variable)

KReg.compute_bandwidth_anisotropic#
PCAfold.kernel_regression.KReg.compute_bandwidth_anisotropic(self, query_points, bandwidth)#

Format a 1D array of bandwidth values for each independent variable into the 2D array matching the shape of query_points

Parameters
  • query_points – array of independent variable points to query the model (n_points x n_independent_variables)

  • bandwidth – 1D array of bandwidth values length n_independent_variables

Returns

an array of bandwidth values matching the shape of query_points (repeats the bandwidth array for each point in query_points)

KReg.compute_nearest_neighbors_bandwidth_isotropic#
PCAfold.kernel_regression.KReg.compute_nearest_neighbors_bandwidth_isotropic(self, query_points, n_neighbors)#

Compute a variable bandwidth for each point in query_points based on the Euclidean distance to the n_neighbors nearest neighbor

Parameters
  • query_points – array of independent variable points to query the model (n_points x n_independent_variables)

  • n_neighbors – integer value for the number of nearest neighbors to consider in computing a bandwidth (distance)

Returns

an array of bandwidth values matching the shape of query_points (varies for each point, constant across independent variables)

KReg.compute_nearest_neighbors_bandwidth_anisotropic#
PCAfold.kernel_regression.KReg.compute_nearest_neighbors_bandwidth_anisotropic(self, query_points, n_neighbors)#

Compute a variable bandwidth for each point in query_points and each independent variable separately based on the distance to the n_neighbors nearest neighbor in each independent variable dimension

Parameters
  • query_points – array of independent variable points to query the model (n_points x n_independent_variables)

  • n_neighbors – integer value for the number of nearest neighbors to consider in computing a bandwidth (distance)

Returns

an array of bandwidth values matching the shape of query_points (varies for each point and independent variable)


Plotting functions#

plot_normalized_variance#
PCAfold.analysis.plot_normalized_variance(variance_data, plot_variables=[], color_map='Blues', figure_size=(10, 5), title=None, save_filename=None)#

This function plots normalized variance \(\mathcal{N}(\sigma)\) over bandwith values \(\sigma\) from an object of a VarianceData class.

Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using plot_variables list.

Example:

from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA to obtain the low-dimensional manifold:
pca_X = PCA(X, n_components=2)
principal_components = pca_X.transform(X)

# Compute normalized variance quantities:
variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True)

# Plot normalized variance quantities:
plt = plot_normalized_variance(variance_data,
                               plot_variables=[0,1,2],
                               color_map='Blues',
                               figure_size=(10,5),
                               title='Normalized variance',
                               save_filename='N.pdf')
plt.close()
Parameters
  • variance_data – an object of VarianceData class objects whose normalized variance quantities should be plotted.

  • plot_variables – (optional) list of int specifying indices of variables to be plotted. By default, all variables are plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'Blues'.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_normalized_variance_comparison#
PCAfold.analysis.plot_normalized_variance_comparison(variance_data_tuple, plot_variables_tuple, color_map_tuple, figure_size=(10, 5), title=None, save_filename=None)#

This function plots a comparison of normalized variance \(\mathcal{N}(\sigma)\) over bandwith values \(\sigma\) from several objects of a VarianceData class.

Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using plot_variables list.

Example:

from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance_comparison
import numpy as np

# Generate dummy data sets:
X = np.random.rand(100,5)
Y = np.random.rand(100,5)

# Perform PCA to obtain low-dimensional manifolds:
pca_X = PCA(X, n_components=2)
pca_Y = PCA(Y, n_components=2)
principal_components_X = pca_X.transform(X)
principal_components_Y = pca_Y.transform(Y)

# Compute normalized variance quantities:
variance_data_X = compute_normalized_variance(principal_components_X, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True)
variance_data_Y = compute_normalized_variance(principal_components_Y, Y, depvar_names=['F', 'G', 'H', 'I', 'J'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True)

# Plot a comparison of normalized variance quantities:
plt = plot_normalized_variance_comparison((variance_data_X, variance_data_Y),
                                          ([0,1,2], [0,1,2]),
                                          ('Blues', 'Reds'),
                                          figure_size=(10,5),
                                          title='Normalized variance comparison',
                                          save_filename='N.pdf')
plt.close()
Parameters
  • variance_data_tupletuple of VarianceData class objects whose normalized variance quantities should be compared on one plot. For instance: (variance_data_1, variance_data_2).

  • plot_variables_tuplelist of int specifying indices of variables to be plotted. It should have as many elements as there are VarianceData class objects supplied. For instance: ([], []) will plot all variables.

  • color_map – (optional) tuple of str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. It should have as many elements as there are VarianceData class objects supplied. For instance: ('Blues', 'Reds').

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_normalized_variance_derivative#
PCAfold.analysis.plot_normalized_variance_derivative(variance_data, plot_variables=[], color_map='Blues', figure_size=(10, 5), title=None, save_filename=None)#

This function plots a scaled normalized variance derivative (computed over logarithmically scaled bandwidths), \(\hat{\mathcal{D}(\sigma)}\), over bandwith values \(\sigma\) from an object of a VarianceData class.

Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using plot_variables list.

Example:

from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance_derivative
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Perform PCA to obtain the low-dimensional manifold:
pca_X = PCA(X, n_components=2)
principal_components = pca_X.transform(X)

# Compute normalized variance quantities:
variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True)

# Plot normalized variance derivative:
plt = plot_normalized_variance_derivative(variance_data,
                                          plot_variables=[0,1,2],
                                          color_map='Blues',
                                          figure_size=(10,5),
                                          title='Normalized variance derivative',
                                          save_filename='D-hat.pdf')
plt.close()
Parameters
  • variance_data – an object of VarianceData class objects whose normalized variance derivative quantities should be plotted.

  • plot_variables – (optional) list of int specifying indices of variables to be plotted. By default, all variables are plotted.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'Blues'.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_normalized_variance_derivative_comparison#
PCAfold.analysis.plot_normalized_variance_derivative_comparison(variance_data_tuple, plot_variables_tuple, color_map_tuple, figure_size=(10, 5), title=None, save_filename=None)#

This function plots a comparison of scaled normalized variance derivative (computed over logarithmically scaled bandwidths), \(\hat{\mathcal{D}(\sigma)}\), over bandwith values \(\sigma\) from an object of a VarianceData class.

Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using plot_variables list.

Example:

from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance_derivative_comparison
import numpy as np

# Generate dummy data sets:
X = np.random.rand(100,5)
Y = np.random.rand(100,5)

# Perform PCA to obtain low-dimensional manifolds:
pca_X = PCA(X, n_components=2)
pca_Y = PCA(Y, n_components=2)
principal_components_X = pca_X.transform(X)
principal_components_Y = pca_Y.transform(Y)

# Compute normalized variance quantities:
variance_data_X = compute_normalized_variance(principal_components_X, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True)
variance_data_Y = compute_normalized_variance(principal_components_Y, Y, depvar_names=['F', 'G', 'H', 'I', 'J'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True)

# Plot a comparison of normalized variance derivatives:
plt = plot_normalized_variance_derivative_comparison((variance_data_X, variance_data_Y),
                                                     ([0,1,2], [0,1,2]),
                                                     ('Blues', 'Reds'),
                                                     figure_size=(10,5),
                                                     title='Normalized variance derivative comparison',
                                                     save_filename='D-hat.pdf')
plt.close()
Parameters
  • variance_data_tupletuple of VarianceData class objects whose normalized variance derivative quantities should be compared on one plot. For instance: (variance_data_1, variance_data_2).

  • plot_variables_tuplelist of int specifying indices of variables to be plotted. It should have as many elements as there are VarianceData class objects supplied. For instance: ([], []) will plot all variables.

  • color_map – (optional) tuple of str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. It should have as many elements as there are VarianceData class objects supplied. For instance: ('Blues', 'Reds').

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.


Bibliography#

AAS21(1,2,3,4)

Elizabeth Armstrong and James C. Sutherland. A technique for characterising feature size and quality of manifolds. Combustion Theory and Modelling, 0(0):1–23, 2021. doi:10.1080/13647830.2021.1931715.

AHardle90

Wolfgang Härdle. Applied Nonparametric Regression. Econometric Society Monographs. Cambridge University Press, 1990. doi:10.1017/CCOL0521382483.

AZASP22(1,2)

Kamila Zdybał, Elizabeth Armstrong, James C. Sutherland, and Alessandro Parente. Cost function for low-dimensional manifold topology assessment. Scientific Reports, 12:14496, 2022. URL: https://www.nature.com/articles/s41598-022-18655-1, doi:https://doi.org/10.1038/s41598-022-18655-1.

Reconstruction#

Tools for reconstructing quantities of interest (QoIs)#

Class ANN#
class PCAfold.reconstruction.ANN(input_data, output_data, interior_architecture=(), activation_functions='tanh', weights_init='glorot_uniform', biases_init='zeros', loss='MSE', optimizer='Adam', batch_size=200, n_epochs=1000, learning_rate=0.001, validation_perc=10, random_seed=None, verbose=False)#

Enables reconstruction of quantities of interest (QoIs) using artificial neural network (ANN).

Example:

from PCAfold import ANN
import numpy as np

# Generate dummy dataset:
input_data = np.random.rand(100,8)
output_data = np.random.rand(100,3)

# Instantiate ANN class object:
ann_model = ANN(input_data,
                output_data,
                interior_architecture=(5,4),
                activation_functions=('tanh', 'tanh', 'linear'),
                weights_init='glorot_uniform',
                biases_init='zeros',
                loss='MSE',
                optimizer='Adam',
                batch_size=100,
                n_epochs=1000,
                learning_rate=0.001,
                validation_perc=10,
                random_seed=100,
                verbose=True)

# Begin model training:
ann_model.train()

A summary of the current ANN model and its hyperparameter settings can be printed using the summary() function:

# Print the ANN model summary
qoi_aware.summary()
ANN model summary...
Parameters
  • input_datanumpy.ndarray specifying the data set used as the input (regressors) to the ANN. It should be of size (n_observations,n_input_variables).

  • output_datanumpy.ndarray specifying the data set used as the output (predictors) to the ANN. It should be of size (n_observations,n_output_variables).

  • interior_architecture – (optional) tuple of int specifying the number of neurons in the interior network architecture. For example, if interior_architecture=(4,5), two interior layers will be created and the overal network architecture will be (Input)-(4)-(5)-(Output). If set to an empty tuple, interior_architecture=(), the overal network architecture will be (Input)-(Output). Keep in mind that if you’d like to create just one interior layer, you should use a comma after the integer: interior_architecture=(4,).

  • activation_functions – (optional) str or tuple specifying activation functions in all layers. If set to str, the same activation function is used in all layers. If set to a tuple of str, a different activation function can be set at different layers. The number of elements in the tuple should match the number of layers! str and str elements of the tuple can only be 'linear', 'sigmoid', or 'tanh'.

  • weights_init – (optional) str specifying the initialization of weights in the network. If set to None, weights will be initialized using the Glorot uniform distribution.

  • biases_init – (optional) str specifying the initialization of biases in the network. If set to None, biases will be initialized as zeros.

  • loss – (optional) str specifying the loss function. It can be 'MAE' or 'MSE'.

  • optimizer – (optional) str specifying the optimizer used during training. It can be 'Adam' or 'Nadam'.

  • batch_size – (optional) int specifying the batch size.

  • n_epochs – (optional) int specifying the number of epochs.

  • learning_rate – (optional) float specifying the learning rate passed to the optimizer.

  • validation_perc – (optional) int specifying the percentage of the input data to be used as validation data during training. It should be a number larger than or equal to 0 and smaller than 100. Note, that if it is set above 0, not all of the input data will be used as training data. Note, that validation data does not impact model training!

  • random_seed – (optional) int specifying the random seed to be used for any random operations. It is highly recommended to set a fixed random seed, as this allows for complete reproducibility of the results.

  • verbose – (optional) bool for printing verbose details.

Attributes:

  • input_data - (read only) numpy.ndarray specifying the data set used as the input to the ANN.

  • output_data - (read only) numpy.ndarray specifying the data set used as the output to the ANN.

  • architecture - (read only) str specifying the ANN architecture.

  • ann_model - (read only) object of Keras.models.Sequential class that stores the artificial neural network model.

  • weights_and_biases_init - (read only) list of numpy.ndarray specifying weights and biases with which the ANN was intialized.

  • weights_and_biases_trained - (read only) list of numpy.ndarray specifying weights and biases after training the ANN. Only available after calling ANN.train().

  • training_loss - (read only) list of losses computed on the training data. Only available after calling ANN.train().

  • validation_loss - (read only) list of losses computed on the validation data. Only available after calling ANN.train() and only when validation_perc is not equal to 0.

ANN.summary#
PCAfold.reconstruction.ANN.summary(self)#

Prints the ANN model summary.

ANN.train#
PCAfold.reconstruction.ANN.train(self)#

Trains the artificial neural network (ANN) model.

ANN.predict#
PCAfold.reconstruction.ANN.predict(self, input_regressors)#

Predicts the quantities of interest (QoIs) from the trained artificial neural network (ANN) model.

Parameters

input_regressorsnumpy.ndarray specifying the input data (regressors) to be used for predicting the quantities of interest (QoIs) from the trained ANN model. It should be of size (n_observations,n_input_variables), where n_observations can be different from the number of observations in the training dataset.

Returns

  • output_predictors - predicted quantities of interest (QoIs).

ANN.print_weights_and_biases_init#
PCAfold.reconstruction.ANN.print_weights_and_biases_init(self)#

Prints initial weights and biases from all layers of the QoI-aware encoder-decoder.

ANN.print_weights_and_biases_trained#
PCAfold.reconstruction.ANN.print_weights_and_biases_trained(self)#

Prints trained weights and biases from all layers of the QoI-aware encoder-decoder.

ANN.plot_losses#
PCAfold.reconstruction.ANN.plot_losses(self, markevery=100, figure_size=(15, 5), save_filename=None)#

Plots training and validation losses.

Parameters
  • markevery – (optional) int specifying how frequently the epoch number on the x-axis should be labelled.

  • figure_size – (optional) tuple specifying figure size.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.


Class PartitionOfUnityNetwork#
class PCAfold.reconstruction.PartitionOfUnityNetwork(partition_centers, partition_shapes, basis_type, ivar_center=None, ivar_scale=None, basis_coeffs=None, transform_power=1.0, transform_shift=0.0, transform_sign_shift=0.0, dtype='float64')#

A class for reconstruction (regression) of QoIs using POUnets.

The POUnets are constructed with a single-layer network of normalized radial basis functions (RBFs) whose neurons each own and weight a polynomial basis. For independent variable inputs \(\vec{x}\) of dimensionality \(d\), the \(i^{\text{th}}\) partition or neuron is computed as

\[\Phi_i(\vec{x};\vec{h}_i,K_i) = \phi^{{\rm RBF}}_i(\vec{x};\vec{h}_i,K_i)/\sum_j \phi^{{\rm RBF}}_j(\vec{x};\vec{h}_i,K_i)\]

where

\[\phi_i^{{\rm RBF}}(\vec{x};\vec{h}_i,K_i) = \exp\left(-(\vec{x}-\vec{h}_i)^\mathsf{T}K_i(\vec{x}-\vec{h}_i)\right) \nonumber\]

with vector \(\vec{h}_i\) and diagonal matrix \(K_i\) defining the \(d\) center and \(d\) shape parameters, respectively, for training.

The final output of a POUnet is then obtained through

\[g(\vec{x};\vec{h},K,c) = \sum_{i=1}^{p}\left(\Phi_i(\vec{x};\vec{h}_i,K_i)\sum_{k=1}^{b}c_{i,k}m_k(\vec{x})\right)\]

where the polynomial basis is represented as a sum of \(b\) Taylor monomials, with the \(k^{\text{th}}\) monomial written as \(m_k(\vec{x})\), that are multiplied by trainable basis coefficients \(c\). The number of basis monomials is determined by the basis_type for the polynomial. For example, in two-dimensional space, a quadratic polynomial basis contains \(b=6\) monomial functions \(\{1, x_1, x_2, x_1^2, x_2^2, x_1x_2\}\). The combination of the partitions and polynomial basis functions creates localized polynomial fits for a QoI.

More information can be found in [UAHK+22].

The PartitionOfUnityNetwork class also provides a nonlinear transformation for the dependent variable(s) during training, which can be beneficial if the variable changes over orders of magnitude, for example. The equation for the transformation of variable \(f\) is

\[(|f + s_1|)^\alpha \text{sign}(f + s_1) + s_2 \text{sign}(f + s_1)\]

where \(\alpha\) is the transform_power, \(s_1\) is the transform_shift, and \(s_2\) is the transform_sign_shift.

Example:

from PCAfold import init_uniform_partitions, PartitionOfUnityNetwork
import numpy as np

# Generate dummy data set:
ivars = np.random.rand(100,2)
dvars = 2.*ivars[:,0] + 3.*ivars[:,1]

# Initialize the POUnet parameters
net = PartitionOfUnityNetwork(**init_uniform_partitions([5,7], ivars), basis_type='linear')

# Build the training graph with provided training data
net.build_training_graph(ivars, dvars)

# (optional) update the learning rate (default is 1.e-3)
net.update_lr(1.e-4)

# (optional) update the least-squares regularization (default is 1.e-10)
net.update_l2reg(1.e-10)

# Train the POUnet
net.train(1000)

# Evaluate the POUnet
pred = net(ivars)

# Evaluate the POUnet derivatives
der = net.derivatives(ivars)

# Save the POUnet to a file
net.write_data_to_file('filename.pkl')

# Load a POUnet from file
net2 = PartitionOfUnityNetwork.load_from_file('filename.pkl')

# Evaluate the loaded POUnet (without needing to call build_training_graph)
pred2 = net2(ivars)
Parameters
  • partition_centers – array size (number of partitions) x (number of ivar inputs) for partition locations

  • partition_shapes – array size (number of partitions) x (number of ivar inputs) for partition shapes influencing the RBF widths

  • basis_type – string ('constant', 'linear', or 'quadratic') for the degree of polynomial basis

  • ivar_center – (optional, default None) array for centering the ivar inputs before evaluating the POUnet, if None centers with zeros

  • ivar_scale – (optional, default None) array for scaling the ivar inputs before evaluating the POUnet, if None scales with ones

  • basis_coeffs – (optional, default None) if the array of polynomial basis coefficients is known, it may be provided here, otherwise it will be initialized with build_training_graph and trained with train

  • transform_power – (optional, default 1.) the power parameter used in the transformation equation during training

  • transform_shift – (optional, default 0.) the shift parameter used in the transformation equation during training

  • transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation during training

  • dtype – (optional, default 'float64') string specifying either float type 'float64' or 'float32'

Attributes:

  • partition_centers - (read only) array of the current partition centers

  • partition_shapes - (read only) array of the current partition shape parameters

  • basis_type - (read only) string relaying the basis degree

  • basis_coeffs - (read only) array of the current basis coefficients

  • ivar_center - (read only) array of the centering parameters for the ivar inputs

  • ivar_scale - (read only) array of the scaling parameters for the ivar inputs

  • dtype - (read only) string relaying the data type ('float64' or 'float32')

  • training_archive - (read only) dictionary of the errors and POUnet states archived during training

  • iterations - (read only) array of the iterations archived during training

PartitionOfUnityNetwork.load_data_from_file#
PCAfold.reconstruction.PartitionOfUnityNetwork.load_data_from_file(filename)#

Load data from a specified filename with pickle (following write_data_to_file)

Parameters

filename – string

Returns

dictionary of the POUnet data

PartitionOfUnityNetwork.load_from_file#
PCAfold.reconstruction.PartitionOfUnityNetwork.load_from_file(filename)#

Load class from a specified filename with pickle (following write_data_to_file)

Parameters

filename – string

Returns

PartitionOfUnityNetwork

PartitionOfUnityNetwork.load_data_from_txt#
PCAfold.reconstruction.PartitionOfUnityNetwork.load_data_from_txt(filename, verbose=False)#

Load data from a specified txt filename (following write_data_to_txt)

Parameters
  • filename – string

  • verbose – (optional, default False) print out the data as it is read

Returns

dictionary of the POUnet data

PartitionOfUnityNetwork.write_data_to_file#
PCAfold.reconstruction.PartitionOfUnityNetwork.write_data_to_file(self, filename)#

Save class data to a specified file using pickle. This does not include the archived data from training, which can be separately accessed with training_archive and saved outside of PartitionOfUnityNetwork.

Parameters

filename – string

PartitionOfUnityNetwork.write_data_to_txt#
PCAfold.reconstruction.PartitionOfUnityNetwork.write_data_to_txt(self, filename, nformat='%.14e')#

Save data to a specified txt file. This may be used to read POUnet parameters into other languages such as C++

Parameters

filename – string

PartitionOfUnityNetwork.build_training_graph#
PCAfold.reconstruction.PartitionOfUnityNetwork.build_training_graph(self, ivars, dvars, error_type='abs', constrain_positivity=False, istensor=False, verbose=False)#

Construct the graph used during training (including defining the training errors) with the provided training data

Parameters
  • ivars – array of independent variables for training

  • dvars – array of dependent variable(s) for training

  • error_type – (optional, default 'abs') the type of training error: relative 'rel' or absolute 'abs'

  • constrain_positivity – (optional, default False) when True, it penalizes the training error with \(f - |f|\) for dependent variables \(f\). This can be useful when used in QoIAwareProjectionPOUnet

  • istensor – (optional, default False) whether to evaluate ivars and dvars as tensorflow Tensors or numpy arrays

  • verbose – (options, default False) print out the number of the partition and basis parameters when True

PartitionOfUnityNetwork.update_lr#
PCAfold.reconstruction.PartitionOfUnityNetwork.update_lr(self, lr)#

update the learning rate for training

Parameters

lr – float for the learning rate

PartitionOfUnityNetwork.update_l2reg#
PCAfold.reconstruction.PartitionOfUnityNetwork.update_l2reg(self, l2reg)#

update the least-squares regularization for training

Parameters

l2reg – float for the least-squares regularization

PartitionOfUnityNetwork.lstsq#
PCAfold.reconstruction.PartitionOfUnityNetwork.lstsq(self, verbose=True)#

update the basis coefficients with least-squares regression

Parameters

verbose – (optional, default True) prints when least-squares solve is performed when True

PartitionOfUnityNetwork.train#
PCAfold.reconstruction.PartitionOfUnityNetwork.train(self, iterations, archive_rate=100, use_best_archive_sse=True, verbose=False)#

Performs training using a least-squares gradient descent block coordinate descent strategy. This alternates between updating the partition parameters with gradient descent and updating the basis coefficients with least-squares.

Parameters
  • iterations – integer for number of training iterations to perform

  • archive_rate – (optional, default 100) the rate at which the errors and parameters are archived during training. These can be accessed with the training_archive attribute

  • use_best_archive_sse – (optional, default True) when True will set the POUnet parameters to those with the lowest error observed during training, otherwise the parameters from the last iteration are used

  • verbose – (optional, default False) when True will print progress

PartitionOfUnityNetwork.__call__#
PCAfold.reconstruction.PartitionOfUnityNetwork.__call__(self, xeval)#

evaluate the POUnet

Parameters

xeval – array of independent variable query points

Returns

array of POUnet predictions

PartitionOfUnityNetwork.derivatives#
PCAfold.reconstruction.PartitionOfUnityNetwork.derivatives(self, xeval, dvar_idx=0)#

evaluate the POUnet derivatives

Parameters
  • xeval – array of independent variable query points

  • dvar_idx – (optional, default 0) index for the dependent variable whose derivatives are being evaluated

Returns

array of POUnet derivative evaluations

PartitionOfUnityNetwork.partition_prenorm#
PCAfold.reconstruction.PartitionOfUnityNetwork.partition_prenorm(self, xeval)#

evaluate the POUnet partitions prior to normalization

Parameters

xeval – array of independent variable query points

Returns

array of POUnet RBF partition evaluations before normalization

init_uniform_partitions#
PCAfold.reconstruction.init_uniform_partitions(list_npartitions, ivars, width_factor=0.5, verbose=False)#

Computes parameters for initializing partition locations near training data with uniform spacing in each dimension.

Example:

from PCAfold import init_uniform_partitions
import numpy as np

# Generate dummy data set:
ivars = np.random.rand(100,2)

# compute partition parameters for an initial 5x7 grid:
init_data = init_uniform_partitions([5, 7], ivars)
Parameters
  • list_npartitions – list of integers specifying the number of partitions to try initializing in each dimension. Only partitions near the provided ivars are kept.

  • ivars – array of independent variables used for determining which partitions to keep

  • width_factor – (optional, default 0.5) the factor multiplying the spacing between partitions for initializing the partitions’ RBF widths

  • verbose – (optional, default False) when True, prints the number of partitions retained compared to the initial grid

Returns

a dictionary of partition parameters to be used in initializing a PartitionOfUnityNetwork


Regression assessment#

Class RegressionAssessment#
class PCAfold.reconstruction.RegressionAssessment(observed, predicted, idx=None, variable_names=None, use_global_mean=False, norm='std', use_global_norm=False, tolerance=0.05)#

Wrapper class for storing all regression assessment metrics for a given regression solution given by the observed dependent variables, \(\pmb{\phi}_o\), and the predicted dependent variables, \(\pmb{\phi}_p\).

Example:

from PCAfold import PCA, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Instantiate RegressionAssessment class object:
regression_metrics = RegressionAssessment(X, X_rec)

# Access mean absolute error values:
MAE = regression_metrics.mean_absolute_error

In addition, all stratified regression metrics can be computed on a single variable:

from PCAfold import variable_bins

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=5, verbose=False)

# Instantiate RegressionAssessment class object:
stratified_regression_metrics = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx)

# Access stratified mean absolute error values:
stratified_MAE = stratified_regression_metrics.stratified_mean_absolute_error
Parameters
  • observednumpy.ndarray specifying the observed values of dependent variables, \(\pmb{\phi}_o\). It should be of size (n_observations,) or (n_observations,n_variables).

  • predictednumpy.ndarray specifying the predicted values of dependent variables, \(\pmb{\phi}_p\). It should be of size (n_observations,) or (n_observations,n_variables).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • variable_names – (optional) list of str specifying variable names.

  • use_global_mean – (optional) bool specifying if global mean of the observed variable should be used as a reference in \(R^2\) calculation.

  • normstr specifying the normalization, \(d_{norm}\), for NRMSE computation. It can be one of the following: std, range, root_square_mean, root_square_range, root_square_std, abs_mean.

  • use_global_norm – (optional) bool specifying if global norm of the observed variable should be used in NRMSE calculation.

  • tolerancefloat specifying the tolerance for GDE computation.

Attributes:

  • coefficient_of_determination - (read only) numpy.ndarray specifying the coefficient of determination, \(R^2\), values. It has size (1,n_variables).

  • mean_absolute_error - (read only) numpy.ndarray specifying the mean absolute error (MAE) values. It has size (1,n_variables).

  • mean_squared_error - (read only) numpy.ndarray specifying the mean squared error (MSE) values. It has size (1,n_variables).

  • root_mean_squared_error - (read only) numpy.ndarray specifying the root mean squared error (RMSE) values. It has size (1,n_variables).

  • normalized_root_mean_squared_error - (read only) numpy.ndarray specifying the normalized root mean squared error (NRMSE) values. It has size (1,n_variables).

  • good_direction_estimate - (read only) float specifying the good direction estimate (GDE) value, treating the entire \(\pmb{\phi}_o\) and \(\pmb{\phi}_p\) as vectors. Note that if a single dependent variable is passed, GDE cannot be computed and is set to NaN.

If idx has been specified:

  • stratified_coefficient_of_determination - (read only) numpy.ndarray specifying the coefficient of determination, \(R^2\), values. It has size (1,n_variables).

  • stratified_mean_absolute_error - (read only) numpy.ndarray specifying the mean absolute error (MAE) values. It has size (1,n_variables).

  • stratified_mean_squared_error - (read only) numpy.ndarray specifying the mean squared error (MSE) values. It has size (1,n_variables).

  • stratified_root_mean_squared_error - (read only) numpy.ndarray specifying the root mean squared error (RMSE) values. It has size (1,n_variables).

  • stratified_normalized_root_mean_squared_error - (read only) numpy.ndarray specifying the normalized root mean squared error (NRMSE) values. It has size (1,n_variables).

RegressionAssessment.print_metrics#
PCAfold.reconstruction.RegressionAssessment.print_metrics(self, table_format=['raw'], float_format='.4f', metrics=None, comparison=None)#

Prints regression assessment metrics as raw text, in tex format and/or as pandas.DataFrame.

Example:

from PCAfold import PCA, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Instantiate RegressionAssessment class object:
regression_metrics = RegressionAssessment(X, X_rec)

# Print regression metrics:
regression_metrics.print_metrics(table_format=['raw', 'tex', 'pandas'],
                                 float_format='.4f',
                                 metrics=['R2', 'NRMSE', 'GDE'])

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
X1
R2:     0.9900
NRMSE:  0.0999
GDE:    70.0000
-------------------------
X2
R2:     0.6126
NRMSE:  0.6224
GDE:    70.0000
-------------------------
X3
R2:     0.6368
NRMSE:  0.6026
GDE:    70.0000

Adding 'tex' to the table_format list will result in printing:

\begin{table}[h!]
\begin{center}
\begin{tabular}{llll} \toprule
 & \textit{X1} & \textit{X2} & \textit{X3} \\ \midrule
R2 & 0.9900 & 0.6126 & 0.6368 \\
NRMSE & 0.0999 & 0.6224 & 0.6026 \\
GDE & 70.0000 & 70.0000 & 70.0000 \\
\end{tabular}
\caption{}\label{}
\end{center}
\end{table}

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

_images/generate-pandas-table.png

Additionally, the current object of RegressionAssessment class can be compared with another object:

from PCAfold import PCA, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)
Y = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)
pca_Y = PCA(Y, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))
Y_rec = pca_Y.reconstruct(pca_Y.transform(Y))

# Instantiate RegressionAssessment class object:
regression_metrics_X = RegressionAssessment(X, X_rec)
regression_metrics_Y = RegressionAssessment(Y, Y_rec)

# Print regression metrics:
regression_metrics_X.print_metrics(table_format=['raw', 'pandas'],
                                   float_format='.4f',
                                   metrics=['R2', 'NRMSE', 'GDE'],
                                   comparison=regression_metrics_Y)

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
X1
R2:     0.9133  BETTER
NRMSE:  0.2944  BETTER
GDE:    67.0000 WORSE
-------------------------
X2
R2:     0.5969  WORSE
NRMSE:  0.6349  WORSE
GDE:    67.0000 WORSE
-------------------------
X3
R2:     0.6175  WORSE
NRMSE:  0.6185  WORSE
GDE:    67.0000 WORSE

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

_images/generate-pandas-table-comparison.png
Parameters
  • table_format – (optional) list of str specifying the format(s) in which the table should be printed. Strings can only be 'raw', 'tex' and/or 'pandas'.

  • float_format – (optional) str specifying the display format for the numerical entries inside the table. By default it is set to '.4f'.

  • metrics – (optional) list of str specifying which metrics should be printed. Strings can only be 'R2', 'MAE', 'MSE', 'MSLE', 'RMSE', 'NRMSE', 'GDE'. If metrics is set to None, all available metrics will be printed.

  • comparison – (optional) object of RegressionAssessment class specifying the metrics that should be compared with the current regression metrics.

RegressionAssessment.print_stratified_metrics#
PCAfold.reconstruction.RegressionAssessment.print_stratified_metrics(self, table_format=['raw'], float_format='.4f', metrics=None, comparison=None)#

Prints stratified regression assessment metrics as raw text, in tex format and/or as pandas.DataFrame. In each cluster, in addition to the regression metrics, number of observations is printed, along with the minimum and maximum values of the observed variable in that cluster.

Example:

from PCAfold import PCA, variable_bins, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=3, verbose=False)

# Instantiate RegressionAssessment class object:
stratified_regression_metrics = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx)

# Print regression metrics:
stratified_regression_metrics.print_stratified_metrics(table_format=['raw', 'tex', 'pandas'],
                                                       float_format='.4f',
                                                       metrics=['R2', 'MAE', 'NRMSE'])

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
k1
Observations:   31
Min:    0.0120
Max:    0.3311
R2:     -3.3271
MAE:    0.1774
NRMSE:  2.0802
-------------------------
k2
Observations:   38
Min:    0.3425
Max:    0.6665
R2:     -1.4608
MAE:    0.1367
NRMSE:  1.5687
-------------------------
k3
Observations:   31
Min:    0.6853
Max:    0.9959
R2:     -3.7319
MAE:    0.1743
NRMSE:  2.1753

Adding 'tex' to the table_format list will result in printing:

\begin{table}[h!]
\begin{center}
\begin{tabular}{llll} \toprule
 & \textit{k1} & \textit{k2} & \textit{k3} \\ \midrule
Observations & 31.0000 & 38.0000 & 31.0000 \\
Min & 0.0120 & 0.3425 & 0.6853 \\
Max & 0.3311 & 0.6665 & 0.9959 \\
R2 & -3.3271 & -1.4608 & -3.7319 \\
MAE & 0.1774 & 0.1367 & 0.1743 \\
NRMSE & 2.0802 & 1.5687 & 2.1753 \\
\end{tabular}
\caption{}\label{}
\end{center}
\end{table}

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

_images/generate-pandas-table-stratified.png

Additionally, the current object of RegressionAssessment class can be compared with another object:

from PCAfold import PCA, variable_bins, RegressionAssessment
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=3, verbose=False)

# Instantiate RegressionAssessment class object:
stratified_regression_metrics_0 = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx)
stratified_regression_metrics_1 = RegressionAssessment(X[:,1], X_rec[:,1], idx=idx)

# Print regression metrics:
stratified_regression_metrics_0.print_stratified_metrics(table_format=['raw', 'pandas'],
                                                         float_format='.4f',
                                                         metrics=['R2', 'MAE', 'NRMSE'],
                                                         comparison=stratified_regression_metrics_1)

Note

Adding 'raw' to the table_format list will result in printing:

-------------------------
k1
Observations:   39
Min:    0.0013
Max:    0.3097
R2:     0.9236  BETTER
MAE:    0.0185  BETTER
NRMSE:  0.2764  BETTER
-------------------------
k2
Observations:   29
Min:    0.3519
Max:    0.6630
R2:     0.9380  BETTER
MAE:    0.0179  BETTER
NRMSE:  0.2491  BETTER
-------------------------
k3
Observations:   32
Min:    0.6663
Max:    0.9943
R2:     0.9343  BETTER
MAE:    0.0194  BETTER
NRMSE:  0.2563  BETTER

Adding 'pandas' to the table_format list (works well in Jupyter notebooks) will result in printing:

_images/generate-pandas-table-comparison-stratified.png
Parameters
  • table_format – (optional) list of str specifying the format(s) in which the table should be printed. Strings can only be 'raw', 'tex' and/or 'pandas'.

  • float_format – (optional) str specifying the display format for the numerical entries inside the table. By default it is set to '.4f'.

  • metrics – (optional) list of str specifying which metrics should be printed. Strings can only be 'R2', 'MAE', 'MSE', 'MSLE', 'RMSE', 'NRMSE'. If metrics is set to None, all available metrics will be printed.

  • comparison – (optional) object of RegressionAssessment class specifying the metrics that should be compared with the current regression metrics.

coefficient_of_determination#
PCAfold.reconstruction.coefficient_of_determination(observed, predicted)#

Computes the coefficient of determination, \(R^2\), value:

\[R^2 = 1 - \frac{\sum_{i=1}^N (\phi_{o,i} - \phi_{p,i})^2}{\sum_{i=1}^N (\phi_{o,i} - \mathrm{mean}(\phi_{o,i}))^2}\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, coefficient_of_determination
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the coefficient of determination for the first variable:
r2 = coefficient_of_determination(X[:,0], X_rec[:,0])
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns

  • r2 - coefficient of determination, \(R^2\).

stratified_coefficient_of_determination#
PCAfold.reconstruction.stratified_coefficient_of_determination(observed, predicted, idx, use_global_mean=True, verbose=False)#

Computes the stratified coefficient of determination, \(R^2\), values. Stratified \(R^2\) is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

\(R_j^2\) in the \(j^{th}\) bin can be computed in two ways:

  • If use_global_mean=True, the mean of the entire observed variable is used as a reference:

\[R_j^2 = 1 - \frac{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \phi_{p,i}^{j})^2}{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \mathrm{mean}(\phi_o))^2}\]
  • If use_global_mean=False, the mean of the considered \(j^{th}\) bin is used as a reference:

\[R_j^2 = 1 - \frac{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \phi_{p,i}^{j})^2}{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \mathrm{mean}(\phi_o^{j}))^2}\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin and \(\phi_p\) is the predicted dependent variable.

Note

After running this function you can call analysis.plot_stratified_coefficient_of_determination(r2_in_bins, bins_borders) on the function outputs and it will visualize how stratified \(R^2\) changes across bins.

Warning

The stratified \(R^2\) metric can be misleading if there are large variations in point density in an observed variable. For instance, below is a data set composed of lines of points that have uniform spacing on the \(x\) axis but become more and more sparse in the direction of increasing \(\phi\) due to an increasing gradient of \(\phi\). If bins are narrow enough (number of bins is high enough), a single bin (like the bin bounded by the red dashed lines) can contain only one of those lines of points for high value of \(\phi\). \(R^2\) will then be computed for constant, or almost constant observations, even though globally those observations lie in a location of a large gradient of the observed variable!

_images/stratified-r2.png

Example:

from PCAfold import PCA, variable_bins, stratified_coefficient_of_determination, plot_stratified_coefficient_of_determination
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified R2 in 10 bins of the first variable in a data set:
r2_in_bins = stratified_coefficient_of_determination(X[:,0], X_rec[:,0], idx=idx, use_global_mean=True, verbose=True)

# Plot the stratified R2 values:
plot_stratified_coefficient_of_determination(r2_in_bins, bins_borders)
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • use_global_mean – (optional) bool specifying if global mean of the observed variable should be used as a reference in \(R^2\) calculation.

  • verbose – (optional) bool for printing sizes (number of observations) and \(R^2\) values in each bin.

Returns

  • r2_in_bins - list specifying the coefficients of determination \(R^2\) in each bin. It has length k.

mean_absolute_error#
PCAfold.reconstruction.mean_absolute_error(observed, predicted)#

Computes the mean absolute error (MAE):

\[\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^N | \phi_{o,i} - \phi_{p,i} |\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, mean_absolute_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the mean absolute error for the first variable:
mae = mean_absolute_error(X[:,0], X_rec[:,0])
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns

  • mae - mean absolute error (MAE).

stratified_mean_absolute_error#
PCAfold.reconstruction.stratified_mean_absolute_error(observed, predicted, idx, verbose=False)#

Computes the stratified mean absolute error (MAE) values. Stratified MAE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

MAE in the \(j^{th}\) bin can be computed as:

\[\mathrm{MAE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} | \phi_{o,i}^j - \phi_{p,i}^j |\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_mean_absolute_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified MAE in 10 bins of the first variable in a data set:
mae_in_bins = stratified_mean_absolute_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and MAE values in each bin.

Returns

  • mae_in_bins - list specifying the mean absolute error (MAE) in each bin. It has length k.

max_absolute_error#
PCAfold.reconstruction.max_absolute_error(observed, predicted)#

Computes the maximum absolute error (MaxAE):

\[\mathrm{MaxAE} = \mathrm{max}( | \phi_{o,i} - \phi_{p,i} | )\]

where \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, max_absolute_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the maximum absolute error for the first variable:
maxae = max_absolute_error(X[:,0], X_rec[:,0])
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns

  • maxae - max absolute error (MaxAE).

mean_squared_error#
PCAfold.reconstruction.mean_squared_error(observed, predicted)#

Computes the mean squared error (MSE):

\[\mathrm{MSE} = \frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the mean squared error for the first variable:
mse = mean_squared_error(X[:,0], X_rec[:,0])
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns

  • mse - mean squared error (MSE).

stratified_mean_squared_error#
PCAfold.reconstruction.stratified_mean_squared_error(observed, predicted, idx, verbose=False)#

Computes the stratified mean squared error (MSE) values. Stratified MSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

MSE in the \(j^{th}\) bin can be computed as:

\[\mathrm{MSE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified MSE in 10 bins of the first variable in a data set:
mse_in_bins = stratified_mean_squared_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and MSE values in each bin.

Returns

  • mse_in_bins - list specifying the mean squared error (MSE) in each bin. It has length k.

mean_squared_logarithmic_error#
PCAfold.reconstruction.mean_squared_logarithmic_error(observed, predicted)#

Computes the mean squared logarithmic error (MSLE):

\[\mathrm{MSLE} = \frac{1}{N} \sum_{i=1}^N (\log(\phi_{o,i} + 1) - \log(\phi_{p,i} + 1)) ^2\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Warning

The MSLE metric can only be used on non-negative samples.

Example:

from PCAfold import PCA, mean_squared_logarithmic_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the mean squared error for the first variable:
msle = mean_squared_logarithmic_error(X[:,0], X_rec[:,0])
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns

  • msle - mean squared logarithmic error (MSLE).

stratified_mean_squared_logarithmic_error#
PCAfold.reconstruction.stratified_mean_squared_logarithmic_error(observed, predicted, idx, verbose=False)#

Computes the stratified mean squared logarithmic error (MSLE) values. Stratified MSLE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

MSLE in the \(j^{th}\) bin can be computed as:

\[\mathrm{MSLE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} (\log(\phi_{o,i}^j + 1) - \log(\phi_{p,i}^j + 1)) ^2\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Warning

The MSLE metric can only be used on non-negative samples.

Example:

from PCAfold import PCA, variable_bins, stratified_mean_squared_logarithmic_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified MSLE in 10 bins of the first variable in a data set:
msle_in_bins = stratified_mean_squared_logarithmic_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and MSLE values in each bin.

Returns

  • msle_in_bins - list specifying the mean squared logarithmic error (MSLE) in each bin. It has length k.

root_mean_squared_error#
PCAfold.reconstruction.root_mean_squared_error(observed, predicted)#

Computes the root mean squared error (RMSE):

\[\mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2}\]

where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the root mean squared error for the first variable:
rmse = root_mean_squared_error(X[:,0], X_rec[:,0])
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

Returns

  • rmse - root mean squared error (RMSE).

stratified_root_mean_squared_error#
PCAfold.reconstruction.stratified_root_mean_squared_error(observed, predicted, idx, verbose=False)#

Computes the stratified root mean squared error (RMSE) values. Stratified RMSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

RMSE in the \(j^{th}\) bin can be computed as:

\[\mathrm{RMSE}_j = \sqrt{\frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2}\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified RMSE in 10 bins of the first variable in a data set:
rmse_in_bins = stratified_root_mean_squared_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • verbose – (optional) bool for printing sizes (number of observations) and RMSE values in each bin.

Returns

  • rmse_in_bins - list specifying the mean squared error (RMSE) in each bin. It has length k.

normalized_root_mean_squared_error#
PCAfold.reconstruction.normalized_root_mean_squared_error(observed, predicted, norm='std')#

Computes the normalized root mean squared error (NRMSE):

\[\mathrm{NRMSE} = \frac{1}{d_{norm}} \sqrt{\frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2}\]

where \(d_{norm}\) is the normalization factor, \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Various normalizations are available:

Normalization

norm

Normalization factor \(d_{norm}\)

Root square mean

'root_square_mean'

\(d_{norm} = \sqrt{\mathrm{mean}(\phi_o^2)}\)

Std

'std'

\(d_{norm} = \mathrm{std}(\phi_o)\)

Range

'range'

\(d_{norm} = \mathrm{max}(\phi_o) - \mathrm{min}(\phi_o)\)

Root square range

'root_square_range'

\(d_{norm} = \sqrt{\mathrm{max}(\phi_o^2) - \mathrm{min}(\phi_o^2)}\)

Root square std

'root_square_std'

\(d_{norm} = \sqrt{\mathrm{std}(\phi_o^2)}\)

Absolute mean

'abs_mean'

\(d_{norm} = | \mathrm{mean}(\phi_o) |\)

Example:

from PCAfold import PCA, normalized_root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the root mean squared error for the first variable:
nrmse = normalized_root_mean_squared_error(X[:,0], X_rec[:,0], norm='std')
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • normstr specifying the normalization, \(d_{norm}\). It can be one of the following: std, range, root_square_mean, root_square_range, root_square_std, abs_mean.

Returns

  • nrmse - normalized root mean squared error (NRMSE).

stratified_normalized_root_mean_squared_error#
PCAfold.reconstruction.stratified_normalized_root_mean_squared_error(observed, predicted, idx, norm='std', use_global_norm=False, verbose=False)#

Computes the stratified normalized root mean squared error (NRMSE) values. Stratified NRMSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).

NRMSE in the \(j^{th}\) bin can be computed as:

\[\mathrm{NRMSE}_j = \frac{1}{d_{norm}} \sqrt{\frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2}\]

where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_normalized_root_mean_squared_error
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified NRMSE in 10 bins of the first variable in a data set:
nrmse_in_bins = stratified_normalized_root_mean_squared_error(X[:,0],
                                                              X_rec[:,0],
                                                              idx=idx,
                                                              norm='std',
                                                              use_global_norm=True,
                                                              verbose=True)
Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size (n_observations,) or (n_observations, 1).

  • idxnumpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

  • normstr specifying the normalization, \(d_{norm}\). It can be one of the following: std, range, root_square_mean, root_square_range, root_square_std, abs_mean.

  • use_global_norm – (optional) bool specifying if global norm of the observed variable should be used in NRMSE calculation. If set to False, norms are computed on samples from the the corresponding bin.

  • verbose – (optional) bool for printing sizes (number of observations) and NRMSE values in each bin.

Returns

  • nrmse_in_bins - list specifying the mean squared error (NRMSE) in each bin. It has length k.

turning_points#
PCAfold.reconstruction.turning_points(observed, predicted)#

Computes the turning points percentage - the percentage of predicted outputs that have the opposite growth tendency to the corresponding observed growth tendency.

Warning

This function is under construction.

Returns

  • turning_points - turning points percentage in %.

good_estimate#
PCAfold.reconstruction.good_estimate(observed, predicted, tolerance=0.05)#

Computes the good estimate (GE) - the percentage of predicted values that are within the specified tolerance from the corresponding observed values.

Warning

This function is under construction.

Parameters
  • observednumpy.ndarray specifying the observed values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

Parm tolerance

float specifying the tolerance.

Returns

  • good_estimate - good estimate (GE) in %.

good_direction_estimate#
PCAfold.reconstruction.good_direction_estimate(observed, predicted, tolerance=0.05)#

Computes the good direction (GD) and the good direction estimate (GDE).

GD for observation \(i\), is computed as:

\[GD_i = \frac{\vec{\phi}_{o,i}}{|| \vec{\phi}_{o,i} ||} \cdot \frac{\vec{\phi}_{p,i}}{|| \vec{\phi}_{p,i} ||}\]

where \(\vec{\phi}_o\) is the observed vector quantity and \(\vec{\phi}_p\) is the predicted vector quantity.

GDE is computed as the percentage of predicted vector observations whose direction is within the specified tolerance from the direction of the corresponding observed vector.

Example:

from PCAfold import PCA, good_direction_estimate
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,3)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Compute the vector of good direction and good direction estimate:
(good_direction, good_direction_estimate) = good_direction_estimate(X, X_rec, tolerance=0.01)
Parameters
  • observednumpy.ndarray specifying the observed vector quantity, \(\vec{\phi}_o\). It should be of size (n_observations,n_dimensions).

  • predictednumpy.ndarray specifying the predicted vector quantity, \(\vec{\phi}_p\). It should be of size (n_observations,n_dimensions).

  • tolerancefloat specifying the tolerance.

Returns

  • good_direction - numpy.ndarray specifying the vector of good direction (GD). It has size (n_observations,).

  • good_direction_estimate - good direction estimate (GDE) in %.

generate_tex_table#
PCAfold.reconstruction.generate_tex_table(data_frame_table, float_format='.2f', caption='', label='')#

Generates tex code for a table stored in a pandas.DataFrame. This function can be useful e.g. for printing regression results.

Example:

from PCAfold import PCA, generate_tex_table
import numpy as np
import pandas as pd

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy variables names:
variable_names = ['A1', 'A2', 'A3', 'A4', 'A5']

# Instantiate PCA class object:
pca_q2 = PCA(X, scaling='auto', n_components=2, use_eigendec=True, nocenter=False)
pca_q3 = PCA(X, scaling='auto', n_components=3, use_eigendec=True, nocenter=False)

# Calculate the R2 values:
r2_q2 = pca_q2.calculate_r2(X)[None,:]
r2_q3 = pca_q3.calculate_r2(X)[None,:]

# Generate pandas.DataFrame from the R2 values:
r2_table = pd.DataFrame(np.vstack((r2_q2, r2_q3)), columns=variable_names, index=['PCA, $q=2$', 'PCA, $q=3$'])

# Generate tex code for the table:
generate_tex_table(r2_table, float_format=".3f", caption='$R^2$ values.', label='r2-values')

Note

The code above will produce tex code:

\begin{table}[h!]
\begin{center}
\begin{tabular}{llllll} \toprule
 & \textit{A1} & \textit{A2} & \textit{A3} & \textit{A4} & \textit{A5} \\ \midrule
PCA, $q=2$ & 0.507 & 0.461 & 0.485 & 0.437 & 0.611 \\
PCA, $q=3$ & 0.618 & 0.658 & 0.916 & 0.439 & 0.778 \\
\end{tabular}
\caption{$R^2$ values.}\label{r2-values}
\end{center}
\end{table}

Which, when compiled, will result in a table:

_images/generate-tex-table.png
Parameters
  • data_frame_tablepandas.DataFrame specifying the table to convert to tex code. It can include column names and index names.

  • float_formatstr specifying the display format for the numerical entries inside the table. By default it is set to '.2f'.

  • captionstr specifying caption for the table.

  • labelstr specifying label for the table.


Plotting functions#

plot_2d_regression#
PCAfold.reconstruction.plot_2d_regression(x, observed, predicted, x_label=None, y_label=None, color_observed=None, color_predicted=None, figure_size=(7, 7), title=None, save_filename=None)#

Plots the result of regression of a dependent variable on top of a one-dimensional manifold defined by a single independent variable x.

Example:

from PCAfold import PCA, plot_2d_regression
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Obtain two-dimensional manifold from PCA:
pca_X = PCA(X)
PCs = pca_X.transform(X)
X_rec = pca_X.reconstruct(PCs)

# Plot the manifold:
plt = plot_2d_regression(X[:,0],
                         X[:,0],
                         X_rec[:,0],
                         x_label='$x$',
                         y_label='$y$',
                         color_observed='k',
                         color_predicted='r',
                         figure_size=(10,10),
                         title='2D regression',
                         save_filename='2d-regression.pdf')
plt.close()
Parameters
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • observednumpy.ndarray specifying the observed values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • color_observed – (optional) str specifying the color of the plotted observed variable.

  • color_predicted – (optional) str specifying the color of the plotted predicted variable.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_2d_regression_scalar_field#
PCAfold.reconstruction.plot_2d_regression_scalar_field(grid_bounds, regression_model, x=None, y=None, resolution=(10, 10), extension=(0, 0), x_label=None, y_label=None, s_field=None, s_manifold=None, manifold_color=None, colorbar_label=None, color_map='viridis', colorbar_range=None, manifold_alpha=1, grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a 2D field of a regressed scalar dependent variable. A two-dimensional manifold can be additionally plotted on top of the field.

Example:

from PCAfold import PCA, KReg, plot_2d_regression_scalar_field
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,2)
Z = np.random.rand(100,1)

# Train the kernel regression model:
model = KReg(X, Z)

# Define the regression model:
def regression_model(query):

    predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)[:,0]

    return predicted

# Define the bounds for the scalar field:
grid_bounds = ([np.min(X[:,0]),np.max(X[:,0])],[np.min(X[:,1]),np.max(X[:,1])])

# Plot the regressed scalar field:
plt = plot_2d_regression_scalar_field(grid_bounds,
                                    regression_model,
                                    x=X[:,0],
                                    y=X[:,1],
                                    resolution=(100,100),
                                    extension=(10,10),
                                    x_label='$X_1$',
                                    y_label='$X_2$',
                                    s_field=4,
                                    s_manifold=60,
                                    manifold_color=Z,
                                    colorbar_label='$Z_1$',
                                    color_map='inferno',
                                    colorbar_range=(0,1),
                                    manifold_alpha=1,
                                    grid_on=False,
                                    figure_size=(10,6),
                                    title='2D regressed scalar field',
                                    save_filename='2D-regressed-scalar-field.pdf')
plt.close()
Parameters
  • grid_boundstuple of list specifying the bounds of the dependent variable on the \(x\) and \(y\) axis.

  • regression_modelfunction that outputs the predicted vector using the regression model. It should take as input a numpy.ndarray of size (1,2), where the two elements specify the first and second independent variable values. It should output a float specifying the regressed scalar value at that input.

  • x – (optional) numpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • y – (optional) numpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • resolution – (optional) tuple of int specifying the resolution of the streamplot grid on the \(x\) and \(y\) axis.

  • extension – (optional) tuple of float or int specifying a percentage by which the dependent variable should be extended beyond on the \(x\) and \(y\) axis, beyond what has been specified by the grid_bounds parameter.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • s_field – (optional) int or float specifying the scatter point size for the scalar field.

  • s_manifold – (optional) int or float specifying the scatter point size for the manifold.

  • manifold_color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, manifold will be plotted in black.

  • colorbar_label – (optional) str specifying colorbar label annotation.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • colorbar_range – (optional) tuple specifying the lower and the upper bound for the colorbar range.

  • manifold_alpha – (optional) float or int specifying the opacity of the plotted manifold.

  • grid_onbool specifying whether grid should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_2d_regression_streamplot#
PCAfold.reconstruction.plot_2d_regression_streamplot(grid_bounds, regression_model, x=None, y=None, resolution=(10, 10), extension=(0, 0), color='k', x_label=None, y_label=None, s_manifold=None, manifold_color=None, colorbar_label=None, color_map='viridis', colorbar_range=None, manifold_alpha=1, grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a streamplot of a regressed vector field of a dependent variable. A two-dimensional manifold can be additionally plotted on top of the streamplot.

Example:

from PCAfold import PCA, KReg, plot_2d_regression_streamplot
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)
S_X = np.random.rand(100,5)

# Obtain two-dimensional manifold from PCA:
pca_X = PCA(X, n_components=2)
PCs = pca_X.transform(X)
S_Z = pca_X.transform(S_X, nocenter=True)

# Train the kernel regression model:
model = KReg(PCs, S_Z)

# Define the regression model:
def regression_model(query):

    predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)

    return predicted

# Define the bounds for the streamplot:
grid_bounds = ([np.min(PCs[:,0]),np.max(PCs[:,0])],[np.min(PCs[:,1]),np.max(PCs[:,1])])

# Plot the regression streamplot:
plt = plot_2d_regression_streamplot(grid_bounds,
                                    regression_model,
                                    x=PCs[:,0],
                                    y=PCs[:,1],
                                    resolution=(15,15),
                                    extension=(20,20),
                                    color='r',
                                    x_label='$Z_1$',
                                    y_label='$Z_2$',
                                    manifold_color=X[:,0],
                                    colorbar_label='$X_1$',
                                    color_map='plasma',
                                    colorbar_range=(0,1),
                                    manifold_alpha=1,
                                    grid_on=False,
                                    figure_size=(10,6),
                                    title='Streamplot',
                                    save_filename='streamplot.pdf')
plt.close()
Parameters
  • grid_boundstuple of list specifying the bounds of the dependent variable on the \(x\) and \(y\) axis.

  • regression_modelfunction that outputs the predicted vector using the regression model. It should take as input a numpy.ndarray of size (1,2), where the two elements specify the first and second independent variable values. It should output a numpy.ndarray of size (1,2), where the two elements specify the first and second regressed vector elements.

  • x – (optional) numpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • y – (optional) numpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1). It can be used to plot a 2D manifold on top of the streamplot.

  • resolution – (optional) tuple of int specifying the resolution of the streamplot grid on the \(x\) and \(y\) axis.

  • extension – (optional) tuple of float or int specifying a percentage by which the dependent variable should be extended beyond on the \(x\) and \(y\) axis, beyond what has been specified by the grid_bounds parameter.

  • color – (optional) str specifying the streamlines color.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • s_manifold – (optional) int or float specifying the scatter point size for the manifold.

  • manifold_color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, manifold will be plotted in black.

  • colorbar_label – (optional) str specifying colorbar label annotation.

  • color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.

  • colorbar_range – (optional) tuple specifying the lower and the upper bound for the colorbar range.

  • manifold_alpha – (optional) float or int specifying the opacity of the plotted manifold.

  • grid_onbool specifying whether grid should be plotted.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_3d_regression#
PCAfold.reconstruction.plot_3d_regression(x, y, observed, predicted, elev=45, azim=-45, clean=False, x_label=None, y_label=None, z_label=None, color_observed=None, color_predicted=None, s_observed=None, s_predicted=None, alpha_observed=None, alpha_predicted=None, figure_size=(7, 7), title=None, save_filename=None)#

Plots the result of regression of a dependent variable on top of a two-dimensional manifold defined by two independent variables x and y.

Example:

from PCAfold import PCA, plot_3d_regression
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Obtain three-dimensional manifold from PCA:
pca_X = PCA(X)
PCs = pca_X.transform(X)
X_rec = pca_X.reconstruct(PCs)

# Plot the manifold:
plt = plot_3d_regression(X[:,0],
                         X[:,1],
                         X[:,0],
                         X_rec[:,0],
                         elev=45,
                         azim=-45,
                         x_label='$x$',
                         y_label='$y$',
                         z_label='$z$',
                         color_observed='k',
                         color_predicted='r',
                         figure_size=(10,10),
                         title='3D regression',
                         save_filename='3d-regression.pdf')
plt.close()
Parameters
  • xnumpy.ndarray specifying the variable on the \(x\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • ynumpy.ndarray specifying the variable on the \(y\)-axis. It should be of size (n_observations,) or (n_observations,1).

  • observednumpy.ndarray specifying the observed values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • predictednumpy.ndarray specifying the predicted values of a single dependent variable. It should be of size (n_observations,) or (n_observations, 1).

  • elev – (optional) float or int specifying the elevation angle.

  • azim – (optional) float or int specifying the azimuth angle.

  • clean – (optional) bool specifying if a clean plot should be made. If set to True, nothing else but the data points and the 3D axes is plotted.

  • x_label – (optional) str specifying \(x\)-axis label annotation. If set to None label will not be plotted.

  • y_label – (optional) str specifying \(y\)-axis label annotation. If set to None label will not be plotted.

  • z_label – (optional) str specifying \(z\)-axis label annotation. If set to None label will not be plotted.

  • color_observed – (optional) str specifying the color of the plotted observed variable.

  • color_predicted – (optional) str specifying the color of the plotted predicted variable.

  • s_observed – (optional) int or float specifying the scatter point size for the observed variable.

  • s_predicted – (optional) int or float specifying the scatter point size for the predicted variable.

  • alpha_observed – (optional) int or float specifying the point opacity for the observed variable.

  • alpha_predicted – (optional) int or float specifying the point opacity for the predicted variable.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

plot_stratified_metric#
PCAfold.reconstruction.plot_stratified_metric(metric_in_bins, bins_borders, variable_name=None, metric_name=None, yscale='linear', ylim=None, figure_size=(10, 5), title=None, save_filename=None)#

This function plots a stratified metric across bins of a dependent variable.

Example:

from PCAfold import PCA, variable_bins, stratified_coefficient_of_determination, plot_stratified_metric
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='auto', n_components=2)

# Approximate the data set:
X_rec = pca_X.reconstruct(pca_X.transform(X))

# Generate bins:
(idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False)

# Compute stratified R2 in 10 bins of the first variable in a data set:
r2_in_bins = stratified_coefficient_of_determination(X[:,0], X_rec[:,0], idx=idx, use_global_mean=True, verbose=True)

# Visualize how R2 changes across bins:
plt = plot_stratified_metric(r2_in_bins,
                              bins_borders,
                              variable_name='$X_1$',
                              metric_name='$R^2$',
                              yscale='log',
                              figure_size=(10,5),
                              title='Stratified $R^2$',
                              save_filename='r2.pdf')
plt.close()
Parameters
  • metric_in_binslist of metric values in each bin.

  • bins_borderslist of bins borders that were created to stratify the dependent variable.

  • variable_name – (optional) str specifying the name of the variable for which the metric was computed. If set to None label on the x-axis will not be plotted.

  • metric_name – (optional) str specifying the name of the metric to be plotted on the y-axis. If set to None label on the x-axis will not be plotted.

  • yscale – (optional) str specifying the scale for the y-axis.

  • figure_size – (optional) tuple specifying figure size.

  • title – (optional) str specifying plot title. If set to None title will not be plotted.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.


Bibliography#

UAHK+22

Elizabeth Armstrong, Michael A. Hansen, Robert C. Knaus, Nathaniel A. Trask, John C. Hewson, and James C. Sutherland. Accurate compression of tabulated chemistry models with partition of unity networks. Combustion Science and Technology, 0(0):1–18, 2022. doi:10.1080/00102202.2022.2102908.

Utilities#

Tools for optimizing manifold topology#

Class QoIAwareProjection#
class PCAfold.utilities.QoIAwareProjection(input_data, n_components, projection_independent_outputs=None, projection_dependent_outputs=None, activation_decoder='tanh', decoder_interior_architecture=(), encoder_weights_init=None, decoder_weights_init=None, hold_initialization=None, hold_weights=None, transformed_projection_dependent_outputs=None, transform_power=0.5, transform_shift=0.0001, transform_sign_shift=0.0, loss='MSE', optimizer='Adam', batch_size=200, n_epochs=1000, learning_rate=0.001, validation_perc=10, random_seed=None, verbose=False)#

Enables computing QoI-aware encoder-decoder projections.

The QoI-aware encoder-decoder is an autoencoder-like neural network that reconstructs important quantities of interest (QoIs) at the output of a decoder. The QoIs can be set to projection-independent variables (such as the original state variables) or projection-dependent variables, whose definition changes during neural network training.

We introduce an intrusive modification to the neural network training process such that at each epoch, a low-dimensional basis matrix is computed from the current weights in the encoder. Any projection-dependent variables at the output get re-projected onto that basis.

The rationale for performing dimensionality reduction with the QoI-aware strategy is that any poor topological behaviors on a low-dimensional projection will immediately increase the loss during training. These behaviors could be non-uniqueness in representing QoIs due to overlaps on a projection, or large gradients in QoIs caused by data compression in certain regions of a projection. Thus, the QoI-aware strategy naturally promotes improved projection topologies and can be useful in reduced-order modeling.

An illustrative explanation of how the QoI-aware encoder-decoder works is presented in the figure below:

_images/tutorial-qoi-aware-encoder-decoder.png

More information can be found in [UZPS23].

Example:

from PCAfold import center_scale, QoIAwareProjection
import numpy as np

# Generate dummy dataset:
X = np.random.rand(100,8)
S = np.random.rand(100,8)

# Request 2D QoI-aware encoder-decoder projection of the dataset:
n_components = 2

# Preprocess the dataset before passing it to the encoder-decoder:
(input_data, centers, scales) = center_scale(X, scaling='0to1')
projection_dependent_outputs = S / scales

# Instantiate QoIAwareProjection class object:
qoi_aware = QoIAwareProjection(input_data,
                               n_components,
                               projection_independent_outputs=input_data[:,0:3],
                               projection_dependent_outputs=projection_dependent_outputs,
                               activation_decoder=('tanh', 'tanh', 'linear'),
                               decoder_interior_architecture=(5,8),
                               encoder_weights_init=None,
                               decoder_weights_init=None,
                               hold_initialization=10,
                               hold_weights=2,
                               transformed_projection_dependent_outputs='signed-square-root',
                               loss='MSE',
                               optimizer='Adam',
                               batch_size=100,
                               n_epochs=200,
                               learning_rate=0.001,
                               validation_perc=10,
                               random_seed=100,
                               verbose=True)

# Begin model training:
qoi_aware.train()

A summary of the current QoI-aware encoder-decoder model and its hyperparameter settings can be printed using the summary() function:

# Print the QoI-aware encoder-decoder model summary
qoi_aware.summary()
QoI-aware encoder-decoder model summary...

(Model has been trained)


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Projection dimensionality:

        - 2D projection

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Encoder-decoder architecture:

        8-2-5-8-7

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Activation functions:

        (8)--linear--(2)--tanh--(5)--tanh--(8)--linear--(7)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Variables at the decoder output:

        - 3 projection independent variables
        - 2 projection dependent variables
        - 2 transformed projection dependent variables using signed-square-root

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model validation:

        - Using 10% of input data as validation data
        - Model will be trained on 90% of input data

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hyperparameters:

        - Batch size:           100
        - # of epochs:          200
        - Optimizer:            Adam
        - Learning rate:        0.001
        - Loss function:        MSE

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights initialization in the encoder:

        - Glorot uniform

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights initialization in the decoder:

        - Glorot uniform

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights updates in the encoder:

        - Initial weights in the encoder will be kept for 10 first epochs
        - Weights in the encoder will change once every 2 epochs

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Results reproducibility:

        - Reproducible neural network training will be assured using random seed: 100

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Training results:

        - Minimum training loss:                0.0852246955037117
        - Minimum training loss at epoch:       199

        - Minimum validation loss:              0.06681100279092789
        - Minimum validation loss at epoch:     182

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parameters
  • input_datanumpy.ndarray specifying the data set used as the input to the encoder-decoder. It should be of size (n_observations,n_variables).

  • n_componentsint specifying the dimensionality of the QoI-aware encoder-decoder projection. This is equal to the number of neurons in the bottleneck layer.

  • projection_independent_outputs – (optional) numpy.ndarray specifying any projection-independent outputs at the decoder. It should be of size (n_observations,n_projection_independent_outputs).

  • projection_dependent_outputs – (optional) numpy.ndarray specifying any projection-dependent outputs at the decoder. During training, projection_dependent_outputs is projected onto the current basis matrix and the decoder outputs are updated accordingly. It should be of size (n_observations,n_projection_dependent_outputs).

  • activation_decoder – (optional) str or tuple specifying activation functions in all the decoding layers. If set to str, the same activation function is used in all decoding layers. If set to a tuple of str, a different activation function can be set at different decoding layers. The number of elements in the tuple should match the number of decoding layers! str and str elements of the tuple can only be 'linear', 'sigmoid', or 'tanh'. Note, that the activation function in the encoder is hardcoded to 'linear'.

  • decoder_interior_architecture – (optional) tuple of int specifying the number of neurons in the interior architecture of a decoder. For example, if decoder_interior_architecture=(4,5), two interior decoding layers will be created and the overal network architecture will be (Input)-(Bottleneck)-(4)-(5)-(Output). If set to an empty tuple, decoder_interior_architecture=(), the overal network architecture will be (Input)-(Bottleneck)-(Output). Keep in mind that if you’d like to create just one interior layer, you should use a comma after the integer: decoder_interior_architecture=(4,).

  • encoder_weights_init – (optional) numpy.ndarray specifying the custom initalization of the weights in the encoder. It should be of size (n_variables, n_components). If set to None, weights in the encoder will be initialized using the Glorot uniform distribution.

  • decoder_weights_init – (optional) tuple of numpy.ndarray specifying the custom initalization of the weights in the decoder. Each element in the tuple should have a shape that matches the architecture. If set to None, weights in the encoder will be initialized using the Glorot uniform distribution.

  • hold_initialization – (optional) int specifying the number of first epochs during which the initial weights in the encoder are held constant. If set to None, weights in the encoder will change at the first epoch. This parameter can be used in conjunction with hold_weights.

  • hold_weights – (optional) int specifying how frequently the weights should be changed in the encoder. For example, if set to hold_weights=2, the weights in the encoder will only be updated once every two epochs throught the whole training process. If set to None, weights in the encoder will change at every epoch. This parameter can be used in conjunction with hold_initialization.

  • transformed_projection_dependent_outputs – (optional) str specifying if any nonlinear transformation of the projection-dependent outputs should be added at the decoder output. It can be 'symlog' or 'signed-square-root'.

  • transform_power – (optional) int or float as per preprocess.power_transform().

  • transform_shift – (optional) int or float as per preprocess.power_transform().

  • transform_sign_shift – (optional) int or float as per preprocess.power_transform().

  • loss – (optional) str specifying the loss function. It can be 'MAE' or 'MSE'.

  • optimizer – (optional) str specifying the optimizer used during training. It can be 'Adam' or 'Nadam'.

  • batch_size – (optional) int specifying the batch size.

  • n_epochs – (optional) int specifying the number of epochs.

  • learning_rate – (optional) float specifying the learning rate passed to the optimizer.

  • validation_perc – (optional) int specifying the percentage of the input data to be used as validation data during training. It should be a number larger than or equal to 0 and smaller than 100. Note, that if it is set above 0, not all of the input data will be used as training data. Note, that validation data does not impact model training!

  • random_seed – (optional) int specifying the random seed to be used for any random operations. It is highly recommended to set a fixed random seed, as this allows for complete reproducibility of the results.

  • verbose – (optional) bool for printing verbose details.

Attributes:

  • input_data - (read only) numpy.ndarray specifying the data set used as the input to the encoder-decoder.

  • n_components - (read only) int specifying the dimensionality of the QoI-aware encoder-decoder projection.

  • projection_independent_outputs - (read only) numpy.ndarray specifying any projection-independent outputs at the decoder.

  • projection_dependent_outputs - (read only) numpy.ndarray specifying any projection-dependent outputs at the decoder.

  • architecture - (read only) str specifying the QoI-aware encoder-decoder architecture.

  • n_total_outputs - (read only) int counting the total number of outputs at the decoder.

  • qoi_aware_encoder_decoder - (read only) object of Keras.models.Sequential class that stores the QoI-aware encoder-decoder neural network.

  • weights_and_biases_init - (read only) list of numpy.ndarray specifying weights and biases with which the QoI-aware encoder-decoder was intialized.

  • weights_and_biases_trained - (read only) list of numpy.ndarray specifying weights and biases after training the QoI-aware encoder-decoder. Only available after calling QoIAwareProjection.train().

  • training_loss - (read only) list of losses computed on the training data. Only available after calling QoIAwareProjection.train().

  • validation_loss - (read only) list of losses computed on the validation data. Only available after calling QoIAwareProjection.train() and only when validation_perc was not equal to 0.

  • bases_across_epochs - (read only) list of numpy.ndarray specifying all basis matrices from all epochs. Only available after calling QoIAwareProjection.train().

QoIAwareProjection.summary#
PCAfold.utilities.QoIAwareProjection.summary(self)#

Prints the QoI-aware encoder-decoder model summary.

QoIAwareProjection.train#
PCAfold.utilities.QoIAwareProjection.train(self)#

Trains the QoI-aware encoder-decoder neural network model.

After training, the optimized basis matrix for low-dimensional data projection can be obtained.

QoIAwareProjection.print_weights_and_biases_init#
PCAfold.utilities.QoIAwareProjection.print_weights_and_biases_init(self)#

Prints initial weights and biases from all layers of the QoI-aware encoder-decoder.

QoIAwareProjection.print_weights_and_biases_trained#
PCAfold.utilities.QoIAwareProjection.print_weights_and_biases_trained(self)#

Prints trained weights and biases from all layers of the QoI-aware encoder-decoder.

QoIAwareProjection.get_best_basis#
PCAfold.utilities.QoIAwareProjection.get_best_basis(self, method='min-training-loss')#

Returns the best low-dimensional basis according to the selected method.

Parameters

method – (optional) str specifying the method used to select the best basis. It should be 'min-training-loss', 'min-validation-loss', or 'last-epoch'.

Returns

  • best_basis - numpy.ndarray specifying the best basis extracted from the bases_across_epochs attribute.

QoIAwareProjection.plot_losses#
PCAfold.utilities.QoIAwareProjection.plot_losses(self, markevery=100, figure_size=(15, 5), save_filename=None)#

Plots training and validation losses.

Parameters
  • markevery – (optional) int specifying how frequently the epoch number on the x-axis should be labelled.

  • figure_size – (optional) tuple specifying figure size.

  • save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

  • plt - matplotlib.pyplot plot handle.

manifold_informed_forward_variable_addition#
PCAfold.utilities.manifold_informed_forward_variable_addition(X, X_source, variable_names, scaling, bandwidth_values, target_variables=None, add_transformed_source=True, target_manifold_dimensionality=3, bootstrap_variables=None, penalty_function=None, power=1, vertical_shift=1, norm='max', integrate_to_peak=False, verbose=False)#

Manifold-informed feature selection algorithm based on forward variable addition introduced in [UZSP22]. The goal of the algorithm is to select a meaningful subset of the original variables such that undesired behaviors on a PCA-derived manifold of a given dimensionality are minimized. The algorithm uses the cost function, \(\mathcal{L}\), based on minimizing the area under the normalized variance derivatives curves, \(\hat{\mathcal{D}}(\sigma)\), for the selected \(n_{dep}\) dependent variables (as per cost_function_normalized_variance_derivative function). The algorithm can be bootstrapped in two ways:

  • Automatic bootstrap when bootstrap_variables=None: the first best variable is selected automatically as the one that gives the lowest cost.

  • User-defined bootstrap when bootstrap_variables is set to a user-defined list of the bootstrap variables.

The algorithm iterates, adding a new variable that exhibits the lowest cost at each iteration. The original variables in a data set get ordered according to their effect on the manifold topology. Assuming that the original data set is composed of \(Q\) variables, the first output is a list of indices of the ordered original variables, \(\mathbf{X} = [X_1, X_2, \dots, X_Q]\). The second output is a list of indices of the selected subset of the original variables, \(\mathbf{X}_S = [X_1, X_2, \dots, X_n]\), that correspond to the minimum cost, \(\mathcal{L}\).

More information can be found in [UZSP22].

Note

The algorithm can be very expensive (for large data sets) due to multiple computations of the normalized variance derivative. Try running it on multiple cores or on a sampled data set.

In case the algorithm breaks when not being able to determine the peak location, try increasing the range in the bandwidth_values parameter.

Example:

from PCAfold import manifold_informed_forward_variable_addition as FVA
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)
X_source = np.random.rand(100,10)

# Define original variables to add to the optimization:
target_variables = X[:,0:3]

# Specify variables names
variable_names = ['X_' + str(i) for i in range(0,10)]

# Specify the bandwidth values to compute the optimization on:
bandwidth_values = np.logspace(-4, 2, 50)

# Run the subset selection algorithm:
(ordered, selected, min_cost, costs) = FVA(X,
                                           X_source,
                                           variable_names,
                                           scaling='auto',
                                           bandwidth_values=bandwidth_values,
                                           target_variables=target_variables,
                                           add_transformed_source=True,
                                           target_manifold_dimensionality=2,
                                           bootstrap_variables=None,
                                           penalty_function='peak',
                                           norm='max',
                                           integrate_to_peak=True,
                                           verbose=True)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • X_sourcenumpy.ndarray specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09]. It should be of size (n_observations,n_variables).

  • variable_nameslist of str specifying variables names.

  • scaling – (optional) str specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • bandwidth_valuesnumpy.ndarray specifying the bandwidth values, \(\sigma\), for \(\hat{\mathcal{D}}(\sigma)\) computation.

  • target_variables – (optional) numpy.ndarray specifying the dependent variables that should be used in \(\hat{\mathcal{D}}(\sigma)\) computation. It should be of size (n_observations,n_target_variables).

  • add_transformed_source – (optional) bool specifying if the PCA-transformed source terms of the state-space variables should be added in \(\hat{\mathcal{D}}(\sigma)\) computation, alongside the user-defined dependent variables.

  • target_manifold_dimensionality – (optional) int specifying the target dimensionality of the PCA manifold.

  • bootstrap_variables – (optional) list specifying the user-selected variables to bootstrap the algorithm with. If set to None, automatic bootstrapping is performed.

  • penalty_function – (optional) str specifying the weighting applied to each area. Set penalty_function='peak' to weight each area by the rightmost peak location, \(\sigma_{peak, i}\), for the \(i^{th}\) dependent variable. Set penalty_function='sigma' to weight each area continuously by the bandwidth. Set penalty_function='log-sigma-over-peak' to weight each area continuously by the \(\log_{10}\) -transformed bandwidth, normalized by the right most peak location, \(\sigma_{peak, i}\). If penalty_function=None, the area is not weighted.

  • power – (optional) float or int specifying the power, \(r\). It can be used to control how much penalty should be applied to variance happening at the smallest length scales.

  • vertical_shift – (optional) float or int specifying the vertical shift multiplier, \(b\). It can be used to control how much penalty should be applied to feature sizes.

  • norm – (optional) str specifying the norm to apply for all areas \(A_i\). norm='average' uses an arithmetic average, norm='max' uses the \(L_{\infty}\) norm, norm='median' uses a median area, norm='cumulative' uses a cumulative area and norm='min' uses a minimum area.

  • integrate_to_peak – (optional) bool specifying whether an individual area for the \(i^{th}\) dependent variable should be computed only up the the rightmost peak location.

  • verbose – (optional) bool for printing verbose details.

Returns

  • ordered_variables - list specifying the indices of the ordered variables.

  • selected_variables - list specifying the indices of the selected variables that correspond to the minimum cost \(\mathcal{L}\).

  • optimized_cost - float specifying the cost corresponding to the optimized subset.

  • costs - list specifying the costs, \(\mathcal{L}\), from each iteration.

manifold_informed_backward_variable_elimination#
PCAfold.utilities.manifold_informed_backward_variable_elimination(X, X_source, variable_names, scaling, bandwidth_values, target_variables=None, add_transformed_source=True, source_space=None, target_manifold_dimensionality=3, penalty_function=None, power=1, vertical_shift=1, norm='max', integrate_to_peak=False, verbose=False)#

Manifold-informed feature selection algorithm based on backward variable elimination introduced in [UZSP22]. The goal of the algorithm is to select a meaningful subset of the original variables such that undesired behaviors on a PCA-derived manifold of a given dimensionality are minimized. The algorithm uses the cost function, \(\mathcal{L}\), based on minimizing the area under the normalized variance derivatives curves, \(\hat{\mathcal{D}}(\sigma)\), for the selected \(n_{dep}\) dependent variables (as per cost_function_normalized_variance_derivative function).

The algorithm iterates, removing another variable that has an effect of decreasing the cost the most at each iteration. The original variables in a data set get ordered according to their effect on the manifold topology. Assuming that the original data set is composed of \(Q\) variables, the first output is a list of indices of the ordered original variables, \(\mathbf{X} = [X_1, X_2, \dots, X_Q]\). The second output is a list of indices of the selected subset of the original variables, \(\mathbf{X}_S = [X_1, X_2, \dots, X_n]\), that correspond to the minimum cost, \(\mathcal{L}\).

More information can be found in [UZSP22].

Note

The algorithm can be very expensive (for large data sets) due to multiple computations of the normalized variance derivative. Try running it on multiple cores or on a sampled data set.

In case the algorithm breaks when not being able to determine the peak location, try increasing the range in the bandwidth_values parameter.

Example:

from PCAfold import manifold_informed_backward_variable_elimination as BVE
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,10)
X_source = np.random.rand(100,10)

# Define original variables to add to the optimization:
target_variables = X[:,0:3]

# Specify variables names
variable_names = ['X_' + str(i) for i in range(0,10)]

# Specify the bandwidth values to compute the optimization on:
bandwidth_values = np.logspace(-4, 2, 50)

# Run the subset selection algorithm:
(ordered, selected, min_cost, costs) = BVE(X,
                                           X_source,
                                           variable_names,
                                           scaling='auto',
                                           bandwidth_values=bandwidth_values,
                                           target_variables=target_variables,
                                           add_transformed_source=True,
                                           target_manifold_dimensionality=2,
                                           penalty_function='peak',
                                           norm='max',
                                           integrate_to_peak=True,
                                           verbose=True)
Parameters
  • Xnumpy.ndarray specifying the original data set, \(\mathbf{X}\). It should be of size (n_observations,n_variables).

  • X_sourcenumpy.ndarray specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09]. It should be of size (n_observations,n_variables).

  • variable_nameslist of str specifying variables names. Order of names in the variable_names list should match the order of variables (columns) in X.

  • scaling – (optional) str specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'poisson', 'vast_2', 'vast_3', 'vast_4'.

  • bandwidth_valuesnumpy.ndarray specifying the bandwidth values, \(\sigma\), for \(\hat{\mathcal{D}}(\sigma)\) computation.

  • target_variables – (optional) numpy.ndarray specifying the dependent variables that should be used in \(\hat{\mathcal{D}}(\sigma)\) computation. It should be of size (n_observations,n_target_variables).

  • add_transformed_source – (optional) bool specifying if the PCA-transformed source terms of the state-space variables should be added in \(\hat{\mathcal{D}}(\sigma)\) computation, alongside the user-defined dependent variables.

  • source_space – (optional) str specifying the space to which the PC source terms should be transformed before computing the cost. It can be one of the following: symlog, continuous-symlog, original-and-symlog, original-and-continuous-symlog. If set to None, PC source terms are kept in their original PCA-space.

  • target_manifold_dimensionality – (optional) int specifying the target dimensionality of the PCA manifold.

  • penalty_function – (optional) str specifying the weighting applied to each area. Set penalty_function='peak' to weight each area by the rightmost peak location, \(\sigma_{peak, i}\), for the \(i^{th}\) dependent variable. Set penalty_function='sigma' to weight each area continuously by the bandwidth. Set penalty_function='log-sigma-over-peak' to weight each area continuously by the \(\log_{10}\) -transformed bandwidth, normalized by the right most peak location, \(\sigma_{peak, i}\). If penalty_function=None, the area is not weighted.

  • power – (optional) float or int specifying the power, \(r\). It can be used to control how much penalty should be applied to variance happening at the smallest length scales.

  • vertical_shift – (optional) float or int specifying the vertical shift multiplier, \(b\). It can be used to control how much penalty should be applied to feature sizes.

  • norm – (optional) str specifying the norm to apply for all areas \(A_i\). norm='average' uses an arithmetic average, norm='max' uses the \(L_{\infty}\) norm, norm='median' uses a median area, norm='cumulative' uses a cumulative area and norm='min' uses a minimum area.

  • integrate_to_peak – (optional) bool specifying whether an individual area for the \(i^{th}\) dependent variable should be computed only up the the rightmost peak location.

  • verbose – (optional) bool for printing verbose details.

Returns

  • ordered_variables - list specifying the indices of the ordered variables.

  • selected_variables - list specifying the indices of the selected variables that correspond to the minimum cost \(\mathcal{L}\).

  • optimized_cost - float specifying the cost corresponding to the optimized subset.

  • costs - list specifying the costs, \(\mathcal{L}\), from each iteration.

Class QoIAwareProjectionPOUnet#
class PCAfold.utilities.QoIAwareProjectionPOUnet(projection_weights, partition_centers, partition_shapes, basis_type, projection_biases=None, basis_coeffs=None, dtype='float64', **kwargs)#

This is analogous to QoIAwareProjection but uses PartitionOfUnityNetwork as the decoder.

Example:

from PCAfold import init_uniform_partitions, PCA, QoIAwareProjectionPOUnet
import numpy as np
import tensorflow as tf

# generate dummy data set:
ivars = np.random.rand(100,3)

# initialize a projection (e.g., using PCA)
pca = PCA(ivars, scaling='none', n_components=2)
ivar_proj = pca.transform(ivars)

# initialize the QoIAwareProjectionPOUnet parameters
net = QoIAwareProjectionPOUnet(pca.A[:,:2], **init_uniform_partitions([5,7], ivar_proj), basis_type='linear')

# function for defining the training dependent variables (can include a projection)
dvar = np.vstack((ivars[:,0] + ivars[:,1], 2.*ivars[:,0] + 3.*ivars[:,1], 3.*ivars[:,0] + 5.*ivars[:,1])).T
def dvar_func(proj_weights):
    temp = tf.Variable(np.expand_dims(dvar, axis=2), name='eval_qoi', dtype=net._reconstruction._dtype)
    temp = net.tf_projection(temp, nobias=True)
    return temp

# build the training graph with provided training data
net.build_training_graph(ivars, dvar_func)

# train the projection
net.train(1000)

# compute new projected variables
net.projection(ivars)

# evaluate the encoder-decoder
net(ivars)

# Save the data to a file
net.write_data_to_file('filename.pkl')

# reload projection data from file
net2 = QoIAwareProjectionPOUnet.load_from_file('filename.pkl')
Parameters
  • projection_weights – array of the projection matrix weights

  • partition_centers – array size (number of partitions) x (number of ivar inputs) for partition locations

  • partition_shapes – array size (number of partitions) x (number of ivar inputs) for partition shapes influencing the RBF widths

  • basis_type – string ('constant', 'linear', or 'quadratic') for the degree of polynomial basis

  • projection_biases – (optional, default None) array of the biases (offsets) corresponding to the projection weights, if None the projections are offset by zeros

  • basis_coeffs – (optional, default None) if the array of polynomial basis coefficients is known, it may be provided here, otherwise it will be initialized with build_training_graph and trained with train

  • dtype – (optional, default 'float64') string specifying either float type 'float64' or 'float32'

Attributes:

  • projection_weights - (read only) array of the current projection weights

  • projection_biases - (read only) array of the projection biases

  • reconstruction_model - (read only) the current POUnet decoder

  • partition_centers - (read only) array of the current partition centers

  • partition_shapes - (read only) array of the current partition shape parameters

  • basis_type - (read only) string relaying the basis degree

  • basis_coeffs - (read only) array of the current basis coefficients

  • proj_ivar_center - (read only) array of the centering parameters used in the POUnet for the projected ivar inputs

  • proj_ivar_scale - (read only) array of the scaling parameters used in the POUnet for the projected ivar inputs

  • dtype - (read only) string relaying the data type ('float64' or 'float32')

  • training_archive - (read only) dictionary of the errors and POUnet states archived during training

  • iterations - (read only) array of the iterations archived during training

QoIAwareProjectionPOUnet.projection#
PCAfold.utilities.QoIAwareProjectionPOUnet.projection(self, ivars, nobias=False)#

Projects the independent variable inputs using the current projection weights and biases

Parameters
  • ivars – array of independent variable query points

  • nobias – (optional, default False) whether or not to apply the projection bias. Analogous to nocenter in the PCA transform function.

Returns

array of the projected independent variable query points

QoIAwareProjectionPOUnet.tf_projection#
PCAfold.utilities.QoIAwareProjectionPOUnet.tf_projection(self, y, nobias=False)#

version of projection using tensorflow operations and Tensors

QoIAwareProjectionPOUnet.update_lr#
PCAfold.utilities.QoIAwareProjectionPOUnet.update_lr(self, lr)#

update the learning rate for training

Parameters

lr – float for the learning rate

QoIAwareProjectionPOUnet.update_l2reg#
PCAfold.utilities.QoIAwareProjectionPOUnet.update_l2reg(self, l2reg)#

update the least-squares regularization for training

Parameters

l2reg – float for the least-squares regularization

QoIAwareProjectionPOUnet.build_training_graph#
PCAfold.utilities.QoIAwareProjectionPOUnet.build_training_graph(self, ivars, dvars_function, error_type='abs', constrain_positivity=False, first_trainable_idx=0)#

Construct the graph used during training (including defining the training errors) with the provided training data

Parameters
  • ivars – array of independent variables for training

  • dvars_function – function (using tensorflow operations) for defining the dependent variable(s) for training. This must take a single argument of the projection weights which, if used, will be evaluated with the weights as they are updated

  • error_type – (optional, default 'abs') the type of training error: relative 'rel' or absolute 'abs'

  • constrain_positivity – (optional, default False) when True, it penalizes the training error with \(f - |f|\) for dependent variables \(f\). This can be useful for defining projected source term dependent variables, for example.

  • first_trainable_idx – (optional, default 0) This separates the trainable projection weights (with index greater than or equal to first_trainable_idx) from the nontrainable projection weights.

QoIAwareProjectionPOUnet.train#
PCAfold.utilities.QoIAwareProjectionPOUnet.train(self, iterations, archive_rate=100, use_best_archive_sse=True, verbose=False)#

Performs training using a least-squares gradient descent block coordinate descent strategy. This alternates between updating the partition and projection parameters with gradient descent and updating the basis coefficients with least-squares.

Parameters
  • iterations – integer for number of training iterations to perform

  • archive_rate – (optional, default 100) the rate at which the errors and parameters are archived during training. These can be accessed with the training_archive attribute

  • use_best_archive_sse – (optional, default True) when True will set the POUnet parameters to those with the lowest error observed during training, otherwise the parameters from the last iteration are used

  • verbose – (optional, default False) when True will print progress

QoIAwareProjectionPOUnet.__call__#
PCAfold.utilities.QoIAwareProjectionPOUnet.__call__(self, xeval)#

evaluate the encoder-decoder

Parameters

xeval – array of independent variable query points

Returns

array of predictions

QoIAwareProjectionPOUnet.write_data_to_file#
PCAfold.utilities.QoIAwareProjectionPOUnet.write_data_to_file(self, filename)#

Save class data to a specified file using pickle. This does not include the archived data from training, which can be separately accessed with training_archive and saved outside of QoIAwareProjectionPOUnet.

Parameters

filename – string

QoIAwareProjectionPOUnet.load_data_from_file#
PCAfold.utilities.QoIAwareProjectionPOUnet.load_data_from_file(filename)#

Load data from a specified filename with pickle (following write_data_to_file)

Parameters

filename – string

Returns

dictionary of the encoder-decoder data

QoIAwareProjectionPOUnet.load_from_file#
PCAfold.utilities.QoIAwareProjectionPOUnet.load_from_file(filename)#

Load class from a specified filename with pickle (following write_data_to_file)

Parameters

filename – string

Returns

QoIAwareProjectionPOUnet


Bibliography#

AAS21

Elizabeth Armstrong and James C. Sutherland. A technique for characterising feature size and quality of manifolds. Combustion Theory and Modelling, 0(0):1–23, 2021. doi:10.1080/13647830.2021.1931715.

AZASP22

Kamila Zdybał, Elizabeth Armstrong, James C. Sutherland, and Alessandro Parente. Cost function for low-dimensional manifold topology assessment. Scientific Reports, 12:14496, 2022. URL: https://www.nature.com/articles/s41598-022-18655-1, doi:https://doi.org/10.1038/s41598-022-18655-1.

UZPS23

Kamila Zdybał, Alessandro Parente, and James C. Sutherland. Improving reduced-order models through nonlinear decoding of projection-dependent model outputs. Article in preparation for PNAS, :, 2023. URL:, doi:.

UZSP22(1,2,3,4)

Kamila Zdybał, James C. Sutherland, and Alessandro Parente. Manifold-informed state vector subset for reduced-order modeling. Proceedings of the Combustion Institute, 2022. URL: https://www.sciencedirect.com/science/article/pii/S1540748922000153, doi:https://doi.org/10.1016/j.proci.2022.06.019.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Preprocessing#

In this tutorial, we present data manipulation functionalities of the preprocess module. To import the module:

from PCAfold import preprocess

Centering, scaling and constant variable removal#

We begin by generating a dummy data set:

import numpy as np

X = np.random.rand(100,20)

Several popular scaling options have been implemented such as Auto (std), Range, VAST or Pareto. Centering and scaling of data sets can be performed using preprocess.center_scale function:

(X_cs, X_center, X_scale) = preprocess.center_scale(X, 'range', nocenter=False)

To invert the centering and scaling using the current centers and scales preprocess.invert_center_scale function can be used:

X = preprocess.invert_center_scale(X_cs, X_center, X_scale)

If constant variables are present in the data set, they can be removed using preprocess.remove_constant_vars function which can be a useful pre-processing before PCA is applied on a data set. If an artificial constant column is injected:

X[:,5] = np.ones((100,))

it can be removed by:

(X_removed, idx_removed, idx_retained) = preprocess.remove_constant_vars(X)

In addition to that, an object of the PreProcessing class can be created and used to store the combination of the above pre-processing:

preprocessed = preprocess.PreProcessing(X, 'range', nocenter=False)

Centered and scaled data set can then be accessed as class attribute:

preprocessed.X_cs

as well as centers and scales:

preprocessed.X_center
preprocessed.X_scale

Conditional statistics#

In this section, we demonstrate how conditional statistics can be computed and plotted for the original data set. A data set representing combustion of syngas in air generated from steady laminar flamelet model using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07] is used as a demo data set. We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding mixture fraction observations, \(Z\), that will serve as the conditioning variable:

X = np.genfromtxt('data-state-space.csv', delimiter=',')
Z = np.genfromtxt('data-mixture-fraction.csv', delimiter=',')

First, we create an object of the ConditionalStatistics class. We condition the entire data set \(\mathbf{X}\), using the mixture fraction as a conditioning variable. We compute the conditional stastics in 20 bins of the conditioning variable:

cond = preprocess.ConditionalStatistics(X, Z, k=20)

We can then retrieve the centroids of the bins for which the conditional statistics have been computed:

cond.centroids

and retrieve different conditional statistics. For instance, the conditional mean can be accessed through:

conditional_mean = cond.conditional_mean

The conditional statistics can also be ploted using a dedicated function:

plt = preprocess.plot_conditional_statistics(X[:,0], Z, k=20, x_label='Mixture fraction [-]', y_label='$T$ [K]', color='#c0c0c0', statistics_to_plot=['mean', 'max', 'min'], figure_size=(10,4), save_filename=save_filename)
_images/conditional-statistics.svg

Note, that the original data set that is plotted in the backround can be colored using any vector variable:

plt = preprocess.plot_conditional_statistics(X[:,0], Z, k=20, statistics_to_plot=['mean', 'max', 'min'], x_label='Mixture fraction [-]', y_label='$T$ [K]', color=X[:,2], color_map='inferno', colorbar_label='$Y_{O_2}$ [-]', figure_size=(12.5,4), save_filename=save_filename)
_images/conditional-statistics-colored.svg

Multivariate outlier detection#

We first generate a synthetic data set with artificially appended outliers. This data set, with outliers visible as a cloud in the top right corner, can be seen below:

_images/data-manipulation-initial-data.svg

We will first detect outliers with 'MULTIVARIATE TRIMMING' method and we will demonstrate the effect of setting two levels of trimming_threshold.

We first set trimming_threshold=0.6:

(idx_outliers_removed, idx_outliers) = preprocess.outlier_detection(X, scaling='auto', detection_method='MULTIVARIATE TRIMMING', trimming_threshold=0.6, n_iterations=0, verbose=True)

With verbose=True we will see some more information on outliers detected:

Number of observations classified as outliers: 20

We can visualize the observations that were classified as outliers using the preprocess.plot_2d_clustering, assuming that the cluster \(k_0\) (blue) will be observations with removed outliers and cluster \(k_1\) (red) will be the detected outliers.

We first create a dummy idx_new vector of cluster classifications based on idx_outliers obtained. This can for instance be done in the following way:

idx_new = np.zeros((n_observations,))
for i in range(0, n_observations):
  if i in idx_outliers:
      idx_new[i] = 1

where n_observations is the total number of observations in the data set.

The result of this detection can be seen below:

_images/data-manipulation-outliers-multivariate-trimming-60.svg

We then set the trimming_threshold=0.3 which will capture outliers earlier (at smaller Mahalanobis distances from the variables’ centroids).

(idx_outliers_removed, idx_outliers) = preprocess.outlier_detection(X, scaling='auto', detection_method='MULTIVARIATE TRIMMING', trimming_threshold=0.3, n_iterations=0, verbose=True)

With verbose=True we will see some more information on outliers detected:

Number of observations classified as outliers: 180

The result of this detection can be seen below:

_images/data-manipulation-outliers-multivariate-trimming-30.svg

It can be seen that the algorithm started to pick up outlier observations at the perimeter of the original data set.


Kernel density weighting#

In this tutorial we reproduce results on a synthetic data set from the following paper:

Coussement, A., Gicquel, O., & Parente, A. (2012). Kernel density weighted principal component analysis of combustion processes. Combustion and flame, 159(9), 2844-2855.

We begin by generating the synthetic data set that has two distinct clouds with many observations and an intermediate region with few observations:

from PCAfold import KernelDensity
from PCAfold import PCA
from PCAfold import reduction
import numpy as np

n_observations = 2021
x1 = np.zeros((n_observations,1))
x2 = np.zeros((n_observations,1))

for i in range(0,n_observations):

  R = np.random.rand()

  if i <= 999:

      x1[i] = -1 + 20*R
      x2[i] = 5*x1[i] + 100*R

  if i >= 1000 and i <= 1020:

      x1[i] = 420 + 8*(i+1 - 1001)
      x2[i] = 5000/200 * (x1[i] - 400) + 500*R

  if i >= 1021 and i <= 2020:

      x1[i] = 1000 + 20*R
      x2[i] = 5*x1[i] + 100*R

X = np.hstack((x1, x2))

This data set can be seen below:

_images/kernel-density-original-data.svg

We perform PCA on the data set and approximate it with a single principal component:

pca = PCA(X, scaling='auto', n_components=1)
PCs = pca.transform(X)
X_rec = pca.reconstruct(PCs)

Using the reduction.plot_parity function we can visualize how each variable is reconstructed:

_images/kernel-density-original-x1.svg _images/kernel-density-original-x2.svg

We thus note that PCA adjusts to reconstruct well the two regions with many observations and the intermediate region is not reconstructed well.

Single-variable case#

We will weight the data set using kernel density weighting method in order to give more importance to the intermediate region. Kernel density weighting can be performed by instantiating an object of the KernelDensity class. As the first variable we pass the entire centered and scaled data set and as a second variable we specify what should be the conditioning variable based on which weighting will be computed:

kernd_single = KernelDensity(pca.X_cs, pca.X_cs[:,0], verbose=True)

With verbose=True we will see which case is being run:

Single-variable case will be applied.

In general, whenever the conditioning variable is a single vector a single-variable case will be used.

We then obtain the weighted data set:

X_weighted_single = kernd_single.X_weighted

Weights \(\mathbf{W_c}\) used to scale the data set can be accessed as well:

weights_single = kernd_single.weights

We perform PCA on the weighted data set and we project the centered and scaled original data set onto the basis identified on X_weighted_single:

pca_single = PCA(X_weighted_single, 'none', n_components=1, nocenter=True)
PCs_single = pca_single.transform(pca.X_cs)

Reconstruction of that data set can be obtained:

X_rec_single = pca_single.reconstruct(PCs_single)
X_rec_single = (X_rec_single * pca.X_scale) + pca.X_center

We can now use reduction.plot_parity function to visualize the new reconstruction:

_images/kernel-density-single-x1.svg _images/kernel-density-single-x2.svg

We note that this time the intermediate region got better represented in the PCA reconstruction.

Multi-variable case#

In a similar way, multi-variable case can be used by passing the entire two-dimensional data set as a conditioning variable:

kernd_multi = KernelDensity(pca.X_cs, pca.X_cs, verbose=True)

We then perform analogous steps to obtain the new reconstruction:

X_weighted_multi = kernd_multi.X_weighted
weights_multi = kernd_multi.weights

pca_multi = PCA(X_weighted_multi, 'none', n_components=1)
PCs_multi = pca_multi.transform(pca.X_cs)
X_rec_multi = pca_multi.reconstruct(PCs_multi)
X_rec_multi = (X_rec_multi * pca.X_scale) + pca.X_center

The result of this reconstruction can be seen below:

_images/kernel-density-multi-x1.svg _images/kernel-density-multi-x2.svg

Bibliography#

CHan20

Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.

CHSSC07

Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Data clustering#

In this tutorial, we present the clustering functionalities from the preprocess module.

We import the necessary modules:

from PCAfold import preprocess
from PCAfold import reduction
import numpy as np
from matplotlib.colors import ListedColormap
from sklearn.cluster import KMeans

and we set some initial parameters:

x_label = '$x$'
y_label = '$y$'
z_label = '$z$'
figure_size = (6,3)
color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54'])
save_filename = None
random_seed = 200

Visualize the clustering result in 2D#

We begin by demonstrating how the result of clustering can be visualized using the plotting functionalities from the preprocess module.

We generate a synthetic 2D data set composed of two distinct clouds:

np.random.seed(seed=random_seed)

n_observations = 1000

mean_1 = [0,1]
mean_2 = [6,4]
covariance_1 = [[2, 0.5], [0.5, 0.5]]
covariance_2 = [[3, 0.3], [0.3, 0.5]]

x_1, y_1 = np.random.multivariate_normal(mean_1, covariance_1, n_observations).T
x_2, y_2 = np.random.multivariate_normal(mean_2, covariance_2, n_observations).T
x = np.concatenate([x_1, x_2])
y = np.concatenate([y_1, y_2])

The original data set can be visualized using the function from the reduction module:

plt = reduction.plot_2d_manifold(x, y, x_label=x_label, y_label=y_label, figure_size=figure_size, save_filename=None)
_images/tutorial-clustering-cloud-2d-data-set.svg

We divide the data into two clusters using the K-Means algorithm:

idx_kmeans = KMeans(n_clusters=2).fit(np.hstack((x, y))).labels_

As soon as the idx vector of cluster classification is known for the data set, the result of clustering can be visualized using the plot_2d_clustering function.

We plot the result of K-Means clustering on the 2D data set:

plt = preprocess.plot_2d_clustering(x, y, idx_kmeans, x_label=x_label, y_label=y_label, color_map=color_map, first_cluster_index_zero=False, figure_size=figure_size, save_filename=None)
_images/tutorial-clustering-cloud-2d-data-set-kmeans.svg

Note, that the numbers in the legend, next to each cluster number, represent the number of samples in a particular cluster. The populations of each cluster can also be computed and printed, for instance through:

print(preprocess.get_populations(idx_kmeans))

which in this case will print:

[991, 1009]

Visualize the clustering result in 3D#

Clustering result can also be visualized in a three-dimensional space. In this example, we generate a synthetic 3D data set composed of three connected planes:

n_observations = 50

x = np.tile(np.linspace(0,50,n_observations), n_observations)
y = np.zeros((n_observations,1))
z = np.zeros((n_observations*n_observations,1))

for i in range(1,n_observations):
    y = np.vstack((y, np.ones((n_observations,1))*i))
y = y.ravel()

for observation, x_value in enumerate(x):

    y_value = y[observation]

    if x_value <= 10:
        z[observation] = 2 * x_value + y_value
    elif x_value > 10 and x_value <= 35:
        z[observation] = 10 * x_value + y_value - 80
    elif x_value > 35:
        z[observation] = 5 * x_value + y_value + 95

(x, _, _) = preprocess.center_scale(x[:,None], scaling='0to1')
(y, _, _) = preprocess.center_scale(y[:,None], scaling='0to1')
(z, _, _) = preprocess.center_scale(z, scaling='0to1')

The original data set can be visualized using the function from the reduction module:

plt = reduction.plot_3d_manifold(x, y, z, elev=30, azim=-100, x_label=x_label, y_label=y_label, z_label=z_label, figure_size=(12,8), save_filename=None)
_images/tutorial-clustering-3d-data-set.svg

We divide the data into four clusters using the K-Means algorithm:

idx_kmeans = KMeans(n_clusters=4).fit(np.hstack((x, y, z))).labels_

The result of K-Means clustering can then be plotted in 3D:

plt = preprocess.plot_3d_clustering(x, y, z, idx_kmeans, elev=30, azim=-100, x_label=x_label, y_label=y_label, z_label=z_label, color_map=color_map, first_cluster_index_zero=False, figure_size=(12,8), save_filename=None)
_images/tutorial-clustering-3d-data-set-kmeans.svg

Clustering based on binning a single variable#

In this section, we demonstrate a few clustering functions that are implemented in PCAfold. All of them cluster data sets based on binning a single variable.

First, we generate a synthetic two-dimensional data set:

x = np.linspace(-1,1,100)
y = -x**2 + 1

The data set can be visualized using the function from the reduction module:

plt = reduction.plot_2d_manifold(x, y, x_label=x_label, y_label=y_label, figure_size=figure_size, save_filename=None)
_images/tutorial-clustering-original-data-set.svg

We will now cluster the 2D data set according to bins of a single variable, \(x\).

Cluster into equal variable bins#
_images/clustering-variable-bins.svg

This clustering will divide the data set based on equal bins of a variable vector.

(idx_variable_bins, borders_variable_bins) = preprocess.variable_bins(x, 4, verbose=True)

With verbose=True we will see some detailed information on clustering:

Border values for bins:
[-1.0, -0.5, 0.0, 0.5, 1.0]

Bounds for cluster 0:
      -1.0, -0.5152
Bounds for cluster 1:
      -0.4949, -0.0101
Bounds for cluster 2:
      0.0101, 0.4949
Bounds for cluster 3:
      0.5152, 1.0

The result of clustering can be plotted in 2D:

plt = preprocess.plot_2d_clusteringplt = preprocess.plot_2d_clustering(x, y, idx_variable_bins, x_label=x_label, y_label=y_label, color_map=color_map, first_cluster_index_zero=False, grid_on=True, figure_size=figure_size, save_filename=None)

The visual result of this clustering can be seen below:

_images/tutorial-clustering-variable-bins-k4.svg

Note that this clustering function created four equal bins in the space of \(x\). In this case, since \(x\) ranges from -1 to 1, the bins are created as intervals of length 0.5 in the \(x\)-space.

Cluster into pre-defined variable bins#
_images/clustering-predefined-variable-bins.svg

This clustering will divide the data set into bins of a one-dimensional variable vector whose borders are specified by the user. Let’s specify the split values as split_values = [-0.6, 0.4, 0.8]:

split_values = [-0.6, 0.4, 0.8]
(idx_predefined_variable_bins, borders_predefined_variable_bins) = preprocess.predefined_variable_bins(x, split_values, verbose=True)

With verbose=True we will see some detailed information on clustering:

Border values for bins:
[-1.0, -0.6, 0.4, 0.8, 1.0]

Bounds for cluster 0:
      -1.0, -0.6162
Bounds for cluster 1:
      -0.596, 0.3939
Bounds for cluster 2:
      0.4141, 0.798
Bounds for cluster 3:
      0.8182, 1.0

The visual result of this clustering can be seen below:

_images/tutorial-clustering-predefined-variable-bins-k4.svg

This clustering function created four bins in the space of \(x\), where the splits in the \(x\)-space are located at \(x=-0.6\), \(x=0.4\) and \(x=0.8\).

Cluster into zero-neighborhood variable bins#

This partitioning relies on unbalanced variable vector which, in principle, is assumed to have a lot of observations whose values are close to zero and relatively few observations with values away from zero. This function can be used to separate close-to-zero observations into one cluster (split_at_zero=False) or two clusters (split_at_zero=True).

Without splitting at zero, split_at_zero=False#
_images/clustering-zero-neighborhood-bins.svg
(idx_zero_neighborhood_bins, borders_zero_neighborhood_bins) = preprocess.zero_neighborhood_bins(x, 3, zero_offset_percentage=10, split_at_zero=False, verbose=True)

With verbose=True we will see some detailed information on clustering:

Border values for bins:
[-1.  -0.2  0.2  1. ]

Bounds for cluster 0:
      -1.0, -0.2121
Bounds for cluster 1:
      -0.1919, 0.1919
Bounds for cluster 2:
      0.2121, 1.0

The visual result of this clustering can be seen below:

_images/tutorial-clustering-zero-neighborhood-bins-k3.svg

We note that the observations corresponding to \(x \approx 0\) have been classified into one cluster (\(k_2\)).

With splitting at zero, split_at_zero=True#
_images/clustering-zero-neighborhood-bins-zero-split.svg
(idx_zero_neighborhood_bins_split_at_zero, borders_zero_neighborhood_bins_split_at_zero) = preprocess.zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True)

With verbose=True we will see some detailed information on clustering:

Border values for bins:
[-1.  -0.2  0.   0.2  1. ]

Bounds for cluster 0:
-1.0, -0.2121
Bounds for cluster 1:
-0.1919, -0.0101
Bounds for cluster 2:
0.0101, 0.1919
Bounds for cluster 3:
0.2121, 1.0

The visual result of this clustering can be seen below:

_images/tutorial-clustering-zero-neighborhood-bins-split-at-zero-k4.svg

We note that the observations corresponding to \(x \approx 0^{-}\) have been classified into one cluster (\(k_2\)) and the observations corresponding to \(x \approx 0^{+}\) have been classified into another cluster (\(k_3\)).


Clustering combustion data sets#

In this section, we present functions that are specifically aimed for clustering reactive flows data sets. We will use a data set representing combustion of syngas in air, generated from the steady laminar flamelet model using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07].

We import the flamelet data set:

X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
mixture_fraction = np.genfromtxt('data-mixture-fraction.csv', delimiter=',')
Cluster into bins of the mixture fraction vector#
_images/clustering-mixture-fraction-bins.svg

In this example, we partition the data set into five bins of the mixture fraction vector. This is a feasible clustering strategy for non-premixed flames which takes advantage of the physics-based (supervised) partitioning of the data set based on local stoichiometry. The partitioning function requires specifying the value for the stoichiometric mixture fraction, \(Z_{st}\) (Z_stoich). Note that the first split in the data set is performed at \(Z_{st}\) and further splits are performed automatically on the fuel-lean and the fuel-rich branch.

Z_stoich = 0.273
(idx_mixture_fraction_bins, borders_mixture_fraction_bins) = preprocess.mixture_fraction_bins(mixture_fraction, 5, Z_stoich, verbose=True)

With verbose=True we will see some detailed information on clustering:

Border values for bins:
[0.         0.1365     0.273      0.51533333 0.75766667 1.        ]

Bounds for cluster 0:
      0.0, 0.1313
Bounds for cluster 1:
      0.1414, 0.2727
Bounds for cluster 2:
      0.2828, 0.5152
Bounds for cluster 3:
      0.5253, 0.7576
Bounds for cluster 4:
      0.7677, 1.0

The visual result of this clustering can be seen below:

_images/tutorial-clustering-mixture-fraction-bins-k4.svg

It can be seen that the data set is divided at the stoichiometric value of mixture fraction, in this case \(Z_{st} \approx 0.273\). The fuel-lean branch (the part of the flamelet to the left of \(Z_{st}\)) is divided into two clusters (\(k_1\) and \(k_2\)) and the fuel-rich branch (the part of the flamelet to the right of \(Z_{st}\)) is divided into three clusters (\(k_3\), \(k_4\) and \(k_5\)), since this branch has a longer range in the mixture fraction space.

Separating close-to-zero principal component source terms#

The function zero_neighborhood_bins can be used to separate close-to-zero source terms of the original variables (or close-to-zero source terms of the principal components (PCs)). The zero source terms physically correspond to the steady-state.

We first compute the source terms of the principal components by transforming the source terms of the original variables to the new PC-basis:

pca_X = reduction.PCA(X, scaling='auto', n_components=2)
S_Z = pca_X.transform(S_X, nocenter=True)

and we use the first PC source term, \(S_{Z,1}\), as the conditioning variable for the clustering function:

(idx_close_to_zero_source_terms, borders_close_to_zero_source_terms) = preprocess.zero_neighborhood_bins(S_Z[:,0], 4, zero_offset_percentage=5, split_at_zero=True, verbose=True)

With verbose=True we will see some detailed information on clustering:

Border values for bins:
[-87229.83051401  -5718.91469641      0.           5718.91469641
  27148.46341416]

Bounds for cluster 0:
      -87229.8305, -5722.1432
Bounds for cluster 1:
      -5717.5228, -0.0
Bounds for cluster 2:
      0.0, 5705.7159
Bounds for cluster 3:
      5719.0347, 27148.4634

The visual result of this clustering can be seen below:

_images/tutorial-clustering-close-to-zero-source-terms-k4.svg

From the verbose information, we can see that the first cluster (\(k_1\)) contains observations corresponding to the highly negative values of \(S_{Z,1}\), the second cluster (\(k_2\)) to the close-to-zero but negative values of \(S_{Z,1}\), the third cluster (\(k_3\)) to the close-to-zero but positive values of \(S_{Z,1}\) and the fourth cluster (\(k_4\)) to the highly positive values of \(S_{Z,1}\).

We can further merge the two clusters that contain observations corresponding to the high magnitudes of \(S_{Z, 1}\) into one cluster. This can be achieved using the function flip_clusters. We change the label of the fourth cluster to 0 and thus all observations from the fourth cluster are now assigned to the first cluster.

idx_merged = preprocess.flip_clusters(idx_close_to_zero_source_terms, {3:0})

The visual result of this merged clustering can be seen below:

_images/tutorial-clustering-close-to-zero-source-terms-merged-k4.svg

If we further plot the two-dimensional flamelet manifold, colored by \(S_{Z, 1}\), we can check that the clustering technique correctly identified the regions on the manifold where \(S_{Z, 1} \approx 0\) as well as the regions where \(S_{Z, 1}\) has high positive or high negative magnitudes.

_images/tutorial-clustering-close-to-zero-source-terms-manifold.svg

Bibliography#

CHan20

Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.

CHSSC07

Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Data sampling#

In this tutorial, we present how train and test samples can be selected using the sampling functionalities of the preprocess module. In general, train and test samples will always be some subset of the entire data set X:

_images/tts-train-test-select.svg

We import the necessary modules:

from PCAfold import DataSampler
from PCAfold import preprocess
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np

and we set some initial parameters:

save_filename = None
color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54'])
first_cluster = False
figure_size = (5,5)
random_seed = 200
np.random.seed(seed=random_seed)

We generate a synthetic data set that is composed of four distinct clusters that have an imbalanced number of observations (100, 250, 400 and 500 - 1250 total number of observations):

N_1, N_2, N_3, N_4 = 100, 250, 400, 500
n_observations = N_1 + N_2 + N_3 + N_4
mean_k1, mean_k2, mean_k3, mean_k4 = [-3, 3], [3, 3], [-3, -3], [3, -3]
covariance = [[1, 0.2], [0.2, 1]]
x_k1, y_k1 = np.random.multivariate_normal(mean_k1, covariance, N_1).T
x_k2, y_k2 = np.random.multivariate_normal(mean_k2, covariance, N_2).T
x_k3, y_k3 = np.random.multivariate_normal(mean_k3, covariance, N_3).T
x_k4, y_k4 = np.random.multivariate_normal(mean_k4, covariance, N_4).T
x = np.vstack((x_k1[:,np.newaxis], x_k2[:,np.newaxis], x_k3[:,np.newaxis], x_k4[:,np.newaxis]))
y = np.vstack((y_k1[:,np.newaxis], y_k2[:,np.newaxis], y_k3[:,np.newaxis], y_k4[:,np.newaxis]))
idx = np.vstack((np.zeros((N_1, 1)), np.ones((N_2, 1)), 2*np.ones((N_3, 1)), 3*np.ones((N_4, 1)))).astype(int).ravel()
populations = preprocess.get_populations(idx)

We visualize the original data set:

_images/tutorial-train-test-select-original-data-set.svg

The only information about the original data set that will be needed is the vector idx of cluster classifications.

Note

Note that idx_train and idx_test, that are the outputs of the sampling functions in this module, have a different interpretation than idx. They are vectors containing observation indices, not cluster classifications. For instance, if train samples are composed of the first, second and tenth observation then idx_train=[0,1,9].

You can find which cluster each observation in idx_train (or idx_test) belongs to, for instance through:

idx[idx_train,]
idx[idx_test,]

You can also extract a subset of idx_train that are only the indices belonging to a particular cluster. For instance, for the first cluster you can extract them by:

train_indices_in_cluster_1 = [i for i in idx_train if idx[i,]==0]

for the second cluster:

train_indices_in_cluster_2 = [i for i in idx_train if idx[i,]==1]

and so on.

We start by initalizing an object of the DataSampler class. For the moment, we will set the parameter idx_test to an empty list, but we will demonstrate an example for setting that parameter to something else later. Note that we can set a fixed random seed if we want the sampling results to be reproducible. With verbose=True, we will additionally see some detailed information about the current sampling.

sample = DataSampler(idx, idx_test=None, random_seed=random_seed, verbose=True)

Sample a fixed number#

We first select a fixed number of samples using the DataSampler.number function. Let’s request 15% of the total data to be the train data. The function calculates that it needs to select 46 samples from each cluster, which amounts to 14.7% of the total number of samples in the data set. Whenever the exact percentage requested by the user cannot be achieved, the function always under-samples.

_images/sampling-test-selection-option-number.svg
Select test data with test_selection_option=1#

There are always two ways in which the complementary test data can be selected. They can be selected using the test_selection_option parameter. We start with test_selection_option=1, which selects all remaining observations as the test data:

(idx_train, idx_test) = sample.number(15, test_selection_option=1)

Setting verbose=True lets us see some detailed information on sampling:

Cluster 0: taking 46 train samples out of 100 observations (46.0%).
Cluster 1: taking 46 train samples out of 250 observations (18.4%).
Cluster 2: taking 46 train samples out of 400 observations (11.5%).
Cluster 3: taking 46 train samples out of 500 observations (9.2%).

Cluster 0: taking 54 test samples out of 54 remaining observations (100.0%).
Cluster 1: taking 204 test samples out of 204 remaining observations (100.0%).
Cluster 2: taking 354 test samples out of 354 remaining observations (100.0%).
Cluster 3: taking 454 test samples out of 454 remaining observations (100.0%).

Selected 184 train samples (14.7%) and 1066 test samples (85.3%).

A dedicated plotting function from the preprocess module can be used to visualize the train and test samples. This function takes as inputs the obtained idx_train and idx_test vectors. Note that a custom colormap can be specified by the user.

plt = preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, color_map=color_map, first_cluster_index_zero=False, figure_size=(10,5), save_filename=None)

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-fixed-number-1.svg
Select test data with test_selection_option=2#

We then set test_selection_option=2 which selects a fixed number of test samples from each cluster, calculated based on the smallest cluster. This amounts to 54 test samples from each cluster.

(idx_train, idx_test) = sample.number(15, test_selection_option=2)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 46 train samples out of 100 observations (46.0%).
Cluster 1: taking 46 train samples out of 250 observations (18.4%).
Cluster 2: taking 46 train samples out of 400 observations (11.5%).
Cluster 3: taking 46 train samples out of 500 observations (9.2%).

Cluster 0: taking 54 test samples out of 54 remaining observations (100.0%).
Cluster 1: taking 54 test samples out of 204 remaining observations (26.5%).
Cluster 2: taking 54 test samples out of 354 remaining observations (15.3%).
Cluster 3: taking 54 test samples out of 454 remaining observations (11.9%).

Selected 184 train samples (14.7%) and 216 test samples (17.3%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-fixed-number-2.svg

Sample a fixed percentage#

Next, we select a percentage of samples from each cluster using the DataSampler.percentage function. Let’s request 10% of the total data to be the train data - the function selects 10% of samples from each cluster.

_images/sampling-test-selection-option-percentage.svg
Select test data with test_selection_option=1#

We start with test_selection_option=1, which selects all remaining observations as the test data:

(idx_train, idx_test) = sample.percentage(10, test_selection_option=1)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 10 train samples out of 100 observations (10.0%).
Cluster 1: taking 25 train samples out of 250 observations (10.0%).
Cluster 2: taking 40 train samples out of 400 observations (10.0%).
Cluster 3: taking 50 train samples out of 500 observations (10.0%).

Cluster 0: taking 90 test samples out of 90 remaining observations (100.0%).
Cluster 1: taking 225 test samples out of 225 remaining observations (100.0%).
Cluster 2: taking 360 test samples out of 360 remaining observations (100.0%).
Cluster 3: taking 450 test samples out of 450 remaining observations (100.0%).

Selected 125 train samples (10.0%) and 1125 test samples (90.0%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-fixed-percentage-1.svg
Select test data with test_selection_option=2#

We then set test_selection_option=2 which uses the same procedure to select the test data as was used to select the train data. In this case, it also selects 10% of samples from each cluster as the test samples.

(idx_train, idx_test) = sample.percentage(10, test_selection_option=2)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 10 train samples out of 100 observations (10.0%).
Cluster 1: taking 25 train samples out of 250 observations (10.0%).
Cluster 2: taking 40 train samples out of 400 observations (10.0%).
Cluster 3: taking 50 train samples out of 500 observations (10.0%).

Cluster 0: taking 10 test samples out of 90 remaining observations (11.1%).
Cluster 1: taking 25 test samples out of 225 remaining observations (11.1%).
Cluster 2: taking 40 test samples out of 360 remaining observations (11.1%).
Cluster 3: taking 50 test samples out of 450 remaining observations (11.1%).

Selected 125 train samples (10.0%) and 125 test samples (10.0%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-fixed-percentage-2.svg

Sample manually#

We select samples manually from each cluster using the DataSampler.manual function.

_images/sampling-test-selection-option-manual.svg
Select test data with test_selection_option=1#

We start with test_selection_option=1 which selects all remaining observations as the test data. Let’s request 4, 5, 10 and 2 samples from the first, second, third and fourth cluster respectively. The sampling dictionary will thus have to be: sampling_dictionary={0:4, 1:5, 2:10, 3:2}. Note that the function still selects those samples randomly from each cluster. We should also change sampling_type to 'number' so that samples are selected on a number and not a percentage basis:

(idx_train, idx_test) = sample.manual({0:4, 1:5, 2:10, 3:2}, sampling_type='number', test_selection_option=1)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 4 train samples out of 100 observations (4.0%).
Cluster 1: taking 5 train samples out of 250 observations (2.0%).
Cluster 2: taking 10 train samples out of 400 observations (2.5%).
Cluster 3: taking 2 train samples out of 500 observations (0.4%).

Cluster 0: taking 96 test samples out of 96 remaining observations (100.0%).
Cluster 1: taking 245 test samples out of 245 remaining observations (100.0%).
Cluster 2: taking 390 test samples out of 390 remaining observations (100.0%).
Cluster 3: taking 498 test samples out of 498 remaining observations (100.0%).

Selected 21 train samples (1.7%) and 1229 test samples (98.3%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-manually-1.svg
Select test data with test_selection_option=2#

We then set test_selection_option=2 which uses the same procedure to select the test data as was used to select the train data. This time, let’s request 50%, 10%, 10% and 20% from the first, second, third and fourth cluster respectively. The sampling dictionary will thus have to be: sampling_dictionary={0:50, 1:10, 2:10, 3:20} and we should change the sampling_type to 'percentage':

(idx_train, idx_test) = sample.manual({0:50, 1:10, 2:10, 3:20}, sampling_type='percentage', test_selection_option=2)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 50 train samples out of 100 observations (50.0%).
Cluster 1: taking 25 train samples out of 250 observations (10.0%).
Cluster 2: taking 40 train samples out of 400 observations (10.0%).
Cluster 3: taking 100 train samples out of 500 observations (20.0%).

Cluster 0: taking 50 test samples out of 50 remaining observations (100.0%).
Cluster 1: taking 25 test samples out of 225 remaining observations (11.1%).
Cluster 2: taking 40 test samples out of 360 remaining observations (11.1%).
Cluster 3: taking 100 test samples out of 400 remaining observations (25.0%).

Selected 215 train samples (17.2%) and 215 test samples (17.2%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-manually-2.svg

Sample at random#

Finally, we select random samples using the DataSampler.random function. Let’s request 10% of the total data to be the train data.

_images/sampling-test-selection-option-random.svg

Note

Random sampling will typically give a very similar sample distribution as percentage sampling. The only difference is that percentage sampling will maintain the percentage perc exact within each cluster while random sampling will typically result in some small variations from perc in each cluster since it is sampling independently of cluster definitions.

Select test data with test_selection_option=1#

We start with test_selection_option=1 which selects all remaining observations as test data.

(idx_train, idx_test) = sample.random(10, test_selection_option=1)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 14 train samples out of 100 observations (14.0%).
Cluster 1: taking 28 train samples out of 250 observations (11.2%).
Cluster 2: taking 42 train samples out of 400 observations (10.5%).
Cluster 3: taking 41 train samples out of 500 observations (8.2%).

Cluster 0: taking 86 test samples out of 86 remaining observations (100.0%).
Cluster 1: taking 222 test samples out of 222 remaining observations (100.0%).
Cluster 2: taking 358 test samples out of 358 remaining observations (100.0%).
Cluster 3: taking 459 test samples out of 459 remaining observations (100.0%).

Selected 125 train samples (10.0%) and 1125 test samples (90.0%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-random-doc-1.svg
Select test data with test_selection_option=2#

We then set test_selection_option=2 which uses the same procedure to select the test data as was used to select the train data. In this case, it will also sample 10% of the total data set as the test data.

(idx_train, idx_test) = sample.random(10, test_selection_option=2)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 14 train samples out of 100 observations (14.0%).
Cluster 1: taking 28 train samples out of 250 observations (11.2%).
Cluster 2: taking 42 train samples out of 400 observations (10.5%).
Cluster 3: taking 41 train samples out of 500 observations (8.2%).

Cluster 0: taking 8 test samples out of 86 remaining observations (9.3%).
Cluster 1: taking 25 test samples out of 222 remaining observations (11.3%).
Cluster 2: taking 29 test samples out of 358 remaining observations (8.1%).
Cluster 3: taking 63 test samples out of 459 remaining observations (13.7%).

Selected 125 train samples (10.0%) and 125 test samples (10.0%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-random-doc-2.svg

Maintaining a fixed test data set#

In this example, we further illustrate how maintaining a fixed test data set functionality can be utilized. Suppose that in every cluster you have a very distinct set of observations on which you always want to test your model. You can point out those observations when initializing a DataSampler object through the use of the idx_test parameter.

We simulate this situation by appending additional samples to the previously defined data set. We add 20 samples in each cluster - those sammples can be seen in the figure below as smaller clouds next to each cluster:

_images/tutorial-train-test-select-original-data-set-appended-doc.svg

Assuming that we know the indices of points that represent the appended clouds, stored in idx_test, we can use that array of indices as an input parameter:

sample = DataSampler(idx, idx_test=idx_test, random_seed=random_seed, verbose=True)

Any sampling function now called will maintain those samples as the test data and the train data will be sampled ignoring the indices in idx_test. Note also that if idx_test is specified, the test_selection_option parameter is ignored.

We will demonstrate this sampling using the DataSampler.random function, but any other sampling function that we demonstrated earlier can be used as well.

(idx_train, idx_test) = sample.random(80, test_selection_option=2)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 86 train samples out of 120 observations (71.7%).
Cluster 1: taking 211 train samples out of 270 observations (78.1%).
Cluster 2: taking 347 train samples out of 420 observations (82.6%).
Cluster 3: taking 420 train samples out of 520 observations (80.8%).

Cluster 0: taking 20 test samples out of 34 remaining observations (58.8%).
Cluster 1: taking 20 test samples out of 59 remaining observations (33.9%).
Cluster 2: taking 20 test samples out of 73 remaining observations (27.4%).
Cluster 3: taking 20 test samples out of 100 remaining observations (20.0%).

Selected 1064 train samples (80.0%) and 80 test samples (6.0%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-random-with-idx-test-doc.svg

Chaining sampling functions#

Finally, we discuss an interesting use-case for chaining two sampling functions, where train samples obtained from one sampling can become a fixed test data for another sampling.

Suppose that our target is to have a fixed test data set composed of:

  • 10 samples from the first cluster

  • 20 samples from the second cluster

  • 10 samples from the third cluster

  • 50 samples from the fourth cluster

and, at the same time, select a fixed number of train samples from each cluster.

We can start with generating the desired test samples using the DataSampler.manual function. We can output the train data as the test data:

sample = DataSampler(idx, random_seed=random_seed, verbose=True)
(idx_test, _) = sample.manual({0:10, 1:20, 2:10, 3:50}, sampling_type='number', test_selection_option=1)

Now we feed the obtained test set as a fixed test set for the target sampling:

sample.idx_test = idx_test
(idx_train, idx_test) = sample.number(19.5, test_selection_option=1)

With verbose=True we will see some detailed information on sampling:

Cluster 0: taking 60 train samples out of 100 observations (60.0%).
Cluster 1: taking 60 train samples out of 250 observations (24.0%).
Cluster 2: taking 60 train samples out of 400 observations (15.0%).
Cluster 3: taking 60 train samples out of 500 observations (12.0%).

Cluster 0: taking 10 test samples out of 40 remaining observations (25.0%).
Cluster 1: taking 20 test samples out of 190 remaining observations (10.5%).
Cluster 2: taking 10 test samples out of 340 remaining observations (2.9%).
Cluster 3: taking 50 test samples out of 440 remaining observations (11.4%).

Selected 240 train samples (19.2%) and 90 test samples (7.2%).

The visual result of this sampling can be seen below:

_images/tutorial-train-test-select-chaining-functions.svg

Notice that we have achieved what we wanted to: we generated a desired test data set with 10, 20, 10 and 50 samples, and we also have an equal number of train samples selected from each cluster - in this case 60 samples.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Global and local PCA#

In this tutorial, we present how global and local PCA can be performed on a synthetic data set using the reduction module.

We import the necessary modules:

from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import PCA, LPCA
import matplotlib.pyplot as plt
from matplotlib import gridspec
from matplotlib.colors import ListedColormap
import numpy as np

and we set some initial parameters:

n_points = 1000
save_filename = None
global_color = '#454545'
k1_color = '#0e7da7'
k2_color = '#ceca70'
color_map = ListedColormap([k1_color, k2_color])

Generate a synthetic data set for global PCA#

We generate a synthetic data set on which the global PCA will be performed. This data set is composed of a single cloud of points.

mean_global = [0,1]
covariance_global = [[3.4, 1.1], [1.1, 2.1]]

x_noise, y_noise = np.random.multivariate_normal(mean_global, covariance_global, n_points).T
y_global = np.linspace(0,4,n_points)
x_global = -(y_global**2) + 7*y_global + 4
y_global = y_global + y_noise
x_global = x_global + x_noise

Dataset_global = np.hstack((x_global[:,np.newaxis], y_global[:,np.newaxis]))

This data set can be seen below:

_images/tutorial-pca-data-set-for-global-pca.svg
Global PCA#

We perform global PCA to obtain global principal components, global eigenvectors and global eigenvalues:

pca = PCA(Dataset_global, 'none', n_components=2)
principal_components_global = pca.transform(Dataset_global, nocenter=False)
eigenvectors_global = pca.A
eigenvalues_global = pca.L

We also retrieve the centered and scaled data set:

Dataset_global_pp = pca.X_cs

Generate a synthetic data set for local PCA#

Similarly, we generate another synthetic data set that is composed of two distinct clouds of points.

mean_local_1 = [0,1]
mean_local_2 = [6,4]
covariance_local_1 = [[2, 0.5], [0.5, 0.5]]
covariance_local_2 = [[3, 0.3], [0.3, 0.5]]

x_noise_1, y_noise_1 = np.random.multivariate_normal(mean_local_1, covariance_local_1, n_points).T
x_noise_2, y_noise_2 = np.random.multivariate_normal(mean_local_2, covariance_local_2, n_points).T
x_local = np.concatenate([x_noise_1, x_noise_2])
y_local = np.concatenate([y_noise_1, y_noise_2])

Dataset_local = np.hstack((x_local[:,np.newaxis], y_local[:,np.newaxis]))

This data set can be seen below:

_images/tutorial-pca-data-set-for-local-pca.svg
Cluster the data set for local PCA#

We perform clustering of this data set based on pre-defined bins using the available preprocess.predefined_variable_bins function. We obtain cluster classifications and centroids for each cluster:

(idx, borders) = preprocess.predefined_variable_bins(Dataset_local[:,0], [2.5], verbose=False)
centroids = preprocess.get_centroids(Dataset_local, idx)

The result of this clustering can be seen below:

_images/tutorial-local-pca-clustering.svg

In local PCA, PCA is applied in each cluster separately.

Local PCA#

We perform local PCA to obtain local principal components, local eigenvectors and local eigenvalues:

lpca = LPCA(Dataset_local, idx, scaling='none')
principal_components_local = lpca.principal_components
eigenvectors_local = lpca.A
eigenvalues_local = lpca.L

Plotting global versus local PCA#

Finally, for the demonstration purposes, we plot the identified global and local eigenvectors on top of both synthetic data sets. The visual result of performing PCA globally and locally can be seen below:

_images/tutorial-pca-global-local-pca.svg

Note, that in local PCA, a separate set of eigenvectors is found in each cluster. The same goes for principal components and eigenvalues.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Plotting PCA results#

In this tutorial, we present plotting functionalities from the reduction module that aid in visualizing PCA results.

We import the necessary modules:

from PCAfold import PCA
from PCAfold import reduction
import numpy as np

and we set some initial parameters:

title = None
save_filename = None

As an example, we will use a data set representing combustion of syngas (CO/H2 mixture) in air generated from the steady laminar flamelet model. This data set has 11 variables and 50,000 observations. The data set was generated using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07]. To load the data set from the tutorials directory:

X = np.genfromtxt('data-state-space.csv', delimiter=',')
X_names = ['$T$', '$H_2$', '$O_2$', '$O$', '$OH$', '$H_2O$', '$H$', '$HO_2$', '$CO$', '$CO_2$', '$HCO$']

We generate four PCA objects corresponding to four scaling criteria:

pca_X_Auto = PCA(X, scaling='auto', n_components=3)
pca_X_Range = PCA(X, scaling='range', n_components=3)
pca_X_Vast = PCA(X, scaling='vast', n_components=3)
pca_X_Pareto = PCA(X, scaling='pareto', n_components=3)

and we will plot PCA results from the generated objects.


Eigenvectors#

Weights of a single eigenvector can be plotted using the reduction.plot_eigenvectors function. Note, that multiple eigenvectors can be passed as an input and this function will generate as many plots as there are eigenvectors supplied.

Below is an example of plotting just the first eigenvector:

plt = reduction.plot_eigenvectors(pca_X_Auto.A[:,0], variable_names=X_names)

To plot all eigenvectors resulting from a single PCA class object:

plts = reduction.plot_eigenvectors(pca_X_Auto.A, variable_names=X_names)

Two weight normalizations are available:

  • No normalization. To use this variant set plot_absolute=False. Example can be seen below:

plt = reduction.plot_eigenvectors(pca_X_Auto.A[:,0], eigenvectors_indices=[], variable_names=X_names, plot_absolute=False, save_filename=save_filename)
_images/eigenvector-1-plotting-pca.svg
  • Absolute values. To use this variant set plot_absolute=True. Example can be seen below:

plt = reduction.plot_eigenvectors(pca_X_Auto.A[:,0], eigenvectors_indices=[], variable_names=X_names, plot_absolute=True, save_filename=save_filename)
_images/eigenvector-1-plotting-pca-absolute.svg

Eigenvectors comparison#

Eigenvectors resulting from, for instance, different PCA class objects can be compared on a single plot using the reduction.plot_eigenvectors_comparison function.

Two weight normalizations are available:

  • No normalization. To use this variant set plot_absolute=False. Example can be seen below:

plt = reduction.plot_eigenvectors_comparison((pca_X_Auto.A[:,0], pca_X_Range.A[:,0], pca_X_Vast.A[:,0], pca_X_Pareto.A[:,0]), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], variable_names=X_names, plot_absolute=False, color_map='coolwarm', save_filename=save_filename)
_images/plotting-pca-eigenvectors-comparison.svg
  • Absolute values. To use this variant set plot_absolute=True. Example can be seen below:

plt = reduction.plot_eigenvectors_comparison((pca_X_Auto.A[:,0], pca_X_Range.A[:,0], pca_X_Vast.A[:,0], pca_X_Pareto.A[:,0]), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], variable_names=X_names, plot_absolute=True, color_map='coolwarm', save_filename=save_filename)
_images/plotting-pca-eigenvectors-comparison-absolute.svg

Eigenvalue distribution#

Eigenvalue distribution can be plotted using the reduction.plot_eigenvalue_distribution function.

Two eigenvalue normalizations are available:

  • No normalization. To use this variant set normalized=False. Example can be seen below:

plt = reduction.plot_eigenvalue_distribution(pca_X_Auto.L, normalized=False, save_filename=save_filename)
_images/plotting-pca-eigenvalue-distribution.svg
  • Normalized to 1. To use this variant set normalized=True. Example can be seen below:

plt = reduction.plot_eigenvalue_distribution(pca_X_Auto.L, normalized=True, save_filename=save_filename)
_images/plotting-pca-eigenvalue-distribution-normalized.svg

Eigenvalue distribution comparison#

Eigenvalues resulting from, for instance, different PCA class objects can be compared on a single plot using the reduction.plot_eigenvalues_comparison function.

Two eigenvalue normalizations are available:

  • No normalization. To use this variant set normalized=False. Example can be seen below:

plt = reduction.plot_eigenvalue_distribution_comparison((pca_X_Auto.L, pca_X_Range.L, pca_X_Vast.L, pca_X_Pareto.L), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], normalized=False, color_map='coolwarm', save_filename=save_filename)
_images/plotting-pca-eigenvalue-distribution-comparison.svg
  • Normalized to 1. To use this variant set normalized=True. Example can be seen below:

plt = reduction.plot_eigenvalue_distribution_comparison((pca_X_Auto.L, pca_X_Range.L, pca_X_Vast.L, pca_X_Pareto.L), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], normalized=True, color_map='coolwarm', save_filename=save_filename)
_images/plotting-pca-eigenvalue-distribution-comparison-normalized.svg

Cumulative variance#

Cumulative variance computed from eigenvalues can be plotted using the reduction.plot_cumulative_variance function. Example of a plot:

plt = reduction.plot_cumulative_variance(pca_X_Auto.L, n_components=0, save_filename=save_filename)
_images/cumulative-variance.svg

The number of eigenvalues to look at can also be truncated by setting n_components input parameter accordingly. Example of a plot when n_components=5 in this case:

plt = reduction.plot_cumulative_variance(pca_X_Auto.L, n_components=5, save_filename=save_filename)
_images/cumulative-variance-truncated.svg

Two-dimensional manifold#

Two-dimensional manifold resulting from performing PCA transformation can be plotted using the reduction.plot_2d_manifold function. We first calculate the principal components by transforming the original data set to the new basis:

principal_components = pca_X_Vast.transform(X)

By setting color=X[:,0] parameter, the manifold can be additionally colored by the first variable in the data set (in this case, the temperature). Note that you can select the colormap to use through the color_map parameter. Example of using color_map='inferno' and coloring by the first variable in the data set:

plt = reduction.plot_2d_manifold(principal_components[:,0], principal_components[:,1], color=X[:,0], x_label='$Z_1$', y_label='$Z_2$', colorbar_label='$T$ [K]', color_map='inferno', figure_size=(10,4), save_filename=save_filename)
_images/plotting-pca-2d-manifold-inferno.svg

Example of an uncolored plot:

plt = reduction.plot_2d_manifold(principal_components[:,0], principal_components[:,1], x_label='$Z_1$', y_label='$Z_2$', figure_size=(10,4), save_filename=save_filename)
_images/plotting-pca-2d-manifold-black.svg

Example of using color_map='Blues' and coloring by the first variable in the data set:

plt = reduction.plot_2d_manifold(principal_components[:,0], principal_components[:,1], color=X[:,0], x_label='$Z_1$', y_label='$Z_2$', colorbar_label='$T$ [K]', color_map='Blues', figure_size=(10,4), save_filename=save_filename)
_images/plotting-pca-2d-manifold-blues.svg

Three-dimensional manifold#

Similarly, a three-dimensional manifold can be visualized:

plt = reduction.plot_3d_manifold(principal_components[:,0], principal_components[:,1], principal_components[:,2], elev=30, azim=-20, color=X[:,0], x_label='$Z_1$', y_label='$Z_2$', z_label='$Z_3$', colorbar_label='$T$ [K]', color_map='inferno', figure_size=(15,8), save_filename=save_filename)
_images/plotting-pca-3d-manifold.svg

Parity plot#

Parity plots of reconstructed variables can be visualized using the reduction.plot_parity function. We approximate the data set using the previously obtained two principal components:

X_rec = pca_X_Vast.reconstruct(principal_components)

and we generate a parity plot which visualizes the reconstruction of the first variable:

plt = reduction.plot_parity(X[:,0], X_rec[:,0], color=X[:,0], x_label='Observed $T$', y_label='Reconstructed $T$', colorbar_label='$T$ [K]', color_map='inferno', figure_size=(7,7), save_filename=None)
_images/plotting-pca-parity.svg

Similarly as in the plot_2d_manifold function, you can select the colormap to use.


Bibliography#

CHan20

Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.

CHSSC07

Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

PCA on sampled data sets#

In this tutorial, we present how PCA can be performed on sampled data sets using various helpful functions from the preprocess and the reduction module. Those functions essentially allow to compare PCA done on the original full data set, \(\mathbf{X}\), and on the sampled data set, \(\mathbf{X_r}\). We are first going to present major functionalities for performing and analyzing PCA on a sampled data set using a special case of sampling - by taking equal number of samples from each cluster. Next, we are going to show a more general way to perform PCA on data sets that are sampled in any way of choice. A general overview for performing PCA on a sampled data set is presented below:

_images/pca-on-sampled-data-set.svg

The main goal is to inform the PCA transformation with some of the characteristics of the sampled data set, \(\mathbf{X_r}\). There are several ways in which that information can be incorporated and they can be controlled using a selected biasing option and setting the biasing_option input parameter whenever needed. The user is referred to the documentation for more information on the available options (under User guide \(\rightarrow\) Data reduction \(\rightarrow\) Biasing options). It is understood, that PCA performed on a sampled data set is biased in some way, since that data set contains different proportions of features in terms of sample density compared to their original contribution within the full original data set, \(\mathbf{X}\). Those features can be identified using any clustering technique of choice.

We import the necessary modules:

from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import DataSampler
from PCAfold import PCA
import numpy as np
from matplotlib.colors import ListedColormap

and we set some initial parameters:

scaling = 'auto'
biasing_option = 2
n_clusters = 4
n_components = 2
random_seed = 100
legend_label = ['$\mathbf{X}$', '$\mathbf{X_r}$']
color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54'])
save_filename = None

Load and cluster the data set#

As an example, we will use a data set representing combustion of syngas in air generated from the steady laminar flamelet model using chemical mechanism by Hawkes et al. [CHSSC07]. This data set has 11 variables and 50,000 observations. The data set was generated using Spitfire software [CHan20]. To load the data set from the tutorials directory:

X = np.genfromtxt('data-state-space.csv', delimiter=',')
X_names = ['$T$', '$H_2$', '$O_2$', '$O$', '$OH$', '$H_2O$', '$H$', '$HO_2$', '$CO$', '$CO_2$', '$HCO$']
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
Z = np.genfromtxt('data-mixture-fraction.csv', delimiter=',')

We start with clustering the data set that will result in an idx vector of cluster classifications. Clustering can be performed with any technique of choice. Here we will use one of the available functions from the preprocess module preprocess.zero_neighborhood_bins and use the first principal component source term as the conditioning variable.

Perform global PCA on the data set and transform source terms of the original variables:

pca_X = PCA(X, scaling=scaling, n_components=n_components)
S_Z = pca_X.transform(S, nocenter=True)

Cluster the data set:

(idx, borders) = preprocess.zero_neighborhood_bins(S_Z[:,0], k=4, zero_offset_percentage=2, split_at_zero=True, verbose=True)

Visualize the result of clustering:

plt = preprocess.plot_2d_clustering(Z, X[:,0], idx, x_label='Mixture fraction [-]', y_label='$T$ [K]', color_map=color_map, first_cluster_index_zero=False, grid_on=True, figure_size=(8, 3), save_filename=save_filename)
_images/tutorial-sampled-pca-clustering.svg

Special case of PCA on sampled data sets#

In this section, we present the special case for performing PCA on data sets formed by taking equal number of samples from local clusters.

The reduction.EquilibratedSamplePCA class enables a special case of performing PCA on a sampled data set. It uses equal number of samples from each cluster and allows to analyze what happens when the data set is sampled gradually. It begins with performing PCA on the original data set and then in n_iterations it will gradually decrease the number of populations in each cluster larger than the smallest cluster, heading towards population of the smallest cluster, in each cluster. At each iteration, we obtain a new sampled data set on which PCA is performed. At the last iteration, the number of populations in each cluster are equal and, finally, PCA is performed on this equilibrated data set.

A schematic representation of this procedure is presented in the figure below:

_images/cluster-equilibration-scheme.svg
Run cluster equilibration#
equilibrated_pca = reduction.EquilibratedSamplePCA(X,
                                                   idx,
                                                   scaling=scaling,
                                                   X_source=S_X,
                                                   n_components=n_components,
                                                   biasing_option=biasing_option,
                                                   n_iterations=10,
                                                   stop_iter=0,
                                                   random_seed=random_seed,
                                                   verbose=True)

With verbose=True we will see some detailed information on thee number of samples in each cluster at each iteration:

Biasing is performed with option 2.

At iteration 1 taking samples:
{0: 4144, 1: 14719, 2: 24689, 3: 2416}

At iteration 2 taking samples:
{0: 3953, 1: 13352, 2: 22215, 3: 2416}

At iteration 3 taking samples:
{0: 3762, 1: 11985, 2: 19741, 3: 2416}

At iteration 4 taking samples:
{0: 3571, 1: 10618, 2: 17267, 3: 2416}

At iteration 5 taking samples:
{0: 3380, 1: 9251, 2: 14793, 3: 2416}

At iteration 6 taking samples:
{0: 3189, 1: 7884, 2: 12319, 3: 2416}

At iteration 7 taking samples:
{0: 2998, 1: 6517, 2: 9845, 3: 2416}

At iteration 8 taking samples:
{0: 2807, 1: 5150, 2: 7371, 3: 2416}

At iteration 9 taking samples:
{0: 2616, 1: 3783, 2: 4897, 3: 2416}

At iteration 10 taking samples:
{0: 2416, 1: 2416, 2: 2416, 3: 2416}
eigenvalues = equilibrated_pca.eigenvalues
eigenvectors = equilibrated_pca.eigenvectors
PCs = equilibrated_pca.pc_scores
PC_sources = equilibrated_pca.pc_sources
idx_train = equilibrated_pca.idx_train
Analyze centers change#

The reduction.analyze_centers_change function compares centers computed on the original data set, \(\mathbf{X}\), versus on the sampled data set, \(\mathbf{X_r}\). The idx_train input parameter could for instance be obtained from reduction.EquilibratedSamplePCA and will thus represent the equilibrated data set sampled from the original data set. It could also be obtained as sampled indices using any of the sampling function from the preprocess.DataSampler class.

This function will produce a plot that shows the normalized centers and a percentage by which the new centers have moved with respect to the original ones. Example of a plot:

(centers_X, centers_X_r, perc, plt) = reduction.analyze_centers_change(X, idx_train, variable_names=X_names, legend_label=legend_label, save_filename=save_filename)
_images/centers-change.svg

If you do not wish to plot all variables present in a data set, use the plot_variables list as an input parameter to select indices of variables to plot:

(centers_X, centers_X_r, perc, plt) = reduction.analyze_centers_change(X, idx_train, variable_names=X_names, plot_variables=[1,3,4,6,8], legend_label=legend_label, save_filename=save_filename)
_images/centers-change-selected-variables.svg
Analyze eigenvector weights change#

The eigenvectors 3D array obtained from reduction.EquilibratedSamplePCA can now be used as an input parameter for plotting the eigenvector weights change as we were gradually equilibrating cluster populations.

We are going to plot the first eigenvector (corresponding to PC-1) weights change with three variants of normalization. To access the first eigenvector one can simply do:

eigenvectors[:,0,:]

similarly, to access the second eigenvector:

eigenvectors[:,1,:]

and so on.

Three weight normalization variants are available:

  • No normalization, the absolute values of the eigenvector weights are plotted. To use this variant set normalize=False. Example can be seen below:

plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=False, zero_norm=False, save_filename=save_filename)
_images/eigenvector-weights-movement-non-normalized.svg
  • Normalizing so that the highest weight is equal to 1 and the smallest weight is between 0 and 1. This is useful for judging the severity of the weight change. To use this variant set normalize=True and zero_norm=False. Example can be seen below:

plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=True, zero_norm=False, save_filename=save_filename)
_images/eigenvector-weights-movement-normalized.svg
  • Normalizing so that weights are between 0 and 1. This is useful for judging the change trends since it will blow up even the smallest changes to the entire range 0-1. To use this variant set normalize=True and zero_norm=True. Example can be seen below:

plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=True, zero_norm=True, save_filename=save_filename)
_images/eigenvector-weights-movement-normalized-to-zero.svg

Note, that in the above example, the color bar marks the iteration number and so the \(0^{th}\) iteration represents eigenvectors from the original data set, \(\mathbf{X}\). The last iteration, in this example the \(10^{th}\) iteration, represents eigenvectors computed on the equilibrated, sampled data set.

If you do not wish to plot all variables present in a data set, use the plot_variables list as an input parameter to select indices of variables to plot:

_images/eigenvector-weights-movement-selected-variables.svg

If you are only interested in plotting a comparison in the eigenvector weights change between the original data set, \(\mathbf{X}\), and one target sampled data set, \(\mathbf{X_r}\), (for instance the equilibrated data set) you can set the eigenvectors input parameter to only contain these two sets of weights. The function will then understand that only these two should be compared:

plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,[0,-1]], X_names, normalize=False, zero_norm=False, legend_label=legend_label, save_filename=save_filename)
_images/eigenvector-weights-movement-X-Xr.svg

Such plot can be done for the pre-selected variables as well using the plot_variables list:

plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,[0,-1]], X_names, plot_variables=[1,3,4,6,8], normalize=False, zero_norm=False, legend_label=legend_label, save_filename=save_filename)
_images/eigenvector-weights-movement-X-Xr-selected-variables.svg
Analyze eigenvalue distribution#

The reduction.analyze_eigenvalue_distribution function will produce a plot that shows the normalized eigenvalues distribution for the original data set, \(\mathbf{X}\), and for the sampled data set, \(\mathbf{X_r}\). Example of a plot:

plt = reduction.analyze_eigenvalue_distribution(state_space, idx_train, scal_crit, biasing_option, legend_label=legend_label, save_filename=save_filename)
_images/eigenvalue-distribution.svg
Visualize the re-sampled manifold#

Using the function reduction.plot_2d_manifold you can visualize any two-dimensional manifold and additionally color it with a variable of choice. Here we are going to plot the re-sampled manifold resulting from performing PCA on the sampled data set. Example of a plot:

plt = reduction.plot_2d_manifold(PCs[:,0,-1], PCs[:,1,-1], color=X[:,0], x_label='$Z_{r, 1}$', y_label='$Z_{r, 2}$', colorbar_label='$T$ [K]', save_filename=save_filename)
_images/re-sampled-manifold.svg

Generalization of PCA on sampled data sets#

A more general approach to performing PCA on sampled data sets (instead of using the reduction.EquilibratedSamplePCA class) is to use the the reduction.SamplePCA class. This function allows to perform PCA on a data set that has been sampled in any way (in contrast to equilibrated sampling which always samples equal number of samples from each cluster).

Note

It is worth noting that the class reduction.EquilibratedSamplePCA uses reduction.SamplePCA inside.

We first inspect how many samples each cluster has (in the clusters we identified earlier by binning the first principal component source term):

print(preprocess.get_populations(idx))

which shows us populations of each cluster to be:

[4335, 16086, 27163, 2416]

We begin by performing manual sampling. Suppose that we would like to severely under-represent the two largest clusters and over-represent the features of the two smallest clusters. Let’s select 4000 samples from \(k_0\), 1000 samples from \(k_1\), 1000 samples from \(k_2\) and 2400 samples from \(k_3\). In this example we are not interested in generating test samples, so we can suppress returning those.

sample = DataSampler(idx, idx_test=None, random_seed=random_seed, verbose=True)

(idx_manual, _) = sample.manual({0:4000, 1:1000, 2:1000, 3:2400}, sampling_type='number', test_selection_option=1)

The verbose information will tell us how sample densities compare in terms of percentage of samples in each cluster:

Cluster 0: taking 4000 train samples out of 4335 observations (92.3%).
Cluster 1: taking 1000 train samples out of 16086 observations (6.2%).
Cluster 2: taking 1000 train samples out of 27163 observations (3.7%).
Cluster 3: taking 2400 train samples out of 2416 observations (99.3%).

Cluster 0: taking 335 test samples out of 335 remaining observations (100.0%).
Cluster 1: taking 15086 test samples out of 15086 remaining observations (100.0%).
Cluster 2: taking 26163 test samples out of 26163 remaining observations (100.0%).
Cluster 3: taking 16 test samples out of 16 remaining observations (100.0%).

Selected 8400 train samples (16.8%) and 41600 test samples (83.2%).

We now perform PCA on a data set that has been sampled according to idx_manual using the reduction.SamplePCA class:

sample_pca = reduction.SamplePCA(X,
                                 idx_manual,
                                 scaling,
                                 n_components,
                                 biasing_option)
eigenvalues_manual = sample_pca.eigenvalues
eigenvectors_manual = sample_pca.eigenvectors
PCs_manual = sample_pca.pc_scores

Finally, we can generate all the same plots that were shown before. Here, we are only going to present the new re-sampled manifold resulting from current manual sampling:

plt = reduction.plot_2d_manifold(PCs_manual[:,0], PCs_manual[:,1], color=X[:,0], x_label='$Z_{r, 1}$', y_label='$Z_{r, 2}$', colorbar_label='$T$ [K]', save_filename=save_filename)
_images/generalize-sampling-re-sampled-manifold.svg

CHan20

Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.

CHSSC07

Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Handling source terms#

This tutorial can be of interest to researchers working with reactive flows data sets. We present how source terms of the original state variables can be handled using PCAfold software. Specifically, PCAfold functionalities accommodate treatment of sources of principal components (PCs) which can be valuable for implementing PC-transport approaches such as proposed in [TSP09].

Theory#

The methodology for the standard PC-transport approach was first proposed in [TSP09]. As an illustrative example, PC-transport equations adequate to a 0D chemical reactor are presented below. The reader is referred to [TBS15], [TEM15] for treatment of the full PC-transport equations including diffusion.

We assume that the data set containing original state-space variables is:

\[\mathbf{X} = [T, Y_1, Y_2, \dots, Y_{N_s-1}]\]

where \(T\) is temperature and \(Y_i\) is a mass fraction of species \(i\). \(N_s\) is the total number of chemical species. \(\mathbf{X}\) is also referred to as the state vector, see [THS18] for various definitions of the state vector. The corresponding source terms of the original state-space variables are:

\[\mathbf{S_X} = \Big[-\frac{1}{\rho c_p} \sum_{i=1}^{N_s} ( \omega_i h_i ), \frac{\omega_1}{\rho}, \frac{\omega_2}{\rho}, \dots, \frac{\omega_{N_s-1}}{\rho} \Big]\]

where \(\rho\) is the density of the mixture and \(c_p\) is the specific heat capacity of the mixture, \(\omega_i\) is the net mass production rate of species \(i\) and \(h_i\) is the enthalpy of species \(i\).

For a 0D-system, we can write the evolution equation as:

\[\frac{d \mathbf{X}}{dt} = \mathbf{S_X}\]

This equation can be instead written in the space of principal components by applying a linear operator, \(\mathbf{A}\), identified by PCA. We can also account for centering and scaling the original data set, \(\mathbf{X}\), using centers \(\mathbf{C}\) and scales \(\mathbf{D}\):

\[\frac{d \Big( \frac{\mathbf{X} - \mathbf{C}}{\mathbf{D}} \Big) \mathbf{A}}{dt} = \frac{\mathbf{S_X}}{\mathbf{D}}\mathbf{A}\]

It is worth noting that when the original data set is centered and scaled, the corresponding source terms should only be scaled and not centered, since:

\[\frac{d \frac{\mathbf{C}}{\mathbf{D}} \mathbf{A}}{dt} = 0\]

for constant \(\mathbf{C}\), \(\mathbf{D}\) and \(\mathbf{A}\).

We finally obtain the 0D PC-transport equation where the evolved variables are principal components instead of the original state-space variables:

\[\frac{d \mathbf{Z}}{dt} = \mathbf{S_{Z}}\]

where \(\mathbf{Z} = \Big( \frac{\mathbf{X} - \mathbf{C}}{\mathbf{D}} \Big) \mathbf{A}\) and \(\mathbf{S_{Z}} = \frac{\mathbf{S_X}}{\mathbf{D}}\mathbf{A}\).

Code implementation#

We import the necessary modules:

from PCAfold import PCA
import numpy as np

A data set representing combustion of syngas in air generated from steady laminar flamelet model using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07] is used as a demo data set.

We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding source terms, \(\mathbf{S_X}\):

X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')

We perform PCA on the original data:

pca_X = PCA(X, scaling='auto', n_components=2)

We transform the original data set to the newly identified basis and compute the principal components (PCs), \(\mathbf{Z}\):

Z = pca_X.transform(X, nocenter=False)

Transform the source terms to the newly identified basis and compute the sources of principal components, \(\mathbf{S_Z}\):

S_Z = pca_X.transform(S_X, nocenter=True)

Note that we set the flag nocenter=True which is a specific setting that should be applied when transforming source terms. With that setting, only scales \(\mathbf{D}\) will be applied when transforming \(\mathbf{S_X}\) to the new basis defined by \(\mathbf{A}\) and thus the transformation will be consistent with the discussion presented in the previous section.


Bibliography#

TBS15

Amir Biglari and James C. Sutherland. An a-posteriori evaluation of principal component analysis-based models for turbulent combustion simulations. Combustion and Flame, 162(10):4025–4035, 2015.

TEM15

Tarek Echekki and Hessam Mirgolbabaei. Principal component transport in turbulent combustion: a posteriori analysis. Combustion and Flame, 162(5):1919–1933, 2015.

THS18

Michael A. Hansen and James C. Sutherland. On the consistency of state vectors and jacobian matrices. Combustion and Flame, 193:257–271, 2018.

CHan20

Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.

CHSSC07

Evatt R. Hawkes, Ramanan Sankaran, James C. Sutherland, and Jacqueline H. Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.

TSP09(1,2)

James C. Sutherland and Alessandro Parente. Combustion modeling using principal component analysis. Proceedings of the Combustion Institute, 32(1):1563–1570, 2009.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Manifold Assessment#

In this tutorial, we demonstrate tools that may be used for assessing manifold quality and dimensionality as well as comparing manifolds (parameterizations) in terms of representing dependent variables of interest.

import numpy as np
import matplotlib.pyplot as plt
from PCAfold import compute_normalized_variance, PCA, normalized_variance_derivative,\
find_local_maxima, plot_normalized_variance, plot_normalized_variance_comparison,\
plot_normalized_variance_derivative, plot_normalized_variance_derivative_comparison, random_sampling_normalized_variance

Here we are creating a two-dimensional manifold to assess with a dependent variable. Independent variables \(x\) and \(y\) and dependent variable \(f\) will be defined as

\[x = e^{g} \cos^2(g)\]
\[y = \cos^2(g)\]
\[f = g^3+g\]

for a grid \(g\) between [-0.5,1].

npts = 1001
grid = np.linspace(-0.5,1.,npts)

x = np.exp(grid)*np.cos(grid)**2
y = np.cos(grid)**2

f = grid**3+grid
depvar_name = 'f' # dependent variable name

plt.scatter(x, y, c=f, s=5, cmap='rainbow')
plt.colorbar()
plt.grid()
plt.xlabel('x')
plt.ylabel('y')
plt.title('colored by f')
plt.show()
_images/output_3_0.png

We now want to assess the manifold in one and two dimensions using compute_normalized_variance. In order to use this function, the independent and dependent variables must be arranged into two-dimensional arrays size npts by number of variables. This is done in the following code.

indepvars = np.vstack((x, y)).T
depvars = np.expand_dims(f, axis=1)
print('indepvars shape:', indepvars.shape, '\n  depvars shape:', depvars.shape)
indepvars shape: (1001, 2)
  depvars shape: (1001, 1)

We can now call compute_normalized_variance on both the two-dimensional manifold and one-dimensional slices of it in order to assess the true dimensionality of the manifold (which should be two in this case). A normalized variance is computed at various bandwidths (Gaussian kernel filter widths) which can provide indications of overlapping states in the manifold (or non-uniqueness) as well as indications of how spread out the dependent variables are. A unique manifold with large spread in the data should better facilitate building models for accurate representations of the dependent variables of interest. Details on the normalized variance equations may be found in the documentation.

The bandwidths are applied to the independent variables after they are centered and scaled inside a unit box (by default). The bandwidth values may be computed by default according to interpoint distances or may be specified directly by the user.

Below is a demonstration of using default bandwidth values and plotting the resulting normalized variance.

orig2D_default  = compute_normalized_variance(indepvars, depvars, [depvar_name])

plt = plot_normalized_variance(orig2D_default)
plt.show()
_images/output_7_0.png

Now we will define an array for the bandwidths in order for the same values to be applied to our manifolds of interest.

bandwidth = np.logspace(-6,1,100) # array of bandwidth values

# one-dimensional manifold represented by x
orig1Dx = compute_normalized_variance(indepvars[:,:1], depvars, [depvar_name], bandwidth_values=bandwidth)
# one-dimensional manifold represented by y
orig1Dy = compute_normalized_variance(indepvars[:,1:], depvars, [depvar_name], bandwidth_values=bandwidth)
# original two-dimensional manifold
orig2D  = compute_normalized_variance(indepvars,       depvars, [depvar_name], bandwidth_values=bandwidth)

The following plot shows the normalized variance calculated for the dependent variable on each of the three manifolds. A single smooth rise in the normalized variance over bandwidth values indicates a unique manifold. Multiple rises, as can be seen in the one-dimensional manifolds, indicate multiple scales of variation. In this example, those smaller scales can be attributed to non-uniqueness introduced through the projection into one dimension. A curve that rises at larger bandwidth values also indicates more spread in the dependent variable over the manifold. Therefore the desired curve for an optimal manifold is one that has a single smooth rise that occurs at larger bandwidth values.

plt = plot_normalized_variance_comparison((orig1Dx, orig1Dy, orig2D), ([], [], []), ('Blues', 'Reds', 'Greens'), title='Normalized variance for '+depvar_name)
plt.legend(['orig,1D_x', 'orig,1D_y', 'orig,2D'])
plt.show()
_images/output_11_0.png

In order to better highlight the fastest changes in the normalized variance, we look at a scaled derivative over the logarithmically scaled bandwidths which relays how fast the variance is changing as the bandwidth changes. Specifically we compute \(\hat{\mathcal{D}}(\sigma)\), whose equation can be found in the documentation. Below we show this quantity for the original two-dimensional manifold.

We see a single peak in \(\hat{\mathcal{D}}(\sigma)\) corresponding to the single rise in \(\mathcal{N}(\sigma)\) pointed out above. The location of this peak gives an idea of the feature sizes or length scales associated with variation in the dependent variable over the manifold.

plt = plot_normalized_variance_derivative(orig2D)
plt.show()
_images/output_13_0.png

We can also plot a comparison of these peaks using plot_normalized_variance_derivative_comparison for the three manifold representations discussed thus far. In the plot below, we can see that the two one-dimensional projections have two peaks in \(\hat{\mathcal{D}}(\sigma)\) corresponding to the two humps in the normalized variance. This clearly shows that the projections are introducing a significant scale of variation not present on the original two-dimensional manifold. The locations of these peaks indicate the feature sizes or scales of variaiton present in the dependent variable on the manifolds.

plt = plot_normalized_variance_derivative_comparison((orig1Dx, orig1Dy, orig2D), ([],[],[]), ('Blues', 'Reds','Greens'))
plt.legend(['orig,1D_x', 'orig,1D_y', 'orig,2D'])
plt.show()
_images/output_15_0.png

We can also break down the analysis of these peaks to determine the \(\sigma\) where they occur. The normalized_variance_derivative function will return a dictionary of \(\hat{\mathcal{D}}(\sigma)\) for each dependent variable along with the corresponding \(\sigma\) values. The find_local_maxima function can then be used to report the locations of the peaks in \(\hat{\mathcal{D}}(\sigma)\) along with the peak values themselves. In order to properly analyze these peaks, we leave the logscaling parameter to its default True value. We can also set show_plot to True to display the peaks found. This is demonstrated for the one-dimensional projection onto x below.

orig1Dx_derivative, orig1Dx_sigma, _ = normalized_variance_derivative(orig1Dx)
orig1Dx_peak_locs, orig1Dx_peak_values = find_local_maxima(orig1Dx_derivative[depvar_name], orig1Dx_sigma, show_plot=True)
print('peak locations:', orig1Dx_peak_locs)
print('peak values:', orig1Dx_peak_values)
_images/output_17_0.png
peak locations: [0.00086033 0.5070298 ]
peak values: [1.01351778 0.60217727]

In this example, we know in the case of the one-dimensional projections that non-uniqueness or overlap is introduced in the dependent variable representation. This shows up as an additional peak in \(\hat{\mathcal{D}}(\sigma)\) compared to the original two-dimensional manifold. In general, though, we may not know whether that additional scale of variation is due to non-uniqueness or is a new characteristic feature from sharpening gradients. We can analyze sensitivity to data sampling in order to distinguish between the two.

As an example, we will analyze the projection onto x. We can use the random_sampling_normalized_variance to compute the normalized variance for various random samplings based on the provided sampling_percentages argument. We can also specify multiple realizations through the n_sample_iterations argument, which will be averaged for returning \(\hat{\mathcal{D}}(\sigma)\). We will test 100%, 50%, and 25% specified as [1., 0.5, 0.25]. Note that specifying 100% returns the same result as calling compute_normalized variance on the full dataset as we did above.

pctdict, pctsig, _ = random_sampling_normalized_variance([1., 0.5, 0.25],
                                                             indepvars[:,:1],
                                                             depvars,
                                                             [depvar_name],
                                                             bandwidth_values=bandwidth,
                                                             n_sample_iterations=5)
sampling 100.0 % of the data
  iteration 1 of 5
  iteration 2 of 5
  iteration 3 of 5
  iteration 4 of 5
  iteration 5 of 5
sampling 50.0 % of the data
  iteration 1 of 5
  iteration 2 of 5
  iteration 3 of 5
  iteration 4 of 5
  iteration 5 of 5
sampling 25.0 % of the data
  iteration 1 of 5
  iteration 2 of 5
  iteration 3 of 5
  iteration 4 of 5
  iteration 5 of 5

We then plot the result below and report the peak locations for the two dominant peaks. We can see that the peak at the larger \(\sigma\) isn’t very sensitive to data sampling. It remains around 0.5. The peak at smaller \(\sigma\) though experiences a shift to larger \(\sigma\) as less data is included (lower percent sampling). This is because variation from non-uniqueness is much more sensitive to data spacing than characteristic feature variation. We would therefore conclude that the second scale of variation introduced by the projection onto x is due to non-uniqueness, not a characteristic feature size, and therefore the projection is unacceptable. This confirms what we already knew from the visual analysis.

peakthreshold = 0.4

for pct in pctdict.keys():
    plt.semilogx(pctsig, pctdict[pct][depvar_name], '--', linewidth=2, label=pct)
    peak_locs, peak_vals = find_local_maxima(pctdict[pct][depvar_name], pctsig, threshold=peakthreshold)
    print(f'{pct*100:3.0f}% sampling peak locations: {peak_locs[0]:.2e}, {peak_locs[1]:.2e}')

plt.grid()
plt.xlabel('$\sigma$')
plt.ylabel('$\hat{\mathcal{D}}$')
plt.legend()
plt.xlim([np.min(pctsig), np.max(pctsig)])
plt.ylim([0,1.02])
plt.title('Detecting non-uniqueness through sensitivity to sampling')
plt.show()
100% sampling peak locations: 8.60e-04, 5.07e-01
 50% sampling peak locations: 1.15e-03, 5.06e-01
 25% sampling peak locations: 3.68e-03, 4.98e-01
_images/output_21_1.png

As an example of comparing multiple representations of a manifold in the same dimensional space, we will use PCA. Below, two pca objects are created with different scalings. The first uses the default scaling std while the second uses the scaling pareto. The plots of the resulting manifolds are shown below for comparison to the original. The dimensions for the PCA manifolds are referred to as PC1 and PC2.

# PCA using std scaling
pca_std = PCA(indepvars)
eta_std = pca_std.transform(indepvars)

plt.scatter(eta_std[:,0], eta_std[:,1], c=f, s=2, cmap='rainbow')
plt.colorbar()
plt.grid()
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('std scaling')
plt.show()

# PCA using pareto scaling
pca_pareto = PCA(indepvars,'pareto')
eta_pareto = pca_pareto.transform(indepvars)

plt.scatter(eta_pareto[:,0], eta_pareto[:,1], c=f, s=2, cmap='rainbow')
plt.colorbar()
plt.grid()
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('pareto scaling')
plt.show()
_images/output_23_0.png _images/output_23_1.png

We call compute_normalized_variance in order to assess these manifolds in one and two dimensional space. Since PCA orders the PCs according the amount of variance explained, we will use PC1 for representing a one-dimensional manifold.

pca1D_std = compute_normalized_variance(eta_std[:,:1], depvars, [depvar_name],bandwidth_values=bandwidth)
pca2D_std = compute_normalized_variance(eta_std,       depvars, [depvar_name],bandwidth_values=bandwidth)

pca1D_pareto = compute_normalized_variance(eta_pareto[:,:1], depvars, [depvar_name],bandwidth_values=bandwidth)
pca2D_pareto = compute_normalized_variance(eta_pareto,       depvars, [depvar_name],bandwidth_values=bandwidth)

We then go straight to plotting \(\hat{\mathcal{D}}\) to see if new peaks are introduced compared to the original two-dimensional manifold, indicating new scales of variation. We again find that the one-dimensional projections are introducing a new scale. We could perform a similar analysis as shown above on the projection onto x to conclude that these new scales are also from non-uniqueness introduced in the projection. We therefore continue the analysis only considering two-dimensional parameterizations to figure out which one may be best in representing f.

plt = plot_normalized_variance_derivative_comparison((pca1D_std, pca2D_std, pca1D_pareto, pca2D_pareto, orig2D),
                                                     ([],[],[],[],[]),
                                                     ('Blues', 'Reds', 'Purples', 'Oranges', 'Greens'))
plt.legend(['pca1D_std', 'pca2D_std', 'pca1D_pareto', 'pca2D_pareto', 'orig,2D'])
plt.show()
_images/output_27_0.png

We compute the locations of the peaks in \(\hat{\mathcal{D}}\) over \(\sigma\) below.

pca2D_std_derivative, pca2D_std_sigma, _  = normalized_variance_derivative(pca2D_std)
pca2D_pareto_derivative, pca2D_pareto_sigma, _ = normalized_variance_derivative(pca2D_pareto)
orig2D_derivative, orig2D_sigma, _ = normalized_variance_derivative(orig2D)

pca2D_std_peak_locs, _ = find_local_maxima(pca2D_std_derivative[depvar_name], pca2D_std_sigma)
pca2D_pareto_peak_locs, _ = find_local_maxima(pca2D_pareto_derivative[depvar_name], pca2D_pareto_sigma)
orig2D_peak_locs, _ = find_local_maxima(orig2D_derivative[depvar_name], orig2D_sigma)

print('peak locations:')
print('orig2D',orig2D_peak_locs)
print('pca2D_std',pca2D_std_peak_locs)
print('pca2D_pareto',pca2D_pareto_peak_locs)
peak locations:
orig2D [0.66762295]
pca2D_std [0.78185085]
pca2D_pareto [0.67063695]

The results show that PCA with std scaling results in the largest feature size (largest \(\sigma\)) and is therefore the best for parameterizing f. This representation should better facilitate modeling of f as the features are more spread out.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Local feature size estimation#

In this tutorial, we present the local feature size estimation tool from the analysis module.

We import the necessary modules:

from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
import numpy as np
import pandas as pd
import time
import matplotlib
import matplotlib.pyplot as plt

and we set some initial parameters:

save_filename = None
bandwidth_values = np.logspace(-5, 1, 40)

We upload the dataset which comes from solving the Brusselator PDE. The datasets has two independent variables, \(x\) and \(y\), and one dependent variable, \(\phi\). The dataset is generated on a uniform \(x\)-\(y\) grid.

data = pd.read_csv('brusselator-PDE.csv', header=None).to_numpy()
indepvars = data[:,0:2]
depvar = data[:,2:3]
_images/demo-feature-size-map-Brusselator-LDM.png

Compute the feature sizes map on a synthetic dataset#

We start by computing the normalized variance, \(\hat{\mathcal{N}}(\sigma)\). In order to compute the quantities necessary for drawing the feature size map, we need to set either compute_sample_norm_var=True or compute_sample_norm_range=True.

tic = time.perf_counter()

variance_data = analysis.compute_normalized_variance(indepvars,
                                                     depvars=depvar,
                                                     depvar_names=['phi'],
                                                     bandwidth_values=bandwidth_values,
                                                     compute_sample_norm_range=True)

toc = time.perf_counter()
print(f'\tTime it took: {(toc - tic)/60:0.1f} minutes.\n' + '-'*40)

We compute the normalized variance derivative, \(\hat{\mathcal{D}}(\sigma)\):

derivative, sigmas, _ = analysis.normalized_variance_derivative(variance_data)
derivatives = derivative['phi']

The local feature size estimation algorithm iteratively updates the size of the local features by running “bandwidth descent” algorithm. The goal is to compute the bandwidth vector \(\mathbf{B}\) which contains estimation of the local feature size tied to every data point. The vector \(\mathbf{B}\) is first initialized with the largest feature sizes indicated by the starting_bandwidth_idx parameter. Entries in \(\mathbf{B}\) are iteratively updated based on the cutoff value.

starting_bandwidth_idx = 29
_images/demo-feature-size-map-D-hat.png

We run bandwidth descent algorithm. This will update the bandwidth vector at each location where the sample normalized variance is above cutoff of its maximum value.

cutoff = 15
B = analysis.feature_size_map(variance_data,
                              variable_name='phi',
                              cutoff=cutoff,
                              starting_bandwidth_idx='peak',
                              use_variance=False,
                              verbose=True)
_images/demo-feature-size-map-Brusselator-LDM-with-local-features.png

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Cost function for manifold topology assessment and optimization#

In this tutorial, we present the cost function from the analysis module which distills information from the normalized variance derivative into a single number. The cost function can be used for low-dimensional manifold topology assessment and manifold optimization.

We import the necessary modules:

from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
from PCAfold import utilities
from PCAfold import manifold_informed_backward_variable_elimination as BVE
import numpy as np
import time

and we set some initial parameters:

save_filename = None
random_seed = 100

Upload a combustion data set#

A data set representing combustion of syngas in air generated from steady laminar flamelet model using Spitfire and a chemical mechanism by Hawkes et al. is used as a demo data set.

We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding source terms, \(\mathbf{S_X}\):

X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
X_names = ['T', 'H2', 'O2', 'O', 'OH', 'H2O', 'H', 'HO2', 'CO', 'CO2', 'HCO']

(n_observations, n_variables) = np.shape(X)

Generate low-dimensional manifolds using PCA#

Below, we generate two- and three-dimensional projections of the original data set from PCA for further assessment.

pca_X_2D = reduction.PCA(X, scaling='auto', n_components=2)
Z_2D = pca_X_2D.transform(X)
S_Z_2D = pca_X_2D.transform(S_X, nocenter=True)
pca_X_3D = reduction.PCA(X, scaling='auto', n_components=3)
Z_3D = pca_X_3D.transform(X)
S_Z_3D = pca_X_3D.transform(S_X, nocenter=True)

We visualize the generated manifolds:

_images/tutorial-cost-function-2D-manifold-SZ1.png _images/tutorial-cost-function-2D-manifold-SZ2.png _images/tutorial-cost-function-3D-manifold-SZ1.png _images/tutorial-cost-function-3D-manifold-SZ2.png _images/tutorial-cost-function-3D-manifold-SZ3.png

Manifold assessment using the cost function#

We are going to compute the cost function for the PC source terms as the target dependent variables.

We specify the penalty function to use:

penalty_function = 'log-sigma-over-peak'

and the bandwidth values, \(\sigma\), for normalized variance derivative computation:

bandwidth_values = np.logspace(-7, 3, 50)

We specify the cost function’s hyper-parameters, the power \(r\) and the vertical shift \(b\). Increasing the power parameter allows for a stronger penalty for non-uniqueness and increasing the vertical shift parameter allows for a stronger penalty for small feature sizes.

power = 1
vertical_shift = 1

We sample the dataset to decrease the computational time of this tutorial:

sample_random = preprocess.DataSampler(np.zeros((n_observations,)).astype(int), random_seed=random_seed, verbose=False)
(idx_sample, _) = sample_random.random(50)

We create lists of the target dependent variables names:

depvar_names_2D = ['SZ' + str(i) for i in range(1,3)]
depvar_names_3D = ['SZ' + str(i) for i in range(1,4)]

and we begin with computing the normalized variance derivative for the two-dimensional PCA projection:

variance_data_2D = analysis.compute_normalized_variance(Z_2D[idx_sample,:],
                                                        S_Z_2D[idx_sample,:],
                                                        depvar_names=depvar_names_2D,
                                                        bandwidth_values=bandwidth_values)

The associated costs are computed from the generated object of the VarianceData class. With the norm=None we do not normalize the costs over all target variables (in this case the PC source terms), instead the output will give us the individual costs for each target variable.

costs_2D = analysis.cost_function_normalized_variance_derivative(variance_data_2D,
                                                                 penalty_function=penalty_function,
                                                                 power=power,
                                                                 vertical_shift=vertical_shift,
                                                                 norm=None)

We can print the individual costs:

for i, variable in enumerate(depvar_names_2D):
    print(variable + ':\t' + str(round(costs_2D[i],3)))
SZ1:        4.238
SZ2:        1.567

Finally, we repeat the cost function computation for the three-dimensional PCA projection:

variance_data_3D = analysis.compute_normalized_variance(Z_3D[idx_sample,:],
                                                        S_Z_3D[idx_sample,:],
                                                        depvar_names=depvar_names_3D,
                                                        bandwidth_values=bandwidth_values)
costs_3D = analysis.cost_function_normalized_variance_derivative(variance_data_3D,
                                                                 penalty_function=penalty_function,
                                                                 power=power,
                                                                 vertical_shift=verical_shift,
                                                                 norm=None)

and we print the individual costs:

for i, variable in enumerate(depvar_names_3D):
    print(variable + ':\t' + str(round(costs_3D[i],3)))
SZ1:        1.157
SZ2:        1.23
SZ3:        1.422

The cost function provides us information about the quality of the low-dimensional data projection with respect to target dependent variables, which in this case were the PC source terms. A higher cost indicates a worse manifold topology. The two topological aspects that the cost function takes into account are non-uniqueness and feature sizes.

We observe that individual costs are higher for the two-dimensional than for the three-dimensional PCA projection. This can be understood from our visualization of the manifolds, where we have seen a significant overlap affecting the first PC source term in particular. With the third manifold parameter added in the three-dimensional projection, the projection quality improves and the costs drop.

Moreover, for the two-dimensional PCA projection, the cost associated with the first PC source term is higher than the cost associated with the second PC source term. This can also be understood by comparing the two-dimensional projections colored by \(S_{Z, 1}\) and by \(S_{Z, 2}\). The high magnitudes of \(S_{Z, 1}\) values occur at the location where the manifold exhibits overlap, while the same overlap does not affect the \(S_{Z, 2}\) values to the same extent.

Manifold optimization using the cost function#

The utilities.manifold_informed_backward_variable_elimination function implements an iterative feature selection algorithm that uses the cost function as an objective function. The algorithm selects an optimal subset of the original state variables that result in an optimized PCA manifold topology. Below, we demonstrate the algorithm on a 10% sample of the original data. The data is sampled to speed-up the calculations for the purpose of this demonstration. In real applications it is recommended to use the full data set.

Sample the original data:

sample_random = preprocess.DataSampler(np.zeros((n_observations,)).astype(int), random_seed=100, verbose=False)
(idx_sample, _) = sample_random.random(10)

sampled_X = X[idx_sample,:]
sampled_S_X = S_X[idx_sample,:]

Specify the target variables to assess on the manifold (we will also add the PC source terms to the target variables by setting add_transformed_source=True). In this case we take the temperature, $T$, and several important chemical species mass fractions: \(H_2\), \(O_2\), \(H_2O\), \(CO\) and \(CO_2\):

target_variables = sampled_X[:,[0,1,2,5,8,9]]

Set the norm to take over all target dependent variables:

norm = 'cumulative'

Set the target manifold dimensionality:

q = 2

Run the algorithm:

_, selected_variables, _, _ = BVE(sampled_X,
                                  sampled_S_X,
                                  X_names,
                                  scaling='auto',
                                  bandwidth_values=bandwidth_values,
                                  target_variables=target_variables,
                                  add_transformed_source=True,
                                  target_manifold_dimensionality=q,
                                  penalty_function=penalty_function,
                                  power=power,
                                  vertical_shift=vertical_shift,
                                  norm=norm,
                                  verbose=True)

With verbose=True we will see additional information on costs at each iteration:

Iteration No.4
Currently eliminating variable from the following list:
['T', 'H2', 'O2', 'O', 'OH', 'H2O', 'H', 'CO2']
    Currently eliminated variable: T
    Running PCA for a subset:
    H2, O2, O, OH, H2O, H, CO2
    Cost:   11.4539
    WORSE
    Currently eliminated variable: H2
    Running PCA for a subset:
    T, O2, O, OH, H2O, H, CO2
    Cost:   13.4908
    WORSE
    Currently eliminated variable: O2
    Running PCA for a subset:
    T, H2, O, OH, H2O, H, CO2
    Cost:   14.8488
    WORSE
    Currently eliminated variable: O
    Running PCA for a subset:
    T, H2, O2, OH, H2O, H, CO2
    Cost:   12.6549
    WORSE
    Currently eliminated variable: OH
    Running PCA for a subset:
    T, H2, O2, O, H2O, H, CO2
    Cost:   10.0785
    SAME OR BETTER
    Currently eliminated variable: H2O
    Running PCA for a subset:
    T, H2, O2, O, OH, H, CO2
    Cost:   10.7182
    WORSE
    Currently eliminated variable: H
    Running PCA for a subset:
    T, H2, O2, O, OH, H2O, CO2
    Cost:   11.8644
    WORSE
    Currently eliminated variable: CO2
    Running PCA for a subset:
    T, H2, O2, O, OH, H2O, H
    Cost:   10.9898
    WORSE

    Variable OH is removed.
    Cost:   10.0785

    Iteration time: 0.8 minutes.

Finally, we generate the PCA projection of the optimized subset of the original data set:

pca_X_optimized = reduction.PCA(X[:,selected_variables], scaling='auto', n_components=2)
Z_optimized = pca_X_optimized.transform(X[:,selected_variables])
S_Z_optimized = pca_X_optimized.transform(S_X[:,selected_variables], nocenter=True)
_images/tutorial-cost-function-2D-optimized-manifold-SZ1.png _images/tutorial-cost-function-2D-optimized-manifold-SZ2.png

From the plots above, we observe that the optimized two-dimensional PCA projection exhibits much less overlap compared to the two-dimensional PCA projection that we computed earlier using the full data set.

Below, we compute the costs for the two PC source terms again for this optimized projection:

variance_data_optimized = analysis.compute_normalized_variance(Z_optimized,
                                                               S_Z_optimized,
                                                               depvar_names=depvar_names_2D,
                                                               bandwidth_values=bandwidth_values)
costs_optimized = analysis.cost_function_normalized_variance_derivative(variance_data_optimized,
                                                                        penalty_function=penalty_function,
                                                                        power=power,
                                                                        vertical_shift=vertical_shift,
                                                                        norm=None)
for i, variable in enumerate(depvar_names_2D):
    print(variable + ':\t' + str(round(costs_optimized[i],3)))
SZ1:        1.653
SZ2:        1.179

We note that the costs for the two PC source terms are lower than the costs that we computed earlier using the full data set to generate the PCA projection.

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Nonlinear regression#

In this tutorial, we present the nonlinear regression utilities from the analysis module.

We import the necessary modules:

from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
from PCAfold import reconstruction
import numpy as np

and we set some initial parameters:

save_filename = None

Generating a synthetic data set#

We begin by generating a synthetic data set with two independent variables, \(x\) and \(y\), and one dependent variable, \(\phi\), that we will nonlinearly regress using kernel regression.

Generate independent variables \(x\) and \(y\) from a uniform grid:

n_points = 100
grid = np.linspace(0,100,n_points)
x, y = np.meshgrid(grid, grid)
x = x.flatten()
y = y.flatten()
xy = np.hstack((x[:,None],y[:,None]))
(n_observations, _) = np.shape(xy)

Generate a dependent variable \(\phi\) as a quadratic function of \(x\):

phi = xy[:,0:1]**2

Visualize the generated data set:

plt = reduction.plot_2d_manifold(x,
                                 y,
                                 color=phi,
                                 x_label='x',
                                 y_label='y',
                                 colorbar_label='$\phi$',
                                 color_map='inferno',
                                 figure_size=(8,4),
                                 save_filename=save_filename)
_images/tutorial-regression-data-set.svg

Kernel regression#

We first generate train and test samples using the DataSampler class:

train_perc = 80
random_seed = 100

idx = np.zeros((n_observations,)).astype(int)
sample_random = preprocess.DataSampler(idx, random_seed=random_seed, verbose=False)
(idx_train, idx_test) = sample_random.random(train_perc, test_selection_option=1)

xy_train = xy[idx_train,:]
xy_test = xy[idx_test,:]

phi_train = phi[idx_train]
phi_test = phi[idx_test]

Specify the bandwidth for the Nadaraya-Watson kernel:

bandwidth = 10

Fit the kernel regression model with train data:

model = analysis.KReg(xy_train, phi_train)

Predict the test data:

phi_test_predicted = model.predict(xy_test, bandwidth=bandwidth)

Predict all data:

phi_predicted = model.predict(xy, bandwidth=bandwidth)

Nonlinear regression assessment#

In this section we will perform few assessments of the quality of the nonlinear regression.

Visual assessment#

We begin by visualizing the regressed (predicted) dependent variable \(\phi\). This can be done either in 2D:

plt = reconstruction.plot_2d_regression(x,
                                  phi,
                                  phi_predicted,
                                  x_label='$x$',
                                  y_label='$\phi$',
                                  figure_size=(10,4),
                                  save_filename=save_filename)
_images/tutorial-regression-result-2d.svg

or in 3D:

plt = reconstruction.plot_3d_regression(x,
                                  y,
                                  phi,
                                  phi_predicted,
                                  elev=20,
                                  azim=-100,
                                  x_label='$x$',
                                  y_label='$y$',
                                  z_label='$\phi$',
                                  figure_size=(10,7),
                                  save_filename=save_filename)
_images/tutorial-regression-result.svg
Predicted 2D field for scalar quantities#

When the predicted variable is a scalar quantity, a scatter plot for the regressed scalar field can be plotted using the function plot_2d_regression_scalar_field. Regression of the scalar field can be tested at any user-defined grid, also outside of the bounds of the training data. This can be of particular importance when generating reduced-order models, where the behavior of the regression should be tested outside of the training manifold.

Below, we show an example on a combustion data set.

X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
pca_X = reduction.PCA(X, scaling='vast', n_components=2)
PCs = pca_X.transform(X)
PC_sources = pca_X.transform(S_X, nocenter=True)
(PCs_pp, centers_PCs, scales_PCs) = preprocess.center_scale(PCs, '-1to1')

Fit the kernel regression model with the train data:

KReg_model = analysis.KReg(PCs_pp, PC_sources)

We define the regression model function that will make predictions for any query point:

def regression_model(regression_input):

    regression_input_CS = (regression_input - centers_PCs)/scales_PCs

    regressed_value = KReg_model.predict(regression_input_CS, 'nearest_neighbors_isotropic', n_neighbors=10)[0,1]

    return regressed_value

We first visualize the training manifold, colored by the dependent variable being predicted:

reduction.plot_2d_manifold(PCs[:,0],
                           PCs[:,1],
                           x_label='$Z_1$',
                           y_label='$Z_2$',
                           color=PC_sources[:,1],
                           color_map='viridis',
                           colorbar_label='$S_{Z_2}$',
                           figure_size=(8,6),
                           save_filename=save_filename)
_images/tutorial-regression-scalar-field-training-manifold.png

Define the bounds for the scalar field:

grid_bounds = ([np.min(PCs[:,0]),np.max(PCs[:,0])],[np.min(PCs[:,1]),np.max(PCs[:,1])])

Plot the regressed scalar field:

plt = reconstruction.plot_2d_regression_scalar_field(grid_bounds,
                                               regression_model,
                                               x=PCs[:,0],
                                               y=PCs[:,1],
                                               resolution=(200,200),
                                               extension=(10,10),
                                               s_field=10,
                                               s_manifold=1,
                                               x_label='$Z_1$ [$-$]',
                                               y_label='$Z_2$ [$-$]',
                                               manifold_color='r',
                                               colorbar_label='$S_{Z, 1}$',
                                               color_map='viridis',
                                               colorbar_range=(np.min(PC_sources[:,1]), np.max(PC_sources[:,1])),
                                               manifold_alpha=1,
                                               grid_on=False,
                                               figure_size=(10,6),
                                               save_filename=save_filename);
_images/tutorial-regression-scalar-field.png
Streamplots for predicted vector quantities#

In a special case, when the predicted variable is a vector, a streamplot of the regressed vector field can be plotted using the function plot_2d_regression_streamplot. Regression of a vector field can be tested at any user-defined grid, also outside of the bounds of the training data. This can be of particular importance when generating reduced-order models, where the behavior of the regression should be tested outside of the training manifold.

Below, we show an example on a synthetic data set:

X = np.random.rand(100,5)
S_X = np.random.rand(100,5)

pca_X = reduction.PCA(X, n_components=2)
PCs = pca_X.transform(X)
S_Z = pca_X.transform(S_X, nocenter=True)

vector_model = analysis.KReg(PCs, S_Z)

We define the regression model function that will make predictions for any query point:

def regression_model(query):

    predicted = vector_model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)

    return predicted

Define the bounds for the streamplot:

grid_bounds = ([np.min(PCs[:,0]),np.max(PCs[:,0])],[np.min(PCs[:,1]),np.max(PCs[:,1])])

Plot the regression streamplot:

plt = reconstruction.plot_2d_regression_streamplot(grid_bounds,
                                    regression_model,
                                    x=PCs[:,0],
                                    y=PCs[:,1],
                                    resolution=(15,15),
                                    extension=(20,20),
                                    color='k',
                                    x_label='$Z_1$',
                                    y_label='$Z_2$',
                                    manifold_color=X[:,0],
                                    colorbar_label='$X_1$',
                                    color_map='plasma',
                                    colorbar_range=(0,1),
                                    manifold_alpha=1,
                                    grid_on=False,
                                    figure_size=(10,6),
                                    title='Streamplot',
                                    save_filename=None)
_images/tutorial-regression-streamplot.svg
Error metrics#

Several error metrics are available that will measure how well the dependent variable(s) were predicted. Metrics can be accessed individually and collectively. Below, we will show examples of both. The available metrics are:

  • Mean absolute error

  • Mean squared error

  • Root mean squared error

  • Normalized root mean squared error

  • Turning points

  • Good estimate

  • Good direction estimate

An example of computing mean absolute error is shown below:

MAE = reconstruction.mean_absolute_error(phi, phi_predicted)

We also compute the coefficient of determination, \(R^2\), values for the test data and entire data:

r2_test = reconstruction.coefficient_of_determination(phi_test, phi_test_predicted)
r2_all = reconstruction.coefficient_of_determination(phi, phi_predicted)

print('All R2:\t\t' + str(round(r2_all, 6)) + '\nTest R2:\t' + str(round(r2_test, 6)))

The code above will print:

All R2:       0.997378
Test R2:      0.997366

By instantiating an object of the RegressionAssessment class, one can compute all available metrics at once:

regression_metrics = reconstruction.RegressionAssessment(phi, phi_predicted, variable_names=['$\phi$'], norm='std')

As an example, mean absolute error can be accessed by:

regression_metrics.mean_absolute_error

All computed metrics can be printed with the use of the RegressionAssessment.print_metrics function. Few output formats are available.

Raw text format:

regression_metrics.print_metrics(table_format=['raw'], float_format='.4f')
--------------------
$\phi$
R2: 0.9958
MAE:        98.4007
MSE:        37762.8664
RMSE:       194.3267
NRMSE:      0.0645
GDE:        nan

tex format:

regression_metrics.print_metrics(table_format=['tex'], float_format='.4f')
\begin{table}[h!]
\begin{center}
\begin{tabular}{ll} \toprule
 & \textit{$\phi$} \\ \midrule
$R^2$ & 0.9958 \\
MAE & 98.4007 \\
MSE & 37762.8664 \\
RMSE & 194.3267 \\
NRMSE & 0.0645 \\
GDE & nan \\
\end{tabular}
\caption{}\label{}
\end{center}
\end{table}

pandas.DataFrame format (most recommended for Jupyter notebooks):

regression_metrics.print_metrics(table_format=['pandas'], float_format='.4f')
_images/tutorial-regression-metrics-4f.png

Note that with the float_format parameter you can change the number of digits displayed:

regression_metrics.print_metrics(table_format=['pandas'], float_format='.2f')
_images/tutorial-regression-metrics-2f.png
Stratified error metrics#

In addition to a single value of \(R^2\) for the entire data set, we can also compute stratified \(R^2\) values. This allows us to observe how kernel regression performed in each strata of the dependent variable \(\phi\). We will compute the stratified \(R^2\) in 20 bins of \(\phi\):

n_bins = 20
use_global_mean = False
verbose = True

(idx, bins_borders) = preprocess.variable_bins(phi, k=n_bins, verbose=False)

r2_in_bins = reconstruction.stratified_coefficient_of_determination(phi, phi_predicted, idx=idx, use_global_mean=use_global_mean, verbose=verbose)

The code above will print:

Bin 1       | size   2300   | R2    0.868336
Bin 2       | size   900    | R2    0.870357
Bin 3       | size   700    | R2    0.863821
Bin 4       | size   600    | R2    0.880655
Bin 5       | size   500    | R2    0.875764
Bin 6       | size   500    | R2    0.889148
Bin 7       | size   400    | R2    0.797888
Bin 8       | size   400    | R2    0.773907
Bin 9       | size   400    | R2    0.79479
Bin 10      | size   400    | R2    0.862069
Bin 11      | size   300    | R2    0.864022
Bin 12      | size   300    | R2    0.93599
Bin 13      | size   300    | R2    0.972185
Bin 14      | size   300    | R2    0.988894
Bin 15      | size   300    | R2    0.979975
Bin 16      | size   300    | R2    0.766598
Bin 17      | size   300    | R2    -0.46525
Bin 18      | size   200    | R2    -11.158072
Bin 19      | size   300    | R2    -10.94865
Bin 20      | size   300    | R2    -28.00655

We can plot the stratified \(R^2\) values across bins centers:

plt = reconstruction.plot_stratified_metric(r2_in_bins,
                                      bins_borders,
                                      variable_name='$\phi$',
                                      metric_name='$R^2$',
                                      yscale='linear',
                                      figure_size=(10,2),
                                      save_filename=save_filename)
_images/tutorial-regression-stratified-r2.svg

This last plot lets us see that kernel regression performed very well in the middle range of the dependent variable values but very poorly at both edges of that range. This is consistent with what we have seen in a 3D plot that visualized the regression result.

All other regression metrics can also be computed in the data bins, similarly to the example shown for the stratified \(R^2\) values.

We will create five bins:

(idx, bins_borders) = preprocess.variable_bins(phi, k=5, verbose=False)

stratified_regression_metrics = reconstruction.RegressionAssessment(phi, phi_predicted, idx=idx, variable_names=['$\phi$'], norm='std')

All computed stratified metrics can be printed with the use of the RegressionAssessment.print_stratified_metrics function. Few output formats are available.

Raw text format:

stratified_regression_metrics.print_stratified_metrics(table_format=['raw'], float_format='.4f')
-------------------------
k1
N. samples: 4500
R2: 0.9920
MAE:        53.2295
MSE:        2890.8754
RMSE:       53.7669
NRMSE:      0.0892
-------------------------
k2
N. samples: 1800
R2: 0.9906
MAE:        53.8869
MSE:        3032.0995
RMSE:       55.0645
NRMSE:      0.0971
-------------------------
k3
N. samples: 1400
R2: 0.9912
MAE:        50.4640
MSE:        2865.7682
RMSE:       53.5329
NRMSE:      0.0936
-------------------------
k4
N. samples: 1200
R2: 0.9956
MAE:        28.4107
MSE:        1492.1498
RMSE:       38.6284
NRMSE:      0.0665
-------------------------
k5
N. samples: 1100
R2: 0.1271
MAE:        493.3956
MSE:        321235.7188
RMSE:       566.7766
NRMSE:      0.9343

tex format:

stratified_regression_metrics.print_stratified_metrics(table_format=['tex'], float_format='.4f')
\\begin{table}[h!]
\\begin{center}
\\begin{tabular}{llllll} \\toprule
 & \\textit{k1} & \\textit{k2} & \\textit{k3} & \\textit{k4} & \\textit{k5} \\\\ \\midrule
N. samples & 4500.0000 & 1800.0000 & 1400.0000 & 1200.0000 & 1100.0000 \\\\
$R^2$ & 0.9920 & 0.9906 & 0.9912 & 0.9956 & 0.1271 \\\\
MAE & 53.2295 & 53.8869 & 50.4640 & 28.4107 & 493.3956 \\\\
MSE & 2890.8754 & 3032.0995 & 2865.7682 & 1492.1498 & 321235.7188 \\\\
RMSE & 53.7669 & 55.0645 & 53.5329 & 38.6284 & 566.7766 \\\\
NRMSE & 0.0892 & 0.0971 & 0.0936 & 0.0665 & 0.9343 \\\\
\\end{tabular}
\\caption{}\\label{}
\\end{center}
\\end{table}

pandas.DataFrame format (most recommended for Jupyter notebooks):

stratified_regression_metrics.print_stratified_metrics(table_format=['pandas'], float_format='.4f')
_images/tutorial-regression-metrics-stratified.png
Comparison of two regression solutions#

Two object of the RegressionAssessment class can be compared when printing the metrics. This results in a color-coded comparison where worse results are colored red and better results are colored green.

Below, we generate a new regression solution that will be compared with the one obtained above. We will increase the bandwidth to get different regression metrics:

phi_predicted_comparison = model.predict(xy, bandwidth=bandwidth+2)

Comparison can be done for the global metrics, where each variable will be compared separately:

regression_metrics_comparison = reconstruction.RegressionAssessment(phi, phi_predicted_comparison, variable_names=['$\phi$'], norm='std')

regression_metrics.print_metrics(table_format=['pandas'], float_format='.4f', comparison=regression_metrics_comparison)
_images/tutorial-regression-metrics-comparison.png

and for the stratified metrics, where each bin will be compared separately:

stratified_regression_metrics_comparison = reconstruction.RegressionAssessment(phi, phi_predicted_comparison, idx=idx)

stratified_regression_metrics.print_stratified_metrics(table_format=['raw'], float_format='.2f', comparison=stratified_regression_metrics_comparison)
-------------------------
k1
N. samples: 4500
R2: 0.99    BETTER
MAE:        53.23   BETTER
MSE:        2890.88 BETTER
RMSE:       53.77   BETTER
NRMSE:      0.09    BETTER
-------------------------
k2
N. samples: 1800
R2: 0.99    BETTER
MAE:        53.89   BETTER
MSE:        3032.10 BETTER
RMSE:       55.06   BETTER
NRMSE:      0.10    BETTER
-------------------------
k3
N. samples: 1400
R2: 0.99    BETTER
MAE:        50.46   BETTER
MSE:        2865.77 BETTER
RMSE:       53.53   BETTER
NRMSE:      0.09    BETTER
-------------------------
k4
N. samples: 1200
R2: 1.00    BETTER
MAE:        28.41   BETTER
MSE:        1492.15 BETTER
RMSE:       38.63   BETTER
NRMSE:      0.07    BETTER
-------------------------
k5
N. samples: 1100
R2: 0.13    BETTER
MAE:        493.40  BETTER
MSE:        321235.72       BETTER
RMSE:       566.78  BETTER
NRMSE:      0.93    BETTER

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

Partition of Unity Networks (POUnets)#

In this tutorial, we demonstrate how POUnets may be initialized and trained to reconstruct quantities of interest (QoIs).

from PCAfold import PartitionOfUnityNetwork, init_uniform_partitions
import numpy as np
import matplotlib.pyplot as plt

First, we create a two-dimensional manifold with vacant patches. This is shown in the first plot, colored by a dependent variable or QoI. We then ask to initialize partitions over a 5x2 grid. We find that only 8 of the 10 partitions are retained, as those initialized in the vacant spaces are discarded. We then visualize the locations of these partition centers, which exist in the normalized manifold space, along with the normalized data.

ivar1 = np.linspace(1,2,20)
ivar1 = ivar1[np.argwhere((ivar1<1.4)|(ivar1>1.6))[:,0]] # create hole
ivars = np.meshgrid(ivar1, ivar1) # make 2D
ivars = np.vstack([b.ravel() for b in ivars]).T # reshape (nobs x ndim)

dvar = 2.*ivars[:,0] + 0.1*ivars[:,1]**2

plt.scatter(ivars[:,0],ivars[:,1], s=3, c=dvar)
plt.colorbar()
plt.grid()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

init_data = init_uniform_partitions([5,2], ivars, verbose=True) # initialize partitions
ivars_cs = (ivars - init_data['ivar_center'])/init_data['ivar_scale'] # center/scale ivars

plt.plot(ivars_cs[:,0],ivars_cs[:,1], 'b.', label='normalized training data')
plt.plot(init_data['partition_centers'][:,0], init_data['partition_centers'][:,1], 'r*', label='partition centers')
plt.grid()
plt.xlabel('normalized x1')
plt.ylabel('normalized x2')
plt.legend()
plt.show()
_images/tutorial-pounet-domain.png
kept 8 partitions out of 10
_images/tutorial-pounet-partitions.png

We can now initialize a POUnet with a linear basis, build the graph with absolute training errors, and train for 1000 iterations.

There are also options, as outlined in the documentation, to set transformation parameters for training on a transformed dvar.

net = PartitionOfUnityNetwork(**init_data,
                              basis_type='linear',
#                               transform_power=1.,
#                               transform_shift=0.,
#                               transform_sign_shift=0.
                             )
net.build_training_graph(ivars, dvar, error_type='abs')
net.train(1000, archive_rate=100, verbose=True)
------------------------------------------------------------
   iteration |   mean sqr |      % max  |    sum sqr
------------------------------------------------------------
         100 |   1.93e-06 |       0.22% |   4.93e-04
resetting best error
         200 |   1.75e-06 |       0.21% |   4.49e-04
resetting best error
         300 |   1.69e-06 |       0.20% |   4.33e-04
resetting best error
         400 |   1.66e-06 |       0.20% |   4.25e-04
resetting best error
         500 |   1.64e-06 |       0.20% |   4.21e-04
resetting best error
         600 |   1.05e-06 |       0.20% |   2.68e-04
resetting best error
         700 |   5.25e-07 |       0.21% |   1.34e-04
resetting best error
         800 |   2.07e-07 |       0.22% |   5.29e-05
resetting best error
         900 |   2.57e-10 |       0.01% |   6.58e-08
resetting best error
        1000 |   1.06e-10 |       0.01% |   2.72e-08
resetting best error

The learning rate (default 1e-3) and least squares l2 regularization (default 1e-10) can also be updated at any time.

net.update_lr(1.e-4)
net.update_l2reg(1.e-12)
net.train(200, archive_rate=100, verbose=True)
updating lr: 0.0001
updating l2reg: 1e-12
------------------------------------------------------------
   iteration |   mean sqr |      % max  |    sum sqr
------------------------------------------------------------
         100 |   1.01e-10 |       0.01% |   2.58e-08
resetting best error
         200 |   9.61e-11 |       0.01% |   2.46e-08
resetting best error

Here we visualize the error during training at every 100th iteration, which is the default archive rate.

err_dict = net.training_archive

for k in ['mse', 'sse', 'inf']:
    plt.loglog(net.iterations,err_dict[k],'-', label=k)
plt.grid()
plt.xlabel('iterations')
plt.ylabel('error')
plt.legend()
plt.show()
_images/tutorial-pounet-error1.png

We can evaluate the POUnet and its derivatives.

pred = net(ivars)

plt.plot(dvar,dvar,'k-')
plt.plot(dvar,pred,'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('QoI')
plt.show()
_images/tutorial-pounet-parity1.png
der = net.derivatives(ivars) # predicted

der1 = 2.*np.ones_like(dvar) # observed
der2 = 0.2*ivars[:,1] # observed

plt.plot(der1,der1,'k-')
plt.plot(der1,der[:,0],'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('d/dx1')
plt.show()

plt.plot(der2,der2,'k-')
plt.plot(der2,der[:,1],'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('d/dx2')
plt.show()
_images/tutorial-pounet-dx1.png _images/tutorial-pounet-dx2.png

We can then save and load the POUnet parameters to/from file. The training history needs to be saved separately if desired.

# Save the POUnet to a file
net.write_data_to_file('filename.pkl')

# Load a POUnet from file
net2 = PartitionOfUnityNetwork.load_from_file('filename.pkl')

# Evaluate the loaded POUnet (without needing to build the graph)
pred2 = net2(ivars)

It is also possible to train a POUnet more after loading from file…

net2.build_training_graph(ivars, dvar, error_type='abs')
net2.train(1000, archive_rate=100, verbose=False)

Notice how the error history for the loaded POUnet only includes the recent training.

err_dict = net2.training_archive

for k in ['mse', 'sse', 'inf']:
    plt.loglog(net2.iterations,err_dict[k],'-', label=k)
plt.grid()
plt.xlabel('iterations')
plt.ylabel('error')
plt.legend()
plt.show()
_images/tutorial-pounet-error2.png

More training may be beneficial if new training data, perhaps with more resolution, become available…

ivars2 = np.meshgrid(np.linspace(1,2,20), np.linspace(1,2,20))
ivars2 = np.vstack([b.ravel() for b in ivars2]).T

dvar2 = 2.*ivars2[:,0] + 0.1*ivars2[:,1]**2

net2.build_training_graph(ivars2, dvar2, error_type='abs')
net2.train(1000, archive_rate=100, verbose=False)

If we have a different QoI that we want to use the same partitions for, we may also create a new POUnet from trained parameters and redo the least squares regression to update the basis coefficients appropriately…

dvar_new = ivars[:,0]*2 + 0.5*ivars[:,1]

net_new = PartitionOfUnityNetwork.load_from_file('filename.pkl')

net_new.build_training_graph(ivars, dvar_new)
net_new.lstsq()

pred_new = net_new(ivars)

plt.plot(dvar_new,dvar_new,'k-')
plt.plot(dvar_new,pred_new,'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('QoI new')
plt.show()
performing least-squares solve
_images/tutorial-pounet-parity2.png

There is also flexibility in adding/removing partitions or changing the basis degree, but the parameters must be appropriately resized for such changes.

Below, we remove the 4th partition from the originally trained POUnet. Partition parameters are shaped as n_partition x n_dim while the basis coefficients can easily be reshaped into n_basis x n_partition as shown below. Since we had a linear basis, the number of terms in each partition’s basis function is 3: a constant, linear in x1, and linear in x2.

pou_data = PartitionOfUnityNetwork.load_data_from_file('filename.pkl')

i_partition_remove = 3 # index to remove the 4th partition

old_coeffs = pou_data['basis_coeffs'].reshape(3,pou_data['partition_centers'].shape[0]) # reshape basis coeffs into n_basis x n_partition

pou_data['partition_centers'] = np.delete(pou_data['partition_centers'], i_partition_remove, axis=0) # remove the 4th row
pou_data['partition_shapes'] = np.delete(pou_data['partition_shapes'], i_partition_remove, axis=0) # remove the 4th row
pou_data['basis_coeffs'] = np.expand_dims(np.delete(old_coeffs, i_partition_remove, axis=1).ravel(), axis=0) # remove the 4th column

We then simply initialize a new POUnet with the modified data and continue training.

net_modified = PartitionOfUnityNetwork(**pou_data)
net_modified.build_training_graph(ivars, dvar, error_type='abs')
net_modified.train(1000, archive_rate=100, verbose=False)

We could also change the basis type and modify the basis coefficient size accordingly. Below, we change the basis from linear to quadratic, which adds 3 additional terms: x1^2, x2^2, and x1x2. We initialize these coefficients to zero and perform the least squares to update them appropriately. Further training could be performed if desired.

pou_data = PartitionOfUnityNetwork.load_data_from_file('filename.pkl')

old_coeffs = pou_data['basis_coeffs'].reshape(3,pou_data['partition_centers'].shape[0]) # reshape basis coeffs into n_basis x n_partition
old_coeffs = np.vstack((old_coeffs, np.zeros((3,old_coeffs.shape[1])))) # add basis terms for x1^2, x2^2, and x1x2
pou_data['basis_coeffs'] = np.expand_dims(old_coeffs.ravel(), axis=0)
pou_data['basis_type'] = 'quadratic'

net_modified = PartitionOfUnityNetwork(**pou_data)
net_modified.build_training_graph(ivars, dvar, error_type='abs')
net_modified.lstsq()
performing least-squares solve

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

QoI-aware encoder-decoder#

In this tutorial, we present the QoI-aware encoder-decoder dimensionality reduction strategy from the utilities module.

The QoI-aware encoder-decoder is an autoencoder-like neural network that reconstructs important quantities of interest (QoIs) at the output of a decoder. The QoIs can be set to projection-independent variables (such as the original state variables) or projection-dependent variables, whose definition changes during neural network training.

We introduce an intrusive modification to the neural network training process such that at each epoch, a low-dimensional basis matrix is computed from the current weights in the encoder. Any projection-dependent variables at the output get re-projected onto that basis.

The rationale for performing dimensionality reduction with the QoI-aware strategy is that any poor topological behaviors on a low-dimensional projection will immediately increase the loss during training. These behaviors could be non-uniqueness in representing QoIs due to overlaps on a projection, or large gradients in QoIs caused by data compression in certain regions of a projection. Thus, the QoI-aware strategy naturally promotes improved projection topologies and can be useful in reduced-order modeling.

An illustrative explanation of how the QoI-aware encoder-decoder works is presented in the figure below:

_images/tutorial-qoi-aware-encoder-decoder.png

We import the necessary modules:

from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
from PCAfold import utilities
import numpy as np

and we set some initial parameters:

save_filename = None

Upload a combustion data set#

A data set representing combustion of hydrogen in air generated from steady laminar flamelet model using Spitfire is used as a demo data set.

We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding source terms, \(\mathbf{S_X}\):

X = np.genfromtxt('H2-air-state-space.csv', delimiter=',')[:,0:-2]
S_X = np.genfromtxt('H2-air-state-space-sources.csv', delimiter=',')[:,0:-2]
X_names = np.genfromtxt('H2-air-state-space-names.csv', delimiter='\n', dtype=str)[0:-2]

(n_observations, n_variables) = np.shape(X)

Train the QoI-aware encoder-decoder#

We are going to generate 2D projections of the state-space:

n_components = 2

First, we are going to scale the state-space variables to a \(\langle 0, 1 \rangle\) range. This is done to help the neural network training process.

We are also going to apply an adequate scaling to the source terms. This is done for consistency in reduced-order modeling (see: Handling source terms). The scaled source terms will serve as projection-dependent variables.

(input_data, centers, scales) = preprocess.center_scale(X, scaling='0to1')
projection_dependent_outputs = S_X / scales

We create a PCA-initialization of the encoder:

pca = reduction.PCA(X, n_components=n_components, scaling='auto')
encoder_weights_init = pca.A[:,0:n_components]

We visualize the initial projection:

X_projected = np.dot(input_data, encoder_weights_init)
S_X_projected = np.dot(projection_dependent_outputs, encoder_weights_init)
_images/tutorial-qoi-aware-encoder-decoder-initial-2D-projection.png

We select a couple of important state variables to be used as the projection-independent variables:

selected_state_variables = [0, 2, 4, 5, 6]

First, we fix the random seed for results reproducibility:

random_seed = 100

We set several important hyper-parameters:

activation_decoder = 'tanh'
decoder_interior_architecture = (6,9)
optimizer = 'Adam'
learning_rate = 0.001
loss = 'MSE'
batch_size = n_observations
validation_perc = 10

We are not going to hold initial weights constant, and we are going to allow the encoder to update weights at each epoch:

hold_initialization = None
hold_weights = None

We are going to train the model for 5000 epochs:

n_epochs = 5000

We instantiate an object of the QoIAwareProjection class with various parameters:

projection = utilities.QoIAwareProjection(input_data,
                                          n_components=2,
                                          projection_independent_outputs=input_data[:,selected_state_variables],
                                          projection_dependent_outputs=projection_dependent_outputs,
                                          activation_decoder=activation_decoder,
                                          decoder_interior_architecture=decoder_interior_architecture,
                                          encoder_weights_init=None,
                                          decoder_weights_init=None,
                                          hold_initialization=hold_initialization,
                                          hold_weights=hold_weights,
                                          transformed_projection_dependent_outputs='signed-square-root',
                                          loss=loss,
                                          optimizer=optimizer,
                                          batch_size=batch_size,
                                          n_epochs=n_epochs,
                                          learning_rate=learning_rate,
                                          validation_perc=validation_perc,
                                          random_seed=random_seed,
                                          verbose=True)

Before we begin neural network training, we can print the summary of the current Keras model:

projection.summary()
QoI-aware encoder-decoder model summary...

(Model has not been trained yet)


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Projection dimensionality:

      - 2D projection

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Encoder-decoder architecture:

      9-2-6-9-9

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Activation functions:

      (9)--linear--(2)--tanh--(6)--tanh--(9)--tanh--(9)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Variables at the decoder output:

      - 5 projection independent variables
      - 2 projection dependent variables
      - 2 transformed projection dependent variables using signed-square-root

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model validation:

      - Using 10% of input data as validation data
      - Model will be trained on 90% of input data

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hyperparameters:

      - Batch size:           58101
      - # of epochs:          5000
      - Optimizer:            Adam
      - Learning rate:        0.001
      - Loss function:        MSE

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights initialization in the encoder:

      - User-provided custom initialization of the encoder

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights initialization in the decoder:

      - Glorot uniform

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights updates in the encoder:

      - Initial weights in the encoder will change after first epoch
      - Weights in the encoder will change at every epoch

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Results reproducibility:

      - Reproducible neural network training will be assured using random seed: 100

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

We train the current Keras model:

projection.train()

We can visualize the MSE loss computed on training and validation data during training:

projection.plot_losses(markevery=100,
                       figure_size=(15, 4),
                       save_filename=save_filename)
_images/tutorial-qoi-aware-encoder-decoder-losses.png

After training, additional information is available in the model summary:

projection.summary()
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Training results:

      - Minimum training loss:                0.0018488304922357202
      - Minimum training loss at epoch:       5000

      - Minimum validation loss:              0.0019012088887393475
      - Minimum validation loss at epoch:     5000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

We extract the best lower-dimensional basis that corresponds to the epoch with the smallest training loss:

basis = projection.get_best_basis(method='min-training-loss')

We project the original dataset onto that basis:

X_projected = np.dot(input_data, basis)
S_X_projected = np.dot(projection_dependent_outputs, basis)

We visualize the current manifold topology:

_images/tutorial-qoi-aware-encoder-decoder-2D-projection.png

Note

This tutorial was generated from a Jupyter notebook that can be accessed here.

QoI-aware encoder-decoders employing Partition of Unity Networks (POUnets)#

This demo takes the general formulation of QoI-aware encoder-decoders available in PCAfold and implements POUnets as the decoder.

from PCAfold import QoIAwareProjectionPOUnet, init_uniform_partitions, PCA, center_scale, PartitionOfUnityNetwork
import numpy as np
import matplotlib.pyplot as plt
import tensorflow.compat.v1 as tf

We will load the combustion dataset and remove temperature from the state variable list.

X = np.genfromtxt('H2-air-state-space.csv', delimiter=',')[:,1:-2]
S_X = np.genfromtxt('H2-air-state-space-sources.csv', delimiter=',')[:,1:-2]
X_names = np.genfromtxt('H2-air-state-space-names.csv', delimiter='\n', dtype=str)[1:-2]
X_names
array(['H', 'H2', 'O', 'OH', 'H2O', 'O2', 'HO2', 'H2O2'], dtype='<U4')

We then initialize the encoder weights using PCA. Notice how the 2D manifold is squeezed tightly and has overlapping states in some regions.

n_components = 2

pca = PCA(X, n_components=n_components, scaling='auto')
encoder_weights_init = pca.A[:,:n_components]

X_projected = X.dot(encoder_weights_init)
S_X_projected = S_X.dot(encoder_weights_init)

plt.scatter(X_projected[:,0], X_projected[:,1],s=3, c=S_X_projected[:,0], cmap='viridis')
plt.colorbar()
plt.grid()
plt.show()
_images/tutorial-qoi-pounet-init.png

Next, we finish initializing the encoder-decoder with the POUnet parameters. The helper function init_uniform_partitions is used as done in the POUnet demo, but note the independent variable space for the POUnet is the projected state variables X_projected. We have chosen a linear basis below.

When building the graph for the encoder-decoder, a function is required for computing the dependent training variables (QoIs). This allows for flexibility in using the projection parameters, which get updated themselves during training, for the dependent variable definitions. Therefore, the projection training can be informed by how well the projected source terms are represented, for example. The function must take in the encoder weights as an argument, but these do not have to be used. We also perform a nonlinear transformation on the source terms, which can help penalize projections introducing overlap in values. Below, we build a function that computes the projected source terms and concatenates these values with those of the OH and water mass fractions. These four variables provide the QoIs for which the loss function is computed during training. Note the QoI function must be written using tensorflow operations.

The graph is then built. Below we have turned on the optional constrain_positivity flag. As mass fractions are naturally positive, this only penalizes projections that create negative projected source terms. This can have advantages in simplifying regression and reducing the impact of regression errors with the wrong sign during simulation.

Finally, we train the QoIAwareProjectionPOUnet for 1000 iterations, archiving every 100th iteration, and save the parameters with the lowest overall errors. This is for demonstration, but more iterations are generally needed to converge to an optimal solution.

ednet = QoIAwareProjectionPOUnet(encoder_weights_init,
                                 **init_uniform_partitions([8,8], X_projected),
                                 basis_type='linear'
                                )

# define the function to produce the dependent variables for training
def define_dvar(proj_weights):
    dvar_y = tf.Variable(X[:,3:5], name='non_transform', dtype=tf.float64) # mass fractions for OH and H2O

    dvar_s = tf.Variable(np.expand_dims(S_X, axis=2), name='non_transform', dtype=tf.float64)
    dvar_s = ednet.tf_projection(dvar_s, nobias=True) # projected source terms
    dvar_st = tf.math.sqrt(tf.cast(tf.abs(dvar_s+1.e-4), dtype=tf.float64)) * tf.math.sign(dvar_s+1.e-4)+1.e-2 * tf.math.sign(dvar_s+1.e-4) # power transform source terms
    dvar_st_norm = dvar_st/tf.reduce_max(tf.cast(tf.abs(dvar_st), dtype=tf.float64), axis=0, keepdims=True) # normalize

    dvar = tf.concat([dvar_y, dvar_st_norm], axis=1) # train on combination
    return dvar

ednet.build_training_graph(X, define_dvar, error_type='abs', constrain_positivity=True)

ednet.train(1000, archive_rate=100, verbose=True)
------------------------------------------------------------
   iteration |   mean sqr |      % max  |    sum sqr
------------------------------------------------------------
         100 |   7.85e-04 |      79.79% |   4.56e+01
resetting best error
         200 |   6.72e-04 |      79.53% |   3.91e+01
resetting best error
         300 |   6.04e-04 |      78.90% |   3.51e+01
resetting best error
         400 |   5.29e-04 |      78.87% |   3.07e+01
resetting best error
         500 |   4.49e-04 |      76.84% |   2.61e+01
resetting best error
         600 |   3.53e-04 |      73.46% |   2.05e+01
resetting best error
         700 |   2.20e-04 |      67.99% |   1.28e+01
resetting best error
         800 |   1.18e-04 |      59.24% |   6.86e+00
resetting best error
         900 |   9.86e-05 |      59.00% |   5.73e+00
resetting best error
        1000 |   9.25e-05 |      59.07% |   5.37e+00
resetting best error

The learning rate (default 1e-3) and least squares l2 regularization (default 1e-10) can also be updated at any time.

ednet.update_lr(1.e-4)
ednet.update_l2reg(1.e-11)

ednet.train(200, archive_rate=100, verbose=True)
updating lr: 0.0001
updating l2reg: 1e-11
------------------------------------------------------------
   iteration |   mean sqr |      % max  |    sum sqr
------------------------------------------------------------
         100 |   9.03e-05 |      59.40% |   5.25e+00
resetting best error
         200 |   8.96e-05 |      59.41% |   5.21e+00
resetting best error

We can look at the trained projection weights:

print(ednet.projection_weights)
[[-0.35640105  0.05729153]
 [ 0.42022997  0.05765012]
 [-0.48311619 -0.24169431]
 [-0.24244533 -0.20839019]
 [-0.11743472  0.78807212]
 [-0.24317541 -0.00543714]
 [-1.16916608 -0.3213446 ]
 [-1.52308699  0.10786119]]

We can look at the projection after the progress in training. We see how the projected source term values are closer to being positive than before and the overlap has been removed. We would expect further training to create more separation between observations. Other QoIs for training may also lead to better separation faster.

X_projected = ednet.projection(X)
S_X_projected = ednet.projection(X, nobias=True)

plt.scatter(X_projected[:,0], X_projected[:,1],s=3, c=S_X_projected[:,0], cmap='viridis')
plt.colorbar()
plt.grid()
plt.show()
_images/tutorial-qoi-pounet-final.png

Below we grab the archived states during training and visualize the errors.

err_dict = ednet.training_archive

for k in ['mse', 'sse', 'inf']:
    plt.loglog(ednet.iterations,err_dict[k],'-', label=k)
plt.grid()
plt.xlabel('iterations')
plt.ylabel('error')
plt.legend()
plt.show()
_images/tutorial-qoi-pounet-error.png

We may also save and load a QoIAwareProjectionPOUnet to/from file. Rebuilding the graph is not necessary to grab the projection off a loaded QoIAwareProjectionPOUnet.

# Save the data to a file
ednet.write_data_to_file('filename.pkl')

# reload projection data from file
ednet2 = QoIAwareProjectionPOUnet.load_from_file('filename.pkl')

#compute projection without needing to rebuild graph:
X_projected = ednet2.projection(X)

It can then be useful to create multiple POUnets for separate variables using the same trained projection and partitions from the QoIAwareProjectionPOUnet. Below we demonstrate this for the water mass fraction.

net = PartitionOfUnityNetwork(
                             partition_centers=ednet.partition_centers,
                             partition_shapes=ednet.partition_shapes,
                             basis_type=ednet.basis_type,
                             ivar_center=ednet.proj_ivar_center,
                             ivar_scale=ednet.proj_ivar_scale
                            )
i_dvar = 4
dvar1 = X[:,i_dvar]
net.build_training_graph(ednet.projection(X), dvar1)
net.lstsq()

pred = net(ednet.projection(X))
plt.plot(dvar1, dvar1, 'k-')
plt.plot(dvar1, pred.ravel(), 'r.')
plt.title(X_names[i_dvar])
plt.grid()
plt.show()
performing least-squares solve
_images/tutorial-qoi-pounet-parity.png

There is also an option when building the QoIAwareProjectionPOUnet graph of separating trainable from nontrainable projection weights. This can be useful if certain dimensions of the projection are predefined, such as mixture fraction commonly used in combustion. In order to set certain columns of the projection weight matrix constant, specify the first index for which the weights are trainable (first_trainable_idx).

Below is an example of holding the first weights constant. We see how the second weights change after training, but the first do not.

ednet2 = QoIAwareProjectionPOUnet.load_from_file('filename.pkl')
ednet2.build_training_graph(X, define_dvar, first_trainable_idx=1)
old_weights = ednet2.projection_weights
ednet2.train(10, archive_rate=1)
print('difference in weigths before and after training:\n', old_weights-ednet2.projection_weights)
difference in weigths before and after training:
 [[ 0.        -0.001    ]
 [ 0.        -0.001    ]
 [ 0.        -0.001    ]
 [ 0.        -0.001    ]
 [ 0.         0.001    ]
 [ 0.         0.001    ]
 [ 0.         0.001    ]
 [ 0.         0.0009999]]