Low-dimensional PCA-derived manifolds and everything in between!
Intro#
PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds. It incorporates a variety of data preprocessing tools (including data clustering and sampling), implements several dimensionality reduction strategies and utilizes novel approaches to assess the quality of the obtained low-dimensional manifolds. The latest software version introduces algorithms to optimize projection topologies based on quantities of interest (QoIs) and novel tools to reconstruct QoIs from the low-dimensional data representations using partition of unity networks (POUnets).
A general overview for using PCAfold modules is presented in the diagram below:
Each module’s functionalities can also be used as a standalone tool for performing a specific task and can easily combine with techniques outside of this software.
Reach out to the Getting started section for more information on installing the software and for possible workflows that can be achieved with PCAfold. You can also download the poster below for a condensed overview of the available functionalities.

Citing PCAfold#
PCAfold is published in the SoftwareX journal. If you use PCAfold in a scientific publication, you can cite the software as:
Zdybał, K., Armstrong, E., Parente, A. and Sutherland, J.C., 2020. PCAfold: Python software to generate, analyze and improve PCA-derived low-dimensional manifolds. SoftwareX, 12, p.100630.
or using BibTeX:
@article{pcafold2020,
title = "PCAfold: Python software to generate, analyze and improve PCA-derived low-dimensional manifolds",
journal = "SoftwareX",
volume = "12",
pages = "100630",
year = "2020",
issn = "2352-7110",
doi = "https://doi.org/10.1016/j.softx.2020.100630",
url = "http://www.sciencedirect.com/science/article/pii/S2352711020303435",
author = "Kamila Zdybał and Elizabeth Armstrong and Alessandro Parente and James C. Sutherland"
}
Getting started#
Installation#
Dependencies#
PCAfold requires python3.7
and the following packages:
pip install Cython
pip install matplotlib
pip install numpy
pip install scipy
pip install termcolor
pip install tqdm
pip install scikit-learn
pip install tensorflow
Build from source#
Clone the PCAfold repository and move into the PCAfold
directory created:
git clone http://gitlab.multiscale.utah.edu/common/PCAfold.git
cd PCAfold
Run the setup.py
script as below to complete the installation:
python3.7 setup.py build_ext --inplace
python3.7 setup.py install
If the installation was successful, you are ready to import PCAfold
!
Testing#
To run regression tests from the base repo directory run:
python3.7 -m unittest discover
To switch verbose on, use the -v
flag.
All tests should be passing. If any of the tests is failing and you can’t sort out why, please open an issue on GitLab.
Local documentation build#
To build the documentation locally, you need sphinx
installed on your machine,
along with few extensions:
pip install Sphinx
pip install sphinxcontrib-bibtex
pip install furo
Then, navigate to docs/
directory and build the documentation:
sphinx-build -b html . builddir
make html
Documentation main page _build/html/index.html
can be opened in a web browser.
On MacOS, you can open it directly from the terminal:
open _build/html/index.html
Plotting#
Some functions within PCAfold result in plot outputs. Global styles for the
plots, such as font types and sizes, are set using the PCAfold/styles.py
file.
This file can be updated with new settings that will be seen globally by all
PCAfold modules. Re-build the project after changing styles.py
file:
python3.7 setup.py install
Note, that all plotting functions return handles to generated plots.
Workflows#
In this section, we present several popular workflows that can be achieved using functionalities of PCAfold. An overview for combining PCAfold modules into a complete workflow is presented in the diagram below:
Each module’s functionalities can also be used as a standalone tool for performing a specific task and can easily combine with techniques from outside of this software.
The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).
Below are brief descriptions of several workflows that utilize functionalities of PCAfold:
Data manipulation#
Basic data manipulation such as centering, scaling, outlier detection and removal
or kernel density weighting of data sets can be achieved using the preprocess
module.
Data clustering#
Data clustering can be achieved using the preprocess
module. This functionality can be
useful for data analysis or feature detection and can also be the first
step for applying data reduction techniques locally (on local portions of the data).
It is also worth pointing out that clustering algorithms from outside of
PCAfold software can be brought into the workflow.
Data sampling#
Data sampling can be achieved using the preprocess
module. Possible
use-case for sampling data sets could be to split data sets into train and test
samples for other Machine Learning algorithms. Another use-case can be sampling
imbalanced data sets.
Global PCA#
Global PCA can be performed using PCA
class available in the reduction
module.
Local PCA#
Local PCA can be performed using LPCA
class available in the reduction
module.
PCA on sampled data sets#
PCA on sampled data sets can be performed by combining sampling techniques from
the preprocess
module, with PCA
class
available in the reduction
module. The reduction
module additionally
contains a few more functions specifically designed to help analyze the results of
performing PCA on sampled data sets.
Assessing manifold quality#
Once a low-dimensional manifold is obtained, the quality of the manifold can be
assessed using functionalities available in the analysis
module.
It is worth noting that the manifold assessment metrics available can be
equally applied to manifolds derived by means of techniques other than PCA.
Reconstructing quantities of interest (QoIs)#
Using the reconstruction
module, quantities of interest (QoIs) can be reconstructed from the reduced
data representations using kernel regression, artificial neural networks (ANN) and a novel
approach called partition of unity networks (POUnets).
Improving projection topologies#
Two novel algorithms based on the quantitative cost function are introduced in the utilities
module that can help
improve topologies of PCA projections through appropriate variable selection. We also introduce an autoencoder-like strategy
that optimizes the projection topology directly based on the custom projection-independent and projection-dependent quantities of interest (QoIs).
Data preprocessing#
The preprocess
module can be used for performing data preprocessing
including centering and scaling, outlier detection and removal, kernel density
weighting of data sets, data clustering and data sampling. It also includes
functionalities that allow the user to perform initial data inspection such
as computing conditional statistics, calculating statistically representative sample sizes,
or ordering variables in a data set according to a criterion.
Note
The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).
The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.
The representation of the user-supplied data matrix in PCAfold
is the input parameter X
, which should be of type numpy.ndarray
and of size (n_observations,n_variables)
.
Data manipulation#
This section includes functions for performing basic data manipulation such as centering and scaling and outlier detection and removal.
center_scale
#
- PCAfold.preprocess.center_scale(X, scaling, nocenter=False)#
Centers and scales the original data set, \(\mathbf{X}\). In the discussion below, we understand that \(X_j\) is the \(j^{th}\) column of \(\mathbf{X}\).
Centering is performed by subtracting the center, \(c_j\), from \(X_j\), where centers for all columns are stored in the matrix \(\mathbf{C}\):
\[\mathbf{X_c} = \mathbf{X} - \mathbf{C}\]Centers for each column are computed as:
\[c_j = mean(X_j)\]with the only exceptions of
'0to1'
and'-1to1'
scalings, which introduce a different quantity to center each column.Scaling is performed by dividing \(X_j\) by the scaling factor, \(d_j\), where scaling factors for all columns are stored in the diagonal matrix \(\mathbf{D}\):
\[\mathbf{X_s} = \mathbf{X} \cdot \mathbf{D}^{-1}\]If both centering and scaling is applied:
\[\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\]Several scaling options are implemented here:
Scaling method
scaling
Scaling factor \(d_j\)
None
'none'
or''
1
Auto [PvdBHW+06]
'auto'
or'std'
\(\sigma\)
Pareto [PNod08]
'pareto'
\(\sqrt{\sigma}\)
VAST [PKEA+03]
'vast'
\(\sigma^2 / mean(X_j)\)
Range [PvdBHW+06]
'range'
\(max(X_j) - min(X_j)\)
0 to 1'0to1'
\(d_j = max(X_j) - min(X_j)\)\(c_j = min(X_j)\)-1 to 1'-1to1'
\(d_j = 0.5 \cdot (max(X_j) - min(X_j))\)\(c_j = 0.5 \cdot (max(X_j) + min(X_j))\)Level [PvdBHW+06]
'level'
\(mean(X_j)\)
Max
'max'
\(max(X_j)\)
Variance
'variance'
\(var(X_j)\)
Median
'median'
\(median(X_j)\)
Poisson [PKK04]
'poisson'
\(\sqrt{mean(X_j)}\)
S1
'vast_2'
\(\sigma^2 k^2 / mean(X_j)\)
S2
'vast_3'
\(\sigma^2 k^2 / max(X_j)\)
S3
'vast_4'
\(\sigma^2 k^2 / (max(X_j) - min(X_j))\)
L2-norm
'l2-norm'
\(\|X_j\|_2\)
where \(\sigma\) is the standard deviation of \(X_j\) and \(k\) is the kurtosis of \(X_j\).
The effect of data preprocessing (including scaling) on low-dimensional manifolds was studied in [PPS13].
Example:
from PCAfold import center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
,'l2-norm'
.nocenter – (optional)
bool
specifying whether data should be centered by mean. If set toTrue
data will not be centered.
- Returns
X_cs -
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It has size(n_observations,n_variables)
.X_center -
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,)
.X_scale -
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,)
.
invert_center_scale
#
- PCAfold.preprocess.invert_center_scale(X_cs, X_center, X_scale)#
Inverts whatever centering and scaling was done by the
center_scale
function:\[\mathbf{X} = \mathbf{X_{cs}} \cdot \mathbf{D} + \mathbf{C}\]Example:
from PCAfold import center_scale, invert_center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False) # Uncenter and unscale: X = invert_center_scale(X_cs, X_center, X_scale)
- Parameters
X_cs –
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables)
.X_center –
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,)
.X_scale –
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,)
.
- Returns
X -
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It has size(n_observations,n_variables)
.
power_transform
#
- PCAfold.preprocess.power_transform(X, transform_power, transform_shift=0.0, transform_sign_shift=0.0, invert=False)#
Performs a power transformation of the provided data. The equation for the transformation of variable \(X\) is
\[(|X + s_1|)^\alpha \text{sign}(X + s_1) + s_2 \text{sign}(X + s_1)\]where \(\alpha\) is the
transform_power
, \(s_1\) is thetransform_shift
, and \(s_2\) is thetransform_sign_shift
.Example:
from PCAfold import power_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform power transformation: X_pow = power_transform(X, 0.5) # undo the transformation: X_orig = power_transform(X_pow, 0.5, invert=True)
- Parameters
X – array of the variable(s) to be transformed
transform_power – the power parameter used in the transformation equation
transform_shift – (optional, default 0.) the shift parameter used in the transformation equation
transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation
invert – (optional, default False) when True, will undo the transformation
- Returns
array of the transformed variables
log_transform
#
- PCAfold.preprocess.log_transform(X, method='log', threshold=1e-06)#
Performs log transformation of the original data set, \(\mathbf{X}\).
For an example original function:
The symlog transformation can be obtained with
method='symlog'
:The continuous symlog transformation can be obtained with
method='continuous-symlog'
:Example:
from PCAfold import log_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform log transformation: X_log = log_transform(X) # Perform symlog transformation: X_symlog = log_transform(X, method='symlog', threshold=1.e-4)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.method – (optional)
str
specifying the log-transformation method. It can be one of the following:log
,ln
,symlog
,continuous-symlog
.threshold – (optional)
float
orint
specifying the threshold for symlog transformation.
- Returns
X_transformed -
numpy.ndarray
specifying the log-transformed data set. It has size(n_observations,n_variables)
.
remove_constant_vars
#
- PCAfold.preprocess.remove_constant_vars(X, maxtol=1e-12, rangetol=0.0001)#
Removes any constant columns from the original data set, \(\mathbf{X}\). The \(j^{th}\) column, \(X_j\), is considered constant if either of the following is true:
The maximum of an absolute value of a column \(X_j\) is less than
maxtol
:
\[max(|X_j|) < \verb|maxtol|\]The ratio of the range of values in a column \(X_j\) to \(max(|X_j|)\) is less than
rangetol
:
\[\frac{max(X_j) - min(X_j)}{max(|X_j|)} < \verb|rangetol|\]Specifically, it can be used as preprocessing for PCA so the eigenvalue calculation doesn’t break.
Example:
from PCAfold import remove_constant_vars import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Remove the constant column: (X_removed, idx_removed, idx_retained) = remove_constant_vars(X)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.maxtol – (optional)
float
specifying the tolerance for \(max(|X_j|)\).rangetol – (optional)
float
specifying the tolerance for \(max(X_j) - min(X_j)\) over \(max(|X_j|)\).
- Returns
X_removed -
numpy.ndarray
specifying the original data set, \(\mathbf{X}\) with any constant columns removed. It has size(n_observations,n_variables)
.idx_removed -
list
specifying the indices of columns removed from \(\mathbf{X}\).idx_retained -
list
specifying the indices of columns retained in \(\mathbf{X}\).
order_variables
#
- PCAfold.preprocess.order_variables(X, method='mean', descending=True)#
Orders variables in the original data set, \(\mathbf{X}\), using a selected method.
Example:
from PCAfold import order_variables import numpy as np # Generate a dummy data set: X = np.array([[100, 1, 10], [200, 2, 20], [300, 3, 30]]) # Order variables by the mean value in the descending order: (X_ordered, idx) = order_variables(X, method='mean', descending=True)
The code above should return an ordered data set:
array([[100, 10, 1], [200, 20, 2], [300, 30, 3]])
and the list of ordered variable indices:
[1, 2, 0]
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.method – (optional)
str
orlist
ofint
specifying the ordering method. Ifstr
, it can be one of the following:'mean'
,'min'
,'max'
,'std'
or'var'
. Iflist
, it is a custom user-provided list of indices for how the variables should be ordered.descending – (optional)
bool
specifying whether variables should be ordered in the descending order. If set toFalse
, variables will be ordered in the ascending order.
- Returns
X_ordered -
numpy.ndarray
specifying the original data set with ordered variables. It has size(n_observations,n_variables)
.idx -
list
specifying the indices of the ordered variables. It has lengthn_variables
.
Class PreProcessing
#
- class PCAfold.preprocess.PreProcessing(X, scaling='none', nocenter=False)#
Performs a composition of data manipulation done by
remove_constant_vars
andcenter_scale
functions on the original data set, \(\mathbf{X}\). It can be used to store the result of that manipulation. Specifically, it:checks for constant columns in a data set and removes them,
centers and scales the data.
Example:
from PCAfold import PreProcessing import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Instantiate PreProcessing class object: preprocessed = PreProcessing(X, 'range', nocenter=False)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.nocenter – (optional)
bool
specifying whether data should be centered by mean. If set toTrue
data will not be centered.
Attributes:
X_removed - (read only)
numpy.ndarray
specifying the original data set with any constant columns removed. It has size(n_observations,n_variables)
.idx_removed - (read only)
list
specifying the indices of columns removed from \(\mathbf{X}\).idx_retained - (read only)
list
specifying the indices of columns retained in \(\mathbf{X}\).X_cs - (read only)
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables)
.X_center - (read only)
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,)
.X_scale - (read only)
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,)
.
outlier_detection
#
- PCAfold.preprocess.outlier_detection(X, scaling, method='MULTIVARIATE TRIMMING', trimming_threshold=0.5, quantile_threshold=0.9899, verbose=False)#
Finds outliers in the original data set, \(\mathbf{X}\), and returns indices of observations without outliers as well as indices of the outliers themselves. Two options are implemented here:
'MULTIVARIATE TRIMMING'
Outliers are detected based on multivariate Mahalanobis distance, \(D_M\):
\[D_M = \sqrt{(\mathbf{X} - \mathbf{\bar{X}})^T \mathbf{S}^{-1} (\mathbf{X} - \mathbf{\bar{X}})}\]where \(\mathbf{\bar{X}}\) is a matrix of the same size as \(\mathbf{X}\) storing in each column a copy of the average value of the same column in \(\mathbf{X}\). \(\mathbf{S}\) is the covariance matrix computed as per
PCA
class. Note that the scaling option selected will affect the covariance matrix \(\mathbf{S}\). Since Mahalanobis distance takes into account covariance between variables, observations with sufficiently large \(D_M\) can be considered as outliers. For more detailed information on Mahalanobis distance the user is referred to [PBis06] or [PDMJRM00].The threshold above which observations will be classified as outliers can be specified using
trimming_threshold
parameter. Specifically, the \(i^{th}\) observation is classified as an outlier if:\[D_{M, i} > \verb|trimming_threshold| \cdot max(D_M)\]'PC CLASSIFIER'
Outliers are detected based on major and minor principal components (PCs). The method of principal component classifier (PCC) was first proposed in [PSCSC03]. The application of this technique to combustion data sets was studied in [PPS13]. Specifically, the \(i^{th}\) observation is classified as an outlier if the first PC classifier based on \(q\)-first (major) PCs:
\[\sum_{j=1}^{q} \frac{z_{ij}^2}{L_j} > c_1\]or if the second PC classifier based on \((Q-k+1)\)-last (minor) PCs:
\[\sum_{j=k}^{Q} \frac{z_{ij}^2}{L_j} > c_2\]where \(z_{ij}\) is the \(i^{th}, j^{th}\) element from the principal components matrix \(\mathbf{Z}\) and \(L_j\) is the \(j^{th}\) eigenvalue from \(\mathbf{L}\) (as per
PCA
class). Major PCs are selected such that the total variance explained is 50%. Minor PCs are selected such that the remaining variance they explain is 20%.Coefficients \(c_1\) and \(c_2\) are found such that they represent the
quantile_threshold
(by default 98.99%) quantile of the empirical distributions of the first and second PC classifier respectively.Example:
from PCAfold import outlier_detection import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Find outliers: (idx_outliers_removed, idx_outliers) = outlier_detection(X, scaling='auto', method='MULTIVARIATE TRIMMING', trimming_threshold=0.8, verbose=True) # New data set without outliers can be obtained as: X_outliers_removed = X[idx_outliers_removed,:] # Observations that were classified as outliers can be obtained as: X_outliers = X[idx_outliers,:]
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.method – (optional)
str
specifying the outlier detection method to use. It should be'MULTIVARIATE TRIMMING'
or'PC CLASSIFIER'
.trimming_threshold – (optional)
float
specifying the trimming threshold to use in combination with'MULTIVARIATE TRIMMING'
method.quantile_threshold – (optional)
float
specifying the quantile threshold to use in combination with'PC CLASSIFIER'
method.verbose – (optional)
bool
for printing verbose details.
- Returns
idx_outliers_removed -
list
specifying the indices of observations without outliers.idx_outliers -
list
specifying the indices of observations that were classified as outliers.
representative_sample_size
#
- PCAfold.preprocess.representative_sample_size(depvars, percentages, thresholds, variable_names=None, method='kl-divergence', statistics='median', n_resamples=10, random_seed=None, verbose=False)#
Computes a representative sample size given dependent variables that serve as ground truth (100% of data). It is assumed that the full dataset is representative of some physical phenomena.
Two general approaches are available:
If
method='kl-divergence'
, the representative sample size is computed based on Kullback-Leibler divergence.If
method='mean'
,method='median'
,method='variance'
, ormethod='std'
, the representative sample size is computed based on convergence of a first order (mean or median) or of second order (variance, standard deviation) statistics.
Example:
from PCAfold import center_scale, representative_sample_size import numpy as np # Generate dummy data set and two dependent variables: x, y = np.meshgrid(np.linspace(-1,1,100), np.linspace(-1,1,100)) xy = np.hstack((x.ravel()[:,None],y.ravel()[:,None])) phi_1 = np.exp(-((x*x+y*y) / (1 * 1**2))) phi_1 = phi_1.ravel()[:,None] phi_2 = np.exp(-((x*x+y*y) / (0.01 * 1**2))) phi_2 = phi_2.ravel()[:,None] depvars = np.column_stack((phi_1, phi_2)) depvars, _, _ = center_scale(depvars, scaling='0to1') # Specify the list of percentages to explore: percentages = list(np.linspace(1,99.9,200)) # Specify the list of thresholds for each dependent variable: thresholds = [10**-4, 10**-4] # Specify the names of the dependent variables: variable_names = ['Phi-1', 'Phi-2'] # Compute representative sample size for each dependent variable: (idx, sample_sizes, statistics) = representative_sample_size(depvars, percentages, thresholds=thresholds, variable_names=variable_names, method='kl-divergence', statistics='median', n_resamples=20, random_seed=100, verbose=True)
With
verbose=True
we will see some detailed information:Dependent variable Phi-1 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-1: 2833 samples (28.3% of data). Dependent variable Phi-2 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-2: 9890 samples (98.9% of data).
- Parameters
depvars –
numpy.ndarray
specifying the dependent variables that should be well represented in a sampled dataset. . It should be of size(n_observations,n_dependent_variables)
.percentages –
list
of percentages to explore. It should be ordered in ascending order. Elements should be larger than 0 and not larger than 100.thresholds – (optional)
list
offloat
specifying the target thresholds for each dependent variable. The thresholds should be appropriate to the method based on which a representative sample size is computed.variable_names – (optional)
list
ofstr
specifying names for all dependent variables. If set toNone
, dependent variables are called with consecutive integers.method – (optional)
str
specifying the method used to compute the sample size statistics. It can bemean
,median
,variance
,std
, or'kl-divergence'
.statistics – (optional)
str
specifying the overall statistics that should be computed from a given method. It can bemin
,max
,mean
, ormedian
.n_resamples – (optional)
int
specifying the number of resamples to perform for each percentage in thepercentages
vector. It is recommended to set this parameters to above 1, since it might accidentally happen that a random sample is statistically representative of the full dataset. Re-sampling helps to average-out the effect of such one-off “lucky” random samples.random_seed – (optional)
int
specifying the random seed.verbose – (optional)
bool
for printing verbose details.
- Returns
threshold_idx -
list
ofint
specifying the highest indices from thepercentages
list where the representative number of samples condition was still met. It has lengthn_depvars
. If the condition for a representative sample size was not met for a dependent variable, a value of-1
is returned in the list for that dependent variable.representatitive_sample_sizes -
numpy.ndarray
ofint
specifying the representative number of samples. It has size(1,n_depvars)
. If the condition for a representative sample size was not met for a dependent variable, a value of-1
is returned in the array for that dependent variable.sample_size_statistics -
numpy.ndarray
specifying the full vector of computed statistics correponding to each entry inpercentages
and each dependent variable. It has size(n_percentages,n_depvars)
.
Class ConditionalStatistics
#
- class PCAfold.preprocess.ConditionalStatistics(X, conditioning_variable, k=20, split_values=None, verbose=False)#
Enables computing conditional statistics on the original data set, \(\mathbf{X}\). This includes:
conditional mean
conditional minimum
conditional maximum
conditional standard deviation
Other quantities can be added in the future at the user’s request.
Example:
from PCAfold import ConditionalStatistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Instantiate an object of the ConditionalStatistics class # and compute conditional statistics in 10 bins of the conditioning variable: cond = ConditionalStatistics(y[:,None], conditioning_variable, k=10) # Access conditional statistics: conditional_mean = cond.conditional_mean conditional_min = cond.conditional_minimum conditional_max = cond.conditional_maximum conditional_std = cond.conditional_standard_deviation # Access the centroids of the created bins: centroids = cond.centroids
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.conditioning_variable –
numpy.ndarray
specifying a single variable to be used as a conditioning variable. It should be of size(n_observations,1)
or(n_observations,)
.k –
int
specifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
list
specifying values at which splits should be performed. If set toNone
, splits will be performed using \(k\) equal variable bins.verbose – (optional)
bool
for printing verbose details.
Attributes:
idx - (read only)
numpy.ndarray
of cluster (bins) classifications. It has size(n_observations,)
.borders - (read only)
list
of values that define borders for the clusters (bins). It has lengthk+1
.centroids - (read only)
list
of values that specify bins centers. It has lengthk
.conditional_mean - (read only)
numpy.ndarray
specifying the conditional means of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_minimum - (read only)
numpy.ndarray
specifying the conditional minimums of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_maximum - (read only)
numpy.ndarray
specifying the conditional maximums of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_standard_deviation - (read only)
numpy.ndarray
specifying the conditional standard deviations of all original variables in the \(k\) bins created. It has size(k,n_variables)
.
Class KernelDensity
#
- class PCAfold.preprocess.KernelDensity(X, conditioning_variable, verbose=False)#
Enables kernel density weighting of the original data set, \(\mathbf{X}\), based on single-variable or multi-variable case as proposed in [PCGP12].
The goal of both cases is to obtain a vector of weights, \(\mathbf{W_c}\), that has the same number of elements as there are observations in the original data set, \(\mathbf{X}\). Each observation will then get multiplied by the corresponding weight from \(\mathbf{W_c}\).
Note
Kernel density weighting technique is usually very expensive, even on data sets with relatively small number of observations. Since the single-variable case is a cheaper option than the multi-variable case, it is recommended that this technique is tried first for larger data sets.
Gaussian kernel is used in both approaches:
\[K_{c, c'} = \sqrt{\frac{1}{2 \pi h^2}} exp(- \frac{d^2}{2 h^2})\]\(h\) is the kernel bandwidth:
\[h = \Big( \frac{4 \hat{\sigma}}{3 n} \Big)^{1/5}\]where \(\hat{\sigma}\) is the standard deviation of the considered variable and \(n\) is the number of observations in the data set.
\(d\) is the distance between two observations \(c\) and \(c'\):
\[d = |x_c - x_{c'}|\]Single-variable
If the
conditioning_variable
argument is a single vector, weighting will be performed according to the single-variable case. It begins by summing Gaussian kernels:\[\mathbf{K_c} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c'}\]and weights are then computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Multi-variable
If the
conditioning_variable
argument is a matrix of multiple variables, weighting will be performed according to the multi-variable case. It begins by summing Gaussian kernels for a \(k^{th}\) variable:\[\mathbf{K_c}_{, k} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c', k}\]Global density taking into account all variables is then obtained as:
\[\mathbf{K_{c}} = \prod_{k=1}^{k=Q} \mathbf{K_c}_{, k}\]where \(Q\) is the total number of conditioning variables, and weights are computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Example:
from PCAfold import KernelDensity import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Perform kernel density weighting based on the first variable: kerneld = KernelDensity(X, X[:,0]) # Access the weighted data set: X_weighted = kerneld.X_weighted # Access the weights used to scale the data set: weights = kerneld.weights
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.conditioning_variable –
numpy.ndarray
specifying either a single variable or multiple variables to be used as a conditioning variable for kernel weighting procedure. Note that it can also be passed as the data set \(\mathbf{X}\).
Attributes:
weights -
numpy.ndarray
specifying the computed weights, \(\mathbf{W_c}\). It has size(n_observations,1)
.X_weighted -
numpy.ndarray
specifying the weighted data set (each observation in \(\mathbf{X}\) is multiplied by the corresponding weight in \(\mathbf{W_c}\)). It has size(n_observations,n_variables)
.
Class DensityEstimation
#
- class PCAfold.preprocess.DensityEstimation(X, n_neighbors)#
Enables density estimation on point-cloud data.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.n_neighbors –
int
specifying the number of nearest neighbors, or the \(k\) th nearest neighbor when applicable.
DensityEstimation.average_knn_distance
#
- PCAfold.preprocess.DensityEstimation.average_knn_distance(self, verbose=False)#
Computes an average Euclidean distances to \(k\) nearest neighbors on a manifold defined by the independent variables.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute average distances on a manifold defined by the PCs: average_distances = density_estimation.average_knn_distance(verbose=True)
With
verbose=True
, minimum, maximum and average distance will be printed:Minimum distance: 0.1388300829487847 Maximum distance: 0.4689587542132183 Average distance: 0.20824964953425693 Median distance: 0.18333873029179215
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Parameters
verbose – (optional)
bool
for printing verbose details.- Returns
average_distances -
numpy.ndarray
specifying the vector of average distances for every observation in a data set to its \(k\) nearest neighbors. It has size(n_observations,)
.
DensityEstimation.kth_nearest_neighbor_codensity
#
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_codensity(self)#
Computes the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data codensity defined as:
\[\delta_k(x) = d(x, v_k(x))\]where \(v_k(x)\) is the \(k\) th nearest neighbor of \(x\).
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_codensity = density_estimation.kth_nearest_neighbor_codensity()
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Returns
data_codensity -
numpy.ndarray
specifying the vector of distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,)
.
DensityEstimation.kth_nearest_neighbor_density
#
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_density(self)#
Computes an inverse of the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data density defined as:
\[\rho_k(x) = \frac{1}{\delta_k(x)}\]where \(\delta_k(x)\) is the codensity.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_density = density_estimation.kth_nearest_neighbor_density()
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Returns
data_density -
numpy.ndarray
specifying the vector of inverse distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,)
.
Data clustering#
This section includes functions for classifying data sets into local clusters and performing some basic operations on clusters [PELL09], [PKR09].
Clustering functions#
Each function that clusters the data set returns a vector of integers idx
of type numpy.ndarray
of size (n_observations,)
that specifies
classification of each observation from the original data set,
\(\mathbf{X}\), to a local cluster.
Note
The first cluster has index 0
within all idx
vectors returned.
variable_bins
#
- PCAfold.preprocess.variable_bins(var, k, verbose=False)#
Clusters the data by dividing a variable vector
var
into bins of equal lengths.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to bins of x: (idx, borders) = variable_bins(x, 4, verbose=True)
- Parameters
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
predefined_variable_bins
#
- PCAfold.preprocess.predefined_variable_bins(var, split_values, verbose=False)#
Clusters the data by dividing a variable vector
var
into bins such that splits are done at user-specified values. Split values can be specified in thesplit_values
list. In general:split_values = [value_1, value_2, ..., value_n]
.Note: When a split is performed at a given
value_i
, the observation invar
that takes exactly that value is assigned to the newly created bin.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import predefined_variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to pre-defined bins of x: (idx, borders) = predefined_variable_bins(x, [-0.6, 0.4, 0.8], verbose=True)
- Parameters
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.split_values –
list
specifying values at which splits should be performed.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
mixture_fraction_bins
#
- PCAfold.preprocess.mixture_fraction_bins(Z, k, Z_stoich, verbose=False)#
Clusters the data by dividing a mixture fraction vector
Z
into bins of equal lengths. This technique can be used to partition combustion data sets as proposed in [PPSTS09]. The vector is first split to lean and rich side (according to the stoichiometric mixture fractionZ_stoich
) and then the sides get divided further into clusters. Whenk
is odd, there will always be one more cluster on the side with larger range in mixture fraction space compared to the other side.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import mixture_fraction_bins import numpy as np # Generate dummy mixture fraction variable: Z = np.linspace(0,1,100) # Create partitioning according to bins of mixture fraction: (idx, borders) = mixture_fraction_bins(Z, 4, 0.4, verbose=True)
- Parameters
Z –
numpy.ndarray
specifying the mixture fraction values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number.Z_stoich –
float
specifying the stoichiometric mixture fraction. It has to be between 0 and 1.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
zero_neighborhood_bins
#
- PCAfold.preprocess.zero_neighborhood_bins(var, k, zero_offset_percentage=0.1, split_at_zero=False, verbose=False)#
Clusters the data by separating close-to-zero observations in a vector into one cluster (
split_at_zero=False
) or two clusters (split_at_zero=True
). The offset from zero at which splits are performed is computed based on the input parameterzero_offset_percentage
:\[\verb|offset| = \frac{(max(\verb|var|) - min(\verb|var|)) \cdot \verb|zero_offset_percentage|}{100}\]Further clusters are found by splitting positive and negative values in a vector alternatingly into bins of equal lengths.
This clustering technique can be useful for partitioning any variable that has many observations clustered around zero value and relatively few observations far away from zero on either side.
Two examples of how a vector can be partitioned with this function are presented below:
With
split_at_zero=False
:
If
split_at_zero=False
the smallest allowed number of clusters is 3. This is to assure that there are at least three clusters: with negative values, with close to zero values, with positive values.When
k
is even, there will always be one more cluster on the side with larger range compared to the other side.With
split_at_zero=True
:
If
split_at_zero=True
the smallest allowed number of clusters is 4. This is to assure that there are at least four clusters: with negative values, with negative values close to zero, with positive values close to zero and with positive values.When
k
is odd, there will always be one more cluster on the side with larger range compared to the other side.Note
This clustering technique is well suited for partitioning chemical source terms, \(\mathbf{S_X}\), or sources of principal components, \(\mathbf{S_Z}\), (as per [TSP09]) since it relies on unbalanced vectors that have many observations numerically close to zero. Using
split_at_zero=True
it can further differentiate between negative and positive sources.Example:
from PCAfold import zero_neighborhood_bins import numpy as np # Generate dummy variable: x = np.linspace(-100,100,1000) # Create partitioning according to bins of x: (idx, borders) = zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True)
- Parameters
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number. It cannot be smaller than 3 ifsplit_at_zero=False
or smaller than 4 ifsplit_at_zero=True
.zero_offset_percentage – (optional) percentage of \(max(\verb|var|) - min(\verb|var|)\) range to take as the offset from zero value. For instance, set
zero_offset_percentage=10
if you want 10% as offset.split_at_zero – (optional)
bool
specifying whether partitioning should be done atvar=0
.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
Auxiliary functions#
degrade_clusters
#
- PCAfold.preprocess.degrade_clusters(idx, verbose=False)#
Re-numerates clusters if either of these two cases is true:
idx
is composed of non-consecutive integers, orthe smallest cluster index in
idx
is not equal to0
.
Example:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 2, 0, 5, 10]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
The code above will produce:
>>> idx_degraded array([0, 0, 1, 0, 2, 3])
Alternatively:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([1, 1, 2, 2, 3, 3]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
will produce:
>>> idx_degraded array([0, 0, 1, 1, 2, 2])
- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.verbose – (optional)
bool
for printing verbose details.
- Returns
idx_degraded -
numpy.ndarray
of degraded cluster classifications. It has size(n_observations,)
.k_update -
int
specifying the updated number of clusters.
flip_clusters
#
- PCAfold.preprocess.flip_clusters(idx, dictionary)#
Flips cluster labelling according to instructions provided in a dictionary. For a
dictionary = {key : value}
, a cluster with a numberkey
will get a numbervalue
.Example:
from PCAfold import flip_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0,0,0,1,1,1,1,2,2]) # Swap cluster number 1 with cluster number 2: flipped_idx = flip_clusters(idx, {1:2, 2:1})
The code above will produce:
>>> flipped_idx array([0, 0, 0, 2, 2, 2, 2, 1, 1])
Note
This function can also be used to merge clusters. Using the
idx
from the example above, if we call:flipped_idx = flip_clusters(idx, {2:1})
the result will be:
>>> flipped_idx array([0,0,0,1,1,1,1,1,1])
where clusters
1
and2
have been merged into one cluster numbered1
.- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.dictionary –
dict
specifying instructions for cluster label flipping.
- Returns
flipped_idx -
numpy.ndarray
specifying the re-labelled cluster classifications. It has size(n_observations,)
.
get_centroids
#
- PCAfold.preprocess.get_centroids(X, idx)#
Computes the centroids for all variables in the original data set, \(\mathbf{X}\), and for each cluster specified in the
idx
vector. The centroid \(c_{n, j}\) for variable \(X_j\) in the \(n^{th}\) cluster, is computed as:\[c_{n, j} = mean(X_j), \,\,\,\, \text{for} \,\, X_j \in \text{cluster} \,\, n\]Centroids for all variables from all clusters are stored in the matrix \(\mathbf{c} \in \mathbb{R}^{k \times Q}\) returned:
\[\begin{split}\mathbf{c} = \begin{bmatrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, Q} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, Q} \\ \vdots & \vdots & \vdots & \vdots \\ c_{k, 1} & c_{k, 2} & \dots & c_{k, Q} \\ \end{bmatrix}\end{split}\]Example:
from PCAfold import get_centroids import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute the centroids of each cluster: centroids = get_centroids(X, idx)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.
- Returns
centroids -
numpy.ndarray
specifying the centroids matrix, \(\mathbf{c}\), for all clusters and for all variables. It has size(k,n_variables)
.
get_partition
#
- PCAfold.preprocess.get_partition(X, idx)#
Partitions the observations from the original data set, \(\mathbf{X}\), into \(k\) clusters according to
idx
provided.Example:
from PCAfold import get_partition import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Generate partitioning of the data set according to idx: (X_in_clusters, idx_in_clusters) = get_partition(X, idx)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.
- Returns
X_in_clusters -
list
of \(k\)numpy.ndarray
that contains original data set observations partitioned to \(k\) clusters. It has lengthk
.idx_in_clusters -
list
of \(k\)numpy.ndarray
that contains indices of the original data set observations partitioned to \(k\) clusters. It has lengthk
.
get_populations
#
- PCAfold.preprocess.get_populations(idx)#
Computes populations (number of observations) in clusters specified in the
idx
vector. As an example, if there are 100 observations in the first cluster and 500 observations in the second cluster this function will return a list:[100, 500]
.Example:
from PCAfold import variable_bins, get_populations import numpy as np # Generate dummy partitioning: x = np.linspace(-1,1,100) (idx, borders) = variable_bins(x, 4, verbose=True) # Compute cluster populations: populations = get_populations(idx)
The code above will produce:
>>> populations [25, 25, 25, 25]
- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.- Returns
populations -
list
of cluster populations. Each entry referes to one cluster ordered according toidx
. It has lengthk
.
get_average_centroid_distance
#
- PCAfold.preprocess.get_average_centroid_distance(X, idx, weighted=False)#
Computes the average Euclidean distance between observations and the centroids of clusters to which each observation belongs.
The average can be computed as an arithmetic average from all clusters (
weighted=False
) or as a weighted average (weighted=True
). In the latter, the distances are weighted by the number of observations in a cluster so that the average centroid distance will approach the average distance in the largest cluster.Example:
from PCAfold import get_average_centroid_distance import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute average distance from cluster centroids: average_centroid_distance = get_average_centroid_distance(X, idx, weighted=False)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.weighted – (optional)
bool
specifying whether distances from centroid should be weighted by the number of observations in a cluster. If set toFalse
, arithmetic average will be computed.
- Returns
average_centroid_distance -
float
specifying the average distance from centroids, averaged over all observations and all clusters.
Data sampling#
This section includes functions for splitting data sets into train and test data for use in machine learning algorithms. Apart from random splitting that can be achieved with the commonly used sklearn.model_selection.train_test_split, extended methods are implemented here that allow for purposive sampling [PNey92], such as drawing samples at certain amount from local clusters [PMMD10], [PGSB04]. These functionalities can be specifically used to tackle imbalanced data sets [PHG09], [PRLM+16].
The general idea is to divide the entire data set X
(or its portion) into train and test samples as presented below:
Train data is always sampled in the same way for a given sampling function.
Depending on the option selected, test data will be sampled differently, either as all
remaining samples that were not included in train data or as a subset of those.
You can select the option by setting the test_selection_option
parameter for each sampling function.
Reach out to the documentation for a specific sampling function to see what options are available.
All splitting functions in this module return a tuple of two variables: (idx_train, idx_test)
.
Both idx_train
and idx_test
are vectors of integers of type numpy.ndarray
and of size (_,)
.
These variables contain indices of observations that went into train data and test data respectively.
In your model learning algorithm you can then get the train and test observations, for instance in the following way:
X_train = X[idx_train,:]
X_test = X[idx_test,:]
All functions are equipped with verbose
parameter. If it is set to True
some additional information on train and test selection is printed.
Note
It is assumed that the first cluster has index 0
within all input idx
vectors.
Class DataSampler
#
- class PCAfold.preprocess.DataSampler(idx, idx_test=None, random_seed=None, verbose=False)#
Enables selecting train and test data samples.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, idx_test=np.array([5,9]), random_seed=100, verbose=True)
- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.idx_test – (optional)
numpy.ndarray
specifying the user-provided indices for test data. If specified, train data will be selected ignoring the indices inidx_test
and the test data will be returned the same as the user-providedidx_test
. If not specified, test samples will be selected according to thetest_selection_option
parameter (see documentation for each sampling function). Setting fixedidx_test
parameter may be useful if training a machine learning model on specific test samples is desired. It should be of size(n_test_samples,)
or(n_test_samples,1)
.random_seed – (optional)
int
specifying random seed for random sample selection.verbose – (optional)
bool
for printing verbose details.
DataSampler.number
#
- PCAfold.preprocess.DataSampler.number(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples fixed number of observations from every cluster as training data. In general, this results in a balanced representation of features identified by a clustering algorithm.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.number(20, test_selection_option=1)
Train data:
The number of train samples is estimated based on the percentage
perc
provided. First, the total number of samples for training is estimated as a percentageperc
from the total number of observationsn_observations
in a data set. Next, this number is divided equally into \(k\) clusters. The resultn_of_samples
is the number of samples that will be selected from each cluster:\[\verb|n_of_samples| = \verb|int| \Big( \frac{\verb|perc| \cdot \verb|n_observations|}{k \cdot 100} \Big)\]Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
, the smallest cluster is found and the remaining number of observations \(m\) are taken as test data in that cluster. Next, the same number of samples \(m\) is taken from all remaining larger clusters.The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(n\) and \(m\) are fixed numbers for each cluster. In general, \(n \neq m\).
- Parameters
perc – percentage of data to be selected as training data from the entire data set. For instance, set
perc=20
if you want to select 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
DataSampler.percentage
#
- PCAfold.preprocess.DataSampler.percentage(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples a certain percentage
perc
from every cluster as the training data.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.percentage(20, test_selection_option=1)
Note: If the cluster sizes are comparable, this function will give a similar train sample distribution as random sampling (
DataSampler.random
). This sampling can be useful in cases where one cluster is significantly smaller than others and there is a chance that this cluster will not get covered in the train data if random sampling was used.Train data:
The number of train samples is estimated based on the percentage
perc
provided. First, the size of the \(i^{th}\) cluster is estimatedcluster_size_i
and then a percentageperc
of that number is selected.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(p\) is the percentage
perc
provided.- Parameters
perc – percentage of data to be selected as training data from each cluster. For instance, set
perc=20
if you want to select 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
DataSampler.manual
#
- PCAfold.preprocess.DataSampler.manual(self, sampling_dictionary, sampling_type='percentage', test_selection_option=1)#
Uses classifications into \(k\) clusters and a dictionary
sampling_dictionary
in which you manually specify what'percentage'
(or what'number'
) of samples will be selected as the train data from each cluster. The dictionary keys are cluster classifications as peridx
and the dictionary values are either percentage or number of train samples to be selected. The default dictionary values are percentage but you can selectsampling_type='number'
in order to interpret the values as a number of samples.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.manual({0:1, 1:1, 2:1}, sampling_type='number', test_selection_option=1)
Train data:
The number of train samples selected from each cluster is estimated based on the
sampling_dictionary
. Forkey : value
, percentagevalue
(or numbervalue
) of samples will be selected from clusterkey
.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here it is understood that \(n_1\) train samples were requested from the first cluster, \(n_2\) from the second cluster and \(n_3\) from the third cluster, where \(n_i\) can be interpreted as number or as percentage. This can be achieved by setting:
sampling_dictionary = {0:n_1, 1:n_2, 2:n_3}
- Parameters
sampling_dictionary –
dict
specifying manual sampling. Keys are cluster classifications and values are eitherpercentage
ornumber
of samples to be taken from that cluster. Keys should match the cluster classifications as peridx
.sampling_type – (optional)
str
specifying whether percentage or number is given in thesampling_dictionary
. Available options:percentage
ornumber
. The default ispercentage
.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
DataSampler.random
#
- PCAfold.preprocess.DataSampler.random(self, perc, test_selection_option=1)#
Samples train data at random from the entire data set.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Due to the nature of this sampling technique, it is not necessary to have
idx
classifications since random samples can also be selected from unclassified data sets. You can achieve that by generating a dummyidx
vector that has the same number of observationsn_observations
as your data set. For instance:from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: n_observations = 100 idx = np.zeros(n_observations) # Instantiate DataSampler class object: selection = DataSampler(idx) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Train data:
The total number of train samples is computed as a percentage
perc
from the total number of observations in a data set. These samples are then drawn at random from the entire data set, independent of cluster classifications.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure is used to select test data as was used to select train data (only allowed ifperc
is less than 50%).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(p\) is the percentage
perc
provided.- Parameters
perc – percentage of data to be selected as training data from each cluster. Set
perc=20
if you want 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
Plotting functions#
This section includes functions for data preprocessing related plotting such as visualizing the formed clusters, visualizing the selected train and test samples or plotting the conditional statistics.
plot_2d_clustering
#
- PCAfold.preprocess.plot_2d_clustering(x, y, idx, clean=False, x_label=None, y_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, grid_on=False, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_2d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_2d_clustering(x, y, idx, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(10,6), title='x-y data set', save_filename='clustering.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.clean – (optional)
bool
specifying if a clean plot should be made. If set toTrue
, nothing else but the data points is plotted.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.alphas – (optional)
list
specifying the opacity of each cluster.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.grid_on –
bool
specifying whether grid should be plotted.s – (optional)
int
orfloat
specifying the scatter point size.markerscale – (optional)
int
orfloat
specifying the scale for the legend marker.legend – (optional)
bool
specifying the whether legend should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_3d_clustering
#
- PCAfold.preprocess.plot_3d_clustering(x, y, z, idx, elev=45, azim=-45, x_label=None, y_label=None, z_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a three-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_3d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 z = x + 10 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_3d_clustering(x, y, z, idx, x_label='$x$', y_label='$y$', z_label='$z$', color_map='viridis', first_cluster_index_zero=False, figure_size=(10,6), title='x-y-z data set', save_filename='clustering.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.z –
numpy.ndarray
specifying the variable on the \(z\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.elev – (optional) elevation angle.
azim – (optional) azimuth angle.
x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.z_label – (optional)
str
specifying \(z\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.alphas – (optional)
list
specifying the opacity of each cluster.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.s – (optional)
int
orfloat
specifying the scatter point size.markerscale – (optional)
int
orfloat
specifying the scale for the legend marker.legend – (optional)
bool
specifying the whether legend should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_2d_train_test_samples
#
- PCAfold.preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label=None, y_label=None, color_map='viridis', first_cluster_index_zero=True, grid_on=False, figure_size=(14, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into train and test samples. Number of observations in train and test data respectively will be plotted in the legend.
Example:
from PCAfold import variable_bins, DataSampler, plot_2d_train_test_samples import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, borders) = variable_bins(x, 4, verbose=False) # Generate dummy sampling of the data set: sample = DataSampler(idx, random_seed=None, verbose=True) (idx_train, idx_test) = sample.number(40, test_selection_option=1) # Plot the sampling result: plt = plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(12,6), title='x-y data set', save_filename='sampling.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.idx_train –
numpy.ndarray
specifying the indices of the train data. It should be of size(n_train,)
or(n_train,1)
.idx_test –
numpy.ndarray
specifying the indices of the test data. It should be of size(n_test,)
or(n_test,1)
.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.grid_on –
bool
specifying whether grid should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_conditional_statistics
#
- PCAfold.preprocess.plot_conditional_statistics(variable, conditioning_variable, k=20, split_values=None, statistics_to_plot=['mean'], color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold given by
variable
andconditioning_variable
and the selected conditional statistics (as perpreprocess.ConditionalStatistics
).Example:
from PCAfold import PCA, plot_conditional_statistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Plot the conditional statistics: plt = plot_conditional_statistics(y, conditioning_variable, k=10, x_label='$x$', y_label='$y$', figure_size=(10,3), title='Conditional mean', save_filename='conditional-mean.pdf') plt.close()
- Parameters
variable –
numpy.ndarray
specifying a single dependent variable to condition. This will be plotted on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.conditioning_variable –
numpy.ndarray
specifying a single variable to be used as a conditioning variable. This will be plotted on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
list
specifying values at which splits should be performed. If set toNone
, splits will be performed using \(k\) equal variable bins.statistics_to_plot –
list
ofstr
specifying conditional statistics to plot. The strings can bemean
,min
,max
orstd
.color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
x
andy
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, data will be plotted in black.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.colorbar_label – (optional) string specifying colorbar label annotation. If set to
None
, colorbar label will not be plotted.color_map – (optional) colormap to use as per
matplotlib.cm
. Default is viridis.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
Bibliography#
- PBis06
Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
- PCVJ21(1,2)
Gunnar Carlsson and Mikael Vejdemo-Johansson. Topological Data Analysis with Applications. Cambridge University Press, 2021.
- PCGP12
Axel Coussement, Olivier Gicquel, and Alessandro Parente. Kernel density weighted principal component analysis of combustion processes. Combustion and flame, 159(9):2844–2855, 2012.
- PDMJRM00
Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The mahalanobis distance. Chemometrics and intelligent laboratory systems, 50(1):1–18, 2000.
- PELL09
Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Wiley Publishing, 4th edition, 2009. ISBN 0340761199.
- PGSB04
Abdul A. Gill, George D. Smith, and Anthony J. Bagnall. Improving decision tree performance through induction-and cluster-based stratified sampling. In International Conference on Intelligent Data Engineering and Automated Learning, 339–344. Springer, 2004.
- PHG09
Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
- PKR09
Leonard Kaufman and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. Volume 344. John Wiley & Sons, 2009.
- PKK04
Michael R Keenan and Paul G Kotula. Accounting for poisson noise in the multivariate analysis of tof-sims spectrum images. Surface and Interface Analysis: An International Journal devoted to the development and application of techniques for the analysis of surfaces, interfaces and thin films, 36(3):203–212, 2004.
- PKEA+03
Hector C Keun, Timothy MD Ebbels, Henrik Antti, Mary E Bollard, Olaf Beckonert, Elaine Holmes, John C Lindon, and Jeremy K Nicholson. Improved analysis of multivariate data by variable stability scaling: application to nmr-based metabolic profiling. Analytica chimica acta, 490(1-2):265–276, 2003.
- PMMD10
Robert J. May, Holger R. Maier, and Graeme C. Dandy. Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks, 23(2):283–294, 2010.
- PNey92
Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in Statistics, pages 123–150. Springer, 1992.
- PNod08
Isao Noda. Scaling techniques to enhance two-dimensional correlation spectra. Journal of Molecular Structure, 883:216–227, 2008.
- PPS13(1,2)
Alessandro Parente and James C. Sutherland. Principal component analysis of turbulent combustion data: data pre-processing and manifold sensitivity. Combustion and flame, 160(2):340–350, 2013.
- PPSTS09
Alessandro Parente, James C. Sutherland, Leonardo Tognotti, and Philip J. Smith. Identification of low-dimensional manifolds in turbulent flames. Proceedings of the Combustion Institute, 32(1):1579–1586, 2009.
- PRLM+16
Mojdeh Rastgoo, Guillaume Lemaitre, Joan Massich, Olivier Morel, Franck Marzani, Rafael Garcia, and Fabrice Meriaudeau. Tackling the problem of data imbalancing for melanoma classification. BIOSTEC - 3rd International Conference on BIOIMAGING, 2016.
- PSCSC03
Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003.
- PvdBHW+06(1,2,3)
Robert A van den Berg, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde, and Mariët J van der Werf. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics, 7(1):1–15, 2006.
Data reduction#
The reduction
module contains functions for performing Principal Component
Analysis (PCA).
Note
The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).
The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.
The representation of the user-supplied data matrix in PCAfold
is the input parameter X
, which should be of type numpy.ndarray
and of size (n_observations,n_variables)
.
Principal Component Analysis#
Class PCA
#
- class PCAfold.reduction.PCA(X, scaling='std', n_components=0, use_eigendec=True, nocenter=False)#
Enables performing Principal Component Analysis (PCA) of the original data set, \(\mathbf{X}\). For more detailed information on the theory of PCA the user is referred to [RJolliffe02].
Two options for performing PCA are implemented:
Eigendecomposition of the covariance matrixSetuse_eigendec=True
(default)Singular Value Decomposition (SVD)Setuse_eigendec=False
Centering and scaling (as perpreprocess.center_scale
function):Ifnocenter=False
: \(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\)Ifnocenter=True
: \(\mathbf{X_{cs}} = \mathbf{X} \cdot \mathbf{D}^{-1}\)Eigendecomposition of the covariance matrix \(\mathbf{S}\)
SVD: \(\mathbf{X_{cs}} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^{\mathbf{T}}\)
Modes:Eigenvectors \(\mathbf{A}\)Modes:\(\mathbf{A} = \mathbf{V}\)Amplitudes:Eigenvalues \(\mathbf{L}\)Amplitudes:\(\mathbf{L} = diag(\mathbf{\Sigma})\)Note: For simplicity, we will from now on refer to \(\mathbf{A}\) as the matrix of eigenvectors and to \(\mathbf{L}\) as the vector of eigenvalues, irrespective of the method used to perform PCA.
Covariance matrix is computed at the class initialization as:
\[\mathbf{S} = \frac{1}{N-1} \mathbf{X_{cs}}^{\mathbf{T}} \mathbf{X_{cs}}\]where \(N\) is the number of observations in the original data set, \(\mathbf{X}\).
Loadings matrix, \(\mathbf{l}\), is computed at the class initialization such that the element \(\mathbf{l}_{ij}\) is the corresponding scaled element of the eigenvectors matrix, \(\mathbf{A}_{ij}\):
\[\mathbf{l}_{ij} = \frac{\mathbf{A}_{ij} \sqrt{\mathbf{L}_j}}{\sqrt{\mathbf{S}_{ii}}}\]where \(\mathbf{L}_j\) is the \(j^{th}\) eigenvalue and \(\mathbf{S}_{ii}\) is the \(i^{th}\) element on the diagonal of the covariance matrix, \(\mathbf{S}\).
The variance accounted for in each individual variable by the first \(q\) PCs, \(\mathbf{t_q}\), is computed at the class initialization:
\[\mathbf{t}_{\mathbf{q}i} = \sum_{j=1}^{q} \Bigg( \frac{\mathbf{A}_{ij} \sqrt{\mathbf{L}_j}}{ s_i } \Bigg)^2\]where \(q\) is the number of retained principal components and \(s_i\) is the standard deviation of the \(i^{th}\) variable in the data set.
Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Access the eigenvectors: A = pca_X.A # Access the eigenvalues: L = pca_X.L # Access the loadings: l = pca_X.loadings # Access the variance accounted for each variable: tq = pca_X.tq
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.n_components – (optional)
int
specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.use_eigendec –
(optional)
bool
specifying the method for obtaining eigenvalues and eigenvectors:use_eigendec=True
uses eigendecomposition of the covariance matrix (fromnumpy.linalg.eigh
)use_eigendec=False
uses Singular Value Decomposition (SVD) (fromscipy.linalg.svd
)
nocenter – (optional)
bool
specifying whether the data original data set should be centered by mean.
Attributes:
n_components - (can be re-set) number of retained principal components, \(q\).
n_components_init - (read only) number of retained principal components, \(q\), with which
PCA
class object was initialized.scaling - (read only) scaling criteria with which
PCA
class object was initialized.n_variables - (read only) number of variables of the original data set on which
PCA
class object was initialized.X_cs - (read only) centered and scaled data set \(\mathbf{X_{cs}}\).
X_center - (read only) vector of centers, \(\mathbf{C}\), applied on the original data set \(\mathbf{X}\).
X_scale - (read only) vector of scales, \(\mathbf{D}\), applied on the original data set \(\mathbf{X}\).
S - (read only) covariance matrix, \(\mathbf{S}\).
L - (read only) vector of eigenvalues, \(\mathbf{L}\).
A - (read only) matrix of eigenvectors, \(\mathbf{A}\), (vectors are stored in columns, rows correspond to weights).
loadings - (read only) loadings, \(\mathbf{l}\), (vectors are stored in columns, rows correspond to weights).
tq - (read only) variance accounted for in each individual variable, \(\mathbf{t_q}\).
tq - (read only) variance accounted for in each individual variable in each PC, \(\mathbf{t_{q,j}}\).
PCA.transform
#
- PCAfold.reduction.PCA.transform(self, X, nocenter=False)#
Transforms any original data set, \(\mathbf{X}\), to a new truncated basis, \(\mathbf{A}_q\), identified by PCA. It computes the \(q\) first principal components, \(\mathbf{Z}_q\), given the original data.
If
nocenter=False
:\[\mathbf{Z}_q = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1} \cdot \mathbf{A}_q\]If
nocenter=True
:\[\mathbf{Z}_q = \mathbf{X} \cdot \mathbf{D}^{-1} \cdot \mathbf{A}_q\]Here \(\mathbf{C}\) and \(\mathbf{D}\) are centers and scales computed during
PCA
class initialization and \(\mathbf{A}_q\) is the matrix of \(q\) first eigenvectors extracted from \(\mathbf{A}\).Warning
Set
nocenter=True
only if you know what you are doing.One example when
nocenter
should be set toTrue
is when transforming chemical source terms, \(\mathbf{S_X}\), to principal components space (as per [TSP09]) to obtain sources of principal components, \(\mathbf{S_Z}\). In that case \(\mathbf{X} = \mathbf{S_X}\) and the transformation should be performed without centering:\[\mathbf{S}_{\mathbf{Z}, q} = \mathbf{S_X} \cdot \mathbf{D}^{-1} \cdot \mathbf{A}_q\]Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X)
- Parameters
X –
numpy.ndarray
specifying the data set \(\mathbf{X}\) to transform. It should be of size(n_observations,n_variables)
. Note that it does not need to be the same data set that was used to construct thePCA
class object. It could for instance be a function of that data set. By default, this data set will be pre-processed with the centers and scales computed on the data set used when constructing the PCA object.nocenter – (optional)
bool
specifying whetherPCA.X_center
centers should be applied to center the data set before transformation. Ifnocenter=True
centers will not be applied on the data set.
- Returns
principal_components -
numpy.ndarray
specifying the \(q\) first principal components \(\mathbf{Z}_q\). It has size(n_observations,n_components)
.
PCA.reconstruct
#
- PCAfold.reduction.PCA.reconstruct(self, principal_components, nocenter=False)#
Calculates rank-\(q\) reconstruction of the data set from the \(q\) first principal components, \(\mathbf{Z}_q\).
If
nocenter=False
:\[\mathbf{X_{rec}} = \mathbf{Z}_q \mathbf{A}_q^{\mathbf{T}} \cdot \mathbf{D} + \mathbf{C}\]If
nocenter=True
:\[\mathbf{X_{rec}} = \mathbf{Z}_q \mathbf{A}_q^{\mathbf{T}} \cdot \mathbf{D}\]Here \(\mathbf{C}\) and \(\mathbf{D}\) are centers and scales computed during
PCA
class initialization and \(\mathbf{A}_q\) is the matrix of \(q\) first eigenvectors extracted from \(\mathbf{A}\).Warning
Set
nocenter=True
only if you know what you are doing.One example when
nocenter
should be set toTrue
is when reconstructing chemical source terms, \(\mathbf{S_X}\), (as per [TSP09]) from the \(q\) first sources of principal components, \(\mathbf{S}_{\mathbf{Z}, q}\). In that case \(\mathbf{Z}_q = \mathbf{S}_{\mathbf{Z}, q}\) and the reconstruction should be performed without uncentering:\[\mathbf{S_{X, rec}} = \mathbf{S}_{\mathbf{Z}, q} \mathbf{A}_q^{\mathbf{T}} \cdot \mathbf{D}\]Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Calculate the reconstructed variables: X_rec = pca_X.reconstruct(principal_components)
- Parameters
principal_components –
numpy.ndarray
of \(q\) first principal components, \(\mathbf{Z}_q\). It should be of size(n_observations,n_variables)
.nocenter – (optional)
bool
specifying whetherPCA.X_center
centers should be applied to un-center the reconstructed data set. Ifnocenter=True
centers will not be applied on the reconstructed data set.
- Returns
X_rec - rank-\(q\) reconstruction of the original data set.
PCA.get_weights_dictionary
#
- PCAfold.reduction.PCA.get_weights_dictionary(self, variable_names, pc_index, n_digits=10)#
Creates a dictionary where keys are the names of the variables in the original data set \(\mathbf{X}\) and values are the eigenvector weights corresponding to the principal component selected by
pc_index
. This function helps in accessing weight value for a specific variable and for a specific PC.Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy variables names: variable_names = ['A1', 'A2', 'A3', 'A4', 'A5'] # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=0, use_eigendec=True, nocenter=False) # Create a dictionary for PC-1 weights: PC1_weights_dictionary = pca_X.get_weights_dictionary(variable_names, 0, n_digits=8)
The code above will create a dictionary:
{'A1': 0.63544443, 'A2': -0.39500424, 'A3': -0.28819465, 'A4': 0.57000796, 'A5': 0.17949037}
Eigenvector weight for a specific variable can then be accessed by:
PC1_weights_dictionary['A3']
- Parameters
variable_names –
list
ofstr
specifying names for all variables in the original data set, \(\mathbf{X}\).pc_index – non-negative
int
specifying the index of the PC to create the dictionary for. Setpc_index=0
if you want to look at the first PC.n_digits – (optional) non-negative
int
specifying how many digits should be kept in rounding the eigenvector weights.
- Returns
weights_dictionary -
dict
of variable names as keys and selected eigenvector weights as values.
PCA.u_scores
#
- PCAfold.reduction.PCA.u_scores(self, X)#
Calculates the U-scores (principal components):
\[\mathbf{U_{scores}} = \mathbf{X_{cs}} \mathbf{A}_q\]This function is equivalent to
PCA.transform
.Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=10, use_eigendec=True, nocenter=False) # Calculate the U-scores: u_scores = pca_X.u_scores(X)
- Parameters
X – data set to transform. Note that it does not need to be the same data set that was used to construct the PCA object. It could for instance be a function of that data set. By default, this data set will be pre-processed with the centers and scales computed on the data set used when constructing the PCA object.
- Returns
u_scores - U-scores (principal components).
PCA.w_scores
#
- PCAfold.reduction.PCA.w_scores(self, X)#
Calculates the W-scores which are the principal components scaled by the inverse square root of the corresponding eigenvalue:
\[\mathbf{W_{scores}} = \frac{\mathbf{Z}_q}{\sqrt{\mathbf{L_q}}}\]where \(\mathbf{L_q}\) are the \(q\) first eigenvalues extracted from \(\mathbf{L}\). The W-scores are still uncorrelated and have variances equal unity.
Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=10, use_eigendec=True, nocenter=False) # Calculate the W-scores: w_scores = pca_X.w_scores(X)
- Parameters
X – data set to transform. Note that it does not need to be the same data set that was used to construct the PCA object. It could for instance be a function of that data set. By default, this data set will be pre-processed with the centers and scales computed on the data set used when constructing the PCA object.
- Returns
w_scores - W-scores (scaled principal components).
PCA.calculate_r2
#
- PCAfold.reduction.PCA.calculate_r2(self, X)#
Calculates coefficient of determination, \(R^2\), values for the rank-\(q\) reconstruction, \(\mathbf{X_{rec}}\), of the original data set, \(\mathbf{X}\):
\[R^2 = 1 - \frac{\sum_{i=1}^N (\mathbf{X}_i - \mathbf{X_{rec}}_i)^2}{\sum_{i=1}^N (\mathbf{X}_i - mean(\mathbf{X}_i))^2}\]where \(\mathbf{X}_i\) is the \(i^{th}\) column of \(\mathbf{X}\), \(\mathbf{X_{rec}}_i\) is the \(i^{th}\) column of \(\mathbf{X_{rec}}\) and \(N\) is the number of observations in \(\mathbf{X}\).
If all of the eigenvalues are retained, \(R^2\) will be equal to unity.
Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=10, use_eigendec=True, nocenter=False) # Calculate the R2 values: r2 = pca_X.calculate_r2(X)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.- Returns
r2 -
numpy.ndarray
specifying the coefficient of determination values \(R^2\) for the rank-\(q\) reconstruction of the original data set. It has size(n_variables,)
.
PCA.r2_convergence
#
- PCAfold.reduction.PCA.r2_convergence(self, X, n_pcs, variable_names=[], print_width=10, verbose=False, save_filename=None)#
Returns and optionally prints and/or saves to
.txt
file \(R^2\) values (as perPCA.calculate_r2
function) for reconstruction of the original data set \(\mathbf{X}\) as a function of number of retained principal components (PCs).Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=3) # Compute and print convergence of R2 values: r2 = pca_X.r2_convergence(X, n_pcs=3, variable_names=['X1', 'X2', 'X3'], print_width=10, verbose=True)
The code above will print \(R^2\) values retaining 1-3 PCs:
| n PCs | X1 | X2 | X3 | Mean | | 1 | 0.17857365 | 0.53258736 | 0.49905763 | 0.40340621 | | 2 | 0.99220888 | 0.57167479 | 0.61150487 | 0.72512951 | | 3 | 1.0 | 1.0 | 1.0 | 1.0 |
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.n_pcs – the maximum number of PCs to consider.
variable_names – (optional)
list
of ‘str’ specifying variable names. If not specified variables will be numbered.print_width – (optional) width of columns printed out.
verbose – (optional)
bool
for printing out the table with \(R^2\) values.save_filename – (optional)
str
specifying.txt
save location/filename.
- Returns
r2 - matrix of size
(n_pcs, n_variables)
containing the \(R^2\) values for each variable as a function of the number of retained PCs.
PCA.set_retained_eigenvalues
#
- PCAfold.reduction.PCA.set_retained_eigenvalues(self, method='SCREE GRAPH', option=None)#
Helps determine how many principal components (PCs) should be retained. The following methods are available:
'TOTAL VARIANCE'
- retain the PCs whose eigenvalues account for a specific percentage of the total variance. The required number of PCs is then the smallest value of \(q\) for which this chosen percentage is exceeded. Fraction of variance can be supplied using theoption
parameter. For instance, setoption=0.6
if you want to account for 60% variance. If variance is not supplied in theoption
paramter, the user will be asked for input during function execution.'INDIVIDUAL VARIANCE'
- retain the PCs whose eigenvalues are greater than the average of the eigenvalues [RKai60] or than 0.7 times the average of the eigenvalues [RJol72]. For a correlation matrix this average equals 1. Fraction of variance can be supplied using theoption
parameter. For instance, setoption=0.6
if you want to account for 60% variance. If variance is not supplied in theoption
paramter, the user will be asked for input during function execution.'BROKEN STICK'
- retain the PCs according to the Broken Stick Model [RFro76].'SCREE GRAPH'
- retain the PCs using the scree graph, a plot of the eigenvalues agaist their indexes, and look for a natural break between the large and small eigenvalues.
For more detailed information on the options implemented here the user is referred to [RJolliffe02].
Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto') # Compute a new ``PCA`` class object with the new number of retained components: pca_X_new = pca_X.set_retained_eigenvalues(method='TOTAL VARIANCE', option=0.6) # The new number of principal components that has been set: print(pca_X_new.n_components)
This function provides a few methods to select the number of eigenvalues to be retained in the PCA reduction.
- Parameters
method – (optional)
str
specifying the method to use in selecting retained eigenvalues.option – (optional) additional parameter used for the
'TOTAL VARIANCE'
and'INDIVIDUAL VARIANCE'
methods. If not supplied, information will be obtained interactively.
- Returns
pca - the PCA object with the number of retained eigenvalues set on it.
PCA.principal_variables
#
- PCAfold.reduction.PCA.principal_variables(self, method='B2', x=[])#
Extracts Principal Variables (PVs) from a PCA.
The following methods are currently supported:
'B4'
- selects Principal Variables based on the variables contained in the eigenvectors corresponding to the largest eigenvalues [RJol72].'B2'
- selects Principal Variables based on the variables contained in the smallest eigenvalues. These are discarded and the remaining variables are used as the PVs [RJol72].'M2'
- at each iteration, each remaining variable is analyzed via PCA [RKrz87]. Note: this is a very expensive method.
For more detailed information on the options implemented here the user is referred to [RJolliffe02].
Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto') # Select Principal Variables (PVs) using M2 method: principal_variables_indices = pca_X.principal_variables(method='M2', X)
- Parameters
method – (optional)
str
specifying the method for determining the Principal Variables (PVs).x – (optional) data set to accompany
'M2'
method. Note that this is only required for the'M2'
method.
- Returns
principal_variables_indices - a vector of indices of retained Principal Variables (PVs).
PCA.data_consistency_check
#
- PCAfold.reduction.PCA.data_consistency_check(self, X, errors_are_fatal=False)#
Checks if the supplied data matrix
X
is consistent with the currentPCA
class object.Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=10) # This data set will be consistent: X_1 = np.random.rand(50,20) is_consistent = pca_X.data_consistency_check(X_1) # This data set will not be consistent but will not throw ValueError: X_2 = np.random.rand(100,10) is_consistent = pca_X.data_consistency_check(X_2) # This data set will not be consistent and will throw ValueError: X_3 = np.random.rand(100,10) is_consistent = pca_X.data_consistency_check(X_3, errors_are_fatal=True)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.errors_are_fatal – (optional)
bool
indicating ifValueError
should be raised if an incompatibility is detected.
- Returns
is_consistent -
bool
specifying whether or not the supplied data matrix \(\mathbf{X}\) is consistent with thePCA
class object.
PCA.save_to_txt
#
- PCAfold.reduction.PCA.save_to_txt(self, save_filename)#
Writes the eigenvector matrix, \(\mathbf{A}\), loadings, \(\mathbf{l}\), centering, \(\mathbf{C}\), and scaling ,:math:mathbf{D}, to
.txt
file.Example:
from PCAfold import PCA import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=5) # Save the PCA results to .txt: pca_X.save_to_txt('pca_X_Data.txt')
- Parameters
save_filename –
str
specifying.txt
save location/filename.
Local Principal Component Analysis#
Class LPCA
#
- class PCAfold.reduction.LPCA(X, idx, scaling='std', n_components=0, use_eigendec=True, nocenter=False, verbose=False)#
Enables performing local Principal Component Analysis (LPCA) of the original data set, \(\mathbf{X}\), partitioned into clusters.
Example:
from PCAfold import LPCA import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Generate dummy vector of cluster classifications: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Instantiate LPCA class object: lpca_X = LPCA(X, idx, scaling='none', n_components=2) # Access the local covariance matrix in the first cluster: S_k1 = lpca_X.S[0] # Access the local eigenvectors in the first cluster: A_k1 = lpca_X.A[0] # Access the local eigenvalues in the first cluster: L_k1 = lpca_X.L[0] # Access the local principal components in the first cluster: Z_k1 = lpca_X.principal_components[0] # Access the local loadings in the first cluster: l_k1 = lpca_X.loadings[0] # Access the local variance accounted for in each individual variable in the first cluster: tq_k1 = lpca_X.tq[0]
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.scaling – (optional)
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.n_components – (optional)
int
specifying the number of returned eigenvectors, eigenvalues and principal components, \(q\). If set to 0 all are returned.use_eigendec –
(optional)
bool
specifying the method for obtaining eigenvalues and eigenvectors:use_eigendec=True
uses eigendecomposition of the covariance matrix (fromnumpy.linalg.eigh
)use_eigendec=False
uses Singular Value Decomposition (SVD) (fromscipy.linalg.svd
)
nocenter – (optional)
bool
specifying whether data should be centered by mean.verbose – (optional)
bool
for printing verbose details.
Attributes:
S - (read only)
list
ofnumpy.ndarray
specifying the local covariance matrix, \(\mathbf{S}\). Each list element corresponds to the covariance matrix in a single cluster.A - (read only)
list
ofnumpy.ndarray
specifying the local eigenvectors, \(\mathbf{A}\). Each list element corresponds to eigenvectors in a single cluster.L - (read only)
list
ofnumpy.ndarray
specifying the local eigenvalues, \(\mathbf{L}\). Each list element corresponds to eigenvalues in a single cluster.principal_components - (read only)
list
ofnumpy.ndarray
specifying the local principal components, \(\mathbf{Z}\). Each list element corresponds to principal components in a single cluster.loadings - (read only)
list
ofnumpy.ndarray
specifying the local loadings, \(\mathbf{l}\). Each list element corresponds to loadings in a single cluster.tq - (read only)
list
ofnumpy.ndarray
specifying the local variance accounted for in each individual variable by the first \(q\) PCs, \(\mathbf{t_q}\). Each list element corresponds to variance metric in a single cluster.X_reconstructed - (read only)
numpy.ndarray
specifying the dataset reconstructed from local PCA using the first \(q\) PCs. It has size(n_observations,n_variables)
.R2 - (read only)
list
specifying the average coefficient of determination for each cluster reconstructed using the first \(q\) PCs. Each list element corresponds to each reconstructed cluster and is averaged over all non-constant state variables in that cluster.idx_retained_in_clusters - (read only)
list
oflist
specifying the variables retained in each cluster. If a variable within a particular cluster becomes constant, it will be removed from this list.
LPCA.local_correlation
#
- PCAfold.reduction.LPCA.local_correlation(self, variable, index=0, metric='pearson', display=None, verbose=False)#
Computes a correlation in each cluster and a globally-averaged correlation between the local principal component, PC, and some specified variable, \(\phi\). The average is taken from each of the \(k\) clusters. Correlation in the \(n^{th}\) cluster is referred to as \(r_n(\mathrm{PC}, \phi)\).
Available correlation functions are:
Pearson correlation coefficient (PCC), set
metric='pearson'
:
\[r_n(\mathrm{PC}, \phi) = \mathrm{abs} \Bigg( \frac{\sum_{i=1}^{N_n} (\mathrm{PC}_i - \overline{\mathrm{PC}}) (\phi_i - \bar{\phi})}{\sqrt{\sum_{i=1}^{N_n} (\mathrm{PC}_i - \overline{\mathrm{PC}})^2} \sqrt{\sum_{i=1}^{N_n} (\phi_i - \bar{\phi})^2}} \Bigg)\]where \(N_n\) is the number of observations in the \(n^{th}\) cluster.
Spearman correlation coefficient (SCC), set
metric='spearman'
.Distance correlation (dCor), set
metric='dcor'
:
\[r_n(\mathrm{PC}, \phi) = \sqrt{ \frac{\mathrm{dCov}(\mathrm{PC}_n, \phi_n)}{\mathrm{dCov}(\mathrm{PC}_n, \mathrm{PC}_n) \mathrm{dCov}(\phi_n, \phi_n)} }\]where \(\mathrm{dCov}\) is the distance covariance computed for any two variables, \(X\) and \(Y\), as:
\[\mathrm{dCov}(X,Y) = \sqrt{ \frac{1}{N^2} \sum_{i,j=1}^N x_{i,j} y_{i,j}}\]where \(x_{i,j}\) and \(y_{i,j}\) are the elements of the double-centered Euclidean distances matrices for \(X\) and \(Y\) observations respectively. \(N\) is the total number of observations. Note, that the distance correlation computation allows \(X\) and \(Y\) to have different dimensions.
Note
The distance correlation computation requires the
dcor
module. You can install it through:pip install dcor
Globally-averaged correlation metric is computed in two variants:
Weighted, where each local correlation is weighted by the size of each cluster:
\[\bar{r} = \frac{1}{N} \sum_{n=1}^k N_n r_n(\mathrm{PC}, \phi)\]Unweighted, which computes an arithmetic average of local correlations from all clusters:
\[r = \frac{1}{k} \sum_{n=1}^k r_n(\mathrm{PC}, \phi)\]Example:
from PCAfold import predefined_variable_bins, LPCA import numpy as np # Generate dummy data set: x = np.linspace(-1,1,1000) y = -x**2 + 1 X = np.hstack((x[:,None], y[:,None])) # Generate dummy vector of cluster classifications: (idx, _) = predefined_variable_bins(x, [-0.9, 0, 0.6]) # Instantiate LPCA class object: lpca = LPCA(X, idx, scaling='none') # Compute local Pearson correlation coefficient between PC-1 and y: (local_correlations, weighted, unweighted) = lpca.local_correlation(y, index=0, metric='pearson', verbose=True)
With
verbose=True
we will see some detailed information:PCC in cluster 1: 0.999996 PCC in cluster 2: -0.990817 PCC in cluster 3: 0.983221 PCC in cluster 4: 0.999838 Globally-averaged weighted correlation: 0.990801 Globally-averaged unweighted correlation: 0.993468
- Parameters
variable –
numpy.ndarray
specifying the variable, \(\phi\), for correlation computation. It should be of size(n_observations,)
or(n_observations,1)
or(n_observations,n_variables)
whenmetric='dcor'
.index – (optional)
int
specifying the index of the local principal component for correlation computation. Setindex=0
if you want to look at the first PC.metric – (optional)
str
specifying the correlation metric to use. It can be'pearson'
,'spearman'
or'dcor'
.display – (optional)
str
specifying the display format for the correlations. It can be'abs'
,'percent'
,'abs-percent'
.verbose – (optional)
bool
for printing verbose details.
- Returns
local_correlations -
numpy.ndarray
specifying the computed correlation in each cluster. It has size(k,)
.weighted -
float
specifying the globally-averaged weighted correlation.unweighted -
float
specifying the globally-averaged unweighted correlation.
Vector Quantization Principal Component Analysis#
Class VQPCA
#
- class PCAfold.reduction.VQPCA(X, n_clusters, n_components, scaling='std', idx_init='random', max_iter=300, tolerance=None, random_state=None, verbose=False)#
Enables performing Vector Quantization Principal Component Analysis (VQPCA).
The VQPCA algorithm was first proposed in [RKL97] and its modified version that we present here was developed by [PPSTS09]. VQPCA assigns observations to local clusters based on the minimum reconstruction error from PCA approximation with
n_components
number of Principal Components. This is an iterative procedure in which the reconstruction errors are evaluated for every observation as if that observation belonged to cluster j and next, the observation is assigned to the cluster for which the reconstruction error is smallest.Note
The VQPCA algorithm centers the global data set \(\mathbf{X}\) by the mean value and scales it by the scaling specified with the
scaling
parameter. Data in local clusters is centered by the mean value but is not scaled.Example:
from PCAfold import VQPCA import numpy as np # Generate dummy data set: X = np.random.rand(400,10) # Instantiate VQPCA class object: vqpca = VQPCA(X, n_clusters=3, n_components=2, scaling='std', idx_init='random', max_iter=100, tolerance=1.0e-08, random_state=42, verbose=True) # Access the VQPCA clustering solution: idx = vqpca.idx
With
verbose=True
, the code above will print detailed information on each iteration:| It. | Rec. error | Error conv.? | Cent. conv.? | Cluster 1 | Cluster 2 | Cluster 3 | Time [min] | | 1 | 10.20073505 | False | False | 165 | 58 | 177 | 0.00042 | | 2 | 6.02108074 | False | False | 155 | 84 | 161 | 0.00073 | | 3 | 5.79390739 | False | False | 148 | 97 | 155 | 0.0011 | | 4 | 5.69141601 | False | False | 148 | 110 | 142 | 0.00134 | | 5 | 5.63347972 | False | False | 148 | 117 | 135 | 0.00156 | | 6 | 5.61523762 | False | False | 148 | 117 | 135 | 0.00175 | | 7 | 5.61010989 | False | False | 147 | 117 | 136 | 0.00199 | | 8 | 5.60402719 | False | False | 144 | 119 | 137 | 0.00224 | | 9 | 5.59803052 | False | False | 144 | 121 | 135 | 0.00246 | | 10 | 5.59072799 | False | False | 142 | 123 | 135 | 0.00268 | | 11 | 5.5783608 | False | False | 139 | 123 | 138 | 0.00291 | | 12 | 5.57368963 | False | False | 138 | 123 | 139 | 0.00316 | | 13 | 5.56762599 | False | False | 140 | 122 | 138 | 0.0034 | | 14 | 5.55839038 | False | False | 138 | 120 | 142 | 0.00368 | | 15 | 5.55167405 | False | False | 137 | 120 | 143 | 0.00394 | | 16 | 5.54661554 | False | False | 136 | 120 | 144 | 0.0042 | | 17 | 5.5453694 | False | True | 136 | 120 | 144 | 0.00444 | | 18 | 5.5453694 | True | True | 136 | 120 | 144 | 0.00444 | Convergence reached in iteration: 18 Total time: 0.004471 minutes. --------------------------------------------------------------------------------------------------------------
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.n_clusters –
int
specifying the number of clusters to partition the data.n_components – (optional)
int
specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.idx_init – (optional)
str
ornumpy.ndarray
specifying the method for centroids initialization. Ifstr
, it can be'uniform'
or'random'
. By default random intialization is performed. An arbitrary user-supplied initialidx
for initializing the centroids can be passed using anumpy.ndarray
. It should be of size(n_observations,)
or(n_observations,1)
.max_iter – (optional) the maximum number of iterations that the algorithm will loop through.
tolerance – (optional)
float
specifying the tolerance for the global mean squared reconstruction error and for the cluster centroids. This parameter is important for judging the convergence of the VQPCA algorithm. If set toNone
, the default value1.0e-08
is used.random_state – (optional)
int
specifying the random seed.verbose – (optional) boolean for printing clustering details.
Attributes:
idx - vector of cluster classifications.
collected_idx - vector of cluster classifications from all iterations.
converged - boolean specifying whether the algorithm has converged.
A - local eigenvectors from the last iteration.
principal_components - local Principal Components from the last iteration.
reconstruction_errors_in_clusters - mean reconstruction errors in each cluster from the last iteration.
Subset Principal Component Analysis#
Class SubsetPCA
#
- class PCAfold.reduction.SubsetPCA(X, X_source=None, full_sequence=True, subset_indices=None, variable_names=None, scaling='std', n_components=2, use_eigendec=True, nocenter=False, verbose=False)#
Enables performing Principal Component Analysis (PCA) of a subset of the original data set, \(\mathbf{X}\).
Example:
from PCAfold import SubsetPCA import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate SubsetPCA class object: subset_pca_X = SubsetPCA(X, full_sequence=True, scaling='std', n_components=2)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.X_source – (optional)
numpy.ndarray
specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09].full_sequence – (optional)
bool
specifying if a full sequence of subset PCAs should be performed. If set toTrue
, it is assumed that variables in \(\mathbf{X}\) have been ordered according to some criterion. A sequence of subset PCAs will then be performed starting from the firstn_components+1
variables and gradually adding the next variable in \(\mathbf{X}\). Whenfull_sequence=True
, parametersubset_indices
will be ignored and the class attributes will be of typelist
ofnumpy.ndarray
. Each element in those lists corresponds to one subset PCA in a sequence.subset_indices – (optional)
list
specifying the indices of columns to be taken from the original data set to form a subset of a data set.variable_names – (optional)
list
ofstr
specifying the names of variables in \(\mathbf{X}\). It should have lengthn_variables
and each element should correspond to a column in \(\mathbf{X}\).scaling – (optional)
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.n_components – (optional)
int
specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.use_eigendec –
(optional)
bool
specifying the method for obtaining eigenvalues and eigenvectors:use_eigendec=True
uses eigendecomposition of the covariance matrix (fromnumpy.linalg.eigh
)use_eigendec=False
uses Singular Value Decomposition (SVD) (fromscipy.linalg.svd
)
nocenter – (optional)
bool
specifying whether the data original data set should be centered by mean.
Attributes:
S - (read only)
numpy.ndarray
orlist
ofnumpy.ndarray
specifying the covariance matrix, \(\mathbf{S}\).L - (read only)
numpy.ndarray
orlist
ofnumpy.ndarray
specifying the vector of eigenvalues, \(\mathbf{L}\).A - (read only)
numpy.ndarray
orlist
ofnumpy.ndarray
specifying the matrix of eigenvectors, \(\mathbf{A}\).principal_components - (read only)
numpy.ndarray
orlist
ofnumpy.ndarray
specifying the principal components, \(\mathbf{Z}\).PC_source_terms - (read only)
numpy.ndarray
orlist
ofnumpy.ndarray
specifying the PC source terms, \(\mathbf{S_Z}\).variable_sequence - (read only)
list
orlist
oflist
specifying the names of variables that were used in each subset PCA.
Sample Principal Component Analysis#
Class SamplePCA
#
- PCAfold.reduction.SamplePCA(X, idx_X_r, scaling, n_components, biasing_option, X_source=None)#
Enables performing Principal Component Analysis (PCA) on a sample, \(\mathbf{X_r}\), of the original data set, \(\mathbf{X}\), with one of the four implemented options. Reach out to the Biasing options section of the documentation for more information on the available options.
Example:
from PCAfold import DataSampler, SamplePCA import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Generate dummy sampling indices: idx = np.zeros((100,)).astype(int) idx[50:80] = 1 selection = DataSampler(idx) (idx_X_r, _) = selection.number(20, test_selection_option=1) # Instantiate SamplePCA class object: sample_pca = SamplePCA(X, idx_X_r, scaling='auto', n_components=2, biasing_option=1, X_source=None) # Access the re-sampled PCs: PCs_resampled = sample_pca.pc_scores
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx_X_r –
numpy.ndarray
specifying the vector of indices that should be extracted from \(\mathbf{X}\) to form \(\mathbf{X_r}\). It should be of size(n_samples,)
or(n_samples,1)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.n_components –
int
specifying the number of retained principal components, \(q\). If set to 0 all PCs are retained. It should be a non-negative number.biasing_option –
int
specifying the biasing option. It can only attain values 1, 2, 3 or 4.X_source – (optional)
numpy.ndarray
specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09].
Attributes:
eigenvalues - (read only)
numpy.ndarray
specifying the biased eigenvalues, \(\mathbf{L_r}\).eigenvectors - (read only)
numpy.ndarray
specifying the biased eigenvectors, \(\mathbf{A_r}\).pc_scores - (read only)
numpy.ndarray
specifying the \(q\) first biased principal components, \(\mathbf{Z_r}\).pc_sources - (read only)
numpy.ndarray
specifying the \(q\) first biased sources of principal components, \(\mathbf{S_{Z_r}}\). More information can be found in [TSP09]. This parameter is only computed ifX_source
input parameter is specified.C - (read only)
numpy.ndarray
specifying a vector of centers, \(\mathbf{C}\), that were used to preprocess the original full data set \(\mathbf{X}\).D - (read only)
numpy.ndarray
specifying a vector of scales, \(\mathbf{D}\), that were used to preprocess the original full data set \(\mathbf{X}\).C_r - (read only)
numpy.ndarray
specifying a vector of centers, \(\mathbf{C_r}\), that were used to preprocess the sampled data set \(\mathbf{X_r}\).D_r - (read only)
numpy.ndarray
specifying a vector of scales, \(\mathbf{D_r}\), that were used to preprocess the sampled data set \(\mathbf{X_r}\).
Class EquilibratedSamplePCA
#
- PCAfold.reduction.EquilibratedSamplePCA(X, idx, scaling, n_components, biasing_option, X_source=None, n_iterations=10, stop_iter=0, random_seed=None, verbose=False)#
Enables performing Principal Component Analysis (PCA) on a sample, \(\mathbf{X_r}\), of the original data set, \(\mathbf{X}\), with one of the four implemented options. Reach out to the Biasing options section of the documentation for more information on the available options.
This implementation gradually (in
n_iterations
) equilibrates cluster populations heading towards population of the smallest cluster, in each cluster.At each iteration it generates a reduced data set \(\mathbf{X_r}^{(i)}\) made up from new populations, performs PCA on that data set to find the \(i^{th}\) version of the eigenvectors. Depending on the option selected, it then does the projection of a data set (and optionally also its sources) onto the found eigenvectors.
Equilibration:
For the moment, there is only one way implemented for the equilibration. The smallest cluster is found and any larger \(j^{th}\) cluster’s observations are diminished at each iteration by:
\[\frac{N_j - N_s}{\verb|n_iterations|}\]\(N_j\) is the number of observations in that \(j^{th}\) cluster and \(N_s\) is the number of observations in the smallest cluster. This is further illustrated on synthetic data set below:
Future implementation will include equilibration that slows down close to equilibrium.
Interpretation for the outputs:
This function returns 3D arrays
eigenvectors
,pc_scores
andpc_sources
that have the following structure:Example:
from PCAfold import EquilibratedSamplePCA import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Generate dummy sampling indices: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Instantiate EquilibratedSamplePCA class object: equilibrated_pca = EquilibratedSamplePCA(X, idx, 'auto', n_components=2, biasing_option=1, n_iterations=1, random_seed=100, verbose=True) # Access the re-sampled PCs from the last (equilibrated) iteration: PCs_resampled = equilibrated_pca.pc_scores[:,:,-1]
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.X_source –
numpy.ndarray
specifying the source terms \(\mathbf{S_X}\) corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09].n_components –
int
specifying number of \(q\) first principal components that will be saved.biasing_option –
int
specifying the biasing option. It can only attain values 1, 2, 3 or 4.n_iterations – (optional)
int
specifying the number of iterations to loop over.stop_iter – (optional)
int
specifying the index of iteration to stop.random_seed – (optional)
int
specifying random seed for random sample selection.verbose – (optional)
bool
for printing verbose details.
- Returns
eigenvalues -
numpy.ndarray
specifying the collected eigenvalues from each iteration.eigenvectors -
numpy.ndarray
specifying the collected eigenvectors from each iteration. This is a 3D array of size(n_variables, n_components, n_iterations+1)
.pc_scores -
numpy.ndarray
specifying the collected principal components from each iteration. This is a 3D array of size(n_observations, n_components, n_iterations+1)
.pc_sources -
numpy.ndarray
specifying the collected sources of principal components from each iteration. This is a 3D array of size(n_observations, n_components, n_iterations+1)
.idx_train -
numpy.ndarray
specifying the final training indices from the equilibrated iteration.C_r -
numpy.ndarray
specifying a vector of final centers that were used to center the data set at the last (equilibration) iteration.D_r -
numpy.ndarray
specifying a vector of final scales that were used to scale the data set at the last (equilibration) iteration.
analyze_centers_change
#
- PCAfold.reduction.analyze_centers_change(X, idx_X_r, variable_names=[], plot_variables=[], legend_label=[], figure_size=None, title=None, save_filename=None)#
Analyzes the change in normalized centers computed on the sampled subset of the original data set \(\mathbf{X_r}\) with respect to the full original data set \(\mathbf{X}\).
The original data set \(\mathbf{X}\) is first normalized so that each variable ranges from 0 to 1:
\[||\mathbf{X}|| = \frac{\mathbf{X} - min(\mathbf{X})}{max(\mathbf{X} - min(\mathbf{X}))}\]This normalization is done so that centers can be compared across variables on one plot. Samples are then extracted from \(||\mathbf{X}||\), according to
idx_X_r
, to form \(||\mathbf{X_r}||\).Normalized centers are computed as:
\[||\mathbf{C}|| = mean(||\mathbf{X}||)\]\[||\mathbf{C_r}|| = mean(||\mathbf{X_r}||)\]Percentage measuring the relative change in normalized centers is computed as:
\[p = \frac{||\mathbf{C_r}|| - ||\mathbf{C}||}{||\mathbf{C}||} \cdot 100\%\]Example:
from PCAfold import analyze_centers_change, DataSampler import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Generate dummy sampling indices: idx = np.zeros((100,)).astype(int) idx[50:80] = 1 selection = DataSampler(idx) (idx_X_r, _) = selection.number(20, test_selection_option=1) # Analyze the change in normalized centers: (normalized_C, normalized_C_r, center_movement_percentage, plt) = analyze_centers_change(X, idx_X_r)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx_X_r – vector of indices that should be extracted from \(\mathbf{X}\) to form \(\mathbf{X_r}\).
variable_names – (optional)
list
ofstr
specifying variable names.plot_variables – (optional)
list
ofint
specifying indices of variables to be plotted. By default, all variables are plotted.legend_label – (optional)
list
ofstr
specifying labels for the legend. First entry will refer to \(||\mathbf{C}||\) and second entry to \(||\mathbf{C_r}||\). If the list is empty, legend will not be plotted.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
normalized_C - normalized centers \(||\mathbf{C}||\).
normalized_C_r - normalized centers \(||\mathbf{C_r}||\).
center_movement_percentage - percentage \(p\) measuring the relative change in normalized centers.
plt -
matplotlib.pyplot
plot handle.
analyze_eigenvector_weights_change
#
- PCAfold.reduction.analyze_eigenvector_weights_change(eigenvectors, variable_names=[], plot_variables=[], normalize=False, zero_norm=False, legend_label=[], color_map='viridis', figure_size=None, title=None, save_filename=None)#
Analyzes the change of weights on an eigenvector obtained from a reduced data set as specified by the
eigenvectors
matrix. This matrix can contain many versions of eigenvectors, for instance coming from each iteration from theequilibrate_cluster_populations
function.If the number of versions is larger than two, the weights are plot on a color scale that marks each version. If there is a consistent trend, the coloring should form a clear trajectory.
In a special case, when there are only two versions within
eigenvectors
matrix, it is understood that the first version corresponds to the original data set and the last version to the equilibrated data set.Note: This function plots absolute, (and optionally normalized) values of weights on each variable. Columns are normalized dividing by the maximum value. This is done in order to compare the movement of weights equally, with the highest, normalized one being equal to 1. You can additionally set the
zero_norm=True
in order to normalize weights such that they are between 0 and 1 (this is not done by default).Example:
from PCAfold import equilibrate_cluster_populations, analyze_eigenvector_weights_change import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Generate dummy sampling indices: idx = np.zeros((100,)) idx[50:80] = 1 # Run cluster equlibration: (eigenvalues, eigenvectors_matrix, pc_scores_matrix, pc_sources_matrix, idx_train, C_r, D_r) = equilibrate_cluster_populations(X, idx, 'auto', n_components=2, biasing_option=1, n_iterations=1, random_seed=100, verbose=True) # Analyze weights change on the first eigenvector: plt = analyze_eigenvector_weights_change(eigenvectors_matrix[:,0,:]) # Analyze weights change on the second eigenvector: plt = analyze_eigenvector_weights_change(eigenvectors_matrix[:,1,:])
- Parameters
eigenvectors –
matrix of concatenated eigenvectors coming from different data sets or from different iterations. It should be size
(n_variables, n_versions)
. This parameter can be directly extracted fromeigenvectors_matrix
output from functionequilibrate_cluster_populations
. For instance if the first and second eigenvector should be plotted:eigenvectors_1 = eigenvectors_matrix[:,0,:] eigenvectors_2 = eigenvectors_matrix[:,1,:]
variable_names – (optional)
list
ofstr
specifying variable names.plot_variables – (optional) list of integers specifying indices of variables to be plotted. By default, all variables are plotted.
normalize – (optional)
bool
specifying whether weights should be normlized at all. If set to false, the absolute values are plotted.zero_norm – (optional)
bool
specifying whether weights should be normalized between 0 and 1. By default they are not normalized to start at 0. Only has effect ifnormalize=True
.legend_label – (optional)
list
ofstr
specifying labels for the legend. If the list is empty, legend will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
analyze_eigenvalue_distribution
#
- PCAfold.reduction.analyze_eigenvalue_distribution(X, idx_X_r, scaling, biasing_option, legend_label=[], figure_size=None, title=None, save_filename=None)#
Analyzes the normalized eigenvalue distribution when PCA is performed on the original data set \(\mathbf{X}\) and on the sampled data set \(\mathbf{X_r}\).
Reach out to the Biasing options section of the documentation for more information on the available options.
Example:
from PCAfold import analyze_eigenvalue_distribution, DataSampler import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Generate dummy sampling indices: idx = np.zeros((100,)).astype(int) idx[50:80] = 1 selection = DataSampler(idx) (idx_X_r, _) = selection.number(20, test_selection_option=1) # Analyze the change in eigenvalue distribution: plt = analyze_eigenvalue_distribution(X, idx_X_r, 'auto', biasing_option=1)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx_X_r – vector of indices that should be extracted from \(\mathbf{X}\) to form \(\mathbf{X_r}\).
scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.biasing_option –
int
specifying biasing option. Can only attain values 1, 2, 3 or 4.legend_label – (optional)
list
ofstr
specifying labels for the legend. First entry will refer to \(\mathbf{X}\) and second entry to \(\mathbf{X_r}\). If the list is empty, legend will not be plotted.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
Biasing options#
This section explains the choice for biasing_option
input parameter in some
of the functions in this module.
The general goal for PCA on sampled data sets is to bias PCA with some
information about the sampled data set \(\mathbf{X_r}\).
Biasing option parameter will control how PCA is performed on or informed by
\(\mathbf{X_r}\) data set sampled from \(\mathbf{X}\).
It is assumed that centers and scales computed on \(\mathbf{X_r}\) are denoted \(\mathbf{C_r}\) and \(\mathbf{D_r}\) and centers and scales computed on \(\mathbf{X}\) are denoted \(\mathbf{C}\) and \(\mathbf{D}\). \(N\) is the number of observations in \(\mathbf{X}\).
Biasing option 1#
The steps of PCA in this option:
Step |
Option 1 |
---|---|
S1: Sampling |
\(\mathbf{X} \xrightarrow{\text{sampling}} \mathbf{X_r}\) |
S2: Centering and scaling
|
\(\mathbf{X_{cs, r}} = (\mathbf{X_r} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\)
\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\)
|
S3: PCA: Eigenvectors |
\(\frac{1}{N-1} \mathbf{X_{cs, r}}^{\mathbf{T}} \mathbf{X_{cs, r}} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\) |
S4: PCA: Transformation |
\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\) |
These steps are presented graphically below:
Biasing option 2#
The steps of PCA in this option:
Step |
Option 2 |
---|---|
S1: Sampling |
\(\mathbf{X_{cs}} \xrightarrow{\text{sampling}} \mathbf{X_r}\) |
S2: Centering and scaling
|
\(\mathbf{X_r}\) is not further pre-processed
\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\)
|
S3: PCA: Eigenvectors |
\(\frac{1}{N-1} \mathbf{X_r}^{\mathbf{T}} \mathbf{X_r} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\) |
S4: PCA: Transformation |
\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\) |
These steps are presented graphically below:
Biasing option 3#
The steps of PCA in this option:
Step |
Option 3 |
---|---|
S1: Sampling |
\(\mathbf{X} \xrightarrow{\text{sampling}} \mathbf{X_r}\) |
S2: Centering and scaling
|
\(\mathbf{X_{cs, r}} = (\mathbf{X_r} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\)
\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\)
|
S3: PCA: Eigenvectors |
\(\frac{1}{N-1} \mathbf{X_{cs, r}}^{\mathbf{T}} \mathbf{X_{cs, r}} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\) |
S4: PCA: Transformation |
\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\) |
These steps are presented graphically below:
Biasing option 4#
The steps of PCA in this option:
Step |
Option 4 |
---|---|
S1: Sampling |
\(\mathbf{X} \xrightarrow{\text{sampling}} \mathbf{X_r}\) |
S2: Centering and scaling |
\(\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C_r}) \cdot \mathbf{D_r}^{-1}\) |
S3: PCA: Eigenvectors |
\(\frac{1}{N-1} \mathbf{X_{cs}}^{\mathbf{T}} \mathbf{X_{cs}} \xrightarrow{\text{eigendec.}} \mathbf{A_r}\) |
S4: PCA: Transformation |
\(\mathbf{Z_r} = \mathbf{X_{cs}} \mathbf{A_r}\) |
These steps are presented graphically below:
Plotting functions#
plot_2d_manifold
#
- PCAfold.reduction.plot_2d_manifold(x, y, color=None, clean=False, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', colorbar_range=None, norm=None, grid_on=True, s=None, figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold given two vectors defining the manifold.
Example:
from PCAfold import PCA, plot_2d_manifold import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Obtain 2-dimensional manifold from PCA: pca_X = PCA(X) principal_components = pca_X.transform(X) # Plot the manifold: plt = plot_2d_manifold(principal_components[:,0], principal_components[:,1], color=X[:,0], clean=False, x_label='PC-1', y_label='PC-2', colorbar_label='$X_1$', colorbar_range=(0,1), figure_size=(5,5), title='2D manifold', save_filename='2d-manifold.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
x
andy
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, manifold will be plotted in black.clean – (optional)
bool
specifying if a clean plot should be made. If set toTrue
, nothing else but the data points is plotted.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.colorbar_label – (optional)
str
specifying colorbar label annotation. If set toNone
, colorbar label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.norm – (optional)
matplotlib.colors
specifying the colormap normalization to use. Example can bematplotlib.colors.LogNorm()
.colorbar_range – (optional)
tuple
specifying the lower and the upper bound for the colorbar range.grid_on –
bool
specifying whether grid should be plotted.s – (optional)
int
orfloat
specifying the scatter point size.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_3d_manifold
#
- PCAfold.reduction.plot_3d_manifold(x, y, z, color=None, elev=45, azim=-45, clean=False, x_label=None, y_label=None, z_label=None, colorbar_label=None, color_map='viridis', colorbar_range=None, s=None, figure_size=(7, 7), title=None, save_filename=None)#
Plots a three-dimensional manifold given three vectors defining the manifold.
Example:
from PCAfold import PCA, plot_3d_manifold import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Obtain 3-dimensional manifold from PCA: pca_X = PCA(X) PCs = pca_X.transform(X) # Plot the manifold: plt = plot_3d_manifold(PCs[:,0], PCs[:,1], PCs[:,2], color=X[:,0], elev=30, azim=-60, clean=False, x_label='PC-1', y_label='PC-2', z_label='PC-3', colorbar_label='$X_1$', colorbar_range=(0,1), figure_size=(15,7), title='3D manifold', save_filename='3d-manifold.png') plt.close()
- Parameters
x – variable on the \(x\)-axis. It should be of type
numpy.ndarray
and size(n_observations,)
or(n_observations,1)
.y – variable on the \(y\)-axis. It should be of type
numpy.ndarray
and size(n_observations,)
or(n_observations,1)
.z – variable on the \(z\)-axis. It should be of type
numpy.ndarray
and size(n_observations,)
or(n_observations,1)
.color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
x
,y
andz
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, manifold will be plotted in black.elev – (optional) elevation angle.
azim – (optional) azimuth angle.
clean – (optional)
bool
specifying if a clean plot should be made. If set toTrue
, nothing else but the data points and the 3D axes is plotted.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.z_label – (optional)
str
specifying \(z\)-axis label annotation. If set toNone
label will not be plotted.colorbar_label – (optional)
str
specifying colorbar label annotation. If set toNone
, colorbar label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.colorbar_range – (optional)
tuple
specifying the lower and the upper bound for the colorbar range.s – (optional)
int
orfloat
specifying the scatter point size.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_2d_manifold_sequence
#
- PCAfold.reduction.plot_2d_manifold_sequence(xy, color=None, x_label=None, y_label=None, cbar=False, nrows=1, colorbar_label=None, color_map='viridis', grid_on=True, figure_size=(7, 3), title=None, save_filename=None)#
Plots a sequence of two-dimensional manifolds given a list of two vectors defining the manifold.
Example:
from PCAfold import SubsetPCA, plot_2d_manifold_sequence import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Obtain two-dimensional manifolds from subset PCA: subset_PCA = SubsetPCA(X) principal_components = subset_PCA.principal_components # Plot the manifold: plt = plot_2d_manifold_sequence(principal_components, color=X[:,0], x_label='PC-1', y_label='PC-2', nrows=2, colorbar_label='$X_1$', figure_size=(7,3), title=['First', 'Second', 'Third'], save_filename='2d-manifold-sequence.pdf') plt.close()
- Parameters
xy –
list
ofnumpy.ndarray
specifying the manifolds (variables on the \(x\) and \(y\) -axis). Each element of the list should be of size(n_observations,2)
.color – (optional)
numpy.ndarray
orstr
, orlist
ofnumpy.ndarray
orstr
specifying colors for the manifolds. If it is a vector, it has to have length consistent with the number of observations inx
andy
vectors. Eachnumpy.ndarray
should be of size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, manifolds will be plotted in black.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.cbar – (optional)
bool
specifying if the colorbar should be plotted.nrows – (optional)
int
specifying in how many rows the manifold sequence should be plotted.colorbar_label – (optional)
str
specifying colorbar label annotation. If set toNone
, colorbar label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.grid_on –
bool
specifying whether grid should be plotted.figure_size – (optional) tuple specifying figure size.
title – (optional)
list
ofstr
specifying title for each subplot. If set toNone
titles will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_parity
#
- PCAfold.reduction.plot_parity(variable, variable_rec, color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a parity plot between a variable and its reconstruction.
Example:
from PCAfold import PCA, plot_parity import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Obtain PCA reconstruction of the data set: pca_X = PCA(X, n_components=8) principal_components = pca_X.transform(X) X_rec = pca_X.reconstruct(principal_components) # Parity plot for the reconstruction of the first variable: plt = plot_parity(X[:,0], X_rec[:,0], color=X[:,0], x_label='Observed $X_1$', y_label='Reconstructed $X_1$', colorbar_label='X_1', color_map='inferno', figure_size=(5,5), title='Parity plot', save_filename='parity-plot.pdf') plt.close()
- Parameters
variable – vector specifying the original variable. It should be of type
numpy.ndarray
and size(n_observations,)
or(n_observations,1)
.variable_rec – vector specifying the reconstruction of the original variable. It should be of type
numpy.ndarray
and size(n_observations,)
or(n_observations,1)
.color – (optional) vector or string specifying color for the parity plot. If it is a vector, it has to have length consistent with the number of observations in
x
andy
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, parity plot will be plotted in black.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.colorbar_label – (optional)
str
specifying colorbar label annotation. If set toNone
, colorbar label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.grid_on –
bool
specifying whether grid should be plotted.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_mode
#
- PCAfold.reduction.plot_mode(mode, mode_name=None, variable_names=None, plot_absolute=False, rotate_label=False, bar_color=None, ylim=None, figure_size=None, title=None, save_filename=None)#
Plots weights on a generic mode.
Example:
from PCAfold import PCA, plot_mode import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Perform PCA and obtain eigenvectors: pca_X = PCA(X, n_components=2) eigenvectors = pca_X.A # Plot second and third eigenvector: plt = plot_mode(eigenvectors[:,0], variable_names=['$a_1$', '$a_2$', '$a_3$'], plot_absolute=False, rotate_label=True, bar_color=None, figure_size=(5,3), title='PCA on X', save_filename='PCA-X.pdf') plt.close()
- Parameters
mode –
numpy.ndarray
specifying the mode to plot. It should be of size(n_variables,)
or(n_variables,1)
.mode_name –
str
specifying the mode name.variable_names – (optional)
list
ofstr
specifying variable names.plot_absolute – (optional)
bool
specifying whether absolute values of eigenvectors should be plotted.rotate_label – (optional)
bool
specifying whether the labels on the x-axis should be rotated by 90 degrees. It is recommended to set it toTrue
for data sets with many variables for viewing clarity.bar_color – (optional)
str
specifying color of bars.ylim – (optional)
list
specifying limits on the y-axis.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_eigenvectors
#
- PCAfold.reduction.plot_eigenvectors(eigenvectors, eigenvectors_indices=[], variable_names=None, plot_absolute=False, rotate_label=False, bar_color=None, figure_size=None, title=None, save_filename=None)#
Plots weights on eigenvectors. It will generate as many plots as there are eigenvectors present in the
eigenvectors
matrix.Example:
from PCAfold import PCA, plot_eigenvectors import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Perform PCA and obtain eigenvectors: pca_X = PCA(X, n_components=2) eigenvectors = pca_X.A # Plot second and third eigenvector: plts = plot_eigenvectors(eigenvectors[:,[1,2]], eigenvectors_indices=[1,2], variable_names=['$a_1$', '$a_2$', '$a_3$'], plot_absolute=False, rotate_label=True, bar_color=None, title='PCA on X', save_filename='PCA-X.pdf') plts[0].close() plts[1].close()
- Parameters
eigenvectors – matrix of eigenvectors to plot. It can be supplied as an attribute of the
PCA
class:PCA.A
.eigenvectors_indices –
list
ofint
specifying indexing of eigenvectors insideeigenvectors
supplied. If it is not supplied, it is assumed that eigenvectors are numbered \([0, 1, 2, \dots, n]\), where \(n\) is the number of eigenvectors provided.variable_names – (optional)
list
ofstr
specifying variable names.plot_absolute – (optional)
bool
specifying whether absolute values of eigenvectors should be plotted.rotate_label – (optional)
bool
specifying whether the labels on the x-axis should be rotated by 90 degrees. It is recommended to set it toTrue
for data sets with many variables for viewing clarity.bar_color – (optional)
str
specifying color of bars.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
. Note that a prefixeigenvector-#
will be added out front the filename, where#
is the number of the currently plotted eigenvector.
- Returns
plot_handles - list of plot handles.
plot_eigenvectors_comparison
#
- PCAfold.reduction.plot_eigenvectors_comparison(eigenvectors_tuple, legend_labels=[], variable_names=[], plot_absolute=False, rotate_label=False, ylim=None, color_map='coolwarm', figure_size=None, title=None, save_filename=None)#
Plots a comparison of weights on eigenvectors.
Example:
from PCAfold import PCA, plot_eigenvectors_comparison import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Perform PCA and obtain eigenvectors: pca_X = PCA(X, n_components=2) eigenvectors = pca_X.A # Plot comparaison of first and second eigenvector: plt = plot_eigenvectors_comparison((eigenvectors[:,0], eigenvectors[:,1]), legend_labels=['PC-1', 'PC-2'], variable_names=['$a_1$', '$a_2$', '$a_3$'], plot_absolute=False, color_map='coolwarm', title='PCA on X', save_filename='PCA-X.pdf') plt.close()
- Parameters
eigenvectors_tuple –
tuple
specifying the eigenvectors to plot. Each eigenvector inside a tuple should be a 0D array. It can be supplied as an attribute of thePCA
class, for instance:(PCA.A[:,0], PCA.A[:,1])
.legend_labels –
list
ofstr
specifying labels for each element in theeigenvectors_tuple
.variable_names – (optional)
list
ofstr
specifying variable names.plot_absolute –
bool
specifying whether absolute values of eigenvectors should be plotted.rotate_label – (optional)
bool
specifying whether the labels on the x-axis should be rotated by 90 degrees. It is recommended to set it toTrue
for data sets with many variables for viewing clarity.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'coolwarm'
.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_eigenvalue_distribution
#
- PCAfold.reduction.plot_eigenvalue_distribution(eigenvalues, normalized=False, figure_size=None, title=None, save_filename=None)#
Plots eigenvalue distribution.
Example:
from PCAfold import PCA, plot_eigenvalue_distribution import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA and obtain eigenvalues: pca_X = PCA(X) eigenvalues = pca_X.L # Plot eigenvalue distribution: plt = plot_eigenvalue_distribution(eigenvalues, normalized=True, title='PCA on X', save_filename='PCA-X.pdf') plt.close()
- Parameters
eigenvalues – a 0D vector of eigenvalues to plot. It can be supplied as an attribute of the
PCA
class:PCA.L
.normalized – (optional)
bool
specifying whether eigenvalues should be normalized to 1.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_eigenvalue_distribution_comparison
#
- PCAfold.reduction.plot_eigenvalue_distribution_comparison(eigenvalues_tuple, legend_labels=[], normalized=False, color_map='coolwarm', figure_size=None, title=None, save_filename=None)#
Plots a comparison of eigenvalue distributions.
Example:
from PCAfold import PCA, plot_eigenvalue_distribution_comparison import numpy as np # Generate dummy data sets: X = np.random.rand(100,10) Y = np.random.rand(100,10) # Perform PCA and obtain eigenvalues: pca_X = PCA(X) eigenvalues_X = pca_X.L pca_Y = PCA(Y) eigenvalues_Y = pca_Y.L # Plot eigenvalue distribution comparison: plt = plot_eigenvalue_distribution_comparison((eigenvalues_X, eigenvalues_Y), legend_labels=['PCA on X', 'PCA on Y'], normalized=True, title='PCA on X and Y', save_filename='PCA-X-Y.pdf') plt.close()
- Parameters
eigenvalues_tuple –
tuple
specifying the eigenvalues to plot. Each vector of eigenvalues inside a tuple should be a 0D array. It can be supplied as an attribute of thePCA
class, for instance:(PCA_1.L, PCA_2.L)
.legend_labels –
list
ofstr
specifying the labels for each element in theeigenvalues_tuple
.normalized – (optional)
bool
specifying whether eigenvalues should be normalized to 1.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'coolwarm'
.figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_cumulative_variance
#
- PCAfold.reduction.plot_cumulative_variance(eigenvalues, n_components=0, figure_size=None, title=None, save_filename=None)#
Plots the eigenvalues as bars and their cumulative sum to visualize the percent variance in the data explained by each principal component individually and by each principal component cumulatively.
Example:
from PCAfold import PCA, plot_cumulative_variance import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA and obtain eigenvalues: pca_X = PCA(X) eigenvalues = pca_X.L # Plot the cumulative variance from eigenvalues: plt = plot_cumulative_variance(eigenvalues, n_components=0, title='PCA on X', save_filename='PCA-X.pdf') plt.close()
- Parameters
eigenvalues – a 0D vector of eigenvalues to analyze. It can be supplied as an attribute of the
PCA
class:PCA.L
.n_components – (optional) how many principal components you want to visualize (default is all).
figure_size – (optional) tuple specifying figure size.
title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_heatmap
#
- PCAfold.reduction.plot_heatmap(M, annotate=False, text_color='w', format_displayed='%.2f', x_ticks=None, y_ticks=None, color_map='viridis', cbar=False, colorbar_label=None, figure_size=(5, 5), title=None, save_filename=None)#
Plots a heatmap for any matrix \(\mathbf{M}\).
Example:
from PCAfold import PCA, plot_heatmap import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA and obtain the covariance matrix: pca_X = PCA(X) covariance_matrix = pca_X.S # Define ticks: ticks = ['A', 'B', 'C', 'D', 'E'] # Plot a heatmap of the covariance matrix: plt = plot_heatmap(covariance_matrix, annotate=True, text_color='w', format_displayed='%.1f', x_ticks=ticks, y_ticks=ticks, title='Covariance', save_filename='covariance.pdf') plt.close()
- Parameters
M –
numpy.ndarray
specifying the matrix \(\mathbf{M}\).annotate – (optional)
bool
specifying whether numerical values of matrix elements should be plot on top of the heatmap.text_color – (optional)
str
specifying the color of the annotation text.format_displayed – (optional)
str
specifying the display format for the numerical entries inside the table. By default it is set to'%.2f'
.x_ticks – (optional)
bool
specifying whether ticks on the \(x\) -axis should be plotted orlist
specifying the ticks on the \(x\) -axis.y_ticks – (optional)
bool
specifying whether ticks on the \(y\) -axis should be plotted orlist
specifying the ticks on the \(y\) -axis.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.cbar – (optional)
bool
specifying whether colorbar should be plotted.colorbar_label – (optional)
str
specifying colorbar label annotation. If set toNone
, colorbar label will not be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_heatmap_sequence
#
- PCAfold.reduction.plot_heatmap_sequence(M, annotate=False, text_color='w', format_displayed='%.2f', x_ticks=None, y_ticks=None, color_map='viridis', cbar=False, colorbar_label=None, figure_size=(5, 5), title=None, save_filename=None)#
Plots a sequence of heatmaps for matrices \(\mathbf{M}\) stored in a list.
Example:
from PCAfold import PCA, plot_heatmap_sequence import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA and obtain the covariance matrices: pca_X_auto = PCA(X, scaling='auto') pca_X_range = PCA(X, scaling='range') pca_X_vast = PCA(X, scaling='vast') covariance_matrices = [pca_X_auto.S, pca_X_range.S, pca_X_vast.S] titles = ['Auto', 'Range', 'VAST'] # Plot a sequence of heatmaps of the covariance matrices: plt = plot_heatmap_sequence(covariance_matrices, annotate=True, text_color='w', format_displayed='%.1f', color_map='viridis', cbar=True, title=titles, figure_size=(12,3), save_filename='covariance-matrices.pdf') plt.close()
- Parameters
M –
list
ofnumpy.ndarray
specifying the matrices \(\mathbf{M}\).annotate – (optional)
bool
specifying whether numerical values of matrix elements should be plot on top of the heatmap.text_color – (optional)
str
specifying the color of the annotation text.format_displayed – (optional)
str
specifying the display format for the numerical entries inside the table. By default it is set to'%.2f'
.x_ticks – (optional)
bool
specifying whether ticks on the \(x\) -axis should be plotted orlist
oflist
specifying the ticks on the \(x\) -axis.y_ticks – (optional)
bool
specifying whether ticks on the \(y\) -axis should be plotted orlist
oflist
specifying the ticks on the \(y\) -axis.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.cbar – (optional)
bool
specifying whether colorbar should be plotted.colorbar_label – (optional)
str
specifying colorbar label annotation. If set toNone
, colorbar label will not be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
Bibliography#
- RFro76
Serge Frontier. Étude de la décroissance des valeurs propres dans une analyse en composantes principales: comparaison avec le modd́le du bâton brisé. Journal of experimental marine Biology and Ecology, 25(1):67–75, 1976.
- RJol72(1,2,3)
Ian T Jolliffe. Discarding variables in a principal component analysis. i: artificial data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 21(2):160–173, 1972.
- RKai60
Henry F. Kaiser. The application of electronic computers to factor analysis. Educational and psychological measurement, 20(1):141–151, 1960.
- RKL97
Nandakishore Kambhatla and Todd K. Leen. Dimension reduction by local principal component analysis. Neural computation, 9(7):1493–1516, 1997.
- RKrz87
Wojtek J Krzanowski. Selection of variables to preserve multivariate data structure, using principal components. Journal of the Royal Statistical Society: Series C (Applied Statistics), 36(1):22–33, 1987.
- TSP09(1,2,3,4,5,6)
James C. Sutherland and Alessandro Parente. Combustion modeling using principal component analysis. Proceedings of the Combustion Institute, 32(1):1563–1570, 2009.
- RJolliffe02(1,2,3)
Ian Jolliffe. Principal component analysis. Springer Verlag, New York, 2002.
Manifold analysis#
The analysis
module contains functions for assessing the intrinsic
dimensionality and quality of manifolds.
Note
The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).
The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.
The representation of the user-supplied data matrix in PCAfold
is the input parameter X
, which should be of type numpy.ndarray
and of size (n_observations,n_variables)
.
Manifold assessment#
This section includes functions for quantitative assessments of manifold dimensionality and for comparing manifold parameterizations according to scales of variation and uniqueness of dependent variable values as introduced in [AAS21] and [AZASP22].
compute_normalized_variance
#
- PCAfold.analysis.compute_normalized_variance(indepvars, depvars, depvar_names, npts_bandwidth=25, min_bandwidth=None, max_bandwidth=None, bandwidth_values=None, scale_unit_box=True, n_threads=None, compute_sample_norm_var=False, compute_sample_norm_range=False)#
Compute a normalized variance (and related quantities) for analyzing manifold dimensionality. The normalized variance is computed as
\[\mathcal{N}(\sigma) = \frac{\sum_{i=1}^n (y_i - \mathcal{K}(\hat{x}_i; \sigma))^2}{\sum_{i=1}^n (y_i - \bar{y} )^2}\]where \(\bar{y}\) is the average quantity over the whole manifold and \(\mathcal{K}(\hat{x}_i; \sigma)\) is the weighted average quantity calculated using kernel regression with a Gaussian kernel of bandwidth \(\sigma\) centered around the \(i^{th}\) observation. \(n\) is the number of observations. \(\mathcal{N}(\sigma)\) is computed for each bandwidth in an array of bandwidth values. By default, the
indepvars
(\(x\)) are centered and scaled to reside inside a unit box (resulting in \(\hat{x}\)) so that the bandwidths have the same meaning in each dimension. Therefore, the bandwidth and its involved calculations are applied in the normalized independent variable space. This may be turned off by settingscale_unit_box
to False. The bandwidth values may be specified directly throughbandwidth_values
or default values will be calculated as a logspace frommin_bandwidth
tomax_bandwidth
withnpts_bandwidth
number of values. If left unspecified,min_bandwidth
andmax_bandwidth
will be calculated as the minimum and maximum nonzero distance between points, respectively.More information can be found in [AAS21].
Example:
from PCAfold import PCA, compute_normalized_variance import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA to obtain the low-dimensional manifold: pca_X = PCA(X, n_components=2) principal_components = pca_X.transform(X) # Compute normalized variance quantities: variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True) # Access bandwidth values: variance_data.bandwidth_values # Access normalized variance values: variance_data.normalized_variance # Access normalized variance values for a specific variable: variance_data.normalized_variance['B']
- Parameters
indepvars –
numpy.ndarray
specifying the independent variable values. It should be of size(n_observations,n_independent_variables)
.depvars –
numpy.ndarray
specifying the dependent variable values. It should be of size(n_observations,n_dependent_variables)
.depvar_names –
list
ofstr
corresponding to the names of the dependent variables (for saving values in a dictionary)npts_bandwidth – (optional, default 25) number of points to build a logspace of bandwidth values
min_bandwidth – (optional, default to minimum nonzero interpoint distance) minimum bandwidth
max_bandwidth – (optional, default to estimated maximum interpoint distance) maximum bandwidth
bandwidth_values – (optional) array of bandwidth values, i.e. filter widths for a Gaussian filter, to loop over
scale_unit_box – (optional, default True) center/scale the independent variables between [0,1] for computing a normalized variance so the bandwidth values have the same meaning in each dimension
n_threads – (optional, default None) number of threads to run this computation. If None, default behavior of multiprocessing.Pool is used, which is to use all available cores on the current system.
compute_sample_norm_var – (optional, default False)
bool
specifying if sample normalized variance should be computed.compute_sample_norm_range – (optional, default False)
bool
specifying if sample normalized range should be computed.
- Returns
variance_data - an object of the
VarianceData
class.
Class VarianceData
#
- class PCAfold.analysis.VarianceData(bandwidth_values, norm_var, global_var, bandwidth_10pct_rise, keys, norm_var_limit, sample_norm_var, sample_norm_range)#
A class for storing helpful quantities in analyzing dimensionality of manifolds through normalized variance measures. This class will be returned by
compute_normalized_variance
.- Parameters
bandwidth_values – the array of bandwidth values (Gaussian filter widths) used in computing the normalized variance for each variable
normalized_variance – dictionary of the normalized variance computed at each of the bandwidth values for each variable
global_variance – dictionary of the global variance for each variable
bandwidth_10pct_rise – dictionary of the bandwidth value corresponding to a 10% rise in the normalized variance for each variable
variable_names – list of the variable names
normalized_variance_limit – dictionary of the normalized variance computed as the bandwidth approaches zero (numerically at \(10^{-16}\)) for each variable
sample_normalized_variance – dictionary of the sample normalized variance for every observation, for each bandwidth and for each variable
normalized_variance_derivative
#
- PCAfold.analysis.normalized_variance_derivative(variance_data)#
Compute a scaled normalized variance derivative on a logarithmic scale, \(\hat{\mathcal{D}}(\sigma)\), from
\[\mathcal{D}(\sigma) = \frac{\mathrm{d}\mathcal{N}(\sigma)}{\mathrm{d}\log_{10}(\sigma)} + \lim_{\sigma \to 0} \mathcal{N}(\sigma)\]and
\[\hat{\mathcal{D}}(\sigma) = \frac{\mathcal{D}(\sigma)}{\max(\mathcal{D}(\sigma))}\]This value relays how fast the variance is changing as the bandwidth changes and captures non-uniqueness from nonzero values of \(\lim_{\sigma \to 0} \mathcal{N}(\sigma)\). The derivative is approximated with central finite differencing and the limit is approximated by \(\mathcal{N}(\sigma=10^{-16})\) using the
normalized_variance_limit
attribute of theVarianceData
object.More information can be found in [AAS21].
Example:
from PCAfold import PCA, compute_normalized_variance, normalized_variance_derivative import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA to obtain the low-dimensional manifold: pca_X = PCA(X, n_components=2) principal_components = pca_X.transform(X) # Compute normalized variance quantities: variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True) # Compute normalized variance derivative: (derivative, bandwidth_values, max_derivative) = normalized_variance_derivative(variance_data) # Access normalized variance derivative values for a specific variable: derivative['B']
- Parameters
variance_data – a
VarianceData
class returned fromcompute_normalized_variance
- Returns
derivative_dict - a dictionary of \(\hat{\mathcal{D}}(\sigma)\) for each variable in the provided
VarianceData
objectx - the \(\sigma\) values where \(\hat{\mathcal{D}}(\sigma)\) was computed
max_derivatives_dicts - a dictionary of \(\max(\mathcal{D}(\sigma))\) values for each variable in the provided
VarianceData
object.
find_local_maxima
#
- PCAfold.analysis.find_local_maxima(dependent_values, independent_values, logscaling=True, threshold=0.01, show_plot=False)#
Finds and returns locations and values of local maxima in a dependent variable given a set of observations. The functional form of the dependent variable is approximated with a cubic spline for smoother approximations to local maxima.
- Parameters
dependent_values – observations of a single dependent variable such as \(\hat{\mathcal{D}}\) from
normalized_variance_derivative
(for a single variable).independent_values – observations of a single independent variable such as \(\sigma\) returned by
normalized_variance_derivative
logscaling – (optional, default True) this logarithmically scales
independent_values
before finding local maxima. This is needed for scaling \(\sigma\) appropriately before finding peaks in \(\hat{\mathcal{D}}\).threshold – (optional, default \(10^{-2}\)) local maxima found below this threshold will be ignored.
show_plot – (optional, default False) when True, a plot of the
dependent_values
overindependent_values
(logarithmically scaled iflogscaling
is True) with the local maxima highlighted will be shown.
- Returns
the locations of local maxima in
dependent_values
the local maxima values
random_sampling_normalized_variance
#
- PCAfold.analysis.random_sampling_normalized_variance(sampling_percentages, indepvars, depvars, depvar_names, n_sample_iterations=1, verbose=True, npts_bandwidth=25, min_bandwidth=None, max_bandwidth=None, bandwidth_values=None, scale_unit_box=True, n_threads=None)#
Compute the normalized variance derivatives \(\hat{\mathcal{D}}(\sigma)\) for random samples of the provided data specified using
sampling_percentages
. These will be averaged overn_sample_iterations
iterations. Analyzing the shift in peaks of \(\hat{\mathcal{D}}(\sigma)\) due to sampling can distinguish between characteristic features and non-uniqueness due to a transformation/reduction of manifold coordinates. True features should not show significant sensitivity to sampling while non-uniqueness/folds in the manifold will.More information can be found in [AAS21].
- Parameters
sampling_percentages – list or 1D array of fractions (between 0 and 1) of the provided data to sample for computing the normalized variance
indepvars – independent variable values (size: n_observations x n_independent variables)
depvars – dependent variable values (size: n_observations x n_dependent variables)
depvar_names – list of strings corresponding to the names of the dependent variables (for saving values in a dictionary)
n_sample_iterations – (optional, default 1) how many iterations for each
sampling_percentages
to average the normalized variance derivative oververbose – (optional, default True) when True, progress statements are printed
npts_bandwidth – (optional, default 25) number of points to build a logspace of bandwidth values
min_bandwidth – (optional, default to minimum nonzero interpoint distance) minimum bandwidth
max_bandwidth – (optional, default to estimated maximum interpoint distance) maximum bandwidth
bandwidth_values – (optional) array of bandwidth values, i.e. filter widths for a Gaussian filter, to loop over
scale_unit_box – (optional, default True) center/scale the independent variables between [0,1] for computing a normalized variance so the bandwidth values have the same meaning in each dimension
n_threads – (optional, default None) number of threads to run this computation. If None, default behavior of multiprocessing.Pool is used, which is to use all available cores on the current system.
- Returns
a dictionary of the normalized variance derivative (\(\hat{\mathcal{D}}(\sigma)\)) for each sampling percentage in
sampling_percentages
averaged overn_sample_iterations
iterationsthe \(\sigma\) values used for computing \(\hat{\mathcal{D}}(\sigma)\)
a dictionary of the
VarianceData
objects for each sampling percentage and iteration insampling_percentages
andn_sample_iterations
feature_size_map
#
- PCAfold.analysis.feature_size_map(variance_data, variable_name, cutoff=1, starting_bandwidth_idx='peak', use_variance=False, verbose=False)#
Computes a map of local feature sizes on a manifold.
Example:
from PCAfold import PCA, compute_normalized_variance, feature_size_map import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Specify variables names variable_names = ['X_' + str(i) for i in range(0,10)] # Perform PCA to obtain the low-dimensional manifold: pca_X = PCA(X, n_components=2) principal_components = pca_X.transform(X) # Specify the bandwidth values: bandwidth_values = np.logspace(-4, 2, 50) # Compute normalized variance quantities: variance_data = compute_normalized_variance(principal_components, X, depvar_names=variable_names, bandwidth_values=bandwidth_values) # Compute the feature size map: feature_size_map = feature_size_map(variance_data, variable_name='X_1', cutoff=1, starting_bandwidth_idx='peak', verbose=True)
- Parameters
variance_data – an object of
VarianceData
class.variable_name –
str
specifying the name of the dependent variable for which the feature size map should be computed. It should be as per name specified when computingvariance_data
.cutoff – (optional)
float
orint
specifying the cutoff percentage, \(p\). It should be a number between 0 and 100.starting_bandwidth_idx – (optional)
int
orstr
specifying the index of the starting bandwidth to compute the local feature sizes from. Local feature sizes computed will never be smaller than the starting bandwidth. If set to'peak'
, the starting bandwidth will be automatically calculated as the rightmost peak, \(\sigma_{peak}\).verbose – (optional)
bool
for printing verbose details.
- Returns
feature_size_map -
numpy.ndarray
specifying the local feature sizes on a manifold, \(\mathbf{B}\). It has size(n_observations,)
.
feature_size_map_smooth
#
- PCAfold.analysis.feature_size_map_smooth(indepvars, feature_size_map, method='median', n_neighbors=10)#
Smooths out a map of local feature sizes on a manifold.
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
Example:
from PCAfold import PCA, compute_normalized_variance, feature_size_map, smooth_feature_size_map import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Specify variables names variable_names = ['X_' + str(i) for i in range(0,10)] # Perform PCA to obtain the low-dimensional manifold: pca_X = PCA(X, n_components=2) principal_components = pca_X.transform(X) # Specify the bandwidth values: bandwidth_values = np.logspace(-4, 2, 50) # Compute normalized variance quantities: variance_data = compute_normalized_variance(principal_components, X, depvar_names=variable_names, bandwidth_values=bandwidth_values) # Compute the feature size map: feature_size_map = feature_size_map(variance_data, variable_name='X_1', cutoff=1, starting_bandwidth_idx='peak', verbose=True) # Smooth out the feature size map: updated_feature_size_map = feature_size_map_smooth(principal_components, feature_size_map, method='median', n_neighbors=4)
- Parameters
indepvars –
numpy.ndarray
specifying the independent variable values. It should be of size(n_observations,n_independent_variables)
.feature_size_map –
numpy.ndarray
specifying the local feature sizes on a manifold, \(\mathbf{B}\). It should be of size(n_observations,)
or(n_observations,1)
.method – (optional)
str
specifying the smoothing method. It can be'median'
,'mean'
,'max'
or'min'
.n_neighbors – (optional)
int
specifying the number of nearest neighbors to smooth over.
- Returns
updated_feature_size_map -
numpy.ndarray
specifying the smoothed local feature sizes on a manifold, \(\mathbf{B}\). It has size(n_observations,)
.
cost_function_normalized_variance_derivative
#
- PCAfold.analysis.cost_function_normalized_variance_derivative(variance_data, penalty_function=None, power=1, vertical_shift=1, norm=None, integrate_to_peak=False, rightmost_peak_shift=None)#
Defines a cost function for manifold topology assessment based on the areas, or weighted (penalized) areas, under the normalized variance derivatives curves, \(\hat{\mathcal{D}}(\sigma)\), for the selected \(n_{dep}\) dependent variables.
More information on the theory and application of the cost function can be found in [AZASP22].
An individual area, \(A_i\), for the \(i^{th}\) dependent variable, is computed by directly integrating the function \(\hat{\mathcal{D}}_i(\sigma)\) in the \(\log_{10}\) space of bandwidths \(\sigma\). Integration is performed using the composite trapezoid rule.
When
integrate_to_peak=False
, the bounds of integration go from the minimum bandwidth, \(\sigma_{min, i}\), to the maximum bandwidth, \(\sigma_{max, i}\):\[A_i = \int_{\sigma_{min, i}}^{\sigma_{max, i}} \hat{\mathcal{D}}_i(\sigma) d \log_{10} \sigma\]When
integrate_to_peak=True
, the bounds of integration go from the minimum bandwidth, \(\sigma_{min, i}\), to the bandwidth for which the rightmost peak happens in \(\hat{\mathcal{D}}_i(\sigma)\), \(\sigma_{peak, i}\):\[A_i = \int_{\sigma_{min, i}}^{\sigma_{peak, i}} \hat{\mathcal{D}}_i(\sigma) d \log_{10} \sigma\]In addition, each individual area, \(A_i\), can be weighted. The following weighting options are available:
If
penalty_function='peak'
, \(A_i\) is weighted by the inverse of the rightmost peak location:
\[A_i = \frac{1}{\sigma_{peak, i}} \cdot \int \hat{\mathcal{D}}_i(\sigma) d(\log_{10} \sigma)\]This creates a constant penalty:
If
penalty_function='sigma'
, \(A_i\) is weighted continuously by the bandwidth:
\[A_i = \int \frac{1}{\sigma^r} \cdot \hat{\mathcal{D}}_i(\sigma) d(\log_{10} \sigma)\]where \(r\) is a hyper-parameter that can be controlled by the user. This type of weighting strongly penalizes the area happening at lower bandwidth values.
For instance, when \(r=0.2\):
When \(r=1\) (with the penalty corresponding to \(r=0.2\) plotted in gray in the background):
If
penalty_function='log-sigma-over-peak'
, \(A_i\) is weighted continuously by the \(\log_{10}\) -transformed bandwidth and takes into account information about the rightmost peak location.
\[A_i = \int \Big( \big| \log_{10} \Big( \frac{\sigma}{\sigma_{peak, i}} \Big) \big|^r + b \cdot \frac{\log_{10} \sigma_{max, i} - \log_{10} \sigma_{min, i}}{\log_{10} \sigma_{peak, i} - \log_{10} \sigma_{min, i}} \Big) \cdot \hat{\mathcal{D}}_i(\sigma) d(\log_{10} \sigma)\]This type of weighting creates a more gentle penalty for the area happening further from the rightmost peak location. By increasing \(b\), the user can increase the amount of penalty applied to smaller feature sizes over larger ones. By increasing \(r\), the user can penalize non-uniqueness more strongly.
For instance, when \(r=1\):
When \(r=2\) (with the penalty corresponding to \(r=1\) plotted in gray in the background):
If
norm=None
, a list of costs for all dependent variables is returned. Otherwise, the final cost, \(\mathcal{L}\), can be computed from all \(A_i\) in a few ways, where \(n_{dep}\) is the number of dependent variables stored in thevariance_data
object:If
norm='average'
, \(\mathcal{L} = \frac{1}{n_{dep}} \sum_{i = 1}^{n_{dep}} A_i\).If
norm='cumulative'
, \(\mathcal{L} = \sum_{i = 1}^{n_{dep}} A_i\).If
norm='max'
, \(\mathcal{L} = \text{max} (A_i)\).If
norm='median'
, \(\mathcal{L} = \text{median} (A_i)\).If
norm='min'
, \(\mathcal{L} = \text{min} (A_i)\).
Example:
from PCAfold import PCA, compute_normalized_variance, cost_function_normalized_variance_derivative import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Specify variables names variable_names = ['X_' + str(i) for i in range(0,10)] # Perform PCA to obtain the low-dimensional manifold: pca_X = PCA(X, n_components=2) principal_components = pca_X.transform(X) # Specify the bandwidth values: bandwidth_values = np.logspace(-4, 2, 50) # Compute normalized variance quantities: variance_data = compute_normalized_variance(principal_components, X, depvar_names=variable_names, bandwidth_values=bandwidth_values) # Compute the cost for the current manifold: cost = cost_function_normalized_variance_derivative(variance_data, penalty_function='sigma', power=0.5, vertical_shift=1, norm='max', integrate_to_peak=True)
- Parameters
variance_data – an object of
VarianceData
class.penalty_function – (optional)
str
specifying the weighting (penalty) applied to each area. Setpenalty_function='peak'
to weight each area by the rightmost peak location, \(\sigma_{peak, i}\), for the \(i^{th}\) dependent variable. Setpenalty_function='sigma'
to weight each area continuously by the bandwidth. Setpenalty_function='log-sigma'
to weight each area continuously by the \(\log_{10}\) -transformed bandwidth. Setpenalty_function='log-sigma-over-peak'
to weight each area continuously by the \(\log_{10}\) -transformed bandwidth, normalized by the right most peak location, \(\sigma_{peak, i}\). Ifpenalty_function=None
, the area is not weighted.power – (optional)
float
orint
specifying the power, \(r\). It can be used to control how much penalty should be applied to variance happening at the smallest length scales.vertical_shift – (optional)
float
orint
specifying the vertical shift multiplier, \(b\). It can be used to control how much penalty should be applied to feature sizes.norm – (optional)
str
specifying the norm to apply for all areas \(A_i\).norm='average'
uses an arithmetic average,norm='max'
uses the \(L_{\infty}\) norm,norm='median'
uses a median area,norm='cumulative'
uses a cumulative area andnorm='min'
uses a minimum area. Ifnorm=None
, a list of costs for all depedent variables is returned.integrate_to_peak – (optional)
bool
specifying whether an individual area for the \(i^{th}\) dependent variable should be computed only up the the rightmost peak location.rightmost_peak_shift – (optional)
float
orint
specifying the percentage, \(p\), of shift in the rightmost peak location. If set to a number between 0 and 100, a quantity \(p/100 (\sigma_{max} - \sigma_{peak, i})\) is added to the rightmost peak location. It can be used to move the rightmost peak location further right, for instance if there is a blending of scales in the \(\hat{\mathcal{D}}(\sigma)\) profile.
- Returns
cost -
float
specifying the normalized cost, \(\mathcal{L}\), or, ifnorm=None
, a list of costs, \(A_i\), for each dependent variable.
Kernel Regression#
This section includes details on the Nadaraya-Watson kernel regression
[AHardle90] used in assessing manifolds. The KReg
class may be used
for non-parametric regression in general.
Class KReg
#
- class PCAfold.kernel_regression.KReg(indepvars, depvars, internal_dtype=<class 'float'>, supress_warning=False)#
A class for building and evaluating Nadaraya-Watson kernel regression models using a Gaussian kernel. The regression estimator \(\mathcal{K}(u; \sigma)\) evaluated at independent variables \(u\) can be expressed using a set of \(n\) observations of independent variables (\(x\)) and dependent variables (\(y\)) as follows
\[\mathcal{K}(u; \sigma) = \frac{\sum_{i=1}^{n} \mathcal{W}_i(u; \sigma) y_i}{\sum_{i=1}^{n} \mathcal{W}_i(u; \sigma)}\]where a Gaussian kernel of bandwidth \(\sigma\) is used as
\[\mathcal{W}_i(u; \sigma) = \exp \left( \frac{-|| x_i - u ||_2^2}{\sigma^2} \right)\]Both constant and variable bandwidths are supported. Kernels with anisotropic bandwidths are calculated as
\[\mathcal{W}_i(u; \sigma) = \exp \left( -|| \text{diag}(\sigma)^{-1} (x_i - u) ||_2^2 \right)\]where \(\sigma\) is a vector of bandwidths per independent variable.
Example:
from PCAfold import KReg import numpy as np indepvars = np.expand_dims(np.linspace(0,np.pi,11),axis=1) depvars = np.cos(indepvars) query = np.expand_dims(np.linspace(0,np.pi,21),axis=1) model = KReg(indepvars, depvars) predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)
- Parameters
indepvars –
numpy.ndarray
specifying the independent variable training data, \(x\) in equations above. It should be of size(n_observations,n_independent_variables)
.depvars –
numpy.ndarray
specifying the dependent variable training data, \(y\) in equations above. It should be of size(n_observations,n_dependent_variables)
.internal_dtype – (optional, default float) data type to enforce in training and evaluating
supress_warning – (optional, default False) if True, turns off printed warnings
KReg.predict
#
- PCAfold.kernel_regression.KReg.predict(self, query_points, bandwidth, n_neighbors=None)#
Calculate dependent variable predictions at
query_points
.- Parameters
query_points –
numpy.ndarray
specifying the independent variable points to query the model. It should be of size(n_points,n_independent_variables)
.bandwidth –
value(s) to use for the bandwidth in the Gaussian kernel. Supported formats include:
single value: constant bandwidth applied to each query point and independent variable dimension.
2D array shape (n_points x n_independent_variables): an array of bandwidths for each independent variable dimension of each query point.
string “nearest_neighbors_isotropic”: This option requires the argument
n_neighbors
to be specified for which a bandwidth will be calculated for each query point based on the Euclidean distance to then_neighbors
nearestindepvars
point.string “nearest_neighbors_anisotropic”: This option requires the argument
n_neighbors
to be specified for which a bandwidth will be calculated for each query point based on the distance in each (separate) independent variable dimension to then_neighbors
nearestindepvars
point.
- Returns
dependent variable predictions for the
query_points
KReg.compute_constant_bandwidth
#
- PCAfold.kernel_regression.KReg.compute_constant_bandwidth(self, query_points, bandwidth)#
Format a single bandwidth value into a 2D array matching the shape of
query_points
- Parameters
query_points – array of independent variable points to query the model (n_points x n_independent_variables)
bandwidth – single value for the bandwidth used in a Gaussian kernel
- Returns
an array of bandwidth values matching the shape of
query_points
KReg.compute_bandwidth_isotropic
#
- PCAfold.kernel_regression.KReg.compute_bandwidth_isotropic(self, query_points, bandwidth)#
Format a 1D array of bandwidth values for each point in
query_points
into a 2D array matching the shape ofquery_points
- Parameters
query_points – array of independent variable points to query the model (n_points x n_independent_variables)
bandwidth – 1D array of bandwidth values length n_points
- Returns
an array of bandwidth values matching the shape of
query_points
(repeats the bandwidth array for each independent variable)
KReg.compute_bandwidth_anisotropic
#
- PCAfold.kernel_regression.KReg.compute_bandwidth_anisotropic(self, query_points, bandwidth)#
Format a 1D array of bandwidth values for each independent variable into the 2D array matching the shape of
query_points
- Parameters
query_points – array of independent variable points to query the model (n_points x n_independent_variables)
bandwidth – 1D array of bandwidth values length n_independent_variables
- Returns
an array of bandwidth values matching the shape of
query_points
(repeats the bandwidth array for each point inquery_points
)
KReg.compute_nearest_neighbors_bandwidth_isotropic
#
- PCAfold.kernel_regression.KReg.compute_nearest_neighbors_bandwidth_isotropic(self, query_points, n_neighbors)#
Compute a variable bandwidth for each point in
query_points
based on the Euclidean distance to then_neighbors
nearest neighbor- Parameters
query_points – array of independent variable points to query the model (n_points x n_independent_variables)
n_neighbors – integer value for the number of nearest neighbors to consider in computing a bandwidth (distance)
- Returns
an array of bandwidth values matching the shape of
query_points
(varies for each point, constant across independent variables)
KReg.compute_nearest_neighbors_bandwidth_anisotropic
#
- PCAfold.kernel_regression.KReg.compute_nearest_neighbors_bandwidth_anisotropic(self, query_points, n_neighbors)#
Compute a variable bandwidth for each point in
query_points
and each independent variable separately based on the distance to then_neighbors
nearest neighbor in each independent variable dimension- Parameters
query_points – array of independent variable points to query the model (n_points x n_independent_variables)
n_neighbors – integer value for the number of nearest neighbors to consider in computing a bandwidth (distance)
- Returns
an array of bandwidth values matching the shape of
query_points
(varies for each point and independent variable)
Plotting functions#
plot_normalized_variance
#
- PCAfold.analysis.plot_normalized_variance(variance_data, plot_variables=[], color_map='Blues', figure_size=(10, 5), title=None, save_filename=None)#
This function plots normalized variance \(\mathcal{N}(\sigma)\) over bandwith values \(\sigma\) from an object of a
VarianceData
class.Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using
plot_variables
list.Example:
from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA to obtain the low-dimensional manifold: pca_X = PCA(X, n_components=2) principal_components = pca_X.transform(X) # Compute normalized variance quantities: variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True) # Plot normalized variance quantities: plt = plot_normalized_variance(variance_data, plot_variables=[0,1,2], color_map='Blues', figure_size=(10,5), title='Normalized variance', save_filename='N.pdf') plt.close()
- Parameters
variance_data – an object of
VarianceData
class objects whose normalized variance quantities should be plotted.plot_variables – (optional)
list
ofint
specifying indices of variables to be plotted. By default, all variables are plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'Blues'
.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_normalized_variance_comparison
#
- PCAfold.analysis.plot_normalized_variance_comparison(variance_data_tuple, plot_variables_tuple, color_map_tuple, figure_size=(10, 5), title=None, save_filename=None)#
This function plots a comparison of normalized variance \(\mathcal{N}(\sigma)\) over bandwith values \(\sigma\) from several objects of a
VarianceData
class.Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using
plot_variables
list.Example:
from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance_comparison import numpy as np # Generate dummy data sets: X = np.random.rand(100,5) Y = np.random.rand(100,5) # Perform PCA to obtain low-dimensional manifolds: pca_X = PCA(X, n_components=2) pca_Y = PCA(Y, n_components=2) principal_components_X = pca_X.transform(X) principal_components_Y = pca_Y.transform(Y) # Compute normalized variance quantities: variance_data_X = compute_normalized_variance(principal_components_X, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True) variance_data_Y = compute_normalized_variance(principal_components_Y, Y, depvar_names=['F', 'G', 'H', 'I', 'J'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True) # Plot a comparison of normalized variance quantities: plt = plot_normalized_variance_comparison((variance_data_X, variance_data_Y), ([0,1,2], [0,1,2]), ('Blues', 'Reds'), figure_size=(10,5), title='Normalized variance comparison', save_filename='N.pdf') plt.close()
- Parameters
variance_data_tuple –
tuple
ofVarianceData
class objects whose normalized variance quantities should be compared on one plot. For instance:(variance_data_1, variance_data_2)
.plot_variables_tuple –
list
ofint
specifying indices of variables to be plotted. It should have as many elements as there areVarianceData
class objects supplied. For instance:([], [])
will plot all variables.color_map – (optional)
tuple
ofstr
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. It should have as many elements as there areVarianceData
class objects supplied. For instance:('Blues', 'Reds')
.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_normalized_variance_derivative
#
- PCAfold.analysis.plot_normalized_variance_derivative(variance_data, plot_variables=[], color_map='Blues', figure_size=(10, 5), title=None, save_filename=None)#
This function plots a scaled normalized variance derivative (computed over logarithmically scaled bandwidths), \(\hat{\mathcal{D}(\sigma)}\), over bandwith values \(\sigma\) from an object of a
VarianceData
class.Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using
plot_variables
list.Example:
from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance_derivative import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Perform PCA to obtain the low-dimensional manifold: pca_X = PCA(X, n_components=2) principal_components = pca_X.transform(X) # Compute normalized variance quantities: variance_data = compute_normalized_variance(principal_components, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 1, 20), scale_unit_box=True) # Plot normalized variance derivative: plt = plot_normalized_variance_derivative(variance_data, plot_variables=[0,1,2], color_map='Blues', figure_size=(10,5), title='Normalized variance derivative', save_filename='D-hat.pdf') plt.close()
- Parameters
variance_data – an object of
VarianceData
class objects whose normalized variance derivative quantities should be plotted.plot_variables – (optional)
list
ofint
specifying indices of variables to be plotted. By default, all variables are plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'Blues'
.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_normalized_variance_derivative_comparison
#
- PCAfold.analysis.plot_normalized_variance_derivative_comparison(variance_data_tuple, plot_variables_tuple, color_map_tuple, figure_size=(10, 5), title=None, save_filename=None)#
This function plots a comparison of scaled normalized variance derivative (computed over logarithmically scaled bandwidths), \(\hat{\mathcal{D}(\sigma)}\), over bandwith values \(\sigma\) from an object of a
VarianceData
class.Note: this function can accomodate plotting up to 18 variables at once. You can specify which variables should be plotted using
plot_variables
list.Example:
from PCAfold import PCA, compute_normalized_variance, plot_normalized_variance_derivative_comparison import numpy as np # Generate dummy data sets: X = np.random.rand(100,5) Y = np.random.rand(100,5) # Perform PCA to obtain low-dimensional manifolds: pca_X = PCA(X, n_components=2) pca_Y = PCA(Y, n_components=2) principal_components_X = pca_X.transform(X) principal_components_Y = pca_Y.transform(Y) # Compute normalized variance quantities: variance_data_X = compute_normalized_variance(principal_components_X, X, depvar_names=['A', 'B', 'C', 'D', 'E'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True) variance_data_Y = compute_normalized_variance(principal_components_Y, Y, depvar_names=['F', 'G', 'H', 'I', 'J'], bandwidth_values=np.logspace(-3, 2, 20), scale_unit_box=True) # Plot a comparison of normalized variance derivatives: plt = plot_normalized_variance_derivative_comparison((variance_data_X, variance_data_Y), ([0,1,2], [0,1,2]), ('Blues', 'Reds'), figure_size=(10,5), title='Normalized variance derivative comparison', save_filename='D-hat.pdf') plt.close()
- Parameters
variance_data_tuple –
tuple
ofVarianceData
class objects whose normalized variance derivative quantities should be compared on one plot. For instance:(variance_data_1, variance_data_2)
.plot_variables_tuple –
list
ofint
specifying indices of variables to be plotted. It should have as many elements as there areVarianceData
class objects supplied. For instance:([], [])
will plot all variables.color_map – (optional)
tuple
ofstr
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. It should have as many elements as there areVarianceData
class objects supplied. For instance:('Blues', 'Reds')
.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
Bibliography#
- AAS21(1,2,3,4)
Elizabeth Armstrong and James C. Sutherland. A technique for characterising feature size and quality of manifolds. Combustion Theory and Modelling, 0(0):1–23, 2021. doi:10.1080/13647830.2021.1931715.
- AHardle90
Wolfgang Härdle. Applied Nonparametric Regression. Econometric Society Monographs. Cambridge University Press, 1990. doi:10.1017/CCOL0521382483.
- AZASP22(1,2)
Kamila Zdybał, Elizabeth Armstrong, James C. Sutherland, and Alessandro Parente. Cost function for low-dimensional manifold topology assessment. Scientific Reports, 12:14496, 2022. URL: https://www.nature.com/articles/s41598-022-18655-1, doi:https://doi.org/10.1038/s41598-022-18655-1.
Reconstruction#
Tools for reconstructing quantities of interest (QoIs)#
Class ANN
#
- class PCAfold.reconstruction.ANN(input_data, output_data, interior_architecture=(), activation_functions='tanh', weights_init='glorot_uniform', biases_init='zeros', loss='MSE', optimizer='Adam', batch_size=200, n_epochs=1000, learning_rate=0.001, validation_perc=10, random_seed=None, verbose=False)#
Enables reconstruction of quantities of interest (QoIs) using artificial neural network (ANN).
Example:
from PCAfold import ANN import numpy as np # Generate dummy dataset: input_data = np.random.rand(100,8) output_data = np.random.rand(100,3) # Instantiate ANN class object: ann_model = ANN(input_data, output_data, interior_architecture=(5,4), activation_functions=('tanh', 'tanh', 'linear'), weights_init='glorot_uniform', biases_init='zeros', loss='MSE', optimizer='Adam', batch_size=100, n_epochs=1000, learning_rate=0.001, validation_perc=10, random_seed=100, verbose=True) # Begin model training: ann_model.train()
A summary of the current ANN model and its hyperparameter settings can be printed using the
summary()
function:# Print the ANN model summary qoi_aware.summary()
ANN model summary...
- Parameters
input_data –
numpy.ndarray
specifying the data set used as the input (regressors) to the ANN. It should be of size(n_observations,n_input_variables)
.output_data –
numpy.ndarray
specifying the data set used as the output (predictors) to the ANN. It should be of size(n_observations,n_output_variables)
.interior_architecture – (optional)
tuple
ofint
specifying the number of neurons in the interior network architecture. For example, ifinterior_architecture=(4,5)
, two interior layers will be created and the overal network architecture will be(Input)-(4)-(5)-(Output)
. If set to an empty tuple,interior_architecture=()
, the overal network architecture will be(Input)-(Output)
. Keep in mind that if you’d like to create just one interior layer, you should use a comma after the integer:interior_architecture=(4,)
.activation_functions – (optional)
str
ortuple
specifying activation functions in all layers. If set tostr
, the same activation function is used in all layers. If set to atuple
ofstr
, a different activation function can be set at different layers. The number of elements in thetuple
should match the number of layers!str
andstr
elements of thetuple
can only be'linear'
,'sigmoid'
, or'tanh'
.weights_init – (optional)
str
specifying the initialization of weights in the network. If set toNone
, weights will be initialized using the Glorot uniform distribution.biases_init – (optional)
str
specifying the initialization of biases in the network. If set toNone
, biases will be initialized as zeros.loss – (optional)
str
specifying the loss function. It can be'MAE'
or'MSE'
.optimizer – (optional)
str
specifying the optimizer used during training. It can be'Adam'
or'Nadam'
.batch_size – (optional)
int
specifying the batch size.n_epochs – (optional)
int
specifying the number of epochs.learning_rate – (optional)
float
specifying the learning rate passed to the optimizer.validation_perc – (optional)
int
specifying the percentage of the input data to be used as validation data during training. It should be a number larger than or equal to 0 and smaller than 100. Note, that if it is set above 0, not all of the input data will be used as training data. Note, that validation data does not impact model training!random_seed – (optional)
int
specifying the random seed to be used for any random operations. It is highly recommended to set a fixed random seed, as this allows for complete reproducibility of the results.verbose – (optional)
bool
for printing verbose details.
Attributes:
input_data - (read only)
numpy.ndarray
specifying the data set used as the input to the ANN.output_data - (read only)
numpy.ndarray
specifying the data set used as the output to the ANN.architecture - (read only)
str
specifying the ANN architecture.ann_model - (read only) object of
Keras.models.Sequential
class that stores the artificial neural network model.weights_and_biases_init - (read only)
list
ofnumpy.ndarray
specifying weights and biases with which the ANN was intialized.weights_and_biases_trained - (read only)
list
ofnumpy.ndarray
specifying weights and biases after training the ANN. Only available after callingANN.train()
.training_loss - (read only)
list
of losses computed on the training data. Only available after callingANN.train()
.validation_loss - (read only)
list
of losses computed on the validation data. Only available after callingANN.train()
and only whenvalidation_perc
is not equal to 0.
ANN.summary
#
- PCAfold.reconstruction.ANN.summary(self)#
Prints the ANN model summary.
ANN.train
#
- PCAfold.reconstruction.ANN.train(self)#
Trains the artificial neural network (ANN) model.
ANN.predict
#
- PCAfold.reconstruction.ANN.predict(self, input_regressors)#
Predicts the quantities of interest (QoIs) from the trained artificial neural network (ANN) model.
- Parameters
input_regressors –
numpy.ndarray
specifying the input data (regressors) to be used for predicting the quantities of interest (QoIs) from the trained ANN model. It should be of size(n_observations,n_input_variables)
, wheren_observations
can be different from the number of observations in the training dataset.- Returns
output_predictors - predicted quantities of interest (QoIs).
ANN.print_weights_and_biases_init
#
- PCAfold.reconstruction.ANN.print_weights_and_biases_init(self)#
Prints initial weights and biases from all layers of the QoI-aware encoder-decoder.
ANN.print_weights_and_biases_trained
#
- PCAfold.reconstruction.ANN.print_weights_and_biases_trained(self)#
Prints trained weights and biases from all layers of the QoI-aware encoder-decoder.
ANN.plot_losses
#
- PCAfold.reconstruction.ANN.plot_losses(self, markevery=100, figure_size=(15, 5), save_filename=None)#
Plots training and validation losses.
- Parameters
markevery – (optional)
int
specifying how frequently the epoch number on the x-axis should be labelled.figure_size – (optional)
tuple
specifying figure size.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
Class PartitionOfUnityNetwork
#
- class PCAfold.reconstruction.PartitionOfUnityNetwork(partition_centers, partition_shapes, basis_type, ivar_center=None, ivar_scale=None, basis_coeffs=None, transform_power=1.0, transform_shift=0.0, transform_sign_shift=0.0, dtype='float64')#
A class for reconstruction (regression) of QoIs using POUnets.
The POUnets are constructed with a single-layer network of normalized radial basis functions (RBFs) whose neurons each own and weight a polynomial basis. For independent variable inputs \(\vec{x}\) of dimensionality \(d\), the \(i^{\text{th}}\) partition or neuron is computed as
\[\Phi_i(\vec{x};\vec{h}_i,K_i) = \phi^{{\rm RBF}}_i(\vec{x};\vec{h}_i,K_i)/\sum_j \phi^{{\rm RBF}}_j(\vec{x};\vec{h}_i,K_i)\]where
\[\phi_i^{{\rm RBF}}(\vec{x};\vec{h}_i,K_i) = \exp\left(-(\vec{x}-\vec{h}_i)^\mathsf{T}K_i(\vec{x}-\vec{h}_i)\right) \nonumber\]with vector \(\vec{h}_i\) and diagonal matrix \(K_i\) defining the \(d\) center and \(d\) shape parameters, respectively, for training.
The final output of a POUnet is then obtained through
\[g(\vec{x};\vec{h},K,c) = \sum_{i=1}^{p}\left(\Phi_i(\vec{x};\vec{h}_i,K_i)\sum_{k=1}^{b}c_{i,k}m_k(\vec{x})\right)\]where the polynomial basis is represented as a sum of \(b\) Taylor monomials, with the \(k^{\text{th}}\) monomial written as \(m_k(\vec{x})\), that are multiplied by trainable basis coefficients \(c\). The number of basis monomials is determined by the
basis_type
for the polynomial. For example, in two-dimensional space, a quadratic polynomial basis contains \(b=6\) monomial functions \(\{1, x_1, x_2, x_1^2, x_2^2, x_1x_2\}\). The combination of the partitions and polynomial basis functions creates localized polynomial fits for a QoI.More information can be found in [UAHK+22].
The
PartitionOfUnityNetwork
class also provides a nonlinear transformation for the dependent variable(s) during training, which can be beneficial if the variable changes over orders of magnitude, for example. The equation for the transformation of variable \(f\) is\[(|f + s_1|)^\alpha \text{sign}(f + s_1) + s_2 \text{sign}(f + s_1)\]where \(\alpha\) is the
transform_power
, \(s_1\) is thetransform_shift
, and \(s_2\) is thetransform_sign_shift
.Example:
from PCAfold import init_uniform_partitions, PartitionOfUnityNetwork import numpy as np # Generate dummy data set: ivars = np.random.rand(100,2) dvars = 2.*ivars[:,0] + 3.*ivars[:,1] # Initialize the POUnet parameters net = PartitionOfUnityNetwork(**init_uniform_partitions([5,7], ivars), basis_type='linear') # Build the training graph with provided training data net.build_training_graph(ivars, dvars) # (optional) update the learning rate (default is 1.e-3) net.update_lr(1.e-4) # (optional) update the least-squares regularization (default is 1.e-10) net.update_l2reg(1.e-10) # Train the POUnet net.train(1000) # Evaluate the POUnet pred = net(ivars) # Evaluate the POUnet derivatives der = net.derivatives(ivars) # Save the POUnet to a file net.write_data_to_file('filename.pkl') # Load a POUnet from file net2 = PartitionOfUnityNetwork.load_from_file('filename.pkl') # Evaluate the loaded POUnet (without needing to call build_training_graph) pred2 = net2(ivars)
- Parameters
partition_centers – array size (number of partitions) x (number of ivar inputs) for partition locations
partition_shapes – array size (number of partitions) x (number of ivar inputs) for partition shapes influencing the RBF widths
basis_type – string (
'constant'
,'linear'
, or'quadratic'
) for the degree of polynomial basisivar_center – (optional, default
None
) array for centering the ivar inputs before evaluating the POUnet, ifNone
centers with zerosivar_scale – (optional, default
None
) array for scaling the ivar inputs before evaluating the POUnet, ifNone
scales with onesbasis_coeffs – (optional, default
None
) if the array of polynomial basis coefficients is known, it may be provided here, otherwise it will be initialized withbuild_training_graph
and trained withtrain
transform_power – (optional, default 1.) the power parameter used in the transformation equation during training
transform_shift – (optional, default 0.) the shift parameter used in the transformation equation during training
transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation during training
dtype – (optional, default
'float64'
) string specifying either float type'float64'
or'float32'
Attributes:
partition_centers - (read only) array of the current partition centers
partition_shapes - (read only) array of the current partition shape parameters
basis_type - (read only) string relaying the basis degree
basis_coeffs - (read only) array of the current basis coefficients
ivar_center - (read only) array of the centering parameters for the ivar inputs
ivar_scale - (read only) array of the scaling parameters for the ivar inputs
dtype - (read only) string relaying the data type (
'float64'
or'float32'
)training_archive - (read only) dictionary of the errors and POUnet states archived during training
iterations - (read only) array of the iterations archived during training
PartitionOfUnityNetwork.load_data_from_file
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.load_data_from_file(filename)#
Load data from a specified
filename
with pickle (followingwrite_data_to_file
)- Parameters
filename – string
- Returns
dictionary of the POUnet data
PartitionOfUnityNetwork.load_from_file
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.load_from_file(filename)#
Load class from a specified
filename
with pickle (followingwrite_data_to_file
)- Parameters
filename – string
- Returns
PartitionOfUnityNetwork
PartitionOfUnityNetwork.load_data_from_txt
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.load_data_from_txt(filename, verbose=False)#
Load data from a specified txt
filename
(followingwrite_data_to_txt
)- Parameters
filename – string
verbose – (optional, default False) print out the data as it is read
- Returns
dictionary of the POUnet data
PartitionOfUnityNetwork.write_data_to_file
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.write_data_to_file(self, filename)#
Save class data to a specified file using pickle. This does not include the archived data from training, which can be separately accessed with training_archive and saved outside of
PartitionOfUnityNetwork
.- Parameters
filename – string
PartitionOfUnityNetwork.write_data_to_txt
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.write_data_to_txt(self, filename, nformat='%.14e')#
Save data to a specified txt file. This may be used to read POUnet parameters into other languages such as C++
- Parameters
filename – string
PartitionOfUnityNetwork.build_training_graph
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.build_training_graph(self, ivars, dvars, error_type='abs', constrain_positivity=False, istensor=False, verbose=False)#
Construct the graph used during training (including defining the training errors) with the provided training data
- Parameters
ivars – array of independent variables for training
dvars – array of dependent variable(s) for training
error_type – (optional, default
'abs'
) the type of training error: relative'rel'
or absolute'abs'
constrain_positivity – (optional, default False) when True, it penalizes the training error with \(f - |f|\) for dependent variables \(f\). This can be useful when used in
QoIAwareProjectionPOUnet
istensor – (optional, default False) whether to evaluate ivars and dvars as tensorflow Tensors or numpy arrays
verbose – (options, default False) print out the number of the partition and basis parameters when True
PartitionOfUnityNetwork.update_lr
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.update_lr(self, lr)#
update the learning rate for training
- Parameters
lr – float for the learning rate
PartitionOfUnityNetwork.update_l2reg
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.update_l2reg(self, l2reg)#
update the least-squares regularization for training
- Parameters
l2reg – float for the least-squares regularization
PartitionOfUnityNetwork.lstsq
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.lstsq(self, verbose=True)#
update the basis coefficients with least-squares regression
- Parameters
verbose – (optional, default True) prints when least-squares solve is performed when True
PartitionOfUnityNetwork.train
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.train(self, iterations, archive_rate=100, use_best_archive_sse=True, verbose=False)#
Performs training using a least-squares gradient descent block coordinate descent strategy. This alternates between updating the partition parameters with gradient descent and updating the basis coefficients with least-squares.
- Parameters
iterations – integer for number of training iterations to perform
archive_rate – (optional, default 100) the rate at which the errors and parameters are archived during training. These can be accessed with the
training_archive
attributeuse_best_archive_sse – (optional, default True) when True will set the POUnet parameters to those with the lowest error observed during training, otherwise the parameters from the last iteration are used
verbose – (optional, default False) when True will print progress
PartitionOfUnityNetwork.__call__
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.__call__(self, xeval)#
evaluate the POUnet
- Parameters
xeval – array of independent variable query points
- Returns
array of POUnet predictions
PartitionOfUnityNetwork.derivatives
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.derivatives(self, xeval, dvar_idx=0)#
evaluate the POUnet derivatives
- Parameters
xeval – array of independent variable query points
dvar_idx – (optional, default 0) index for the dependent variable whose derivatives are being evaluated
- Returns
array of POUnet derivative evaluations
PartitionOfUnityNetwork.partition_prenorm
#
- PCAfold.reconstruction.PartitionOfUnityNetwork.partition_prenorm(self, xeval)#
evaluate the POUnet partitions prior to normalization
- Parameters
xeval – array of independent variable query points
- Returns
array of POUnet RBF partition evaluations before normalization
init_uniform_partitions
#
- PCAfold.reconstruction.init_uniform_partitions(list_npartitions, ivars, width_factor=0.5, verbose=False)#
Computes parameters for initializing partition locations near training data with uniform spacing in each dimension.
Example:
from PCAfold import init_uniform_partitions import numpy as np # Generate dummy data set: ivars = np.random.rand(100,2) # compute partition parameters for an initial 5x7 grid: init_data = init_uniform_partitions([5, 7], ivars)
- Parameters
list_npartitions – list of integers specifying the number of partitions to try initializing in each dimension. Only partitions near the provided ivars are kept.
ivars – array of independent variables used for determining which partitions to keep
width_factor – (optional, default 0.5) the factor multiplying the spacing between partitions for initializing the partitions’ RBF widths
verbose – (optional, default False) when True, prints the number of partitions retained compared to the initial grid
- Returns
a dictionary of partition parameters to be used in initializing a
PartitionOfUnityNetwork
Regression assessment#
Class RegressionAssessment
#
- class PCAfold.reconstruction.RegressionAssessment(observed, predicted, idx=None, variable_names=None, use_global_mean=False, norm='std', use_global_norm=False, tolerance=0.05)#
Wrapper class for storing all regression assessment metrics for a given regression solution given by the observed dependent variables, \(\pmb{\phi}_o\), and the predicted dependent variables, \(\pmb{\phi}_p\).
Example:
from PCAfold import PCA, RegressionAssessment import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Instantiate RegressionAssessment class object: regression_metrics = RegressionAssessment(X, X_rec) # Access mean absolute error values: MAE = regression_metrics.mean_absolute_error
In addition, all stratified regression metrics can be computed on a single variable:
from PCAfold import variable_bins # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=5, verbose=False) # Instantiate RegressionAssessment class object: stratified_regression_metrics = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx) # Access stratified mean absolute error values: stratified_MAE = stratified_regression_metrics.stratified_mean_absolute_error
- Parameters
observed –
numpy.ndarray
specifying the observed values of dependent variables, \(\pmb{\phi}_o\). It should be of size(n_observations,)
or(n_observations,n_variables)
.predicted –
numpy.ndarray
specifying the predicted values of dependent variables, \(\pmb{\phi}_p\). It should be of size(n_observations,)
or(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.variable_names – (optional)
list
ofstr
specifying variable names.use_global_mean – (optional)
bool
specifying if global mean of the observed variable should be used as a reference in \(R^2\) calculation.norm –
str
specifying the normalization, \(d_{norm}\), for NRMSE computation. It can be one of the following:std
,range
,root_square_mean
,root_square_range
,root_square_std
,abs_mean
.use_global_norm – (optional)
bool
specifying if global norm of the observed variable should be used in NRMSE calculation.tolerance –
float
specifying the tolerance for GDE computation.
Attributes:
coefficient_of_determination - (read only)
numpy.ndarray
specifying the coefficient of determination, \(R^2\), values. It has size(1,n_variables)
.mean_absolute_error - (read only)
numpy.ndarray
specifying the mean absolute error (MAE) values. It has size(1,n_variables)
.mean_squared_error - (read only)
numpy.ndarray
specifying the mean squared error (MSE) values. It has size(1,n_variables)
.root_mean_squared_error - (read only)
numpy.ndarray
specifying the root mean squared error (RMSE) values. It has size(1,n_variables)
.normalized_root_mean_squared_error - (read only)
numpy.ndarray
specifying the normalized root mean squared error (NRMSE) values. It has size(1,n_variables)
.good_direction_estimate - (read only)
float
specifying the good direction estimate (GDE) value, treating the entire \(\pmb{\phi}_o\) and \(\pmb{\phi}_p\) as vectors. Note that if a single dependent variable is passed, GDE cannot be computed and is set toNaN
.
If
idx
has been specified:stratified_coefficient_of_determination - (read only)
numpy.ndarray
specifying the coefficient of determination, \(R^2\), values. It has size(1,n_variables)
.stratified_mean_absolute_error - (read only)
numpy.ndarray
specifying the mean absolute error (MAE) values. It has size(1,n_variables)
.stratified_mean_squared_error - (read only)
numpy.ndarray
specifying the mean squared error (MSE) values. It has size(1,n_variables)
.stratified_root_mean_squared_error - (read only)
numpy.ndarray
specifying the root mean squared error (RMSE) values. It has size(1,n_variables)
.stratified_normalized_root_mean_squared_error - (read only)
numpy.ndarray
specifying the normalized root mean squared error (NRMSE) values. It has size(1,n_variables)
.
RegressionAssessment.print_metrics
#
- PCAfold.reconstruction.RegressionAssessment.print_metrics(self, table_format=['raw'], float_format='.4f', metrics=None, comparison=None)#
Prints regression assessment metrics as raw text, in
tex
format and/or aspandas.DataFrame
.Example:
from PCAfold import PCA, RegressionAssessment import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Instantiate RegressionAssessment class object: regression_metrics = RegressionAssessment(X, X_rec) # Print regression metrics: regression_metrics.print_metrics(table_format=['raw', 'tex', 'pandas'], float_format='.4f', metrics=['R2', 'NRMSE', 'GDE'])
Note
Adding
'raw'
to thetable_format
list will result in printing:------------------------- X1 R2: 0.9900 NRMSE: 0.0999 GDE: 70.0000 ------------------------- X2 R2: 0.6126 NRMSE: 0.6224 GDE: 70.0000 ------------------------- X3 R2: 0.6368 NRMSE: 0.6026 GDE: 70.0000
Adding
'tex'
to thetable_format
list will result in printing:\begin{table}[h!] \begin{center} \begin{tabular}{llll} \toprule & \textit{X1} & \textit{X2} & \textit{X3} \\ \midrule R2 & 0.9900 & 0.6126 & 0.6368 \\ NRMSE & 0.0999 & 0.6224 & 0.6026 \\ GDE & 70.0000 & 70.0000 & 70.0000 \\ \end{tabular} \caption{}\label{} \end{center} \end{table}
Adding
'pandas'
to thetable_format
list (works well in Jupyter notebooks) will result in printing:Additionally, the current object of
RegressionAssessment
class can be compared with another object:from PCAfold import PCA, RegressionAssessment import numpy as np # Generate dummy data set: X = np.random.rand(100,3) Y = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) pca_Y = PCA(Y, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) Y_rec = pca_Y.reconstruct(pca_Y.transform(Y)) # Instantiate RegressionAssessment class object: regression_metrics_X = RegressionAssessment(X, X_rec) regression_metrics_Y = RegressionAssessment(Y, Y_rec) # Print regression metrics: regression_metrics_X.print_metrics(table_format=['raw', 'pandas'], float_format='.4f', metrics=['R2', 'NRMSE', 'GDE'], comparison=regression_metrics_Y)
Note
Adding
'raw'
to thetable_format
list will result in printing:------------------------- X1 R2: 0.9133 BETTER NRMSE: 0.2944 BETTER GDE: 67.0000 WORSE ------------------------- X2 R2: 0.5969 WORSE NRMSE: 0.6349 WORSE GDE: 67.0000 WORSE ------------------------- X3 R2: 0.6175 WORSE NRMSE: 0.6185 WORSE GDE: 67.0000 WORSE
Adding
'pandas'
to thetable_format
list (works well in Jupyter notebooks) will result in printing:- Parameters
table_format – (optional)
list
ofstr
specifying the format(s) in which the table should be printed. Strings can only be'raw'
,'tex'
and/or'pandas'
.float_format – (optional)
str
specifying the display format for the numerical entries inside the table. By default it is set to'.4f'
.metrics – (optional)
list
ofstr
specifying which metrics should be printed. Strings can only be'R2'
,'MAE'
,'MSE'
,'MSLE'
,'RMSE'
,'NRMSE'
,'GDE'
. If metrics is set toNone
, all available metrics will be printed.comparison – (optional) object of
RegressionAssessment
class specifying the metrics that should be compared with the current regression metrics.
RegressionAssessment.print_stratified_metrics
#
- PCAfold.reconstruction.RegressionAssessment.print_stratified_metrics(self, table_format=['raw'], float_format='.4f', metrics=None, comparison=None)#
Prints stratified regression assessment metrics as raw text, in
tex
format and/or aspandas.DataFrame
. In each cluster, in addition to the regression metrics, number of observations is printed, along with the minimum and maximum values of the observed variable in that cluster.Example:
from PCAfold import PCA, variable_bins, RegressionAssessment import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=3, verbose=False) # Instantiate RegressionAssessment class object: stratified_regression_metrics = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx) # Print regression metrics: stratified_regression_metrics.print_stratified_metrics(table_format=['raw', 'tex', 'pandas'], float_format='.4f', metrics=['R2', 'MAE', 'NRMSE'])
Note
Adding
'raw'
to thetable_format
list will result in printing:------------------------- k1 Observations: 31 Min: 0.0120 Max: 0.3311 R2: -3.3271 MAE: 0.1774 NRMSE: 2.0802 ------------------------- k2 Observations: 38 Min: 0.3425 Max: 0.6665 R2: -1.4608 MAE: 0.1367 NRMSE: 1.5687 ------------------------- k3 Observations: 31 Min: 0.6853 Max: 0.9959 R2: -3.7319 MAE: 0.1743 NRMSE: 2.1753
Adding
'tex'
to thetable_format
list will result in printing:\begin{table}[h!] \begin{center} \begin{tabular}{llll} \toprule & \textit{k1} & \textit{k2} & \textit{k3} \\ \midrule Observations & 31.0000 & 38.0000 & 31.0000 \\ Min & 0.0120 & 0.3425 & 0.6853 \\ Max & 0.3311 & 0.6665 & 0.9959 \\ R2 & -3.3271 & -1.4608 & -3.7319 \\ MAE & 0.1774 & 0.1367 & 0.1743 \\ NRMSE & 2.0802 & 1.5687 & 2.1753 \\ \end{tabular} \caption{}\label{} \end{center} \end{table}
Adding
'pandas'
to thetable_format
list (works well in Jupyter notebooks) will result in printing:Additionally, the current object of
RegressionAssessment
class can be compared with another object:from PCAfold import PCA, variable_bins, RegressionAssessment import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=3, verbose=False) # Instantiate RegressionAssessment class object: stratified_regression_metrics_0 = RegressionAssessment(X[:,0], X_rec[:,0], idx=idx) stratified_regression_metrics_1 = RegressionAssessment(X[:,1], X_rec[:,1], idx=idx) # Print regression metrics: stratified_regression_metrics_0.print_stratified_metrics(table_format=['raw', 'pandas'], float_format='.4f', metrics=['R2', 'MAE', 'NRMSE'], comparison=stratified_regression_metrics_1)
Note
Adding
'raw'
to thetable_format
list will result in printing:------------------------- k1 Observations: 39 Min: 0.0013 Max: 0.3097 R2: 0.9236 BETTER MAE: 0.0185 BETTER NRMSE: 0.2764 BETTER ------------------------- k2 Observations: 29 Min: 0.3519 Max: 0.6630 R2: 0.9380 BETTER MAE: 0.0179 BETTER NRMSE: 0.2491 BETTER ------------------------- k3 Observations: 32 Min: 0.6663 Max: 0.9943 R2: 0.9343 BETTER MAE: 0.0194 BETTER NRMSE: 0.2563 BETTER
Adding
'pandas'
to thetable_format
list (works well in Jupyter notebooks) will result in printing:- Parameters
table_format – (optional)
list
ofstr
specifying the format(s) in which the table should be printed. Strings can only be'raw'
,'tex'
and/or'pandas'
.float_format – (optional)
str
specifying the display format for the numerical entries inside the table. By default it is set to'.4f'
.metrics – (optional)
list
ofstr
specifying which metrics should be printed. Strings can only be'R2'
,'MAE'
,'MSE'
,'MSLE'
,'RMSE'
,'NRMSE'
. If metrics is set toNone
, all available metrics will be printed.comparison – (optional) object of
RegressionAssessment
class specifying the metrics that should be compared with the current regression metrics.
coefficient_of_determination
#
- PCAfold.reconstruction.coefficient_of_determination(observed, predicted)#
Computes the coefficient of determination, \(R^2\), value:
\[R^2 = 1 - \frac{\sum_{i=1}^N (\phi_{o,i} - \phi_{p,i})^2}{\sum_{i=1}^N (\phi_{o,i} - \mathrm{mean}(\phi_{o,i}))^2}\]where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, coefficient_of_determination import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the coefficient of determination for the first variable: r2 = coefficient_of_determination(X[:,0], X_rec[:,0])
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.
- Returns
r2 - coefficient of determination, \(R^2\).
stratified_coefficient_of_determination
#
- PCAfold.reconstruction.stratified_coefficient_of_determination(observed, predicted, idx, use_global_mean=True, verbose=False)#
Computes the stratified coefficient of determination, \(R^2\), values. Stratified \(R^2\) is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).
\(R_j^2\) in the \(j^{th}\) bin can be computed in two ways:
If
use_global_mean=True
, the mean of the entire observed variable is used as a reference:
\[R_j^2 = 1 - \frac{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \phi_{p,i}^{j})^2}{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \mathrm{mean}(\phi_o))^2}\]If
use_global_mean=False
, the mean of the considered \(j^{th}\) bin is used as a reference:
\[R_j^2 = 1 - \frac{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \phi_{p,i}^{j})^2}{\sum_{i=1}^{N_j} (\phi_{o,i}^{j} - \mathrm{mean}(\phi_o^{j}))^2}\]where \(N_j\) is the number of observations in the \(j^{th}\) bin and \(\phi_p\) is the predicted dependent variable.
Note
After running this function you can call
analysis.plot_stratified_coefficient_of_determination(r2_in_bins, bins_borders)
on the function outputs and it will visualize how stratified \(R^2\) changes across bins.Warning
The stratified \(R^2\) metric can be misleading if there are large variations in point density in an observed variable. For instance, below is a data set composed of lines of points that have uniform spacing on the \(x\) axis but become more and more sparse in the direction of increasing \(\phi\) due to an increasing gradient of \(\phi\). If bins are narrow enough (number of bins is high enough), a single bin (like the bin bounded by the red dashed lines) can contain only one of those lines of points for high value of \(\phi\). \(R^2\) will then be computed for constant, or almost constant observations, even though globally those observations lie in a location of a large gradient of the observed variable!
Example:
from PCAfold import PCA, variable_bins, stratified_coefficient_of_determination, plot_stratified_coefficient_of_determination import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False) # Compute stratified R2 in 10 bins of the first variable in a data set: r2_in_bins = stratified_coefficient_of_determination(X[:,0], X_rec[:,0], idx=idx, use_global_mean=True, verbose=True) # Plot the stratified R2 values: plot_stratified_coefficient_of_determination(r2_in_bins, bins_borders)
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.use_global_mean – (optional)
bool
specifying if global mean of the observed variable should be used as a reference in \(R^2\) calculation.verbose – (optional)
bool
for printing sizes (number of observations) and \(R^2\) values in each bin.
- Returns
r2_in_bins -
list
specifying the coefficients of determination \(R^2\) in each bin. It has lengthk
.
mean_absolute_error
#
- PCAfold.reconstruction.mean_absolute_error(observed, predicted)#
Computes the mean absolute error (MAE):
\[\mathrm{MAE} = \frac{1}{N} \sum_{i=1}^N | \phi_{o,i} - \phi_{p,i} |\]where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, mean_absolute_error import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the mean absolute error for the first variable: mae = mean_absolute_error(X[:,0], X_rec[:,0])
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.
- Returns
mae - mean absolute error (MAE).
stratified_mean_absolute_error
#
- PCAfold.reconstruction.stratified_mean_absolute_error(observed, predicted, idx, verbose=False)#
Computes the stratified mean absolute error (MAE) values. Stratified MAE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).
MAE in the \(j^{th}\) bin can be computed as:
\[\mathrm{MAE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} | \phi_{o,i}^j - \phi_{p,i}^j |\]where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, variable_bins, stratified_mean_absolute_error import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False) # Compute stratified MAE in 10 bins of the first variable in a data set: mae_in_bins = stratified_mean_absolute_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.verbose – (optional)
bool
for printing sizes (number of observations) and MAE values in each bin.
- Returns
mae_in_bins -
list
specifying the mean absolute error (MAE) in each bin. It has lengthk
.
max_absolute_error
#
- PCAfold.reconstruction.max_absolute_error(observed, predicted)#
Computes the maximum absolute error (MaxAE):
\[\mathrm{MaxAE} = \mathrm{max}( | \phi_{o,i} - \phi_{p,i} | )\]where \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, max_absolute_error import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the maximum absolute error for the first variable: maxae = max_absolute_error(X[:,0], X_rec[:,0])
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.
- Returns
maxae - max absolute error (MaxAE).
mean_squared_error
#
- PCAfold.reconstruction.mean_squared_error(observed, predicted)#
Computes the mean squared error (MSE):
\[\mathrm{MSE} = \frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2\]where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, mean_squared_error import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the mean squared error for the first variable: mse = mean_squared_error(X[:,0], X_rec[:,0])
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.
- Returns
mse - mean squared error (MSE).
stratified_mean_squared_error
#
- PCAfold.reconstruction.stratified_mean_squared_error(observed, predicted, idx, verbose=False)#
Computes the stratified mean squared error (MSE) values. Stratified MSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).
MSE in the \(j^{th}\) bin can be computed as:
\[\mathrm{MSE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2\]where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, variable_bins, stratified_mean_squared_error import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False) # Compute stratified MSE in 10 bins of the first variable in a data set: mse_in_bins = stratified_mean_squared_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.verbose – (optional)
bool
for printing sizes (number of observations) and MSE values in each bin.
- Returns
mse_in_bins -
list
specifying the mean squared error (MSE) in each bin. It has lengthk
.
mean_squared_logarithmic_error
#
- PCAfold.reconstruction.mean_squared_logarithmic_error(observed, predicted)#
Computes the mean squared logarithmic error (MSLE):
\[\mathrm{MSLE} = \frac{1}{N} \sum_{i=1}^N (\log(\phi_{o,i} + 1) - \log(\phi_{p,i} + 1)) ^2\]where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Warning
The MSLE metric can only be used on non-negative samples.
Example:
from PCAfold import PCA, mean_squared_logarithmic_error import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the mean squared error for the first variable: msle = mean_squared_logarithmic_error(X[:,0], X_rec[:,0])
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.
- Returns
msle - mean squared logarithmic error (MSLE).
stratified_mean_squared_logarithmic_error
#
- PCAfold.reconstruction.stratified_mean_squared_logarithmic_error(observed, predicted, idx, verbose=False)#
Computes the stratified mean squared logarithmic error (MSLE) values. Stratified MSLE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).
MSLE in the \(j^{th}\) bin can be computed as:
\[\mathrm{MSLE}_j = \frac{1}{N_j} \sum_{i=1}^{N_j} (\log(\phi_{o,i}^j + 1) - \log(\phi_{p,i}^j + 1)) ^2\]where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Warning
The MSLE metric can only be used on non-negative samples.
Example:
from PCAfold import PCA, variable_bins, stratified_mean_squared_logarithmic_error import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False) # Compute stratified MSLE in 10 bins of the first variable in a data set: msle_in_bins = stratified_mean_squared_logarithmic_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.verbose – (optional)
bool
for printing sizes (number of observations) and MSLE values in each bin.
- Returns
msle_in_bins -
list
specifying the mean squared logarithmic error (MSLE) in each bin. It has lengthk
.
root_mean_squared_error
#
- PCAfold.reconstruction.root_mean_squared_error(observed, predicted)#
Computes the root mean squared error (RMSE):
\[\mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2}\]where \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, root_mean_squared_error import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the root mean squared error for the first variable: rmse = root_mean_squared_error(X[:,0], X_rec[:,0])
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.
- Returns
rmse - root mean squared error (RMSE).
stratified_root_mean_squared_error
#
- PCAfold.reconstruction.stratified_root_mean_squared_error(observed, predicted, idx, verbose=False)#
Computes the stratified root mean squared error (RMSE) values. Stratified RMSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).
RMSE in the \(j^{th}\) bin can be computed as:
\[\mathrm{RMSE}_j = \sqrt{\frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2}\]where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, variable_bins, stratified_root_mean_squared_error import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False) # Compute stratified RMSE in 10 bins of the first variable in a data set: rmse_in_bins = stratified_root_mean_squared_error(X[:,0], X_rec[:,0], idx=idx, verbose=True)
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.verbose – (optional)
bool
for printing sizes (number of observations) and RMSE values in each bin.
- Returns
rmse_in_bins -
list
specifying the mean squared error (RMSE) in each bin. It has lengthk
.
normalized_root_mean_squared_error
#
- PCAfold.reconstruction.normalized_root_mean_squared_error(observed, predicted, norm='std')#
Computes the normalized root mean squared error (NRMSE):
\[\mathrm{NRMSE} = \frac{1}{d_{norm}} \sqrt{\frac{1}{N} \sum_{i=1}^N (\phi_{o,i} - \phi_{p,i}) ^2}\]where \(d_{norm}\) is the normalization factor, \(N\) is the number of observations, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Various normalizations are available:
Normalization
norm
Normalization factor \(d_{norm}\)
Root square mean
'root_square_mean'
\(d_{norm} = \sqrt{\mathrm{mean}(\phi_o^2)}\)
Std
'std'
\(d_{norm} = \mathrm{std}(\phi_o)\)
Range
'range'
\(d_{norm} = \mathrm{max}(\phi_o) - \mathrm{min}(\phi_o)\)
Root square range
'root_square_range'
\(d_{norm} = \sqrt{\mathrm{max}(\phi_o^2) - \mathrm{min}(\phi_o^2)}\)
Root square std
'root_square_std'
\(d_{norm} = \sqrt{\mathrm{std}(\phi_o^2)}\)
Absolute mean
'abs_mean'
\(d_{norm} = | \mathrm{mean}(\phi_o) |\)
Example:
from PCAfold import PCA, normalized_root_mean_squared_error import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the root mean squared error for the first variable: nrmse = normalized_root_mean_squared_error(X[:,0], X_rec[:,0], norm='std')
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.norm –
str
specifying the normalization, \(d_{norm}\). It can be one of the following:std
,range
,root_square_mean
,root_square_range
,root_square_std
,abs_mean
.
- Returns
nrmse - normalized root mean squared error (NRMSE).
stratified_normalized_root_mean_squared_error
#
- PCAfold.reconstruction.stratified_normalized_root_mean_squared_error(observed, predicted, idx, norm='std', use_global_norm=False, verbose=False)#
Computes the stratified normalized root mean squared error (NRMSE) values. Stratified NRMSE is computed separately in each bin (cluster) of an observed dependent variable, \(\phi_o\).
NRMSE in the \(j^{th}\) bin can be computed as:
\[\mathrm{NRMSE}_j = \frac{1}{d_{norm}} \sqrt{\frac{1}{N_j} \sum_{i=1}^{N_j} (\phi_{o,i}^j - \phi_{p,i}^j) ^2}\]where \(N_j\) is the number of observations in the \(j^{th}\) bin, \(\phi_o\) is the observed and \(\phi_p\) is the predicted dependent variable.
Example:
from PCAfold import PCA, variable_bins, stratified_normalized_root_mean_squared_error import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False) # Compute stratified NRMSE in 10 bins of the first variable in a data set: nrmse_in_bins = stratified_normalized_root_mean_squared_error(X[:,0], X_rec[:,0], idx=idx, norm='std', use_global_norm=True, verbose=True)
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable, \(\phi_o\). It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable, \(\phi_p\). It should be of size(n_observations,)
or(n_observations, 1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.norm –
str
specifying the normalization, \(d_{norm}\). It can be one of the following:std
,range
,root_square_mean
,root_square_range
,root_square_std
,abs_mean
.use_global_norm – (optional)
bool
specifying if global norm of the observed variable should be used in NRMSE calculation. If set toFalse
, norms are computed on samples from the the corresponding bin.verbose – (optional)
bool
for printing sizes (number of observations) and NRMSE values in each bin.
- Returns
nrmse_in_bins -
list
specifying the mean squared error (NRMSE) in each bin. It has lengthk
.
turning_points
#
- PCAfold.reconstruction.turning_points(observed, predicted)#
Computes the turning points percentage - the percentage of predicted outputs that have the opposite growth tendency to the corresponding observed growth tendency.
Warning
This function is under construction.
- Returns
turning_points - turning points percentage in %.
good_estimate
#
- PCAfold.reconstruction.good_estimate(observed, predicted, tolerance=0.05)#
Computes the good estimate (GE) - the percentage of predicted values that are within the specified tolerance from the corresponding observed values.
Warning
This function is under construction.
- Parameters
observed –
numpy.ndarray
specifying the observed values of a single dependent variable. It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable. It should be of size(n_observations,)
or(n_observations, 1)
.
- Parm tolerance
float
specifying the tolerance.- Returns
good_estimate - good estimate (GE) in %.
good_direction_estimate
#
- PCAfold.reconstruction.good_direction_estimate(observed, predicted, tolerance=0.05)#
Computes the good direction (GD) and the good direction estimate (GDE).
GD for observation \(i\), is computed as:
\[GD_i = \frac{\vec{\phi}_{o,i}}{|| \vec{\phi}_{o,i} ||} \cdot \frac{\vec{\phi}_{p,i}}{|| \vec{\phi}_{p,i} ||}\]where \(\vec{\phi}_o\) is the observed vector quantity and \(\vec{\phi}_p\) is the predicted vector quantity.
GDE is computed as the percentage of predicted vector observations whose direction is within the specified tolerance from the direction of the corresponding observed vector.
Example:
from PCAfold import PCA, good_direction_estimate import numpy as np # Generate dummy data set: X = np.random.rand(100,3) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Compute the vector of good direction and good direction estimate: (good_direction, good_direction_estimate) = good_direction_estimate(X, X_rec, tolerance=0.01)
- Parameters
observed –
numpy.ndarray
specifying the observed vector quantity, \(\vec{\phi}_o\). It should be of size(n_observations,n_dimensions)
.predicted –
numpy.ndarray
specifying the predicted vector quantity, \(\vec{\phi}_p\). It should be of size(n_observations,n_dimensions)
.tolerance –
float
specifying the tolerance.
- Returns
good_direction -
numpy.ndarray
specifying the vector of good direction (GD). It has size(n_observations,)
.good_direction_estimate - good direction estimate (GDE) in %.
generate_tex_table
#
- PCAfold.reconstruction.generate_tex_table(data_frame_table, float_format='.2f', caption='', label='')#
Generates
tex
code for a table stored in apandas.DataFrame
. This function can be useful e.g. for printing regression results.Example:
from PCAfold import PCA, generate_tex_table import numpy as np import pandas as pd # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy variables names: variable_names = ['A1', 'A2', 'A3', 'A4', 'A5'] # Instantiate PCA class object: pca_q2 = PCA(X, scaling='auto', n_components=2, use_eigendec=True, nocenter=False) pca_q3 = PCA(X, scaling='auto', n_components=3, use_eigendec=True, nocenter=False) # Calculate the R2 values: r2_q2 = pca_q2.calculate_r2(X)[None,:] r2_q3 = pca_q3.calculate_r2(X)[None,:] # Generate pandas.DataFrame from the R2 values: r2_table = pd.DataFrame(np.vstack((r2_q2, r2_q3)), columns=variable_names, index=['PCA, $q=2$', 'PCA, $q=3$']) # Generate tex code for the table: generate_tex_table(r2_table, float_format=".3f", caption='$R^2$ values.', label='r2-values')
Note
The code above will produce
tex
code:\begin{table}[h!] \begin{center} \begin{tabular}{llllll} \toprule & \textit{A1} & \textit{A2} & \textit{A3} & \textit{A4} & \textit{A5} \\ \midrule PCA, $q=2$ & 0.507 & 0.461 & 0.485 & 0.437 & 0.611 \\ PCA, $q=3$ & 0.618 & 0.658 & 0.916 & 0.439 & 0.778 \\ \end{tabular} \caption{$R^2$ values.}\label{r2-values} \end{center} \end{table}
Which, when compiled, will result in a table:
- Parameters
data_frame_table –
pandas.DataFrame
specifying the table to convert totex
code. It can include column names and index names.float_format –
str
specifying the display format for the numerical entries inside the table. By default it is set to'.2f'
.caption –
str
specifying caption for the table.label –
str
specifying label for the table.
Plotting functions#
plot_2d_regression
#
- PCAfold.reconstruction.plot_2d_regression(x, observed, predicted, x_label=None, y_label=None, color_observed=None, color_predicted=None, figure_size=(7, 7), title=None, save_filename=None)#
Plots the result of regression of a dependent variable on top of a one-dimensional manifold defined by a single independent variable
x
.Example:
from PCAfold import PCA, plot_2d_regression import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Obtain two-dimensional manifold from PCA: pca_X = PCA(X) PCs = pca_X.transform(X) X_rec = pca_X.reconstruct(PCs) # Plot the manifold: plt = plot_2d_regression(X[:,0], X[:,0], X_rec[:,0], x_label='$x$', y_label='$y$', color_observed='k', color_predicted='r', figure_size=(10,10), title='2D regression', save_filename='2d-regression.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.observed –
numpy.ndarray
specifying the observed values of a single dependent variable. It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable. It should be of size(n_observations,)
or(n_observations, 1)
.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.color_observed – (optional)
str
specifying the color of the plotted observed variable.color_predicted – (optional)
str
specifying the color of the plotted predicted variable.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_2d_regression_scalar_field
#
- PCAfold.reconstruction.plot_2d_regression_scalar_field(grid_bounds, regression_model, x=None, y=None, resolution=(10, 10), extension=(0, 0), x_label=None, y_label=None, s_field=None, s_manifold=None, manifold_color=None, colorbar_label=None, color_map='viridis', colorbar_range=None, manifold_alpha=1, grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a 2D field of a regressed scalar dependent variable. A two-dimensional manifold can be additionally plotted on top of the field.
Example:
from PCAfold import PCA, KReg, plot_2d_regression_scalar_field import numpy as np # Generate dummy data set: X = np.random.rand(100,2) Z = np.random.rand(100,1) # Train the kernel regression model: model = KReg(X, Z) # Define the regression model: def regression_model(query): predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)[:,0] return predicted # Define the bounds for the scalar field: grid_bounds = ([np.min(X[:,0]),np.max(X[:,0])],[np.min(X[:,1]),np.max(X[:,1])]) # Plot the regressed scalar field: plt = plot_2d_regression_scalar_field(grid_bounds, regression_model, x=X[:,0], y=X[:,1], resolution=(100,100), extension=(10,10), x_label='$X_1$', y_label='$X_2$', s_field=4, s_manifold=60, manifold_color=Z, colorbar_label='$Z_1$', color_map='inferno', colorbar_range=(0,1), manifold_alpha=1, grid_on=False, figure_size=(10,6), title='2D regressed scalar field', save_filename='2D-regressed-scalar-field.pdf') plt.close()
- Parameters
grid_bounds –
tuple
oflist
specifying the bounds of the dependent variable on the \(x\) and \(y\) axis.regression_model –
function
that outputs the predicted vector using the regression model. It should take as input anumpy.ndarray
of size(1,2)
, where the two elements specify the first and second independent variable values. It should output afloat
specifying the regressed scalar value at that input.x – (optional)
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
. It can be used to plot a 2D manifold on top of the streamplot.y – (optional)
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
. It can be used to plot a 2D manifold on top of the streamplot.resolution – (optional)
tuple
ofint
specifying the resolution of the streamplot grid on the \(x\) and \(y\) axis.extension – (optional)
tuple
offloat
orint
specifying a percentage by which the dependent variable should be extended beyond on the \(x\) and \(y\) axis, beyond what has been specified by thegrid_bounds
parameter.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.s_field – (optional)
int
orfloat
specifying the scatter point size for the scalar field.s_manifold – (optional)
int
orfloat
specifying the scatter point size for the manifold.manifold_color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
x
andy
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, manifold will be plotted in black.colorbar_label – (optional)
str
specifying colorbar label annotation.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.colorbar_range – (optional)
tuple
specifying the lower and the upper bound for the colorbar range.manifold_alpha – (optional)
float
orint
specifying the opacity of the plotted manifold.grid_on –
bool
specifying whether grid should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_2d_regression_streamplot
#
- PCAfold.reconstruction.plot_2d_regression_streamplot(grid_bounds, regression_model, x=None, y=None, resolution=(10, 10), extension=(0, 0), color='k', x_label=None, y_label=None, s_manifold=None, manifold_color=None, colorbar_label=None, color_map='viridis', colorbar_range=None, manifold_alpha=1, grid_on=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a streamplot of a regressed vector field of a dependent variable. A two-dimensional manifold can be additionally plotted on top of the streamplot.
Example:
from PCAfold import PCA, KReg, plot_2d_regression_streamplot import numpy as np # Generate dummy data set: X = np.random.rand(100,5) S_X = np.random.rand(100,5) # Obtain two-dimensional manifold from PCA: pca_X = PCA(X, n_components=2) PCs = pca_X.transform(X) S_Z = pca_X.transform(S_X, nocenter=True) # Train the kernel regression model: model = KReg(PCs, S_Z) # Define the regression model: def regression_model(query): predicted = model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1) return predicted # Define the bounds for the streamplot: grid_bounds = ([np.min(PCs[:,0]),np.max(PCs[:,0])],[np.min(PCs[:,1]),np.max(PCs[:,1])]) # Plot the regression streamplot: plt = plot_2d_regression_streamplot(grid_bounds, regression_model, x=PCs[:,0], y=PCs[:,1], resolution=(15,15), extension=(20,20), color='r', x_label='$Z_1$', y_label='$Z_2$', manifold_color=X[:,0], colorbar_label='$X_1$', color_map='plasma', colorbar_range=(0,1), manifold_alpha=1, grid_on=False, figure_size=(10,6), title='Streamplot', save_filename='streamplot.pdf') plt.close()
- Parameters
grid_bounds –
tuple
oflist
specifying the bounds of the dependent variable on the \(x\) and \(y\) axis.regression_model –
function
that outputs the predicted vector using the regression model. It should take as input anumpy.ndarray
of size(1,2)
, where the two elements specify the first and second independent variable values. It should output anumpy.ndarray
of size(1,2)
, where the two elements specify the first and second regressed vector elements.x – (optional)
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
. It can be used to plot a 2D manifold on top of the streamplot.y – (optional)
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
. It can be used to plot a 2D manifold on top of the streamplot.resolution – (optional)
tuple
ofint
specifying the resolution of the streamplot grid on the \(x\) and \(y\) axis.extension – (optional)
tuple
offloat
orint
specifying a percentage by which the dependent variable should be extended beyond on the \(x\) and \(y\) axis, beyond what has been specified by thegrid_bounds
parameter.color – (optional)
str
specifying the streamlines color.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.s_manifold – (optional)
int
orfloat
specifying the scatter point size for the manifold.manifold_color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
x
andy
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, manifold will be plotted in black.colorbar_label – (optional)
str
specifying colorbar label annotation.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.colorbar_range – (optional)
tuple
specifying the lower and the upper bound for the colorbar range.manifold_alpha – (optional)
float
orint
specifying the opacity of the plotted manifold.grid_on –
bool
specifying whether grid should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_3d_regression
#
- PCAfold.reconstruction.plot_3d_regression(x, y, observed, predicted, elev=45, azim=-45, clean=False, x_label=None, y_label=None, z_label=None, color_observed=None, color_predicted=None, s_observed=None, s_predicted=None, alpha_observed=None, alpha_predicted=None, figure_size=(7, 7), title=None, save_filename=None)#
Plots the result of regression of a dependent variable on top of a two-dimensional manifold defined by two independent variables
x
andy
.Example:
from PCAfold import PCA, plot_3d_regression import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Obtain three-dimensional manifold from PCA: pca_X = PCA(X) PCs = pca_X.transform(X) X_rec = pca_X.reconstruct(PCs) # Plot the manifold: plt = plot_3d_regression(X[:,0], X[:,1], X[:,0], X_rec[:,0], elev=45, azim=-45, x_label='$x$', y_label='$y$', z_label='$z$', color_observed='k', color_predicted='r', figure_size=(10,10), title='3D regression', save_filename='3d-regression.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.observed –
numpy.ndarray
specifying the observed values of a single dependent variable. It should be of size(n_observations,)
or(n_observations, 1)
.predicted –
numpy.ndarray
specifying the predicted values of a single dependent variable. It should be of size(n_observations,)
or(n_observations, 1)
.elev – (optional)
float
orint
specifying the elevation angle.azim – (optional)
float
orint
specifying the azimuth angle.clean – (optional)
bool
specifying if a clean plot should be made. If set toTrue
, nothing else but the data points and the 3D axes is plotted.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.z_label – (optional)
str
specifying \(z\)-axis label annotation. If set toNone
label will not be plotted.color_observed – (optional)
str
specifying the color of the plotted observed variable.color_predicted – (optional)
str
specifying the color of the plotted predicted variable.s_observed – (optional)
int
orfloat
specifying the scatter point size for the observed variable.s_predicted – (optional)
int
orfloat
specifying the scatter point size for the predicted variable.alpha_observed – (optional)
int
orfloat
specifying the point opacity for the observed variable.alpha_predicted – (optional)
int
orfloat
specifying the point opacity for the predicted variable.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_stratified_metric
#
- PCAfold.reconstruction.plot_stratified_metric(metric_in_bins, bins_borders, variable_name=None, metric_name=None, yscale='linear', ylim=None, figure_size=(10, 5), title=None, save_filename=None)#
This function plots a stratified metric across bins of a dependent variable.
Example:
from PCAfold import PCA, variable_bins, stratified_coefficient_of_determination, plot_stratified_metric import numpy as np # Generate dummy data set: X = np.random.rand(100,10) # Instantiate PCA class object: pca_X = PCA(X, scaling='auto', n_components=2) # Approximate the data set: X_rec = pca_X.reconstruct(pca_X.transform(X)) # Generate bins: (idx, bins_borders) = variable_bins(X[:,0], k=10, verbose=False) # Compute stratified R2 in 10 bins of the first variable in a data set: r2_in_bins = stratified_coefficient_of_determination(X[:,0], X_rec[:,0], idx=idx, use_global_mean=True, verbose=True) # Visualize how R2 changes across bins: plt = plot_stratified_metric(r2_in_bins, bins_borders, variable_name='$X_1$', metric_name='$R^2$', yscale='log', figure_size=(10,5), title='Stratified $R^2$', save_filename='r2.pdf') plt.close()
- Parameters
metric_in_bins –
list
of metric values in each bin.bins_borders –
list
of bins borders that were created to stratify the dependent variable.variable_name – (optional)
str
specifying the name of the variable for which the metric was computed. If set toNone
label on the x-axis will not be plotted.metric_name – (optional)
str
specifying the name of the metric to be plotted on the y-axis. If set toNone
label on the x-axis will not be plotted.yscale – (optional)
str
specifying the scale for the y-axis.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
Bibliography#
- UAHK+22
Elizabeth Armstrong, Michael A. Hansen, Robert C. Knaus, Nathaniel A. Trask, John C. Hewson, and James C. Sutherland. Accurate compression of tabulated chemistry models with partition of unity networks. Combustion Science and Technology, 0(0):1–18, 2022. doi:10.1080/00102202.2022.2102908.
Utilities#
Tools for optimizing manifold topology#
Class QoIAwareProjection
#
- class PCAfold.utilities.QoIAwareProjection(input_data, n_components, projection_independent_outputs=None, projection_dependent_outputs=None, activation_decoder='tanh', decoder_interior_architecture=(), encoder_weights_init=None, decoder_weights_init=None, hold_initialization=None, hold_weights=None, transformed_projection_dependent_outputs=None, transform_power=0.5, transform_shift=0.0001, transform_sign_shift=0.0, loss='MSE', optimizer='Adam', batch_size=200, n_epochs=1000, learning_rate=0.001, validation_perc=10, random_seed=None, verbose=False)#
Enables computing QoI-aware encoder-decoder projections.
The QoI-aware encoder-decoder is an autoencoder-like neural network that reconstructs important quantities of interest (QoIs) at the output of a decoder. The QoIs can be set to projection-independent variables (such as the original state variables) or projection-dependent variables, whose definition changes during neural network training.
We introduce an intrusive modification to the neural network training process such that at each epoch, a low-dimensional basis matrix is computed from the current weights in the encoder. Any projection-dependent variables at the output get re-projected onto that basis.
The rationale for performing dimensionality reduction with the QoI-aware strategy is that any poor topological behaviors on a low-dimensional projection will immediately increase the loss during training. These behaviors could be non-uniqueness in representing QoIs due to overlaps on a projection, or large gradients in QoIs caused by data compression in certain regions of a projection. Thus, the QoI-aware strategy naturally promotes improved projection topologies and can be useful in reduced-order modeling.
An illustrative explanation of how the QoI-aware encoder-decoder works is presented in the figure below:
More information can be found in [UZPS23].
Example:
from PCAfold import center_scale, QoIAwareProjection import numpy as np # Generate dummy dataset: X = np.random.rand(100,8) S = np.random.rand(100,8) # Request 2D QoI-aware encoder-decoder projection of the dataset: n_components = 2 # Preprocess the dataset before passing it to the encoder-decoder: (input_data, centers, scales) = center_scale(X, scaling='0to1') projection_dependent_outputs = S / scales # Instantiate QoIAwareProjection class object: qoi_aware = QoIAwareProjection(input_data, n_components, projection_independent_outputs=input_data[:,0:3], projection_dependent_outputs=projection_dependent_outputs, activation_decoder=('tanh', 'tanh', 'linear'), decoder_interior_architecture=(5,8), encoder_weights_init=None, decoder_weights_init=None, hold_initialization=10, hold_weights=2, transformed_projection_dependent_outputs='signed-square-root', loss='MSE', optimizer='Adam', batch_size=100, n_epochs=200, learning_rate=0.001, validation_perc=10, random_seed=100, verbose=True) # Begin model training: qoi_aware.train()
A summary of the current QoI-aware encoder-decoder model and its hyperparameter settings can be printed using the
summary()
function:# Print the QoI-aware encoder-decoder model summary qoi_aware.summary()
QoI-aware encoder-decoder model summary... (Model has been trained) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Projection dimensionality: - 2D projection - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Encoder-decoder architecture: 8-2-5-8-7 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Activation functions: (8)--linear--(2)--tanh--(5)--tanh--(8)--linear--(7) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Variables at the decoder output: - 3 projection independent variables - 2 projection dependent variables - 2 transformed projection dependent variables using signed-square-root - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Model validation: - Using 10% of input data as validation data - Model will be trained on 90% of input data - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hyperparameters: - Batch size: 100 - # of epochs: 200 - Optimizer: Adam - Learning rate: 0.001 - Loss function: MSE - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Weights initialization in the encoder: - Glorot uniform - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Weights initialization in the decoder: - Glorot uniform - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Weights updates in the encoder: - Initial weights in the encoder will be kept for 10 first epochs - Weights in the encoder will change once every 2 epochs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Results reproducibility: - Reproducible neural network training will be assured using random seed: 100 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Training results: - Minimum training loss: 0.0852246955037117 - Minimum training loss at epoch: 199 - Minimum validation loss: 0.06681100279092789 - Minimum validation loss at epoch: 182 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- Parameters
input_data –
numpy.ndarray
specifying the data set used as the input to the encoder-decoder. It should be of size(n_observations,n_variables)
.n_components –
int
specifying the dimensionality of the QoI-aware encoder-decoder projection. This is equal to the number of neurons in the bottleneck layer.projection_independent_outputs – (optional)
numpy.ndarray
specifying any projection-independent outputs at the decoder. It should be of size(n_observations,n_projection_independent_outputs)
.projection_dependent_outputs – (optional)
numpy.ndarray
specifying any projection-dependent outputs at the decoder. During training,projection_dependent_outputs
is projected onto the current basis matrix and the decoder outputs are updated accordingly. It should be of size(n_observations,n_projection_dependent_outputs)
.activation_decoder – (optional)
str
ortuple
specifying activation functions in all the decoding layers. If set tostr
, the same activation function is used in all decoding layers. If set to atuple
ofstr
, a different activation function can be set at different decoding layers. The number of elements in thetuple
should match the number of decoding layers!str
andstr
elements of thetuple
can only be'linear'
,'sigmoid'
, or'tanh'
. Note, that the activation function in the encoder is hardcoded to'linear'
.decoder_interior_architecture – (optional)
tuple
ofint
specifying the number of neurons in the interior architecture of a decoder. For example, ifdecoder_interior_architecture=(4,5)
, two interior decoding layers will be created and the overal network architecture will be(Input)-(Bottleneck)-(4)-(5)-(Output)
. If set to an empty tuple,decoder_interior_architecture=()
, the overal network architecture will be(Input)-(Bottleneck)-(Output)
. Keep in mind that if you’d like to create just one interior layer, you should use a comma after the integer:decoder_interior_architecture=(4,)
.encoder_weights_init – (optional)
numpy.ndarray
specifying the custom initalization of the weights in the encoder. It should be of size(n_variables, n_components)
. If set toNone
, weights in the encoder will be initialized using the Glorot uniform distribution.decoder_weights_init – (optional)
tuple
ofnumpy.ndarray
specifying the custom initalization of the weights in the decoder. Each element in the tuple should have a shape that matches the architecture. If set toNone
, weights in the encoder will be initialized using the Glorot uniform distribution.hold_initialization – (optional)
int
specifying the number of first epochs during which the initial weights in the encoder are held constant. If set toNone
, weights in the encoder will change at the first epoch. This parameter can be used in conjunction withhold_weights
.hold_weights – (optional)
int
specifying how frequently the weights should be changed in the encoder. For example, if set tohold_weights=2
, the weights in the encoder will only be updated once every two epochs throught the whole training process. If set toNone
, weights in the encoder will change at every epoch. This parameter can be used in conjunction withhold_initialization
.transformed_projection_dependent_outputs – (optional)
str
specifying if any nonlinear transformation of the projection-dependent outputs should be added at the decoder output. It can be'symlog'
or'signed-square-root'
.transform_power – (optional)
int
orfloat
as perpreprocess.power_transform()
.transform_shift – (optional)
int
orfloat
as perpreprocess.power_transform()
.transform_sign_shift – (optional)
int
orfloat
as perpreprocess.power_transform()
.loss – (optional)
str
specifying the loss function. It can be'MAE'
or'MSE'
.optimizer – (optional)
str
specifying the optimizer used during training. It can be'Adam'
or'Nadam'
.batch_size – (optional)
int
specifying the batch size.n_epochs – (optional)
int
specifying the number of epochs.learning_rate – (optional)
float
specifying the learning rate passed to the optimizer.validation_perc – (optional)
int
specifying the percentage of the input data to be used as validation data during training. It should be a number larger than or equal to 0 and smaller than 100. Note, that if it is set above 0, not all of the input data will be used as training data. Note, that validation data does not impact model training!random_seed – (optional)
int
specifying the random seed to be used for any random operations. It is highly recommended to set a fixed random seed, as this allows for complete reproducibility of the results.verbose – (optional)
bool
for printing verbose details.
Attributes:
input_data - (read only)
numpy.ndarray
specifying the data set used as the input to the encoder-decoder.n_components - (read only)
int
specifying the dimensionality of the QoI-aware encoder-decoder projection.projection_independent_outputs - (read only)
numpy.ndarray
specifying any projection-independent outputs at the decoder.projection_dependent_outputs - (read only)
numpy.ndarray
specifying any projection-dependent outputs at the decoder.architecture - (read only)
str
specifying the QoI-aware encoder-decoder architecture.n_total_outputs - (read only)
int
counting the total number of outputs at the decoder.qoi_aware_encoder_decoder - (read only) object of
Keras.models.Sequential
class that stores the QoI-aware encoder-decoder neural network.weights_and_biases_init - (read only)
list
ofnumpy.ndarray
specifying weights and biases with which the QoI-aware encoder-decoder was intialized.weights_and_biases_trained - (read only)
list
ofnumpy.ndarray
specifying weights and biases after training the QoI-aware encoder-decoder. Only available after callingQoIAwareProjection.train()
.training_loss - (read only)
list
of losses computed on the training data. Only available after callingQoIAwareProjection.train()
.validation_loss - (read only)
list
of losses computed on the validation data. Only available after callingQoIAwareProjection.train()
and only whenvalidation_perc
was not equal to 0.bases_across_epochs - (read only)
list
ofnumpy.ndarray
specifying all basis matrices from all epochs. Only available after callingQoIAwareProjection.train()
.
QoIAwareProjection.summary
#
- PCAfold.utilities.QoIAwareProjection.summary(self)#
Prints the QoI-aware encoder-decoder model summary.
QoIAwareProjection.train
#
- PCAfold.utilities.QoIAwareProjection.train(self)#
Trains the QoI-aware encoder-decoder neural network model.
After training, the optimized basis matrix for low-dimensional data projection can be obtained.
QoIAwareProjection.print_weights_and_biases_init
#
- PCAfold.utilities.QoIAwareProjection.print_weights_and_biases_init(self)#
Prints initial weights and biases from all layers of the QoI-aware encoder-decoder.
QoIAwareProjection.print_weights_and_biases_trained
#
- PCAfold.utilities.QoIAwareProjection.print_weights_and_biases_trained(self)#
Prints trained weights and biases from all layers of the QoI-aware encoder-decoder.
QoIAwareProjection.get_best_basis
#
- PCAfold.utilities.QoIAwareProjection.get_best_basis(self, method='min-training-loss')#
Returns the best low-dimensional basis according to the selected method.
- Parameters
method – (optional)
str
specifying the method used to select the best basis. It should be'min-training-loss'
,'min-validation-loss'
, or'last-epoch'
.- Returns
best_basis -
numpy.ndarray
specifying the best basis extracted from thebases_across_epochs
attribute.
QoIAwareProjection.plot_losses
#
- PCAfold.utilities.QoIAwareProjection.plot_losses(self, markevery=100, figure_size=(15, 5), save_filename=None)#
Plots training and validation losses.
- Parameters
markevery – (optional)
int
specifying how frequently the epoch number on the x-axis should be labelled.figure_size – (optional)
tuple
specifying figure size.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
manifold_informed_forward_variable_addition
#
- PCAfold.utilities.manifold_informed_forward_variable_addition(X, X_source, variable_names, scaling, bandwidth_values, target_variables=None, add_transformed_source=True, target_manifold_dimensionality=3, bootstrap_variables=None, penalty_function=None, power=1, vertical_shift=1, norm='max', integrate_to_peak=False, verbose=False)#
Manifold-informed feature selection algorithm based on forward variable addition introduced in [UZSP22]. The goal of the algorithm is to select a meaningful subset of the original variables such that undesired behaviors on a PCA-derived manifold of a given dimensionality are minimized. The algorithm uses the cost function, \(\mathcal{L}\), based on minimizing the area under the normalized variance derivatives curves, \(\hat{\mathcal{D}}(\sigma)\), for the selected \(n_{dep}\) dependent variables (as per
cost_function_normalized_variance_derivative
function). The algorithm can be bootstrapped in two ways:Automatic bootstrap when
bootstrap_variables=None
: the first best variable is selected automatically as the one that gives the lowest cost.User-defined bootstrap when
bootstrap_variables
is set to a user-defined list of the bootstrap variables.
The algorithm iterates, adding a new variable that exhibits the lowest cost at each iteration. The original variables in a data set get ordered according to their effect on the manifold topology. Assuming that the original data set is composed of \(Q\) variables, the first output is a list of indices of the ordered original variables, \(\mathbf{X} = [X_1, X_2, \dots, X_Q]\). The second output is a list of indices of the selected subset of the original variables, \(\mathbf{X}_S = [X_1, X_2, \dots, X_n]\), that correspond to the minimum cost, \(\mathcal{L}\).
More information can be found in [UZSP22].
Note
The algorithm can be very expensive (for large data sets) due to multiple computations of the normalized variance derivative. Try running it on multiple cores or on a sampled data set.
In case the algorithm breaks when not being able to determine the peak location, try increasing the range in the
bandwidth_values
parameter.Example:
from PCAfold import manifold_informed_forward_variable_addition as FVA import numpy as np # Generate dummy data set: X = np.random.rand(100,10) X_source = np.random.rand(100,10) # Define original variables to add to the optimization: target_variables = X[:,0:3] # Specify variables names variable_names = ['X_' + str(i) for i in range(0,10)] # Specify the bandwidth values to compute the optimization on: bandwidth_values = np.logspace(-4, 2, 50) # Run the subset selection algorithm: (ordered, selected, min_cost, costs) = FVA(X, X_source, variable_names, scaling='auto', bandwidth_values=bandwidth_values, target_variables=target_variables, add_transformed_source=True, target_manifold_dimensionality=2, bootstrap_variables=None, penalty_function='peak', norm='max', integrate_to_peak=True, verbose=True)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.X_source –
numpy.ndarray
specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09]. It should be of size(n_observations,n_variables)
.variable_names –
list
ofstr
specifying variables names.scaling – (optional)
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.bandwidth_values –
numpy.ndarray
specifying the bandwidth values, \(\sigma\), for \(\hat{\mathcal{D}}(\sigma)\) computation.target_variables – (optional)
numpy.ndarray
specifying the dependent variables that should be used in \(\hat{\mathcal{D}}(\sigma)\) computation. It should be of size(n_observations,n_target_variables)
.add_transformed_source – (optional)
bool
specifying if the PCA-transformed source terms of the state-space variables should be added in \(\hat{\mathcal{D}}(\sigma)\) computation, alongside the user-defined dependent variables.target_manifold_dimensionality – (optional)
int
specifying the target dimensionality of the PCA manifold.bootstrap_variables – (optional)
list
specifying the user-selected variables to bootstrap the algorithm with. If set toNone
, automatic bootstrapping is performed.penalty_function – (optional)
str
specifying the weighting applied to each area. Setpenalty_function='peak'
to weight each area by the rightmost peak location, \(\sigma_{peak, i}\), for the \(i^{th}\) dependent variable. Setpenalty_function='sigma'
to weight each area continuously by the bandwidth. Setpenalty_function='log-sigma-over-peak'
to weight each area continuously by the \(\log_{10}\) -transformed bandwidth, normalized by the right most peak location, \(\sigma_{peak, i}\). Ifpenalty_function=None
, the area is not weighted.power – (optional)
float
orint
specifying the power, \(r\). It can be used to control how much penalty should be applied to variance happening at the smallest length scales.vertical_shift – (optional)
float
orint
specifying the vertical shift multiplier, \(b\). It can be used to control how much penalty should be applied to feature sizes.norm – (optional)
str
specifying the norm to apply for all areas \(A_i\).norm='average'
uses an arithmetic average,norm='max'
uses the \(L_{\infty}\) norm,norm='median'
uses a median area,norm='cumulative'
uses a cumulative area andnorm='min'
uses a minimum area.integrate_to_peak – (optional)
bool
specifying whether an individual area for the \(i^{th}\) dependent variable should be computed only up the the rightmost peak location.verbose – (optional)
bool
for printing verbose details.
- Returns
ordered_variables -
list
specifying the indices of the ordered variables.selected_variables -
list
specifying the indices of the selected variables that correspond to the minimum cost \(\mathcal{L}\).optimized_cost -
float
specifying the cost corresponding to the optimized subset.costs -
list
specifying the costs, \(\mathcal{L}\), from each iteration.
manifold_informed_backward_variable_elimination
#
- PCAfold.utilities.manifold_informed_backward_variable_elimination(X, X_source, variable_names, scaling, bandwidth_values, target_variables=None, add_transformed_source=True, source_space=None, target_manifold_dimensionality=3, penalty_function=None, power=1, vertical_shift=1, norm='max', integrate_to_peak=False, verbose=False)#
Manifold-informed feature selection algorithm based on backward variable elimination introduced in [UZSP22]. The goal of the algorithm is to select a meaningful subset of the original variables such that undesired behaviors on a PCA-derived manifold of a given dimensionality are minimized. The algorithm uses the cost function, \(\mathcal{L}\), based on minimizing the area under the normalized variance derivatives curves, \(\hat{\mathcal{D}}(\sigma)\), for the selected \(n_{dep}\) dependent variables (as per
cost_function_normalized_variance_derivative
function).The algorithm iterates, removing another variable that has an effect of decreasing the cost the most at each iteration. The original variables in a data set get ordered according to their effect on the manifold topology. Assuming that the original data set is composed of \(Q\) variables, the first output is a list of indices of the ordered original variables, \(\mathbf{X} = [X_1, X_2, \dots, X_Q]\). The second output is a list of indices of the selected subset of the original variables, \(\mathbf{X}_S = [X_1, X_2, \dots, X_n]\), that correspond to the minimum cost, \(\mathcal{L}\).
More information can be found in [UZSP22].
Note
The algorithm can be very expensive (for large data sets) due to multiple computations of the normalized variance derivative. Try running it on multiple cores or on a sampled data set.
In case the algorithm breaks when not being able to determine the peak location, try increasing the range in the
bandwidth_values
parameter.Example:
from PCAfold import manifold_informed_backward_variable_elimination as BVE import numpy as np # Generate dummy data set: X = np.random.rand(100,10) X_source = np.random.rand(100,10) # Define original variables to add to the optimization: target_variables = X[:,0:3] # Specify variables names variable_names = ['X_' + str(i) for i in range(0,10)] # Specify the bandwidth values to compute the optimization on: bandwidth_values = np.logspace(-4, 2, 50) # Run the subset selection algorithm: (ordered, selected, min_cost, costs) = BVE(X, X_source, variable_names, scaling='auto', bandwidth_values=bandwidth_values, target_variables=target_variables, add_transformed_source=True, target_manifold_dimensionality=2, penalty_function='peak', norm='max', integrate_to_peak=True, verbose=True)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.X_source –
numpy.ndarray
specifying the source terms, \(\mathbf{S_X}\), corresponding to the state-space variables in \(\mathbf{X}\). This parameter is applicable to data sets representing reactive flows. More information can be found in [TSP09]. It should be of size(n_observations,n_variables)
.variable_names –
list
ofstr
specifying variables names. Order of names in thevariable_names
list should match the order of variables (columns) inX
.scaling – (optional)
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.bandwidth_values –
numpy.ndarray
specifying the bandwidth values, \(\sigma\), for \(\hat{\mathcal{D}}(\sigma)\) computation.target_variables – (optional)
numpy.ndarray
specifying the dependent variables that should be used in \(\hat{\mathcal{D}}(\sigma)\) computation. It should be of size(n_observations,n_target_variables)
.add_transformed_source – (optional)
bool
specifying if the PCA-transformed source terms of the state-space variables should be added in \(\hat{\mathcal{D}}(\sigma)\) computation, alongside the user-defined dependent variables.source_space – (optional)
str
specifying the space to which the PC source terms should be transformed before computing the cost. It can be one of the following:symlog
,continuous-symlog
,original-and-symlog
,original-and-continuous-symlog
. If set toNone
, PC source terms are kept in their original PCA-space.target_manifold_dimensionality – (optional)
int
specifying the target dimensionality of the PCA manifold.penalty_function – (optional)
str
specifying the weighting applied to each area. Setpenalty_function='peak'
to weight each area by the rightmost peak location, \(\sigma_{peak, i}\), for the \(i^{th}\) dependent variable. Setpenalty_function='sigma'
to weight each area continuously by the bandwidth. Setpenalty_function='log-sigma-over-peak'
to weight each area continuously by the \(\log_{10}\) -transformed bandwidth, normalized by the right most peak location, \(\sigma_{peak, i}\). Ifpenalty_function=None
, the area is not weighted.power – (optional)
float
orint
specifying the power, \(r\). It can be used to control how much penalty should be applied to variance happening at the smallest length scales.vertical_shift – (optional)
float
orint
specifying the vertical shift multiplier, \(b\). It can be used to control how much penalty should be applied to feature sizes.norm – (optional)
str
specifying the norm to apply for all areas \(A_i\).norm='average'
uses an arithmetic average,norm='max'
uses the \(L_{\infty}\) norm,norm='median'
uses a median area,norm='cumulative'
uses a cumulative area andnorm='min'
uses a minimum area.integrate_to_peak – (optional)
bool
specifying whether an individual area for the \(i^{th}\) dependent variable should be computed only up the the rightmost peak location.verbose – (optional)
bool
for printing verbose details.
- Returns
ordered_variables -
list
specifying the indices of the ordered variables.selected_variables -
list
specifying the indices of the selected variables that correspond to the minimum cost \(\mathcal{L}\).optimized_cost -
float
specifying the cost corresponding to the optimized subset.costs -
list
specifying the costs, \(\mathcal{L}\), from each iteration.
Class QoIAwareProjectionPOUnet
#
- class PCAfold.utilities.QoIAwareProjectionPOUnet(projection_weights, partition_centers, partition_shapes, basis_type, projection_biases=None, basis_coeffs=None, dtype='float64', **kwargs)#
This is analogous to
QoIAwareProjection
but usesPartitionOfUnityNetwork
as the decoder.Example:
from PCAfold import init_uniform_partitions, PCA, QoIAwareProjectionPOUnet import numpy as np import tensorflow as tf # generate dummy data set: ivars = np.random.rand(100,3) # initialize a projection (e.g., using PCA) pca = PCA(ivars, scaling='none', n_components=2) ivar_proj = pca.transform(ivars) # initialize the QoIAwareProjectionPOUnet parameters net = QoIAwareProjectionPOUnet(pca.A[:,:2], **init_uniform_partitions([5,7], ivar_proj), basis_type='linear') # function for defining the training dependent variables (can include a projection) dvar = np.vstack((ivars[:,0] + ivars[:,1], 2.*ivars[:,0] + 3.*ivars[:,1], 3.*ivars[:,0] + 5.*ivars[:,1])).T def dvar_func(proj_weights): temp = tf.Variable(np.expand_dims(dvar, axis=2), name='eval_qoi', dtype=net._reconstruction._dtype) temp = net.tf_projection(temp, nobias=True) return temp # build the training graph with provided training data net.build_training_graph(ivars, dvar_func) # train the projection net.train(1000) # compute new projected variables net.projection(ivars) # evaluate the encoder-decoder net(ivars) # Save the data to a file net.write_data_to_file('filename.pkl') # reload projection data from file net2 = QoIAwareProjectionPOUnet.load_from_file('filename.pkl')
- Parameters
projection_weights – array of the projection matrix weights
partition_centers – array size (number of partitions) x (number of ivar inputs) for partition locations
partition_shapes – array size (number of partitions) x (number of ivar inputs) for partition shapes influencing the RBF widths
basis_type – string (
'constant'
,'linear'
, or'quadratic'
) for the degree of polynomial basisprojection_biases – (optional, default None) array of the biases (offsets) corresponding to the projection weights, if
None
the projections are offset by zerosbasis_coeffs – (optional, default
None
) if the array of polynomial basis coefficients is known, it may be provided here, otherwise it will be initialized withbuild_training_graph
and trained withtrain
dtype – (optional, default
'float64'
) string specifying either float type'float64'
or'float32'
Attributes:
projection_weights - (read only) array of the current projection weights
projection_biases - (read only) array of the projection biases
reconstruction_model - (read only) the current POUnet decoder
partition_centers - (read only) array of the current partition centers
partition_shapes - (read only) array of the current partition shape parameters
basis_type - (read only) string relaying the basis degree
basis_coeffs - (read only) array of the current basis coefficients
proj_ivar_center - (read only) array of the centering parameters used in the POUnet for the projected ivar inputs
proj_ivar_scale - (read only) array of the scaling parameters used in the POUnet for the projected ivar inputs
dtype - (read only) string relaying the data type (
'float64'
or'float32'
)training_archive - (read only) dictionary of the errors and POUnet states archived during training
iterations - (read only) array of the iterations archived during training
QoIAwareProjectionPOUnet.projection
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.projection(self, ivars, nobias=False)#
Projects the independent variable inputs using the current projection weights and biases
- Parameters
ivars – array of independent variable query points
nobias – (optional, default False) whether or not to apply the projection bias. Analogous to
nocenter
in the PCAtransform
function.
- Returns
array of the projected independent variable query points
QoIAwareProjectionPOUnet.tf_projection
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.tf_projection(self, y, nobias=False)#
version of
projection
using tensorflow operations and Tensors
QoIAwareProjectionPOUnet.update_lr
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.update_lr(self, lr)#
update the learning rate for training
- Parameters
lr – float for the learning rate
QoIAwareProjectionPOUnet.update_l2reg
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.update_l2reg(self, l2reg)#
update the least-squares regularization for training
- Parameters
l2reg – float for the least-squares regularization
QoIAwareProjectionPOUnet.build_training_graph
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.build_training_graph(self, ivars, dvars_function, error_type='abs', constrain_positivity=False, first_trainable_idx=0)#
Construct the graph used during training (including defining the training errors) with the provided training data
- Parameters
ivars – array of independent variables for training
dvars_function – function (using tensorflow operations) for defining the dependent variable(s) for training. This must take a single argument of the projection weights which, if used, will be evaluated with the weights as they are updated
error_type – (optional, default
'abs'
) the type of training error: relative'rel'
or absolute'abs'
constrain_positivity – (optional, default False) when True, it penalizes the training error with \(f - |f|\) for dependent variables \(f\). This can be useful for defining projected source term dependent variables, for example.
first_trainable_idx – (optional, default 0) This separates the trainable projection weights (with index greater than or equal to
first_trainable_idx
) from the nontrainable projection weights.
QoIAwareProjectionPOUnet.train
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.train(self, iterations, archive_rate=100, use_best_archive_sse=True, verbose=False)#
Performs training using a least-squares gradient descent block coordinate descent strategy. This alternates between updating the partition and projection parameters with gradient descent and updating the basis coefficients with least-squares.
- Parameters
iterations – integer for number of training iterations to perform
archive_rate – (optional, default 100) the rate at which the errors and parameters are archived during training. These can be accessed with the
training_archive
attributeuse_best_archive_sse – (optional, default True) when True will set the POUnet parameters to those with the lowest error observed during training, otherwise the parameters from the last iteration are used
verbose – (optional, default False) when True will print progress
QoIAwareProjectionPOUnet.__call__
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.__call__(self, xeval)#
evaluate the encoder-decoder
- Parameters
xeval – array of independent variable query points
- Returns
array of predictions
QoIAwareProjectionPOUnet.write_data_to_file
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.write_data_to_file(self, filename)#
Save class data to a specified file using pickle. This does not include the archived data from training, which can be separately accessed with training_archive and saved outside of
QoIAwareProjectionPOUnet
.- Parameters
filename – string
QoIAwareProjectionPOUnet.load_data_from_file
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.load_data_from_file(filename)#
Load data from a specified
filename
with pickle (followingwrite_data_to_file
)- Parameters
filename – string
- Returns
dictionary of the encoder-decoder data
QoIAwareProjectionPOUnet.load_from_file
#
- PCAfold.utilities.QoIAwareProjectionPOUnet.load_from_file(filename)#
Load class from a specified
filename
with pickle (followingwrite_data_to_file
)- Parameters
filename – string
- Returns
QoIAwareProjectionPOUnet
Bibliography#
- AAS21
Elizabeth Armstrong and James C. Sutherland. A technique for characterising feature size and quality of manifolds. Combustion Theory and Modelling, 0(0):1–23, 2021. doi:10.1080/13647830.2021.1931715.
- AZASP22
Kamila Zdybał, Elizabeth Armstrong, James C. Sutherland, and Alessandro Parente. Cost function for low-dimensional manifold topology assessment. Scientific Reports, 12:14496, 2022. URL: https://www.nature.com/articles/s41598-022-18655-1, doi:https://doi.org/10.1038/s41598-022-18655-1.
- UZPS23
Kamila Zdybał, Alessandro Parente, and James C. Sutherland. Improving reduced-order models through nonlinear decoding of projection-dependent model outputs. Article in preparation for PNAS, :, 2023. URL:, doi:.
- UZSP22(1,2,3,4)
Kamila Zdybał, James C. Sutherland, and Alessandro Parente. Manifold-informed state vector subset for reduced-order modeling. Proceedings of the Combustion Institute, 2022. URL: https://www.sciencedirect.com/science/article/pii/S1540748922000153, doi:https://doi.org/10.1016/j.proci.2022.06.019.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Preprocessing#
In this tutorial, we present data manipulation functionalities of the preprocess
module. To import the module:
from PCAfold import preprocess
Centering, scaling and constant variable removal#
We begin by generating a dummy data set:
import numpy as np
X = np.random.rand(100,20)
Several popular scaling options have been implemented such as Auto (std), Range,
VAST or Pareto. Centering and scaling of data sets can be performed using
preprocess.center_scale
function:
(X_cs, X_center, X_scale) = preprocess.center_scale(X, 'range', nocenter=False)
To invert the centering and scaling using the current centers and scales
preprocess.invert_center_scale
function can be used:
X = preprocess.invert_center_scale(X_cs, X_center, X_scale)
If constant variables are present in the data set, they can be removed using
preprocess.remove_constant_vars
function which can be a useful pre-processing
before PCA is applied on a data set. If an artificial constant column is injected:
X[:,5] = np.ones((100,))
it can be removed by:
(X_removed, idx_removed, idx_retained) = preprocess.remove_constant_vars(X)
In addition to that, an object of the PreProcessing
class can be created and
used to store the combination of the above pre-processing:
preprocessed = preprocess.PreProcessing(X, 'range', nocenter=False)
Centered and scaled data set can then be accessed as class attribute:
preprocessed.X_cs
as well as centers and scales:
preprocessed.X_center
preprocessed.X_scale
Conditional statistics#
In this section, we demonstrate how conditional statistics can be computed and plotted for the original data set. A data set representing combustion of syngas in air generated from steady laminar flamelet model using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07] is used as a demo data set. We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding mixture fraction observations, \(Z\), that will serve as the conditioning variable:
X = np.genfromtxt('data-state-space.csv', delimiter=',')
Z = np.genfromtxt('data-mixture-fraction.csv', delimiter=',')
First, we create an object of the ConditionalStatistics
class. We condition the entire data set \(\mathbf{X}\), using the mixture fraction as a conditioning variable. We compute the conditional stastics in 20 bins of the conditioning variable:
cond = preprocess.ConditionalStatistics(X, Z, k=20)
We can then retrieve the centroids of the bins for which the conditional statistics have been computed:
cond.centroids
and retrieve different conditional statistics. For instance, the conditional mean can be accessed through:
conditional_mean = cond.conditional_mean
The conditional statistics can also be ploted using a dedicated function:
plt = preprocess.plot_conditional_statistics(X[:,0], Z, k=20, x_label='Mixture fraction [-]', y_label='$T$ [K]', color='#c0c0c0', statistics_to_plot=['mean', 'max', 'min'], figure_size=(10,4), save_filename=save_filename)
Note, that the original data set that is plotted in the backround can be colored using any vector variable:
plt = preprocess.plot_conditional_statistics(X[:,0], Z, k=20, statistics_to_plot=['mean', 'max', 'min'], x_label='Mixture fraction [-]', y_label='$T$ [K]', color=X[:,2], color_map='inferno', colorbar_label='$Y_{O_2}$ [-]', figure_size=(12.5,4), save_filename=save_filename)
Multivariate outlier detection#
We first generate a synthetic data set with artificially appended outliers. This data set, with outliers visible as a cloud in the top right corner, can be seen below:
We will first detect outliers with 'MULTIVARIATE TRIMMING'
method and we
will demonstrate the effect of setting two levels of trimming_threshold
.
We first set trimming_threshold=0.6
:
(idx_outliers_removed, idx_outliers) = preprocess.outlier_detection(X, scaling='auto', detection_method='MULTIVARIATE TRIMMING', trimming_threshold=0.6, n_iterations=0, verbose=True)
With verbose=True
we will see some more information on outliers detected:
Number of observations classified as outliers: 20
We can visualize the observations that were classified as outliers using the
preprocess.plot_2d_clustering
, assuming that the cluster \(k_0\) (blue) will be
observations with removed outliers and cluster \(k_1\) (red) will be the detected outliers.
We first create a dummy idx_new
vector of cluster classifications based on
idx_outliers
obtained. This can for instance be done in the following way:
idx_new = np.zeros((n_observations,))
for i in range(0, n_observations):
if i in idx_outliers:
idx_new[i] = 1
where n_observations
is the total number of observations in the data set.
The result of this detection can be seen below:
We then set the trimming_threshold=0.3
which will capture outliers earlier (at smaller
Mahalanobis distances from the variables’ centroids).
(idx_outliers_removed, idx_outliers) = preprocess.outlier_detection(X, scaling='auto', detection_method='MULTIVARIATE TRIMMING', trimming_threshold=0.3, n_iterations=0, verbose=True)
With verbose=True
we will see some more information on outliers detected:
Number of observations classified as outliers: 180
The result of this detection can be seen below:
It can be seen that the algorithm started to pick up outlier observations at the perimeter of the original data set.
Kernel density weighting#
In this tutorial we reproduce results on a synthetic data set from the following paper:
Coussement, A., Gicquel, O., & Parente, A. (2012). Kernel density weighted principal component analysis of combustion processes. Combustion and flame, 159(9), 2844-2855.
We begin by generating the synthetic data set that has two distinct clouds with many observations and an intermediate region with few observations:
from PCAfold import KernelDensity
from PCAfold import PCA
from PCAfold import reduction
import numpy as np
n_observations = 2021
x1 = np.zeros((n_observations,1))
x2 = np.zeros((n_observations,1))
for i in range(0,n_observations):
R = np.random.rand()
if i <= 999:
x1[i] = -1 + 20*R
x2[i] = 5*x1[i] + 100*R
if i >= 1000 and i <= 1020:
x1[i] = 420 + 8*(i+1 - 1001)
x2[i] = 5000/200 * (x1[i] - 400) + 500*R
if i >= 1021 and i <= 2020:
x1[i] = 1000 + 20*R
x2[i] = 5*x1[i] + 100*R
X = np.hstack((x1, x2))
This data set can be seen below:
We perform PCA on the data set and approximate it with a single principal component:
pca = PCA(X, scaling='auto', n_components=1)
PCs = pca.transform(X)
X_rec = pca.reconstruct(PCs)
Using the reduction.plot_parity
function we can visualize how each variable is reconstructed:
We thus note that PCA adjusts to reconstruct well the two regions with many observations and the intermediate region is not reconstructed well.
Single-variable case#
We will weight the data set using kernel density weighting method in order to
give more importance to the intermediate region. Kernel density weighting
can be performed by instantiating an object of the KernelDensity
class.
As the first variable we pass the entire centered and scaled data set and as a
second variable we specify what should be the conditioning variable based on
which weighting will be computed:
kernd_single = KernelDensity(pca.X_cs, pca.X_cs[:,0], verbose=True)
With verbose=True
we will see which case is being run:
Single-variable case will be applied.
In general, whenever the conditioning variable is a single vector a single-variable case will be used.
We then obtain the weighted data set:
X_weighted_single = kernd_single.X_weighted
Weights \(\mathbf{W_c}\) used to scale the data set can be accessed as well:
weights_single = kernd_single.weights
We perform PCA on the weighted data set and we project the centered and scaled
original data set onto the basis identified on X_weighted_single
:
pca_single = PCA(X_weighted_single, 'none', n_components=1, nocenter=True)
PCs_single = pca_single.transform(pca.X_cs)
Reconstruction of that data set can be obtained:
X_rec_single = pca_single.reconstruct(PCs_single)
X_rec_single = (X_rec_single * pca.X_scale) + pca.X_center
We can now use reduction.plot_parity
function to visualize the new reconstruction:
We note that this time the intermediate region got better represented in the PCA reconstruction.
Multi-variable case#
In a similar way, multi-variable case can be used by passing the entire two-dimensional data set as a conditioning variable:
kernd_multi = KernelDensity(pca.X_cs, pca.X_cs, verbose=True)
We then perform analogous steps to obtain the new reconstruction:
X_weighted_multi = kernd_multi.X_weighted
weights_multi = kernd_multi.weights
pca_multi = PCA(X_weighted_multi, 'none', n_components=1)
PCs_multi = pca_multi.transform(pca.X_cs)
X_rec_multi = pca_multi.reconstruct(PCs_multi)
X_rec_multi = (X_rec_multi * pca.X_scale) + pca.X_center
The result of this reconstruction can be seen below:
Bibliography#
- CHan20
Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.
- CHSSC07
Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Data clustering#
In this tutorial, we present the clustering functionalities from the preprocess
module.
We import the necessary modules:
from PCAfold import preprocess
from PCAfold import reduction
import numpy as np
from matplotlib.colors import ListedColormap
from sklearn.cluster import KMeans
and we set some initial parameters:
x_label = '$x$'
y_label = '$y$'
z_label = '$z$'
figure_size = (6,3)
color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54'])
save_filename = None
random_seed = 200
Visualize the clustering result in 2D#
We begin by demonstrating how the result of clustering can be visualized using the
plotting functionalities from the preprocess
module.
We generate a synthetic 2D data set composed of two distinct clouds:
np.random.seed(seed=random_seed)
n_observations = 1000
mean_1 = [0,1]
mean_2 = [6,4]
covariance_1 = [[2, 0.5], [0.5, 0.5]]
covariance_2 = [[3, 0.3], [0.3, 0.5]]
x_1, y_1 = np.random.multivariate_normal(mean_1, covariance_1, n_observations).T
x_2, y_2 = np.random.multivariate_normal(mean_2, covariance_2, n_observations).T
x = np.concatenate([x_1, x_2])
y = np.concatenate([y_1, y_2])
The original data set can be visualized using the function from the reduction
module:
plt = reduction.plot_2d_manifold(x, y, x_label=x_label, y_label=y_label, figure_size=figure_size, save_filename=None)
We divide the data into two clusters using the K-Means algorithm:
idx_kmeans = KMeans(n_clusters=2).fit(np.hstack((x, y))).labels_
As soon as the idx
vector of cluster classification is known for the data set,
the result of clustering can be visualized using the plot_2d_clustering
function.
We plot the result of K-Means clustering on the 2D data set:
plt = preprocess.plot_2d_clustering(x, y, idx_kmeans, x_label=x_label, y_label=y_label, color_map=color_map, first_cluster_index_zero=False, figure_size=figure_size, save_filename=None)
Note, that the numbers in the legend, next to each cluster number, represent the number of samples in a particular cluster. The populations of each cluster can also be computed and printed, for instance through:
print(preprocess.get_populations(idx_kmeans))
which in this case will print:
[991, 1009]
Visualize the clustering result in 3D#
Clustering result can also be visualized in a three-dimensional space. In this example, we generate a synthetic 3D data set composed of three connected planes:
n_observations = 50
x = np.tile(np.linspace(0,50,n_observations), n_observations)
y = np.zeros((n_observations,1))
z = np.zeros((n_observations*n_observations,1))
for i in range(1,n_observations):
y = np.vstack((y, np.ones((n_observations,1))*i))
y = y.ravel()
for observation, x_value in enumerate(x):
y_value = y[observation]
if x_value <= 10:
z[observation] = 2 * x_value + y_value
elif x_value > 10 and x_value <= 35:
z[observation] = 10 * x_value + y_value - 80
elif x_value > 35:
z[observation] = 5 * x_value + y_value + 95
(x, _, _) = preprocess.center_scale(x[:,None], scaling='0to1')
(y, _, _) = preprocess.center_scale(y[:,None], scaling='0to1')
(z, _, _) = preprocess.center_scale(z, scaling='0to1')
The original data set can be visualized using the function from the reduction
module:
plt = reduction.plot_3d_manifold(x, y, z, elev=30, azim=-100, x_label=x_label, y_label=y_label, z_label=z_label, figure_size=(12,8), save_filename=None)
We divide the data into four clusters using the K-Means algorithm:
idx_kmeans = KMeans(n_clusters=4).fit(np.hstack((x, y, z))).labels_
The result of K-Means clustering can then be plotted in 3D:
plt = preprocess.plot_3d_clustering(x, y, z, idx_kmeans, elev=30, azim=-100, x_label=x_label, y_label=y_label, z_label=z_label, color_map=color_map, first_cluster_index_zero=False, figure_size=(12,8), save_filename=None)
Clustering based on binning a single variable#
In this section, we demonstrate a few clustering functions that are implemented in PCAfold. All of them cluster data sets based on binning a single variable.
First, we generate a synthetic two-dimensional data set:
x = np.linspace(-1,1,100)
y = -x**2 + 1
The data set can be visualized using the function from the reduction
module:
plt = reduction.plot_2d_manifold(x, y, x_label=x_label, y_label=y_label, figure_size=figure_size, save_filename=None)
We will now cluster the 2D data set according to bins of a single variable, \(x\).
Cluster into equal variable bins#
This clustering will divide the data set based on equal bins of a variable vector.
(idx_variable_bins, borders_variable_bins) = preprocess.variable_bins(x, 4, verbose=True)
With verbose=True
we will see some detailed information on clustering:
Border values for bins:
[-1.0, -0.5, 0.0, 0.5, 1.0]
Bounds for cluster 0:
-1.0, -0.5152
Bounds for cluster 1:
-0.4949, -0.0101
Bounds for cluster 2:
0.0101, 0.4949
Bounds for cluster 3:
0.5152, 1.0
The result of clustering can be plotted in 2D:
plt = preprocess.plot_2d_clusteringplt = preprocess.plot_2d_clustering(x, y, idx_variable_bins, x_label=x_label, y_label=y_label, color_map=color_map, first_cluster_index_zero=False, grid_on=True, figure_size=figure_size, save_filename=None)
The visual result of this clustering can be seen below:
Note that this clustering function created four equal bins in the space of \(x\). In this case, since \(x\) ranges from -1 to 1, the bins are created as intervals of length 0.5 in the \(x\)-space.
Cluster into pre-defined variable bins#
This clustering will divide the data set into bins of a one-dimensional variable vector whose borders are specified by the user. Let’s specify the split values as split_values = [-0.6, 0.4, 0.8]
:
split_values = [-0.6, 0.4, 0.8]
(idx_predefined_variable_bins, borders_predefined_variable_bins) = preprocess.predefined_variable_bins(x, split_values, verbose=True)
With verbose=True
we will see some detailed information on clustering:
Border values for bins:
[-1.0, -0.6, 0.4, 0.8, 1.0]
Bounds for cluster 0:
-1.0, -0.6162
Bounds for cluster 1:
-0.596, 0.3939
Bounds for cluster 2:
0.4141, 0.798
Bounds for cluster 3:
0.8182, 1.0
The visual result of this clustering can be seen below:
This clustering function created four bins in the space of \(x\), where the splits in the \(x\)-space are located at \(x=-0.6\), \(x=0.4\) and \(x=0.8\).
Cluster into zero-neighborhood variable bins#
This partitioning relies on unbalanced variable vector which, in principle,
is assumed to have a lot of observations whose values are close to zero and
relatively few observations with values away from zero.
This function can be used to separate close-to-zero observations into one
cluster (split_at_zero=False
) or two clusters (split_at_zero=True
).
Without splitting at zero, split_at_zero=False
#
(idx_zero_neighborhood_bins, borders_zero_neighborhood_bins) = preprocess.zero_neighborhood_bins(x, 3, zero_offset_percentage=10, split_at_zero=False, verbose=True)
With verbose=True
we will see some detailed information on clustering:
Border values for bins:
[-1. -0.2 0.2 1. ]
Bounds for cluster 0:
-1.0, -0.2121
Bounds for cluster 1:
-0.1919, 0.1919
Bounds for cluster 2:
0.2121, 1.0
The visual result of this clustering can be seen below:
We note that the observations corresponding to \(x \approx 0\) have been classified into one cluster (\(k_2\)).
With splitting at zero, split_at_zero=True
#
(idx_zero_neighborhood_bins_split_at_zero, borders_zero_neighborhood_bins_split_at_zero) = preprocess.zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True)
With verbose=True
we will see some detailed information on clustering:
Border values for bins:
[-1. -0.2 0. 0.2 1. ]
Bounds for cluster 0:
-1.0, -0.2121
Bounds for cluster 1:
-0.1919, -0.0101
Bounds for cluster 2:
0.0101, 0.1919
Bounds for cluster 3:
0.2121, 1.0
The visual result of this clustering can be seen below:
We note that the observations corresponding to \(x \approx 0^{-}\) have been classified into one cluster (\(k_2\)) and the observations corresponding to \(x \approx 0^{+}\) have been classified into another cluster (\(k_3\)).
Clustering combustion data sets#
In this section, we present functions that are specifically aimed for clustering reactive flows data sets. We will use a data set representing combustion of syngas in air, generated from the steady laminar flamelet model using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07].
We import the flamelet data set:
X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
mixture_fraction = np.genfromtxt('data-mixture-fraction.csv', delimiter=',')
Cluster into bins of the mixture fraction vector#
In this example, we partition the data set into five bins of the mixture fraction vector.
This is a feasible clustering strategy for non-premixed flames which takes advantage
of the physics-based (supervised) partitioning of the data set based on local stoichiometry.
The partitioning function requires specifying the value for
the stoichiometric mixture fraction, \(Z_{st}\) (Z_stoich
). Note that the first split in the data set is
performed at \(Z_{st}\) and further splits are performed automatically on
the fuel-lean and the fuel-rich branch.
Z_stoich = 0.273
(idx_mixture_fraction_bins, borders_mixture_fraction_bins) = preprocess.mixture_fraction_bins(mixture_fraction, 5, Z_stoich, verbose=True)
With verbose=True
we will see some detailed information on clustering:
Border values for bins:
[0. 0.1365 0.273 0.51533333 0.75766667 1. ]
Bounds for cluster 0:
0.0, 0.1313
Bounds for cluster 1:
0.1414, 0.2727
Bounds for cluster 2:
0.2828, 0.5152
Bounds for cluster 3:
0.5253, 0.7576
Bounds for cluster 4:
0.7677, 1.0
The visual result of this clustering can be seen below:
It can be seen that the data set is divided at the stoichiometric value of mixture fraction, in this case \(Z_{st} \approx 0.273\). The fuel-lean branch (the part of the flamelet to the left of \(Z_{st}\)) is divided into two clusters (\(k_1\) and \(k_2\)) and the fuel-rich branch (the part of the flamelet to the right of \(Z_{st}\)) is divided into three clusters (\(k_3\), \(k_4\) and \(k_5\)), since this branch has a longer range in the mixture fraction space.
Separating close-to-zero principal component source terms#
The function zero_neighborhood_bins
can be used to separate close-to-zero
source terms of the original variables (or close-to-zero source terms of the principal components (PCs)).
The zero source terms physically correspond to the steady-state.
We first compute the source terms of the principal components by transforming the source terms of the original variables to the new PC-basis:
pca_X = reduction.PCA(X, scaling='auto', n_components=2)
S_Z = pca_X.transform(S_X, nocenter=True)
and we use the first PC source term, \(S_{Z,1}\), as the conditioning variable for the clustering function:
(idx_close_to_zero_source_terms, borders_close_to_zero_source_terms) = preprocess.zero_neighborhood_bins(S_Z[:,0], 4, zero_offset_percentage=5, split_at_zero=True, verbose=True)
With verbose=True
we will see some detailed information on clustering:
Border values for bins:
[-87229.83051401 -5718.91469641 0. 5718.91469641
27148.46341416]
Bounds for cluster 0:
-87229.8305, -5722.1432
Bounds for cluster 1:
-5717.5228, -0.0
Bounds for cluster 2:
0.0, 5705.7159
Bounds for cluster 3:
5719.0347, 27148.4634
The visual result of this clustering can be seen below:
From the verbose information, we can see that the first cluster (\(k_1\)) contains observations corresponding to the highly negative values of \(S_{Z,1}\), the second cluster (\(k_2\)) to the close-to-zero but negative values of \(S_{Z,1}\), the third cluster (\(k_3\)) to the close-to-zero but positive values of \(S_{Z,1}\) and the fourth cluster (\(k_4\)) to the highly positive values of \(S_{Z,1}\).
We can further merge the two clusters that contain observations corresponding to the high magnitudes
of \(S_{Z, 1}\) into one cluster. This can be achieved using the function
flip_clusters
. We change the label of the fourth cluster to 0
and thus all
observations from the fourth cluster are now assigned to the first cluster.
idx_merged = preprocess.flip_clusters(idx_close_to_zero_source_terms, {3:0})
The visual result of this merged clustering can be seen below:
If we further plot the two-dimensional flamelet manifold, colored by \(S_{Z, 1}\), we can check that the clustering technique correctly identified the regions on the manifold where \(S_{Z, 1} \approx 0\) as well as the regions where \(S_{Z, 1}\) has high positive or high negative magnitudes.
Bibliography#
- CHan20
Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.
- CHSSC07
Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Data sampling#
In this tutorial, we present how train and test samples can be selected using the
sampling functionalities of the preprocess
module. In general, train and test
samples will always be some subset of the entire data set X
:
We import the necessary modules:
from PCAfold import DataSampler
from PCAfold import preprocess
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
and we set some initial parameters:
save_filename = None
color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54'])
first_cluster = False
figure_size = (5,5)
random_seed = 200
np.random.seed(seed=random_seed)
We generate a synthetic data set that is composed of four distinct clusters that have an imbalanced number of observations (100, 250, 400 and 500 - 1250 total number of observations):
N_1, N_2, N_3, N_4 = 100, 250, 400, 500
n_observations = N_1 + N_2 + N_3 + N_4
mean_k1, mean_k2, mean_k3, mean_k4 = [-3, 3], [3, 3], [-3, -3], [3, -3]
covariance = [[1, 0.2], [0.2, 1]]
x_k1, y_k1 = np.random.multivariate_normal(mean_k1, covariance, N_1).T
x_k2, y_k2 = np.random.multivariate_normal(mean_k2, covariance, N_2).T
x_k3, y_k3 = np.random.multivariate_normal(mean_k3, covariance, N_3).T
x_k4, y_k4 = np.random.multivariate_normal(mean_k4, covariance, N_4).T
x = np.vstack((x_k1[:,np.newaxis], x_k2[:,np.newaxis], x_k3[:,np.newaxis], x_k4[:,np.newaxis]))
y = np.vstack((y_k1[:,np.newaxis], y_k2[:,np.newaxis], y_k3[:,np.newaxis], y_k4[:,np.newaxis]))
idx = np.vstack((np.zeros((N_1, 1)), np.ones((N_2, 1)), 2*np.ones((N_3, 1)), 3*np.ones((N_4, 1)))).astype(int).ravel()
populations = preprocess.get_populations(idx)
We visualize the original data set:
The only information about the original data set that will be needed is
the vector idx
of cluster classifications.
Note
Note that
idx_train
and idx_test
, that are the outputs of the sampling functions in this
module, have a different interpretation than idx
. They are vectors containing
observation indices, not cluster classifications.
For instance, if train samples are composed of the first, second and tenth
observation then idx_train=[0,1,9]
.
You can find which cluster each observation in idx_train
(or idx_test
)
belongs to, for instance through:
idx[idx_train,]
idx[idx_test,]
You can also extract a subset of idx_train
that are only the indices belonging to a
particular cluster.
For instance, for the first cluster you can extract them by:
train_indices_in_cluster_1 = [i for i in idx_train if idx[i,]==0]
for the second cluster:
train_indices_in_cluster_2 = [i for i in idx_train if idx[i,]==1]
and so on.
We start by initalizing an object of the DataSampler
class. For the moment,
we will set the parameter idx_test
to an empty list, but we will demonstrate
an example for setting that parameter to something else later. Note that we can
set a fixed random seed if we want the sampling results to be reproducible. With
verbose=True
, we will additionally see some detailed information about the current
sampling.
sample = DataSampler(idx, idx_test=None, random_seed=random_seed, verbose=True)
Sample a fixed number#
We first select a fixed number of samples using the DataSampler.number
function. Let’s request 15% of the total data to be the train data. The function
calculates that it needs to select 46 samples from each cluster, which
amounts to 14.7% of the total number of samples in the data set. Whenever the exact percentage
requested by the user cannot be achieved, the function always under-samples.
Select test data with test_selection_option=1
#
There are always two ways in which the complementary test data can be selected.
They can be selected using the test_selection_option
parameter.
We start with test_selection_option=1
, which selects all remaining
observations as the test data:
(idx_train, idx_test) = sample.number(15, test_selection_option=1)
Setting verbose=True
lets us see some detailed information on sampling:
Cluster 0: taking 46 train samples out of 100 observations (46.0%).
Cluster 1: taking 46 train samples out of 250 observations (18.4%).
Cluster 2: taking 46 train samples out of 400 observations (11.5%).
Cluster 3: taking 46 train samples out of 500 observations (9.2%).
Cluster 0: taking 54 test samples out of 54 remaining observations (100.0%).
Cluster 1: taking 204 test samples out of 204 remaining observations (100.0%).
Cluster 2: taking 354 test samples out of 354 remaining observations (100.0%).
Cluster 3: taking 454 test samples out of 454 remaining observations (100.0%).
Selected 184 train samples (14.7%) and 1066 test samples (85.3%).
A dedicated plotting function from the preprocess
module can be used to visualize
the train and test samples. This function takes as inputs the obtained idx_train
and idx_test
vectors. Note that a custom colormap can be specified by the user.
plt = preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, color_map=color_map, first_cluster_index_zero=False, figure_size=(10,5), save_filename=None)
The visual result of this sampling can be seen below:
Select test data with test_selection_option=2
#
We then set test_selection_option=2
which selects a fixed number of
test samples from each cluster, calculated based on the smallest cluster. This
amounts to 54 test samples from each cluster.
(idx_train, idx_test) = sample.number(15, test_selection_option=2)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 46 train samples out of 100 observations (46.0%).
Cluster 1: taking 46 train samples out of 250 observations (18.4%).
Cluster 2: taking 46 train samples out of 400 observations (11.5%).
Cluster 3: taking 46 train samples out of 500 observations (9.2%).
Cluster 0: taking 54 test samples out of 54 remaining observations (100.0%).
Cluster 1: taking 54 test samples out of 204 remaining observations (26.5%).
Cluster 2: taking 54 test samples out of 354 remaining observations (15.3%).
Cluster 3: taking 54 test samples out of 454 remaining observations (11.9%).
Selected 184 train samples (14.7%) and 216 test samples (17.3%).
The visual result of this sampling can be seen below:
Sample a fixed percentage#
Next, we select a percentage of samples from each cluster using the
DataSampler.percentage
function. Let’s request 10% of the total data to be the train
data - the function selects 10% of samples from each cluster.
Select test data with test_selection_option=1
#
We start with test_selection_option=1
, which selects all remaining
observations as the test data:
(idx_train, idx_test) = sample.percentage(10, test_selection_option=1)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 10 train samples out of 100 observations (10.0%).
Cluster 1: taking 25 train samples out of 250 observations (10.0%).
Cluster 2: taking 40 train samples out of 400 observations (10.0%).
Cluster 3: taking 50 train samples out of 500 observations (10.0%).
Cluster 0: taking 90 test samples out of 90 remaining observations (100.0%).
Cluster 1: taking 225 test samples out of 225 remaining observations (100.0%).
Cluster 2: taking 360 test samples out of 360 remaining observations (100.0%).
Cluster 3: taking 450 test samples out of 450 remaining observations (100.0%).
Selected 125 train samples (10.0%) and 1125 test samples (90.0%).
The visual result of this sampling can be seen below:
Select test data with test_selection_option=2
#
We then set test_selection_option=2
which uses the same procedure
to select the test data as was used to select the train data. In this case,
it also selects 10% of samples from each cluster as the test samples.
(idx_train, idx_test) = sample.percentage(10, test_selection_option=2)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 10 train samples out of 100 observations (10.0%).
Cluster 1: taking 25 train samples out of 250 observations (10.0%).
Cluster 2: taking 40 train samples out of 400 observations (10.0%).
Cluster 3: taking 50 train samples out of 500 observations (10.0%).
Cluster 0: taking 10 test samples out of 90 remaining observations (11.1%).
Cluster 1: taking 25 test samples out of 225 remaining observations (11.1%).
Cluster 2: taking 40 test samples out of 360 remaining observations (11.1%).
Cluster 3: taking 50 test samples out of 450 remaining observations (11.1%).
Selected 125 train samples (10.0%) and 125 test samples (10.0%).
The visual result of this sampling can be seen below:
Sample manually#
We select samples manually from each cluster using the DataSampler.manual
function.
Select test data with test_selection_option=1
#
We start with test_selection_option=1
which selects all remaining
observations as the test data.
Let’s request 4, 5, 10 and 2 samples from the first, second, third and fourth cluster respectively.
The sampling dictionary will thus have to be:
sampling_dictionary={0:4, 1:5, 2:10, 3:2}
. Note that the function
still selects those samples randomly from each cluster.
We should also change sampling_type
to 'number'
so that samples are
selected on a number and not a percentage basis:
(idx_train, idx_test) = sample.manual({0:4, 1:5, 2:10, 3:2}, sampling_type='number', test_selection_option=1)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 4 train samples out of 100 observations (4.0%).
Cluster 1: taking 5 train samples out of 250 observations (2.0%).
Cluster 2: taking 10 train samples out of 400 observations (2.5%).
Cluster 3: taking 2 train samples out of 500 observations (0.4%).
Cluster 0: taking 96 test samples out of 96 remaining observations (100.0%).
Cluster 1: taking 245 test samples out of 245 remaining observations (100.0%).
Cluster 2: taking 390 test samples out of 390 remaining observations (100.0%).
Cluster 3: taking 498 test samples out of 498 remaining observations (100.0%).
Selected 21 train samples (1.7%) and 1229 test samples (98.3%).
The visual result of this sampling can be seen below:
Select test data with test_selection_option=2
#
We then set test_selection_option=2
which uses the same procedure
to select the test data as was used to select the train data. This time, let’s request
50%, 10%, 10% and 20% from the first, second, third and fourth cluster respectively.
The sampling dictionary will thus have to be:
sampling_dictionary={0:50, 1:10, 2:10, 3:20}
and we should change the
sampling_type
to 'percentage'
:
(idx_train, idx_test) = sample.manual({0:50, 1:10, 2:10, 3:20}, sampling_type='percentage', test_selection_option=2)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 50 train samples out of 100 observations (50.0%).
Cluster 1: taking 25 train samples out of 250 observations (10.0%).
Cluster 2: taking 40 train samples out of 400 observations (10.0%).
Cluster 3: taking 100 train samples out of 500 observations (20.0%).
Cluster 0: taking 50 test samples out of 50 remaining observations (100.0%).
Cluster 1: taking 25 test samples out of 225 remaining observations (11.1%).
Cluster 2: taking 40 test samples out of 360 remaining observations (11.1%).
Cluster 3: taking 100 test samples out of 400 remaining observations (25.0%).
Selected 215 train samples (17.2%) and 215 test samples (17.2%).
The visual result of this sampling can be seen below:
Sample at random#
Finally, we select random samples using the DataSampler.random
function.
Let’s request 10% of the total data to be the train data.
Note
Random sampling will typically give a very similar sample distribution as
percentage sampling. The only difference is that percentage sampling will
maintain the percentage perc
exact within each cluster while random sampling
will typically result in some small variations from perc
in each cluster
since it is sampling independently of cluster definitions.
Select test data with test_selection_option=1
#
We start with test_selection_option=1
which selects all remaining
observations as test data.
(idx_train, idx_test) = sample.random(10, test_selection_option=1)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 14 train samples out of 100 observations (14.0%).
Cluster 1: taking 28 train samples out of 250 observations (11.2%).
Cluster 2: taking 42 train samples out of 400 observations (10.5%).
Cluster 3: taking 41 train samples out of 500 observations (8.2%).
Cluster 0: taking 86 test samples out of 86 remaining observations (100.0%).
Cluster 1: taking 222 test samples out of 222 remaining observations (100.0%).
Cluster 2: taking 358 test samples out of 358 remaining observations (100.0%).
Cluster 3: taking 459 test samples out of 459 remaining observations (100.0%).
Selected 125 train samples (10.0%) and 1125 test samples (90.0%).
The visual result of this sampling can be seen below:
Select test data with test_selection_option=2
#
We then set test_selection_option=2
which uses the same procedure
to select the test data as was used to select the train data. In this case, it will also sample
10% of the total data set as the test data.
(idx_train, idx_test) = sample.random(10, test_selection_option=2)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 14 train samples out of 100 observations (14.0%).
Cluster 1: taking 28 train samples out of 250 observations (11.2%).
Cluster 2: taking 42 train samples out of 400 observations (10.5%).
Cluster 3: taking 41 train samples out of 500 observations (8.2%).
Cluster 0: taking 8 test samples out of 86 remaining observations (9.3%).
Cluster 1: taking 25 test samples out of 222 remaining observations (11.3%).
Cluster 2: taking 29 test samples out of 358 remaining observations (8.1%).
Cluster 3: taking 63 test samples out of 459 remaining observations (13.7%).
Selected 125 train samples (10.0%) and 125 test samples (10.0%).
The visual result of this sampling can be seen below:
Maintaining a fixed test data set#
In this example, we further illustrate how maintaining a fixed test data set
functionality can be utilized.
Suppose that in every cluster you have a very distinct set of observations on
which you always want to test your model.
You can point out those observations when initializing a DataSampler
object through the use of the idx_test
parameter.
We simulate this situation by appending additional samples to the previously defined data set. We add 20 samples in each cluster - those sammples can be seen in the figure below as smaller clouds next to each cluster:
Assuming that we know the indices of points that represent the appended clouds, stored in
idx_test
, we can use that array of indices as an input parameter:
sample = DataSampler(idx, idx_test=idx_test, random_seed=random_seed, verbose=True)
Any sampling function now called will maintain those samples as the test data and the
train data will be sampled ignoring the indices in idx_test
.
Note also that if idx_test
is specified, the test_selection_option
parameter is ignored.
We will demonstrate this sampling using the DataSampler.random
function, but
any other sampling function that we demonstrated earlier can be used as well.
(idx_train, idx_test) = sample.random(80, test_selection_option=2)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 86 train samples out of 120 observations (71.7%).
Cluster 1: taking 211 train samples out of 270 observations (78.1%).
Cluster 2: taking 347 train samples out of 420 observations (82.6%).
Cluster 3: taking 420 train samples out of 520 observations (80.8%).
Cluster 0: taking 20 test samples out of 34 remaining observations (58.8%).
Cluster 1: taking 20 test samples out of 59 remaining observations (33.9%).
Cluster 2: taking 20 test samples out of 73 remaining observations (27.4%).
Cluster 3: taking 20 test samples out of 100 remaining observations (20.0%).
Selected 1064 train samples (80.0%) and 80 test samples (6.0%).
The visual result of this sampling can be seen below:
Chaining sampling functions#
Finally, we discuss an interesting use-case for chaining two sampling functions, where train samples obtained from one sampling can become a fixed test data for another sampling.
Suppose that our target is to have a fixed test data set composed of:
10 samples from the first cluster
20 samples from the second cluster
10 samples from the third cluster
50 samples from the fourth cluster
and, at the same time, select a fixed number of train samples from each cluster.
We can start with generating the desired test samples using the
DataSampler.manual
function. We can output the train data as the test data:
sample = DataSampler(idx, random_seed=random_seed, verbose=True)
(idx_test, _) = sample.manual({0:10, 1:20, 2:10, 3:50}, sampling_type='number', test_selection_option=1)
Now we feed the obtained test set as a fixed test set for the target sampling:
sample.idx_test = idx_test
(idx_train, idx_test) = sample.number(19.5, test_selection_option=1)
With verbose=True
we will see some detailed information on sampling:
Cluster 0: taking 60 train samples out of 100 observations (60.0%).
Cluster 1: taking 60 train samples out of 250 observations (24.0%).
Cluster 2: taking 60 train samples out of 400 observations (15.0%).
Cluster 3: taking 60 train samples out of 500 observations (12.0%).
Cluster 0: taking 10 test samples out of 40 remaining observations (25.0%).
Cluster 1: taking 20 test samples out of 190 remaining observations (10.5%).
Cluster 2: taking 10 test samples out of 340 remaining observations (2.9%).
Cluster 3: taking 50 test samples out of 440 remaining observations (11.4%).
Selected 240 train samples (19.2%) and 90 test samples (7.2%).
The visual result of this sampling can be seen below:
Notice that we have achieved what we wanted to: we generated a desired test data set with 10, 20, 10 and 50 samples, and we also have an equal number of train samples selected from each cluster - in this case 60 samples.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Global and local PCA#
In this tutorial, we present how global and local PCA can be performed on a
synthetic data set using the reduction
module.
We import the necessary modules:
from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import PCA, LPCA
import matplotlib.pyplot as plt
from matplotlib import gridspec
from matplotlib.colors import ListedColormap
import numpy as np
and we set some initial parameters:
n_points = 1000
save_filename = None
global_color = '#454545'
k1_color = '#0e7da7'
k2_color = '#ceca70'
color_map = ListedColormap([k1_color, k2_color])
Generate a synthetic data set for global PCA#
We generate a synthetic data set on which the global PCA will be performed. This data set is composed of a single cloud of points.
mean_global = [0,1]
covariance_global = [[3.4, 1.1], [1.1, 2.1]]
x_noise, y_noise = np.random.multivariate_normal(mean_global, covariance_global, n_points).T
y_global = np.linspace(0,4,n_points)
x_global = -(y_global**2) + 7*y_global + 4
y_global = y_global + y_noise
x_global = x_global + x_noise
Dataset_global = np.hstack((x_global[:,np.newaxis], y_global[:,np.newaxis]))
This data set can be seen below:
Global PCA#
We perform global PCA to obtain global principal components, global eigenvectors and global eigenvalues:
pca = PCA(Dataset_global, 'none', n_components=2)
principal_components_global = pca.transform(Dataset_global, nocenter=False)
eigenvectors_global = pca.A
eigenvalues_global = pca.L
We also retrieve the centered and scaled data set:
Dataset_global_pp = pca.X_cs
Generate a synthetic data set for local PCA#
Similarly, we generate another synthetic data set that is composed of two distinct clouds of points.
mean_local_1 = [0,1]
mean_local_2 = [6,4]
covariance_local_1 = [[2, 0.5], [0.5, 0.5]]
covariance_local_2 = [[3, 0.3], [0.3, 0.5]]
x_noise_1, y_noise_1 = np.random.multivariate_normal(mean_local_1, covariance_local_1, n_points).T
x_noise_2, y_noise_2 = np.random.multivariate_normal(mean_local_2, covariance_local_2, n_points).T
x_local = np.concatenate([x_noise_1, x_noise_2])
y_local = np.concatenate([y_noise_1, y_noise_2])
Dataset_local = np.hstack((x_local[:,np.newaxis], y_local[:,np.newaxis]))
This data set can be seen below:
Cluster the data set for local PCA#
We perform clustering of this data set based on pre-defined bins using the available
preprocess.predefined_variable_bins
function.
We obtain cluster classifications and centroids for each cluster:
(idx, borders) = preprocess.predefined_variable_bins(Dataset_local[:,0], [2.5], verbose=False)
centroids = preprocess.get_centroids(Dataset_local, idx)
The result of this clustering can be seen below:
In local PCA, PCA is applied in each cluster separately.
Local PCA#
We perform local PCA to obtain local principal components, local eigenvectors and local eigenvalues:
lpca = LPCA(Dataset_local, idx, scaling='none')
principal_components_local = lpca.principal_components
eigenvectors_local = lpca.A
eigenvalues_local = lpca.L
Plotting global versus local PCA#
Finally, for the demonstration purposes, we plot the identified global and local eigenvectors on top of both synthetic data sets. The visual result of performing PCA globally and locally can be seen below:
Note, that in local PCA, a separate set of eigenvectors is found in each cluster. The same goes for principal components and eigenvalues.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Plotting PCA results#
In this tutorial, we present plotting functionalities from the reduction
module that aid in visualizing PCA results.
We import the necessary modules:
from PCAfold import PCA
from PCAfold import reduction
import numpy as np
and we set some initial parameters:
title = None
save_filename = None
As an example, we will use a data set representing combustion of syngas (CO/H2 mixture) in air generated from the steady laminar flamelet model. This data set has 11 variables and 50,000 observations. The data set was generated using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07]. To load the data set from the tutorials directory:
X = np.genfromtxt('data-state-space.csv', delimiter=',')
X_names = ['$T$', '$H_2$', '$O_2$', '$O$', '$OH$', '$H_2O$', '$H$', '$HO_2$', '$CO$', '$CO_2$', '$HCO$']
We generate four PCA objects corresponding to four scaling criteria:
pca_X_Auto = PCA(X, scaling='auto', n_components=3)
pca_X_Range = PCA(X, scaling='range', n_components=3)
pca_X_Vast = PCA(X, scaling='vast', n_components=3)
pca_X_Pareto = PCA(X, scaling='pareto', n_components=3)
and we will plot PCA results from the generated objects.
Eigenvectors#
Weights of a single eigenvector can be plotted using the reduction.plot_eigenvectors
function. Note, that multiple eigenvectors can be passed as an input and this function will
generate as many plots as there are eigenvectors supplied.
Below is an example of plotting just the first eigenvector:
plt = reduction.plot_eigenvectors(pca_X_Auto.A[:,0], variable_names=X_names)
To plot all eigenvectors resulting from a single PCA
class object:
plts = reduction.plot_eigenvectors(pca_X_Auto.A, variable_names=X_names)
Two weight normalizations are available:
No normalization. To use this variant set
plot_absolute=False
. Example can be seen below:
plt = reduction.plot_eigenvectors(pca_X_Auto.A[:,0], eigenvectors_indices=[], variable_names=X_names, plot_absolute=False, save_filename=save_filename)
Absolute values. To use this variant set
plot_absolute=True
. Example can be seen below:
plt = reduction.plot_eigenvectors(pca_X_Auto.A[:,0], eigenvectors_indices=[], variable_names=X_names, plot_absolute=True, save_filename=save_filename)
Eigenvectors comparison#
Eigenvectors resulting from, for instance, different PCA
class objects can
be compared on a single plot using the reduction.plot_eigenvectors_comparison
function.
Two weight normalizations are available:
No normalization. To use this variant set
plot_absolute=False
. Example can be seen below:
plt = reduction.plot_eigenvectors_comparison((pca_X_Auto.A[:,0], pca_X_Range.A[:,0], pca_X_Vast.A[:,0], pca_X_Pareto.A[:,0]), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], variable_names=X_names, plot_absolute=False, color_map='coolwarm', save_filename=save_filename)
Absolute values. To use this variant set
plot_absolute=True
. Example can be seen below:
plt = reduction.plot_eigenvectors_comparison((pca_X_Auto.A[:,0], pca_X_Range.A[:,0], pca_X_Vast.A[:,0], pca_X_Pareto.A[:,0]), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], variable_names=X_names, plot_absolute=True, color_map='coolwarm', save_filename=save_filename)
Eigenvalue distribution#
Eigenvalue distribution can be plotted using the reduction.plot_eigenvalue_distribution
function.
Two eigenvalue normalizations are available:
No normalization. To use this variant set
normalized=False
. Example can be seen below:
plt = reduction.plot_eigenvalue_distribution(pca_X_Auto.L, normalized=False, save_filename=save_filename)
Normalized to 1. To use this variant set
normalized=True
. Example can be seen below:
plt = reduction.plot_eigenvalue_distribution(pca_X_Auto.L, normalized=True, save_filename=save_filename)
Eigenvalue distribution comparison#
Eigenvalues resulting from, for instance, different PCA
class objects can
be compared on a single plot using the reduction.plot_eigenvalues_comparison
function.
Two eigenvalue normalizations are available:
No normalization. To use this variant set
normalized=False
. Example can be seen below:
plt = reduction.plot_eigenvalue_distribution_comparison((pca_X_Auto.L, pca_X_Range.L, pca_X_Vast.L, pca_X_Pareto.L), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], normalized=False, color_map='coolwarm', save_filename=save_filename)
Normalized to 1. To use this variant set
normalized=True
. Example can be seen below:
plt = reduction.plot_eigenvalue_distribution_comparison((pca_X_Auto.L, pca_X_Range.L, pca_X_Vast.L, pca_X_Pareto.L), legend_labels=['Auto', 'Range', 'Vast', 'Pareto'], normalized=True, color_map='coolwarm', save_filename=save_filename)
Cumulative variance#
Cumulative variance computed from eigenvalues can be plotted using the
reduction.plot_cumulative_variance
function. Example of a plot:
plt = reduction.plot_cumulative_variance(pca_X_Auto.L, n_components=0, save_filename=save_filename)
The number of eigenvalues to look at can also be truncated by setting
n_components
input parameter accordingly. Example of a plot when
n_components=5
in this case:
plt = reduction.plot_cumulative_variance(pca_X_Auto.L, n_components=5, save_filename=save_filename)
Two-dimensional manifold#
Two-dimensional manifold resulting from performing PCA transformation can be
plotted using the reduction.plot_2d_manifold
function. We first calculate
the principal components by transforming the original data set to the new basis:
principal_components = pca_X_Vast.transform(X)
By setting color=X[:,0]
parameter, the manifold can be additionally
colored by the first variable in the data set (in this case, the temperature). Note that you can select the colormap to use through the color_map
parameter. Example of using color_map='inferno'
and coloring by the first variable in the data set:
plt = reduction.plot_2d_manifold(principal_components[:,0], principal_components[:,1], color=X[:,0], x_label='$Z_1$', y_label='$Z_2$', colorbar_label='$T$ [K]', color_map='inferno', figure_size=(10,4), save_filename=save_filename)
Example of an uncolored plot:
plt = reduction.plot_2d_manifold(principal_components[:,0], principal_components[:,1], x_label='$Z_1$', y_label='$Z_2$', figure_size=(10,4), save_filename=save_filename)
Example of using color_map='Blues'
and coloring by the first variable in the data set:
plt = reduction.plot_2d_manifold(principal_components[:,0], principal_components[:,1], color=X[:,0], x_label='$Z_1$', y_label='$Z_2$', colorbar_label='$T$ [K]', color_map='Blues', figure_size=(10,4), save_filename=save_filename)
Three-dimensional manifold#
Similarly, a three-dimensional manifold can be visualized:
plt = reduction.plot_3d_manifold(principal_components[:,0], principal_components[:,1], principal_components[:,2], elev=30, azim=-20, color=X[:,0], x_label='$Z_1$', y_label='$Z_2$', z_label='$Z_3$', colorbar_label='$T$ [K]', color_map='inferno', figure_size=(15,8), save_filename=save_filename)
Parity plot#
Parity plots of reconstructed variables can be visualized using the
reduction.plot_parity
function. We approximate the data set using the previously
obtained two principal components:
X_rec = pca_X_Vast.reconstruct(principal_components)
and we generate a parity plot which visualizes the reconstruction of the first variable:
plt = reduction.plot_parity(X[:,0], X_rec[:,0], color=X[:,0], x_label='Observed $T$', y_label='Reconstructed $T$', colorbar_label='$T$ [K]', color_map='inferno', figure_size=(7,7), save_filename=None)
Similarly as in the plot_2d_manifold
function, you can select the colormap to use.
Bibliography#
- CHan20
Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.
- CHSSC07
Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
PCA on sampled data sets#
In this tutorial, we present how PCA can be performed on sampled data sets using
various helpful functions from the preprocess
and the reduction
module. Those functions essentially allow to compare PCA done on the original full data set, \(\mathbf{X}\), and on the sampled data set, \(\mathbf{X_r}\). We are first going to present major functionalities for performing and analyzing PCA on a sampled data set using a special case of sampling - by taking equal number of samples from each cluster. Next, we are going to show a more general way to
perform PCA on data sets that are sampled in any way of choice. A general overview for performing PCA on a sampled data set is presented below:
The main goal is to inform the PCA transformation with some of the characteristics of the sampled data set, \(\mathbf{X_r}\). There are several ways in which that information
can be incorporated and they can be controlled using a selected biasing option and setting the biasing_option
input parameter whenever needed. The user is referred to the documentation for more information on the available options (under User guide \(\rightarrow\) Data reduction \(\rightarrow\) Biasing options). It is understood, that PCA performed on a sampled data set is biased in some way, since that data set contains different
proportions of features in terms of sample density compared to their original
contribution within the full original data set, \(\mathbf{X}\). Those features can be identified using any clustering technique of choice.
We import the necessary modules:
from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import DataSampler
from PCAfold import PCA
import numpy as np
from matplotlib.colors import ListedColormap
and we set some initial parameters:
scaling = 'auto'
biasing_option = 2
n_clusters = 4
n_components = 2
random_seed = 100
legend_label = ['$\mathbf{X}$', '$\mathbf{X_r}$']
color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54'])
save_filename = None
Load and cluster the data set#
As an example, we will use a data set representing combustion of syngas in air generated from the steady laminar flamelet model using chemical mechanism by Hawkes et al. [CHSSC07]. This data set has 11 variables and 50,000 observations. The data set was generated using Spitfire software [CHan20]. To load the data set from the tutorials directory:
X = np.genfromtxt('data-state-space.csv', delimiter=',')
X_names = ['$T$', '$H_2$', '$O_2$', '$O$', '$OH$', '$H_2O$', '$H$', '$HO_2$', '$CO$', '$CO_2$', '$HCO$']
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
Z = np.genfromtxt('data-mixture-fraction.csv', delimiter=',')
We start with clustering the data set that will result in an idx
vector of cluster classifications.
Clustering can be performed with any technique of choice. Here we will use one
of the available functions from the preprocess
module preprocess.zero_neighborhood_bins
and use the first principal component source term as the conditioning variable.
Perform global PCA on the data set and transform source terms of the original variables:
pca_X = PCA(X, scaling=scaling, n_components=n_components)
S_Z = pca_X.transform(S, nocenter=True)
Cluster the data set:
(idx, borders) = preprocess.zero_neighborhood_bins(S_Z[:,0], k=4, zero_offset_percentage=2, split_at_zero=True, verbose=True)
Visualize the result of clustering:
plt = preprocess.plot_2d_clustering(Z, X[:,0], idx, x_label='Mixture fraction [-]', y_label='$T$ [K]', color_map=color_map, first_cluster_index_zero=False, grid_on=True, figure_size=(8, 3), save_filename=save_filename)
Special case of PCA on sampled data sets#
In this section, we present the special case for performing PCA on data sets formed by taking equal number of samples from local clusters.
The reduction.EquilibratedSamplePCA
class enables a special case of performing PCA on a sampled data set. It uses equal number of samples from each cluster and allows to
analyze what happens when the data set is sampled gradually. It begins with
performing PCA on the original data set and then in
n_iterations
it will gradually decrease the number of populations in each
cluster larger than the smallest cluster, heading towards population of the
smallest cluster, in each cluster.
At each iteration, we obtain a new sampled data set on which PCA is performed.
At the last iteration, the number of populations in each cluster are equal and,
finally, PCA is performed on this equilibrated data set.
A schematic representation of this procedure is presented in the figure below:
Run cluster equilibration#
equilibrated_pca = reduction.EquilibratedSamplePCA(X,
idx,
scaling=scaling,
X_source=S_X,
n_components=n_components,
biasing_option=biasing_option,
n_iterations=10,
stop_iter=0,
random_seed=random_seed,
verbose=True)
With verbose=True
we will see some detailed information on thee number of samples
in each cluster at each iteration:
Biasing is performed with option 2.
At iteration 1 taking samples:
{0: 4144, 1: 14719, 2: 24689, 3: 2416}
At iteration 2 taking samples:
{0: 3953, 1: 13352, 2: 22215, 3: 2416}
At iteration 3 taking samples:
{0: 3762, 1: 11985, 2: 19741, 3: 2416}
At iteration 4 taking samples:
{0: 3571, 1: 10618, 2: 17267, 3: 2416}
At iteration 5 taking samples:
{0: 3380, 1: 9251, 2: 14793, 3: 2416}
At iteration 6 taking samples:
{0: 3189, 1: 7884, 2: 12319, 3: 2416}
At iteration 7 taking samples:
{0: 2998, 1: 6517, 2: 9845, 3: 2416}
At iteration 8 taking samples:
{0: 2807, 1: 5150, 2: 7371, 3: 2416}
At iteration 9 taking samples:
{0: 2616, 1: 3783, 2: 4897, 3: 2416}
At iteration 10 taking samples:
{0: 2416, 1: 2416, 2: 2416, 3: 2416}
eigenvalues = equilibrated_pca.eigenvalues
eigenvectors = equilibrated_pca.eigenvectors
PCs = equilibrated_pca.pc_scores
PC_sources = equilibrated_pca.pc_sources
idx_train = equilibrated_pca.idx_train
Analyze centers change#
The reduction.analyze_centers_change
function compares centers computed on the original data set, \(\mathbf{X}\), versus on the sampled data set, \(\mathbf{X_r}\).
The idx_train
input parameter could for instance be obtained
from reduction.EquilibratedSamplePCA
and will thus represent the equilibrated data set sampled from the original data
set. It could also be obtained as sampled indices using any of the sampling
function from the preprocess.DataSampler
class.
This function will produce a plot that shows the normalized centers and a percentage by which the new centers have moved with respect to the original ones. Example of a plot:
(centers_X, centers_X_r, perc, plt) = reduction.analyze_centers_change(X, idx_train, variable_names=X_names, legend_label=legend_label, save_filename=save_filename)
If you do not wish to plot all variables present in a data set, use the
plot_variables
list as an input parameter to select indices of variables to
plot:
(centers_X, centers_X_r, perc, plt) = reduction.analyze_centers_change(X, idx_train, variable_names=X_names, plot_variables=[1,3,4,6,8], legend_label=legend_label, save_filename=save_filename)
Analyze eigenvector weights change#
The eigenvectors
3D array obtained from reduction.EquilibratedSamplePCA
can now be used as an input parameter for plotting the eigenvector weights change
as we were gradually equilibrating cluster populations.
We are going to plot the first eigenvector (corresponding to PC-1) weights change with three variants of normalization. To access the first eigenvector one can simply do:
eigenvectors[:,0,:]
similarly, to access the second eigenvector:
eigenvectors[:,1,:]
and so on.
Three weight normalization variants are available:
No normalization, the absolute values of the eigenvector weights are plotted. To use this variant set
normalize=False
. Example can be seen below:
plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=False, zero_norm=False, save_filename=save_filename)
Normalizing so that the highest weight is equal to 1 and the smallest weight is between 0 and 1. This is useful for judging the severity of the weight change. To use this variant set
normalize=True
andzero_norm=False
. Example can be seen below:
plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=True, zero_norm=False, save_filename=save_filename)
Normalizing so that weights are between 0 and 1. This is useful for judging the change trends since it will blow up even the smallest changes to the entire range 0-1. To use this variant set
normalize=True
andzero_norm=True
. Example can be seen below:
plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=True, zero_norm=True, save_filename=save_filename)
Note, that in the above example, the color bar marks the iteration number and so the \(0^{th}\) iteration represents eigenvectors from the original data set, \(\mathbf{X}\). The last iteration, in this example the \(10^{th}\) iteration, represents eigenvectors computed on the equilibrated, sampled data set.
If you do not wish to plot all variables present in a data set, use the
plot_variables
list as an input parameter to select indices of variables to
plot:
If you are only interested in plotting a comparison in the eigenvector weights
change between the original data set, \(\mathbf{X}\), and one target sampled data set,
\(\mathbf{X_r}\), (for instance the equilibrated data set) you can set the eigenvectors
input parameter to only
contain these two sets of weights. The function will then understand that only these two should be compared:
plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,[0,-1]], X_names, normalize=False, zero_norm=False, legend_label=legend_label, save_filename=save_filename)
Such plot can be done for the pre-selected variables as well using the
plot_variables
list:
plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,[0,-1]], X_names, plot_variables=[1,3,4,6,8], normalize=False, zero_norm=False, legend_label=legend_label, save_filename=save_filename)
Analyze eigenvalue distribution#
The reduction.analyze_eigenvalue_distribution
function will produce a plot that shows the normalized eigenvalues distribution for the original data set, \(\mathbf{X}\), and for the sampled data set, \(\mathbf{X_r}\). Example of a plot:
plt = reduction.analyze_eigenvalue_distribution(state_space, idx_train, scal_crit, biasing_option, legend_label=legend_label, save_filename=save_filename)
Visualize the re-sampled manifold#
Using the function reduction.plot_2d_manifold
you can visualize any
two-dimensional manifold and additionally color it with a variable of choice.
Here we are going to plot the re-sampled manifold resulting from performing PCA on
the sampled data set. Example of a plot:
plt = reduction.plot_2d_manifold(PCs[:,0,-1], PCs[:,1,-1], color=X[:,0], x_label='$Z_{r, 1}$', y_label='$Z_{r, 2}$', colorbar_label='$T$ [K]', save_filename=save_filename)
Generalization of PCA on sampled data sets#
A more general approach to performing PCA on sampled data sets (instead of using the
reduction.EquilibratedSamplePCA
class) is to use the
the reduction.SamplePCA
class. This function allows to perform PCA on
a data set that has been sampled in any way (in contrast to equilibrated sampling
which always samples equal number of samples from each cluster).
Note
It is worth noting that the class reduction.EquilibratedSamplePCA
uses reduction.SamplePCA
inside.
We first inspect how many samples each cluster has (in the clusters we identified earlier by binning the first principal component source term):
print(preprocess.get_populations(idx))
which shows us populations of each cluster to be:
[4335, 16086, 27163, 2416]
We begin by performing manual sampling. Suppose that we would like to severely under-represent the two largest clusters and over-represent the features of the two smallest clusters. Let’s select 4000 samples from \(k_0\), 1000 samples from \(k_1\), 1000 samples from \(k_2\) and 2400 samples from \(k_3\). In this example we are not interested in generating test samples, so we can suppress returning those.
sample = DataSampler(idx, idx_test=None, random_seed=random_seed, verbose=True)
(idx_manual, _) = sample.manual({0:4000, 1:1000, 2:1000, 3:2400}, sampling_type='number', test_selection_option=1)
The verbose information will tell us how sample densities compare in terms of percentage of samples in each cluster:
Cluster 0: taking 4000 train samples out of 4335 observations (92.3%).
Cluster 1: taking 1000 train samples out of 16086 observations (6.2%).
Cluster 2: taking 1000 train samples out of 27163 observations (3.7%).
Cluster 3: taking 2400 train samples out of 2416 observations (99.3%).
Cluster 0: taking 335 test samples out of 335 remaining observations (100.0%).
Cluster 1: taking 15086 test samples out of 15086 remaining observations (100.0%).
Cluster 2: taking 26163 test samples out of 26163 remaining observations (100.0%).
Cluster 3: taking 16 test samples out of 16 remaining observations (100.0%).
Selected 8400 train samples (16.8%) and 41600 test samples (83.2%).
We now perform PCA on a data set that has been sampled according to
idx_manual
using the reduction.SamplePCA
class:
sample_pca = reduction.SamplePCA(X,
idx_manual,
scaling,
n_components,
biasing_option)
eigenvalues_manual = sample_pca.eigenvalues
eigenvectors_manual = sample_pca.eigenvectors
PCs_manual = sample_pca.pc_scores
Finally, we can generate all the same plots that were shown before. Here, we are only going to present the new re-sampled manifold resulting from current manual sampling:
plt = reduction.plot_2d_manifold(PCs_manual[:,0], PCs_manual[:,1], color=X[:,0], x_label='$Z_{r, 1}$', y_label='$Z_{r, 2}$', colorbar_label='$T$ [K]', save_filename=save_filename)
- CHan20
Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.
- CHSSC07
Evatt R Hawkes, Ramanan Sankaran, James C Sutherland, and Jacqueline H Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Handling source terms#
This tutorial can be of interest to researchers working with reactive flows data sets. We present how source terms of the original state variables can be handled using PCAfold software. Specifically, PCAfold functionalities accommodate treatment of sources of principal components (PCs) which can be valuable for implementing PC-transport approaches such as proposed in [TSP09].
Theory#
The methodology for the standard PC-transport approach was first proposed in [TSP09]. As an illustrative example, PC-transport equations adequate to a 0D chemical reactor are presented below. The reader is referred to [TBS15], [TEM15] for treatment of the full PC-transport equations including diffusion.
We assume that the data set containing original state-space variables is:
where \(T\) is temperature and \(Y_i\) is a mass fraction of species \(i\). \(N_s\) is the total number of chemical species. \(\mathbf{X}\) is also referred to as the state vector, see [THS18] for various definitions of the state vector. The corresponding source terms of the original state-space variables are:
where \(\rho\) is the density of the mixture and \(c_p\) is the specific heat capacity of the mixture, \(\omega_i\) is the net mass production rate of species \(i\) and \(h_i\) is the enthalpy of species \(i\).
For a 0D-system, we can write the evolution equation as:
This equation can be instead written in the space of principal components by applying a linear operator, \(\mathbf{A}\), identified by PCA. We can also account for centering and scaling the original data set, \(\mathbf{X}\), using centers \(\mathbf{C}\) and scales \(\mathbf{D}\):
It is worth noting that when the original data set is centered and scaled, the corresponding source terms should only be scaled and not centered, since:
for constant \(\mathbf{C}\), \(\mathbf{D}\) and \(\mathbf{A}\).
We finally obtain the 0D PC-transport equation where the evolved variables are principal components instead of the original state-space variables:
where \(\mathbf{Z} = \Big( \frac{\mathbf{X} - \mathbf{C}}{\mathbf{D}} \Big) \mathbf{A}\) and \(\mathbf{S_{Z}} = \frac{\mathbf{S_X}}{\mathbf{D}}\mathbf{A}\).
Code implementation#
We import the necessary modules:
from PCAfold import PCA
import numpy as np
A data set representing combustion of syngas in air generated from steady laminar flamelet model using Spitfire software [CHan20] and a chemical mechanism by Hawkes et al. [CHSSC07] is used as a demo data set.
We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding source terms, \(\mathbf{S_X}\):
X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
We perform PCA on the original data:
pca_X = PCA(X, scaling='auto', n_components=2)
We transform the original data set to the newly identified basis and compute the principal components (PCs), \(\mathbf{Z}\):
Z = pca_X.transform(X, nocenter=False)
Transform the source terms to the newly identified basis and compute the sources of principal components, \(\mathbf{S_Z}\):
S_Z = pca_X.transform(S_X, nocenter=True)
Note that we set the flag nocenter=True
which is a specific setting that
should be applied when transforming source terms.
With that setting, only scales \(\mathbf{D}\) will be applied when transforming \(\mathbf{S_X}\)
to the new basis defined by \(\mathbf{A}\) and thus the transformation will be consistent with the discussion presented
in the previous section.
Bibliography#
- TBS15
Amir Biglari and James C. Sutherland. An a-posteriori evaluation of principal component analysis-based models for turbulent combustion simulations. Combustion and Flame, 162(10):4025–4035, 2015.
- TEM15
Tarek Echekki and Hessam Mirgolbabaei. Principal component transport in turbulent combustion: a posteriori analysis. Combustion and Flame, 162(5):1919–1933, 2015.
- THS18
Michael A. Hansen and James C. Sutherland. On the consistency of state vectors and jacobian matrices. Combustion and Flame, 193:257–271, 2018.
- CHan20
Michael Alan Hansen. Spitfire. National Technology & Engineering Solutions of Sandia, LLC (NTESS), 2020. URL: https://github.com/sandialabs/Spitfire.
- CHSSC07
Evatt R. Hawkes, Ramanan Sankaran, James C. Sutherland, and Jacqueline H. Chen. Scalar mixing in direct numerical simulations of temporally evolving plane jet flames with skeletal co/h2 kinetics. Proceedings of the combustion institute, 31(1):1633–1640, 2007.
- TSP09(1,2)
James C. Sutherland and Alessandro Parente. Combustion modeling using principal component analysis. Proceedings of the Combustion Institute, 32(1):1563–1570, 2009.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Manifold Assessment#
In this tutorial, we demonstrate tools that may be used for assessing manifold quality and dimensionality as well as comparing manifolds (parameterizations) in terms of representing dependent variables of interest.
import numpy as np
import matplotlib.pyplot as plt
from PCAfold import compute_normalized_variance, PCA, normalized_variance_derivative,\
find_local_maxima, plot_normalized_variance, plot_normalized_variance_comparison,\
plot_normalized_variance_derivative, plot_normalized_variance_derivative_comparison, random_sampling_normalized_variance
Here we are creating a two-dimensional manifold to assess with a dependent variable. Independent variables \(x\) and \(y\) and dependent variable \(f\) will be defined as
for a grid \(g\) between [-0.5,1].
npts = 1001
grid = np.linspace(-0.5,1.,npts)
x = np.exp(grid)*np.cos(grid)**2
y = np.cos(grid)**2
f = grid**3+grid
depvar_name = 'f' # dependent variable name
plt.scatter(x, y, c=f, s=5, cmap='rainbow')
plt.colorbar()
plt.grid()
plt.xlabel('x')
plt.ylabel('y')
plt.title('colored by f')
plt.show()

We now want to assess the manifold in one and two dimensions using
compute_normalized_variance
. In order to use this function, the
independent and dependent variables must be arranged into
two-dimensional arrays size npts by number of variables. This is done in
the following code.
indepvars = np.vstack((x, y)).T
depvars = np.expand_dims(f, axis=1)
print('indepvars shape:', indepvars.shape, '\n depvars shape:', depvars.shape)
indepvars shape: (1001, 2)
depvars shape: (1001, 1)
We can now call compute_normalized_variance
on both the
two-dimensional manifold and one-dimensional slices of it in order to
assess the true dimensionality of the manifold (which should be two in
this case). A normalized variance is computed at various bandwidths
(Gaussian kernel filter widths) which can provide indications of
overlapping states in the manifold (or non-uniqueness) as well as
indications of how spread out the dependent variables are. A unique
manifold with large spread in the data should better facilitate building
models for accurate representations of the dependent variables of
interest. Details on the normalized variance equations may be found in
the documentation.
The bandwidths are applied to the independent variables after they are centered and scaled inside a unit box (by default). The bandwidth values may be computed by default according to interpoint distances or may be specified directly by the user.
Below is a demonstration of using default bandwidth values and plotting the resulting normalized variance.
orig2D_default = compute_normalized_variance(indepvars, depvars, [depvar_name])
plt = plot_normalized_variance(orig2D_default)
plt.show()

Now we will define an array for the bandwidths in order for the same values to be applied to our manifolds of interest.
bandwidth = np.logspace(-6,1,100) # array of bandwidth values
# one-dimensional manifold represented by x
orig1Dx = compute_normalized_variance(indepvars[:,:1], depvars, [depvar_name], bandwidth_values=bandwidth)
# one-dimensional manifold represented by y
orig1Dy = compute_normalized_variance(indepvars[:,1:], depvars, [depvar_name], bandwidth_values=bandwidth)
# original two-dimensional manifold
orig2D = compute_normalized_variance(indepvars, depvars, [depvar_name], bandwidth_values=bandwidth)
The following plot shows the normalized variance calculated for the dependent variable on each of the three manifolds. A single smooth rise in the normalized variance over bandwidth values indicates a unique manifold. Multiple rises, as can be seen in the one-dimensional manifolds, indicate multiple scales of variation. In this example, those smaller scales can be attributed to non-uniqueness introduced through the projection into one dimension. A curve that rises at larger bandwidth values also indicates more spread in the dependent variable over the manifold. Therefore the desired curve for an optimal manifold is one that has a single smooth rise that occurs at larger bandwidth values.
plt = plot_normalized_variance_comparison((orig1Dx, orig1Dy, orig2D), ([], [], []), ('Blues', 'Reds', 'Greens'), title='Normalized variance for '+depvar_name)
plt.legend(['orig,1D_x', 'orig,1D_y', 'orig,2D'])
plt.show()

In order to better highlight the fastest changes in the normalized variance, we look at a scaled derivative over the logarithmically scaled bandwidths which relays how fast the variance is changing as the bandwidth changes. Specifically we compute \(\hat{\mathcal{D}}(\sigma)\), whose equation can be found in the documentation. Below we show this quantity for the original two-dimensional manifold.
We see a single peak in \(\hat{\mathcal{D}}(\sigma)\) corresponding to the single rise in \(\mathcal{N}(\sigma)\) pointed out above. The location of this peak gives an idea of the feature sizes or length scales associated with variation in the dependent variable over the manifold.
plt = plot_normalized_variance_derivative(orig2D)
plt.show()

We can also plot a comparison of these peaks using
plot_normalized_variance_derivative_comparison
for the three
manifold representations discussed thus far. In the plot below, we can
see that the two one-dimensional projections have two peaks in
\(\hat{\mathcal{D}}(\sigma)\) corresponding to the two humps in the
normalized variance. This clearly shows that the projections are
introducing a significant scale of variation not present on the original
two-dimensional manifold. The locations of these peaks indicate the
feature sizes or scales of variaiton present in the dependent variable
on the manifolds.
plt = plot_normalized_variance_derivative_comparison((orig1Dx, orig1Dy, orig2D), ([],[],[]), ('Blues', 'Reds','Greens'))
plt.legend(['orig,1D_x', 'orig,1D_y', 'orig,2D'])
plt.show()

We can also break down the analysis of these peaks to determine the
\(\sigma\) where they occur. The normalized_variance_derivative
function will return a dictionary of \(\hat{\mathcal{D}}(\sigma)\)
for each dependent variable along with the corresponding \(\sigma\)
values. The find_local_maxima
function can then be used to report
the locations of the peaks in \(\hat{\mathcal{D}}(\sigma)\) along
with the peak values themselves. In order to properly analyze these
peaks, we leave the logscaling
parameter to its default True value.
We can also set show_plot
to True to display the peaks found. This
is demonstrated for the one-dimensional projection onto x below.
orig1Dx_derivative, orig1Dx_sigma, _ = normalized_variance_derivative(orig1Dx)
orig1Dx_peak_locs, orig1Dx_peak_values = find_local_maxima(orig1Dx_derivative[depvar_name], orig1Dx_sigma, show_plot=True)
print('peak locations:', orig1Dx_peak_locs)
print('peak values:', orig1Dx_peak_values)

peak locations: [0.00086033 0.5070298 ]
peak values: [1.01351778 0.60217727]
In this example, we know in the case of the one-dimensional projections that non-uniqueness or overlap is introduced in the dependent variable representation. This shows up as an additional peak in \(\hat{\mathcal{D}}(\sigma)\) compared to the original two-dimensional manifold. In general, though, we may not know whether that additional scale of variation is due to non-uniqueness or is a new characteristic feature from sharpening gradients. We can analyze sensitivity to data sampling in order to distinguish between the two.
As an example, we will analyze the projection onto x. We can use the
random_sampling_normalized_variance
to compute the normalized
variance for various random samplings based on the provided
sampling_percentages
argument. We can also specify multiple
realizations through the n_sample_iterations
argument, which will be
averaged for returning \(\hat{\mathcal{D}}(\sigma)\). We will test
100%, 50%, and 25% specified as [1., 0.5, 0.25]. Note that specifying
100% returns the same result as calling compute_normalized variance on
the full dataset as we did above.
pctdict, pctsig, _ = random_sampling_normalized_variance([1., 0.5, 0.25],
indepvars[:,:1],
depvars,
[depvar_name],
bandwidth_values=bandwidth,
n_sample_iterations=5)
sampling 100.0 % of the data
iteration 1 of 5
iteration 2 of 5
iteration 3 of 5
iteration 4 of 5
iteration 5 of 5
sampling 50.0 % of the data
iteration 1 of 5
iteration 2 of 5
iteration 3 of 5
iteration 4 of 5
iteration 5 of 5
sampling 25.0 % of the data
iteration 1 of 5
iteration 2 of 5
iteration 3 of 5
iteration 4 of 5
iteration 5 of 5
We then plot the result below and report the peak locations for the two dominant peaks. We can see that the peak at the larger \(\sigma\) isn’t very sensitive to data sampling. It remains around 0.5. The peak at smaller \(\sigma\) though experiences a shift to larger \(\sigma\) as less data is included (lower percent sampling). This is because variation from non-uniqueness is much more sensitive to data spacing than characteristic feature variation. We would therefore conclude that the second scale of variation introduced by the projection onto x is due to non-uniqueness, not a characteristic feature size, and therefore the projection is unacceptable. This confirms what we already knew from the visual analysis.
peakthreshold = 0.4
for pct in pctdict.keys():
plt.semilogx(pctsig, pctdict[pct][depvar_name], '--', linewidth=2, label=pct)
peak_locs, peak_vals = find_local_maxima(pctdict[pct][depvar_name], pctsig, threshold=peakthreshold)
print(f'{pct*100:3.0f}% sampling peak locations: {peak_locs[0]:.2e}, {peak_locs[1]:.2e}')
plt.grid()
plt.xlabel('$\sigma$')
plt.ylabel('$\hat{\mathcal{D}}$')
plt.legend()
plt.xlim([np.min(pctsig), np.max(pctsig)])
plt.ylim([0,1.02])
plt.title('Detecting non-uniqueness through sensitivity to sampling')
plt.show()
100% sampling peak locations: 8.60e-04, 5.07e-01
50% sampling peak locations: 1.15e-03, 5.06e-01
25% sampling peak locations: 3.68e-03, 4.98e-01

As an example of comparing multiple representations of a manifold in the
same dimensional space, we will use PCA. Below, two pca objects are
created with different scalings. The first uses the default scaling
std
while the second uses the scaling pareto
. The plots of the
resulting manifolds are shown below for comparison to the original. The
dimensions for the PCA manifolds are referred to as PC1 and PC2.
# PCA using std scaling
pca_std = PCA(indepvars)
eta_std = pca_std.transform(indepvars)
plt.scatter(eta_std[:,0], eta_std[:,1], c=f, s=2, cmap='rainbow')
plt.colorbar()
plt.grid()
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('std scaling')
plt.show()
# PCA using pareto scaling
pca_pareto = PCA(indepvars,'pareto')
eta_pareto = pca_pareto.transform(indepvars)
plt.scatter(eta_pareto[:,0], eta_pareto[:,1], c=f, s=2, cmap='rainbow')
plt.colorbar()
plt.grid()
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('pareto scaling')
plt.show()


We call compute_normalized_variance
in order to assess these
manifolds in one and two dimensional space. Since PCA orders the PCs
according the amount of variance explained, we will use PC1 for
representing a one-dimensional manifold.
pca1D_std = compute_normalized_variance(eta_std[:,:1], depvars, [depvar_name],bandwidth_values=bandwidth)
pca2D_std = compute_normalized_variance(eta_std, depvars, [depvar_name],bandwidth_values=bandwidth)
pca1D_pareto = compute_normalized_variance(eta_pareto[:,:1], depvars, [depvar_name],bandwidth_values=bandwidth)
pca2D_pareto = compute_normalized_variance(eta_pareto, depvars, [depvar_name],bandwidth_values=bandwidth)
We then go straight to plotting \(\hat{\mathcal{D}}\) to see if new peaks are introduced compared to the original two-dimensional manifold, indicating new scales of variation. We again find that the one-dimensional projections are introducing a new scale. We could perform a similar analysis as shown above on the projection onto x to conclude that these new scales are also from non-uniqueness introduced in the projection. We therefore continue the analysis only considering two-dimensional parameterizations to figure out which one may be best in representing f.
plt = plot_normalized_variance_derivative_comparison((pca1D_std, pca2D_std, pca1D_pareto, pca2D_pareto, orig2D),
([],[],[],[],[]),
('Blues', 'Reds', 'Purples', 'Oranges', 'Greens'))
plt.legend(['pca1D_std', 'pca2D_std', 'pca1D_pareto', 'pca2D_pareto', 'orig,2D'])
plt.show()

We compute the locations of the peaks in \(\hat{\mathcal{D}}\) over \(\sigma\) below.
pca2D_std_derivative, pca2D_std_sigma, _ = normalized_variance_derivative(pca2D_std)
pca2D_pareto_derivative, pca2D_pareto_sigma, _ = normalized_variance_derivative(pca2D_pareto)
orig2D_derivative, orig2D_sigma, _ = normalized_variance_derivative(orig2D)
pca2D_std_peak_locs, _ = find_local_maxima(pca2D_std_derivative[depvar_name], pca2D_std_sigma)
pca2D_pareto_peak_locs, _ = find_local_maxima(pca2D_pareto_derivative[depvar_name], pca2D_pareto_sigma)
orig2D_peak_locs, _ = find_local_maxima(orig2D_derivative[depvar_name], orig2D_sigma)
print('peak locations:')
print('orig2D',orig2D_peak_locs)
print('pca2D_std',pca2D_std_peak_locs)
print('pca2D_pareto',pca2D_pareto_peak_locs)
peak locations:
orig2D [0.66762295]
pca2D_std [0.78185085]
pca2D_pareto [0.67063695]
The results show that PCA with std
scaling results in the largest
feature size (largest \(\sigma\)) and is therefore the best for
parameterizing f. This representation should better facilitate modeling
of f as the features are more spread out.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Local feature size estimation#
In this tutorial, we present the local feature size estimation tool from the analysis
module.
We import the necessary modules:
from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
import numpy as np
import pandas as pd
import time
import matplotlib
import matplotlib.pyplot as plt
and we set some initial parameters:
save_filename = None
bandwidth_values = np.logspace(-5, 1, 40)
We upload the dataset which comes from solving the Brusselator PDE. The datasets has two independent variables, \(x\) and \(y\), and one dependent variable, \(\phi\). The dataset is generated on a uniform \(x\)-\(y\) grid.
data = pd.read_csv('brusselator-PDE.csv', header=None).to_numpy()
indepvars = data[:,0:2]
depvar = data[:,2:3]

Compute the feature sizes map on a synthetic dataset#
We start by computing the normalized variance, \(\hat{\mathcal{N}}(\sigma)\). In order to compute the quantities necessary for drawing the feature size map, we need to set either compute_sample_norm_var=True
or compute_sample_norm_range=True
.
tic = time.perf_counter()
variance_data = analysis.compute_normalized_variance(indepvars,
depvars=depvar,
depvar_names=['phi'],
bandwidth_values=bandwidth_values,
compute_sample_norm_range=True)
toc = time.perf_counter()
print(f'\tTime it took: {(toc - tic)/60:0.1f} minutes.\n' + '-'*40)
We compute the normalized variance derivative, \(\hat{\mathcal{D}}(\sigma)\):
derivative, sigmas, _ = analysis.normalized_variance_derivative(variance_data)
derivatives = derivative['phi']
The local feature size estimation algorithm iteratively updates the size of the local features by running “bandwidth descent” algorithm. The goal is to compute the bandwidth vector \(\mathbf{B}\) which contains estimation of the local feature size tied to every data point. The vector \(\mathbf{B}\) is first initialized with the largest feature sizes indicated by the starting_bandwidth_idx
parameter. Entries in \(\mathbf{B}\) are iteratively updated based on the cutoff
value.
starting_bandwidth_idx = 29

We run bandwidth descent algorithm. This will update the bandwidth vector at each location where the sample normalized variance is above cutoff
of its maximum value.
cutoff = 15
B = analysis.feature_size_map(variance_data,
variable_name='phi',
cutoff=cutoff,
starting_bandwidth_idx='peak',
use_variance=False,
verbose=True)

Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Cost function for manifold topology assessment and optimization#
In this tutorial, we present the cost function from the analysis
module which distills information from the normalized variance derivative into a single number. The cost function can be used for low-dimensional manifold topology assessment and manifold optimization.
We import the necessary modules:
from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
from PCAfold import utilities
from PCAfold import manifold_informed_backward_variable_elimination as BVE
import numpy as np
import time
and we set some initial parameters:
save_filename = None
random_seed = 100
Upload a combustion data set#
A data set representing combustion of syngas in air generated from steady laminar flamelet model using Spitfire and a chemical mechanism by Hawkes et al. is used as a demo data set.
We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding source terms, \(\mathbf{S_X}\):
X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
X_names = ['T', 'H2', 'O2', 'O', 'OH', 'H2O', 'H', 'HO2', 'CO', 'CO2', 'HCO']
(n_observations, n_variables) = np.shape(X)
Generate low-dimensional manifolds using PCA#
Below, we generate two- and three-dimensional projections of the original data set from PCA for further assessment.
pca_X_2D = reduction.PCA(X, scaling='auto', n_components=2)
Z_2D = pca_X_2D.transform(X)
S_Z_2D = pca_X_2D.transform(S_X, nocenter=True)
pca_X_3D = reduction.PCA(X, scaling='auto', n_components=3)
Z_3D = pca_X_3D.transform(X)
S_Z_3D = pca_X_3D.transform(S_X, nocenter=True)
We visualize the generated manifolds:





Manifold assessment using the cost function#
We are going to compute the cost function for the PC source terms as the target dependent variables.
We specify the penalty function to use:
penalty_function = 'log-sigma-over-peak'
and the bandwidth values, \(\sigma\), for normalized variance derivative computation:
bandwidth_values = np.logspace(-7, 3, 50)
We specify the cost function’s hyper-parameters, the power \(r\) and the vertical shift \(b\). Increasing the power parameter allows for a stronger penalty for non-uniqueness and increasing the vertical shift parameter allows for a stronger penalty for small feature sizes.
power = 1
vertical_shift = 1
We sample the dataset to decrease the computational time of this tutorial:
sample_random = preprocess.DataSampler(np.zeros((n_observations,)).astype(int), random_seed=random_seed, verbose=False)
(idx_sample, _) = sample_random.random(50)
We create lists of the target dependent variables names:
depvar_names_2D = ['SZ' + str(i) for i in range(1,3)]
depvar_names_3D = ['SZ' + str(i) for i in range(1,4)]
and we begin with computing the normalized variance derivative for the two-dimensional PCA projection:
variance_data_2D = analysis.compute_normalized_variance(Z_2D[idx_sample,:],
S_Z_2D[idx_sample,:],
depvar_names=depvar_names_2D,
bandwidth_values=bandwidth_values)
The associated costs are computed from the generated object of the VarianceData
class. With the norm=None
we do not normalize the costs over all target variables (in this case the PC source terms), instead the output will give us the individual costs for each target variable.
costs_2D = analysis.cost_function_normalized_variance_derivative(variance_data_2D,
penalty_function=penalty_function,
power=power,
vertical_shift=vertical_shift,
norm=None)
We can print the individual costs:
for i, variable in enumerate(depvar_names_2D):
print(variable + ':\t' + str(round(costs_2D[i],3)))
SZ1: 4.238
SZ2: 1.567
Finally, we repeat the cost function computation for the three-dimensional PCA projection:
variance_data_3D = analysis.compute_normalized_variance(Z_3D[idx_sample,:],
S_Z_3D[idx_sample,:],
depvar_names=depvar_names_3D,
bandwidth_values=bandwidth_values)
costs_3D = analysis.cost_function_normalized_variance_derivative(variance_data_3D,
penalty_function=penalty_function,
power=power,
vertical_shift=verical_shift,
norm=None)
and we print the individual costs:
for i, variable in enumerate(depvar_names_3D):
print(variable + ':\t' + str(round(costs_3D[i],3)))
SZ1: 1.157
SZ2: 1.23
SZ3: 1.422
The cost function provides us information about the quality of the low-dimensional data projection with respect to target dependent variables, which in this case were the PC source terms. A higher cost indicates a worse manifold topology. The two topological aspects that the cost function takes into account are non-uniqueness and feature sizes.
We observe that individual costs are higher for the two-dimensional than for the three-dimensional PCA projection. This can be understood from our visualization of the manifolds, where we have seen a significant overlap affecting the first PC source term in particular. With the third manifold parameter added in the three-dimensional projection, the projection quality improves and the costs drop.
Moreover, for the two-dimensional PCA projection, the cost associated with the first PC source term is higher than the cost associated with the second PC source term. This can also be understood by comparing the two-dimensional projections colored by \(S_{Z, 1}\) and by \(S_{Z, 2}\). The high magnitudes of \(S_{Z, 1}\) values occur at the location where the manifold exhibits overlap, while the same overlap does not affect the \(S_{Z, 2}\) values to the same extent.
Manifold optimization using the cost function#
The utilities.manifold_informed_backward_variable_elimination
function implements an iterative feature selection algorithm that uses the cost function as an objective function. The algorithm selects an optimal subset of the original state variables that result in an optimized PCA manifold topology. Below, we demonstrate the algorithm on a 10% sample of the original data. The data is sampled to speed-up the calculations for the purpose of this demonstration. In real applications it is recommended to use the full data set.
Sample the original data:
sample_random = preprocess.DataSampler(np.zeros((n_observations,)).astype(int), random_seed=100, verbose=False)
(idx_sample, _) = sample_random.random(10)
sampled_X = X[idx_sample,:]
sampled_S_X = S_X[idx_sample,:]
Specify the target variables to assess on the manifold (we will also add the PC source terms to the target variables by setting add_transformed_source=True
). In this case we take the temperature, $T$, and several important chemical species mass fractions: \(H_2\), \(O_2\), \(H_2O\), \(CO\) and \(CO_2\):
target_variables = sampled_X[:,[0,1,2,5,8,9]]
Set the norm to take over all target dependent variables:
norm = 'cumulative'
Set the target manifold dimensionality:
q = 2
Run the algorithm:
_, selected_variables, _, _ = BVE(sampled_X,
sampled_S_X,
X_names,
scaling='auto',
bandwidth_values=bandwidth_values,
target_variables=target_variables,
add_transformed_source=True,
target_manifold_dimensionality=q,
penalty_function=penalty_function,
power=power,
vertical_shift=vertical_shift,
norm=norm,
verbose=True)
With verbose=True
we will see additional information on costs at each iteration:
Iteration No.4
Currently eliminating variable from the following list:
['T', 'H2', 'O2', 'O', 'OH', 'H2O', 'H', 'CO2']
Currently eliminated variable: T
Running PCA for a subset:
H2, O2, O, OH, H2O, H, CO2
Cost: 11.4539
WORSE
Currently eliminated variable: H2
Running PCA for a subset:
T, O2, O, OH, H2O, H, CO2
Cost: 13.4908
WORSE
Currently eliminated variable: O2
Running PCA for a subset:
T, H2, O, OH, H2O, H, CO2
Cost: 14.8488
WORSE
Currently eliminated variable: O
Running PCA for a subset:
T, H2, O2, OH, H2O, H, CO2
Cost: 12.6549
WORSE
Currently eliminated variable: OH
Running PCA for a subset:
T, H2, O2, O, H2O, H, CO2
Cost: 10.0785
SAME OR BETTER
Currently eliminated variable: H2O
Running PCA for a subset:
T, H2, O2, O, OH, H, CO2
Cost: 10.7182
WORSE
Currently eliminated variable: H
Running PCA for a subset:
T, H2, O2, O, OH, H2O, CO2
Cost: 11.8644
WORSE
Currently eliminated variable: CO2
Running PCA for a subset:
T, H2, O2, O, OH, H2O, H
Cost: 10.9898
WORSE
Variable OH is removed.
Cost: 10.0785
Iteration time: 0.8 minutes.
Finally, we generate the PCA projection of the optimized subset of the original data set:
pca_X_optimized = reduction.PCA(X[:,selected_variables], scaling='auto', n_components=2)
Z_optimized = pca_X_optimized.transform(X[:,selected_variables])
S_Z_optimized = pca_X_optimized.transform(S_X[:,selected_variables], nocenter=True)


From the plots above, we observe that the optimized two-dimensional PCA projection exhibits much less overlap compared to the two-dimensional PCA projection that we computed earlier using the full data set.
Below, we compute the costs for the two PC source terms again for this optimized projection:
variance_data_optimized = analysis.compute_normalized_variance(Z_optimized,
S_Z_optimized,
depvar_names=depvar_names_2D,
bandwidth_values=bandwidth_values)
costs_optimized = analysis.cost_function_normalized_variance_derivative(variance_data_optimized,
penalty_function=penalty_function,
power=power,
vertical_shift=vertical_shift,
norm=None)
for i, variable in enumerate(depvar_names_2D):
print(variable + ':\t' + str(round(costs_optimized[i],3)))
SZ1: 1.653
SZ2: 1.179
We note that the costs for the two PC source terms are lower than the costs that we computed earlier using the full data set to generate the PCA projection.
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Nonlinear regression#
In this tutorial, we present the nonlinear regression utilities from the analysis
module.
We import the necessary modules:
from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
from PCAfold import reconstruction
import numpy as np
and we set some initial parameters:
save_filename = None
Generating a synthetic data set#
We begin by generating a synthetic data set with two independent variables, \(x\) and \(y\), and one dependent variable, \(\phi\), that we will nonlinearly regress using kernel regression.
Generate independent variables \(x\) and \(y\) from a uniform grid:
n_points = 100
grid = np.linspace(0,100,n_points)
x, y = np.meshgrid(grid, grid)
x = x.flatten()
y = y.flatten()
xy = np.hstack((x[:,None],y[:,None]))
(n_observations, _) = np.shape(xy)
Generate a dependent variable \(\phi\) as a quadratic function of \(x\):
phi = xy[:,0:1]**2
Visualize the generated data set:
plt = reduction.plot_2d_manifold(x,
y,
color=phi,
x_label='x',
y_label='y',
colorbar_label='$\phi$',
color_map='inferno',
figure_size=(8,4),
save_filename=save_filename)
Kernel regression#
We first generate train and test samples using the DataSampler
class:
train_perc = 80
random_seed = 100
idx = np.zeros((n_observations,)).astype(int)
sample_random = preprocess.DataSampler(idx, random_seed=random_seed, verbose=False)
(idx_train, idx_test) = sample_random.random(train_perc, test_selection_option=1)
xy_train = xy[idx_train,:]
xy_test = xy[idx_test,:]
phi_train = phi[idx_train]
phi_test = phi[idx_test]
Specify the bandwidth for the Nadaraya-Watson kernel:
bandwidth = 10
Fit the kernel regression model with train data:
model = analysis.KReg(xy_train, phi_train)
Predict the test data:
phi_test_predicted = model.predict(xy_test, bandwidth=bandwidth)
Predict all data:
phi_predicted = model.predict(xy, bandwidth=bandwidth)
Nonlinear regression assessment#
In this section we will perform few assessments of the quality of the nonlinear regression.
Visual assessment#
We begin by visualizing the regressed (predicted) dependent variable \(\phi\). This can be done either in 2D:
plt = reconstruction.plot_2d_regression(x,
phi,
phi_predicted,
x_label='$x$',
y_label='$\phi$',
figure_size=(10,4),
save_filename=save_filename)
or in 3D:
plt = reconstruction.plot_3d_regression(x,
y,
phi,
phi_predicted,
elev=20,
azim=-100,
x_label='$x$',
y_label='$y$',
z_label='$\phi$',
figure_size=(10,7),
save_filename=save_filename)
Predicted 2D field for scalar quantities#
When the predicted variable is a scalar quantity, a scatter plot for the regressed scalar field can be plotted using the function plot_2d_regression_scalar_field. Regression of the scalar field can be tested at any user-defined grid, also outside of the bounds of the training data. This can be of particular importance when generating reduced-order models, where the behavior of the regression should be tested outside of the training manifold.
Below, we show an example on a combustion data set.
X = np.genfromtxt('data-state-space.csv', delimiter=',')
S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',')
pca_X = reduction.PCA(X, scaling='vast', n_components=2)
PCs = pca_X.transform(X)
PC_sources = pca_X.transform(S_X, nocenter=True)
(PCs_pp, centers_PCs, scales_PCs) = preprocess.center_scale(PCs, '-1to1')
Fit the kernel regression model with the train data:
KReg_model = analysis.KReg(PCs_pp, PC_sources)
We define the regression model function that will make predictions for any query point:
def regression_model(regression_input):
regression_input_CS = (regression_input - centers_PCs)/scales_PCs
regressed_value = KReg_model.predict(regression_input_CS, 'nearest_neighbors_isotropic', n_neighbors=10)[0,1]
return regressed_value
We first visualize the training manifold, colored by the dependent variable being predicted:
reduction.plot_2d_manifold(PCs[:,0],
PCs[:,1],
x_label='$Z_1$',
y_label='$Z_2$',
color=PC_sources[:,1],
color_map='viridis',
colorbar_label='$S_{Z_2}$',
figure_size=(8,6),
save_filename=save_filename)

Define the bounds for the scalar field:
grid_bounds = ([np.min(PCs[:,0]),np.max(PCs[:,0])],[np.min(PCs[:,1]),np.max(PCs[:,1])])
Plot the regressed scalar field:
plt = reconstruction.plot_2d_regression_scalar_field(grid_bounds,
regression_model,
x=PCs[:,0],
y=PCs[:,1],
resolution=(200,200),
extension=(10,10),
s_field=10,
s_manifold=1,
x_label='$Z_1$ [$-$]',
y_label='$Z_2$ [$-$]',
manifold_color='r',
colorbar_label='$S_{Z, 1}$',
color_map='viridis',
colorbar_range=(np.min(PC_sources[:,1]), np.max(PC_sources[:,1])),
manifold_alpha=1,
grid_on=False,
figure_size=(10,6),
save_filename=save_filename);

Streamplots for predicted vector quantities#
In a special case, when the predicted variable is a vector, a streamplot of the regressed vector field can be plotted using the function plot_2d_regression_streamplot
. Regression of a vector field can be tested at any user-defined grid, also outside of the bounds of the training data. This can be of particular importance when generating reduced-order models, where the behavior of the regression should be tested outside of the training manifold.
Below, we show an example on a synthetic data set:
X = np.random.rand(100,5)
S_X = np.random.rand(100,5)
pca_X = reduction.PCA(X, n_components=2)
PCs = pca_X.transform(X)
S_Z = pca_X.transform(S_X, nocenter=True)
vector_model = analysis.KReg(PCs, S_Z)
We define the regression model function that will make predictions for any query point:
def regression_model(query):
predicted = vector_model.predict(query, 'nearest_neighbors_isotropic', n_neighbors=1)
return predicted
Define the bounds for the streamplot:
grid_bounds = ([np.min(PCs[:,0]),np.max(PCs[:,0])],[np.min(PCs[:,1]),np.max(PCs[:,1])])
Plot the regression streamplot:
plt = reconstruction.plot_2d_regression_streamplot(grid_bounds,
regression_model,
x=PCs[:,0],
y=PCs[:,1],
resolution=(15,15),
extension=(20,20),
color='k',
x_label='$Z_1$',
y_label='$Z_2$',
manifold_color=X[:,0],
colorbar_label='$X_1$',
color_map='plasma',
colorbar_range=(0,1),
manifold_alpha=1,
grid_on=False,
figure_size=(10,6),
title='Streamplot',
save_filename=None)
Error metrics#
Several error metrics are available that will measure how well the dependent variable(s) were predicted. Metrics can be accessed individually and collectively. Below, we will show examples of both. The available metrics are:
Mean absolute error
Mean squared error
Root mean squared error
Normalized root mean squared error
Turning points
Good estimate
Good direction estimate
An example of computing mean absolute error is shown below:
MAE = reconstruction.mean_absolute_error(phi, phi_predicted)
We also compute the coefficient of determination, \(R^2\), values for the test data and entire data:
r2_test = reconstruction.coefficient_of_determination(phi_test, phi_test_predicted)
r2_all = reconstruction.coefficient_of_determination(phi, phi_predicted)
print('All R2:\t\t' + str(round(r2_all, 6)) + '\nTest R2:\t' + str(round(r2_test, 6)))
The code above will print:
All R2: 0.997378
Test R2: 0.997366
By instantiating an object of the RegressionAssessment
class, one can compute all available metrics at once:
regression_metrics = reconstruction.RegressionAssessment(phi, phi_predicted, variable_names=['$\phi$'], norm='std')
As an example, mean absolute error can be accessed by:
regression_metrics.mean_absolute_error
All computed metrics can be printed with the use of the RegressionAssessment.print_metrics
function. Few output formats are available.
Raw text format:
regression_metrics.print_metrics(table_format=['raw'], float_format='.4f')
--------------------
$\phi$
R2: 0.9958
MAE: 98.4007
MSE: 37762.8664
RMSE: 194.3267
NRMSE: 0.0645
GDE: nan
tex
format:
regression_metrics.print_metrics(table_format=['tex'], float_format='.4f')
\begin{table}[h!]
\begin{center}
\begin{tabular}{ll} \toprule
& \textit{$\phi$} \\ \midrule
$R^2$ & 0.9958 \\
MAE & 98.4007 \\
MSE & 37762.8664 \\
RMSE & 194.3267 \\
NRMSE & 0.0645 \\
GDE & nan \\
\end{tabular}
\caption{}\label{}
\end{center}
\end{table}
pandas.DataFrame
format (most recommended for Jupyter notebooks):
regression_metrics.print_metrics(table_format=['pandas'], float_format='.4f')

Note that with the float_format
parameter you can change the number of digits displayed:
regression_metrics.print_metrics(table_format=['pandas'], float_format='.2f')

Stratified error metrics#
In addition to a single value of \(R^2\) for the entire data set, we can also compute stratified \(R^2\) values. This allows us to observe how kernel regression performed in each strata of the dependent variable \(\phi\). We will compute the stratified \(R^2\) in 20 bins of \(\phi\):
n_bins = 20
use_global_mean = False
verbose = True
(idx, bins_borders) = preprocess.variable_bins(phi, k=n_bins, verbose=False)
r2_in_bins = reconstruction.stratified_coefficient_of_determination(phi, phi_predicted, idx=idx, use_global_mean=use_global_mean, verbose=verbose)
The code above will print:
Bin 1 | size 2300 | R2 0.868336
Bin 2 | size 900 | R2 0.870357
Bin 3 | size 700 | R2 0.863821
Bin 4 | size 600 | R2 0.880655
Bin 5 | size 500 | R2 0.875764
Bin 6 | size 500 | R2 0.889148
Bin 7 | size 400 | R2 0.797888
Bin 8 | size 400 | R2 0.773907
Bin 9 | size 400 | R2 0.79479
Bin 10 | size 400 | R2 0.862069
Bin 11 | size 300 | R2 0.864022
Bin 12 | size 300 | R2 0.93599
Bin 13 | size 300 | R2 0.972185
Bin 14 | size 300 | R2 0.988894
Bin 15 | size 300 | R2 0.979975
Bin 16 | size 300 | R2 0.766598
Bin 17 | size 300 | R2 -0.46525
Bin 18 | size 200 | R2 -11.158072
Bin 19 | size 300 | R2 -10.94865
Bin 20 | size 300 | R2 -28.00655
We can plot the stratified \(R^2\) values across bins centers:
plt = reconstruction.plot_stratified_metric(r2_in_bins,
bins_borders,
variable_name='$\phi$',
metric_name='$R^2$',
yscale='linear',
figure_size=(10,2),
save_filename=save_filename)
This last plot lets us see that kernel regression performed very well in the middle range of the dependent variable values but very poorly at both edges of that range. This is consistent with what we have seen in a 3D plot that visualized the regression result.
All other regression metrics can also be computed in the data bins, similarly to the example shown for the stratified \(R^2\) values.
We will create five bins:
(idx, bins_borders) = preprocess.variable_bins(phi, k=5, verbose=False)
stratified_regression_metrics = reconstruction.RegressionAssessment(phi, phi_predicted, idx=idx, variable_names=['$\phi$'], norm='std')
All computed stratified metrics can be printed with the use of the RegressionAssessment.print_stratified_metrics
function. Few output formats are available.
Raw text format:
stratified_regression_metrics.print_stratified_metrics(table_format=['raw'], float_format='.4f')
-------------------------
k1
N. samples: 4500
R2: 0.9920
MAE: 53.2295
MSE: 2890.8754
RMSE: 53.7669
NRMSE: 0.0892
-------------------------
k2
N. samples: 1800
R2: 0.9906
MAE: 53.8869
MSE: 3032.0995
RMSE: 55.0645
NRMSE: 0.0971
-------------------------
k3
N. samples: 1400
R2: 0.9912
MAE: 50.4640
MSE: 2865.7682
RMSE: 53.5329
NRMSE: 0.0936
-------------------------
k4
N. samples: 1200
R2: 0.9956
MAE: 28.4107
MSE: 1492.1498
RMSE: 38.6284
NRMSE: 0.0665
-------------------------
k5
N. samples: 1100
R2: 0.1271
MAE: 493.3956
MSE: 321235.7188
RMSE: 566.7766
NRMSE: 0.9343
tex
format:
stratified_regression_metrics.print_stratified_metrics(table_format=['tex'], float_format='.4f')
\\begin{table}[h!]
\\begin{center}
\\begin{tabular}{llllll} \\toprule
& \\textit{k1} & \\textit{k2} & \\textit{k3} & \\textit{k4} & \\textit{k5} \\\\ \\midrule
N. samples & 4500.0000 & 1800.0000 & 1400.0000 & 1200.0000 & 1100.0000 \\\\
$R^2$ & 0.9920 & 0.9906 & 0.9912 & 0.9956 & 0.1271 \\\\
MAE & 53.2295 & 53.8869 & 50.4640 & 28.4107 & 493.3956 \\\\
MSE & 2890.8754 & 3032.0995 & 2865.7682 & 1492.1498 & 321235.7188 \\\\
RMSE & 53.7669 & 55.0645 & 53.5329 & 38.6284 & 566.7766 \\\\
NRMSE & 0.0892 & 0.0971 & 0.0936 & 0.0665 & 0.9343 \\\\
\\end{tabular}
\\caption{}\\label{}
\\end{center}
\\end{table}
pandas.DataFrame
format (most recommended for Jupyter notebooks):
stratified_regression_metrics.print_stratified_metrics(table_format=['pandas'], float_format='.4f')

Comparison of two regression solutions#
Two object of the RegressionAssessment
class can be compared when printing the metrics. This results in a color-coded comparison where worse results are colored red and better results are colored green.
Below, we generate a new regression solution that will be compared with the one obtained above. We will increase the bandwidth to get different regression metrics:
phi_predicted_comparison = model.predict(xy, bandwidth=bandwidth+2)
Comparison can be done for the global metrics, where each variable will be compared separately:
regression_metrics_comparison = reconstruction.RegressionAssessment(phi, phi_predicted_comparison, variable_names=['$\phi$'], norm='std')
regression_metrics.print_metrics(table_format=['pandas'], float_format='.4f', comparison=regression_metrics_comparison)

and for the stratified metrics, where each bin will be compared separately:
stratified_regression_metrics_comparison = reconstruction.RegressionAssessment(phi, phi_predicted_comparison, idx=idx)
stratified_regression_metrics.print_stratified_metrics(table_format=['raw'], float_format='.2f', comparison=stratified_regression_metrics_comparison)
-------------------------
k1
N. samples: 4500
R2: 0.99 BETTER
MAE: 53.23 BETTER
MSE: 2890.88 BETTER
RMSE: 53.77 BETTER
NRMSE: 0.09 BETTER
-------------------------
k2
N. samples: 1800
R2: 0.99 BETTER
MAE: 53.89 BETTER
MSE: 3032.10 BETTER
RMSE: 55.06 BETTER
NRMSE: 0.10 BETTER
-------------------------
k3
N. samples: 1400
R2: 0.99 BETTER
MAE: 50.46 BETTER
MSE: 2865.77 BETTER
RMSE: 53.53 BETTER
NRMSE: 0.09 BETTER
-------------------------
k4
N. samples: 1200
R2: 1.00 BETTER
MAE: 28.41 BETTER
MSE: 1492.15 BETTER
RMSE: 38.63 BETTER
NRMSE: 0.07 BETTER
-------------------------
k5
N. samples: 1100
R2: 0.13 BETTER
MAE: 493.40 BETTER
MSE: 321235.72 BETTER
RMSE: 566.78 BETTER
NRMSE: 0.93 BETTER
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
Partition of Unity Networks (POUnets)#
In this tutorial, we demonstrate how POUnets may be initialized and trained to reconstruct quantities of interest (QoIs).
from PCAfold import PartitionOfUnityNetwork, init_uniform_partitions
import numpy as np
import matplotlib.pyplot as plt
First, we create a two-dimensional manifold with vacant patches. This is shown in the first plot, colored by a dependent variable or QoI. We then ask to initialize partitions over a 5x2 grid. We find that only 8 of the 10 partitions are retained, as those initialized in the vacant spaces are discarded. We then visualize the locations of these partition centers, which exist in the normalized manifold space, along with the normalized data.
ivar1 = np.linspace(1,2,20)
ivar1 = ivar1[np.argwhere((ivar1<1.4)|(ivar1>1.6))[:,0]] # create hole
ivars = np.meshgrid(ivar1, ivar1) # make 2D
ivars = np.vstack([b.ravel() for b in ivars]).T # reshape (nobs x ndim)
dvar = 2.*ivars[:,0] + 0.1*ivars[:,1]**2
plt.scatter(ivars[:,0],ivars[:,1], s=3, c=dvar)
plt.colorbar()
plt.grid()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
init_data = init_uniform_partitions([5,2], ivars, verbose=True) # initialize partitions
ivars_cs = (ivars - init_data['ivar_center'])/init_data['ivar_scale'] # center/scale ivars
plt.plot(ivars_cs[:,0],ivars_cs[:,1], 'b.', label='normalized training data')
plt.plot(init_data['partition_centers'][:,0], init_data['partition_centers'][:,1], 'r*', label='partition centers')
plt.grid()
plt.xlabel('normalized x1')
plt.ylabel('normalized x2')
plt.legend()
plt.show()

kept 8 partitions out of 10

We can now initialize a POUnet with a linear basis, build the graph with absolute training errors, and train for 1000 iterations.
There are also options, as outlined in the documentation, to set transformation parameters for training on a transformed dvar.
net = PartitionOfUnityNetwork(**init_data,
basis_type='linear',
# transform_power=1.,
# transform_shift=0.,
# transform_sign_shift=0.
)
net.build_training_graph(ivars, dvar, error_type='abs')
net.train(1000, archive_rate=100, verbose=True)
------------------------------------------------------------
iteration | mean sqr | % max | sum sqr
------------------------------------------------------------
100 | 1.93e-06 | 0.22% | 4.93e-04
resetting best error
200 | 1.75e-06 | 0.21% | 4.49e-04
resetting best error
300 | 1.69e-06 | 0.20% | 4.33e-04
resetting best error
400 | 1.66e-06 | 0.20% | 4.25e-04
resetting best error
500 | 1.64e-06 | 0.20% | 4.21e-04
resetting best error
600 | 1.05e-06 | 0.20% | 2.68e-04
resetting best error
700 | 5.25e-07 | 0.21% | 1.34e-04
resetting best error
800 | 2.07e-07 | 0.22% | 5.29e-05
resetting best error
900 | 2.57e-10 | 0.01% | 6.58e-08
resetting best error
1000 | 1.06e-10 | 0.01% | 2.72e-08
resetting best error
The learning rate (default 1e-3) and least squares l2 regularization (default 1e-10) can also be updated at any time.
net.update_lr(1.e-4)
net.update_l2reg(1.e-12)
net.train(200, archive_rate=100, verbose=True)
updating lr: 0.0001
updating l2reg: 1e-12
------------------------------------------------------------
iteration | mean sqr | % max | sum sqr
------------------------------------------------------------
100 | 1.01e-10 | 0.01% | 2.58e-08
resetting best error
200 | 9.61e-11 | 0.01% | 2.46e-08
resetting best error
Here we visualize the error during training at every 100th iteration, which is the default archive rate.
err_dict = net.training_archive
for k in ['mse', 'sse', 'inf']:
plt.loglog(net.iterations,err_dict[k],'-', label=k)
plt.grid()
plt.xlabel('iterations')
plt.ylabel('error')
plt.legend()
plt.show()

We can evaluate the POUnet and its derivatives.
pred = net(ivars)
plt.plot(dvar,dvar,'k-')
plt.plot(dvar,pred,'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('QoI')
plt.show()

der = net.derivatives(ivars) # predicted
der1 = 2.*np.ones_like(dvar) # observed
der2 = 0.2*ivars[:,1] # observed
plt.plot(der1,der1,'k-')
plt.plot(der1,der[:,0],'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('d/dx1')
plt.show()
plt.plot(der2,der2,'k-')
plt.plot(der2,der[:,1],'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('d/dx2')
plt.show()


We can then save and load the POUnet parameters to/from file. The training history needs to be saved separately if desired.
# Save the POUnet to a file
net.write_data_to_file('filename.pkl')
# Load a POUnet from file
net2 = PartitionOfUnityNetwork.load_from_file('filename.pkl')
# Evaluate the loaded POUnet (without needing to build the graph)
pred2 = net2(ivars)
It is also possible to train a POUnet more after loading from file…
net2.build_training_graph(ivars, dvar, error_type='abs')
net2.train(1000, archive_rate=100, verbose=False)
Notice how the error history for the loaded POUnet only includes the recent training.
err_dict = net2.training_archive
for k in ['mse', 'sse', 'inf']:
plt.loglog(net2.iterations,err_dict[k],'-', label=k)
plt.grid()
plt.xlabel('iterations')
plt.ylabel('error')
plt.legend()
plt.show()

More training may be beneficial if new training data, perhaps with more resolution, become available…
ivars2 = np.meshgrid(np.linspace(1,2,20), np.linspace(1,2,20))
ivars2 = np.vstack([b.ravel() for b in ivars2]).T
dvar2 = 2.*ivars2[:,0] + 0.1*ivars2[:,1]**2
net2.build_training_graph(ivars2, dvar2, error_type='abs')
net2.train(1000, archive_rate=100, verbose=False)
If we have a different QoI that we want to use the same partitions for, we may also create a new POUnet from trained parameters and redo the least squares regression to update the basis coefficients appropriately…
dvar_new = ivars[:,0]*2 + 0.5*ivars[:,1]
net_new = PartitionOfUnityNetwork.load_from_file('filename.pkl')
net_new.build_training_graph(ivars, dvar_new)
net_new.lstsq()
pred_new = net_new(ivars)
plt.plot(dvar_new,dvar_new,'k-')
plt.plot(dvar_new,pred_new,'r.')
plt.grid()
plt.xlabel('observed')
plt.ylabel('predicted')
plt.title('QoI new')
plt.show()
performing least-squares solve

There is also flexibility in adding/removing partitions or changing the basis degree, but the parameters must be appropriately resized for such changes.
Below, we remove the 4th partition from the originally trained POUnet. Partition parameters are shaped as n_partition x n_dim while the basis coefficients can easily be reshaped into n_basis x n_partition as shown below. Since we had a linear basis, the number of terms in each partition’s basis function is 3: a constant, linear in x1, and linear in x2.
pou_data = PartitionOfUnityNetwork.load_data_from_file('filename.pkl')
i_partition_remove = 3 # index to remove the 4th partition
old_coeffs = pou_data['basis_coeffs'].reshape(3,pou_data['partition_centers'].shape[0]) # reshape basis coeffs into n_basis x n_partition
pou_data['partition_centers'] = np.delete(pou_data['partition_centers'], i_partition_remove, axis=0) # remove the 4th row
pou_data['partition_shapes'] = np.delete(pou_data['partition_shapes'], i_partition_remove, axis=0) # remove the 4th row
pou_data['basis_coeffs'] = np.expand_dims(np.delete(old_coeffs, i_partition_remove, axis=1).ravel(), axis=0) # remove the 4th column
We then simply initialize a new POUnet with the modified data and continue training.
net_modified = PartitionOfUnityNetwork(**pou_data)
net_modified.build_training_graph(ivars, dvar, error_type='abs')
net_modified.train(1000, archive_rate=100, verbose=False)
We could also change the basis type and modify the basis coefficient size accordingly. Below, we change the basis from linear to quadratic, which adds 3 additional terms: x1^2, x2^2, and x1x2. We initialize these coefficients to zero and perform the least squares to update them appropriately. Further training could be performed if desired.
pou_data = PartitionOfUnityNetwork.load_data_from_file('filename.pkl')
old_coeffs = pou_data['basis_coeffs'].reshape(3,pou_data['partition_centers'].shape[0]) # reshape basis coeffs into n_basis x n_partition
old_coeffs = np.vstack((old_coeffs, np.zeros((3,old_coeffs.shape[1])))) # add basis terms for x1^2, x2^2, and x1x2
pou_data['basis_coeffs'] = np.expand_dims(old_coeffs.ravel(), axis=0)
pou_data['basis_type'] = 'quadratic'
net_modified = PartitionOfUnityNetwork(**pou_data)
net_modified.build_training_graph(ivars, dvar, error_type='abs')
net_modified.lstsq()
performing least-squares solve
Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
QoI-aware encoder-decoder#
In this tutorial, we present the QoI-aware encoder-decoder dimensionality reduction strategy from the utilities
module.
The QoI-aware encoder-decoder is an autoencoder-like neural network that reconstructs important quantities of interest (QoIs) at the output of a decoder. The QoIs can be set to projection-independent variables (such as the original state variables) or projection-dependent variables, whose definition changes during neural network training.
We introduce an intrusive modification to the neural network training process such that at each epoch, a low-dimensional basis matrix is computed from the current weights in the encoder. Any projection-dependent variables at the output get re-projected onto that basis.
The rationale for performing dimensionality reduction with the QoI-aware strategy is that any poor topological behaviors on a low-dimensional projection will immediately increase the loss during training. These behaviors could be non-uniqueness in representing QoIs due to overlaps on a projection, or large gradients in QoIs caused by data compression in certain regions of a projection. Thus, the QoI-aware strategy naturally promotes improved projection topologies and can be useful in reduced-order modeling.
An illustrative explanation of how the QoI-aware encoder-decoder works is presented in the figure below:

We import the necessary modules:
from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
from PCAfold import utilities
import numpy as np
and we set some initial parameters:
save_filename = None
Upload a combustion data set#
A data set representing combustion of hydrogen in air generated from steady laminar flamelet model using Spitfire is used as a demo data set.
We begin by importing the data set composed of the original state space variables, \(\mathbf{X}\), and the corresponding source terms, \(\mathbf{S_X}\):
X = np.genfromtxt('H2-air-state-space.csv', delimiter=',')[:,0:-2]
S_X = np.genfromtxt('H2-air-state-space-sources.csv', delimiter=',')[:,0:-2]
X_names = np.genfromtxt('H2-air-state-space-names.csv', delimiter='\n', dtype=str)[0:-2]
(n_observations, n_variables) = np.shape(X)
Train the QoI-aware encoder-decoder#
We are going to generate 2D projections of the state-space:
n_components = 2
First, we are going to scale the state-space variables to a \(\langle 0, 1 \rangle\) range. This is done to help the neural network training process.
We are also going to apply an adequate scaling to the source terms. This is done for consistency in reduced-order modeling (see: Handling source terms). The scaled source terms will serve as projection-dependent variables.
(input_data, centers, scales) = preprocess.center_scale(X, scaling='0to1')
projection_dependent_outputs = S_X / scales
We create a PCA-initialization of the encoder:
pca = reduction.PCA(X, n_components=n_components, scaling='auto')
encoder_weights_init = pca.A[:,0:n_components]
We visualize the initial projection:
X_projected = np.dot(input_data, encoder_weights_init)
S_X_projected = np.dot(projection_dependent_outputs, encoder_weights_init)

We select a couple of important state variables to be used as the projection-independent variables:
selected_state_variables = [0, 2, 4, 5, 6]
First, we fix the random seed for results reproducibility:
random_seed = 100
We set several important hyper-parameters:
activation_decoder = 'tanh'
decoder_interior_architecture = (6,9)
optimizer = 'Adam'
learning_rate = 0.001
loss = 'MSE'
batch_size = n_observations
validation_perc = 10
We are not going to hold initial weights constant, and we are going to allow the encoder to update weights at each epoch:
hold_initialization = None
hold_weights = None
We are going to train the model for 5000 epochs:
n_epochs = 5000
We instantiate an object of the QoIAwareProjection class with various parameters:
projection = utilities.QoIAwareProjection(input_data,
n_components=2,
projection_independent_outputs=input_data[:,selected_state_variables],
projection_dependent_outputs=projection_dependent_outputs,
activation_decoder=activation_decoder,
decoder_interior_architecture=decoder_interior_architecture,
encoder_weights_init=None,
decoder_weights_init=None,
hold_initialization=hold_initialization,
hold_weights=hold_weights,
transformed_projection_dependent_outputs='signed-square-root',
loss=loss,
optimizer=optimizer,
batch_size=batch_size,
n_epochs=n_epochs,
learning_rate=learning_rate,
validation_perc=validation_perc,
random_seed=random_seed,
verbose=True)
Before we begin neural network training, we can print the summary of the current Keras model:
projection.summary()
QoI-aware encoder-decoder model summary...
(Model has not been trained yet)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Projection dimensionality:
- 2D projection
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Encoder-decoder architecture:
9-2-6-9-9
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Activation functions:
(9)--linear--(2)--tanh--(6)--tanh--(9)--tanh--(9)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Variables at the decoder output:
- 5 projection independent variables
- 2 projection dependent variables
- 2 transformed projection dependent variables using signed-square-root
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model validation:
- Using 10% of input data as validation data
- Model will be trained on 90% of input data
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hyperparameters:
- Batch size: 58101
- # of epochs: 5000
- Optimizer: Adam
- Learning rate: 0.001
- Loss function: MSE
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights initialization in the encoder:
- User-provided custom initialization of the encoder
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights initialization in the decoder:
- Glorot uniform
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Weights updates in the encoder:
- Initial weights in the encoder will change after first epoch
- Weights in the encoder will change at every epoch
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Results reproducibility:
- Reproducible neural network training will be assured using random seed: 100
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
We train the current Keras model:
projection.train()
We can visualize the MSE loss computed on training and validation data during training:
projection.plot_losses(markevery=100,
figure_size=(15, 4),
save_filename=save_filename)

After training, additional information is available in the model summary:
projection.summary()
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Training results:
- Minimum training loss: 0.0018488304922357202
- Minimum training loss at epoch: 5000
- Minimum validation loss: 0.0019012088887393475
- Minimum validation loss at epoch: 5000
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
We extract the best lower-dimensional basis that corresponds to the epoch with the smallest training loss:
basis = projection.get_best_basis(method='min-training-loss')
We project the original dataset onto that basis:
X_projected = np.dot(input_data, basis)
S_X_projected = np.dot(projection_dependent_outputs, basis)
We visualize the current manifold topology:

Note
This tutorial was generated from a Jupyter notebook that can be accessed here.
QoI-aware encoder-decoders employing Partition of Unity Networks (POUnets)#
This demo takes the general formulation of QoI-aware encoder-decoders available in PCAfold and implements POUnets as the decoder.
from PCAfold import QoIAwareProjectionPOUnet, init_uniform_partitions, PCA, center_scale, PartitionOfUnityNetwork
import numpy as np
import matplotlib.pyplot as plt
import tensorflow.compat.v1 as tf
We will load the combustion dataset and remove temperature from the state variable list.
X = np.genfromtxt('H2-air-state-space.csv', delimiter=',')[:,1:-2]
S_X = np.genfromtxt('H2-air-state-space-sources.csv', delimiter=',')[:,1:-2]
X_names = np.genfromtxt('H2-air-state-space-names.csv', delimiter='\n', dtype=str)[1:-2]
X_names
array(['H', 'H2', 'O', 'OH', 'H2O', 'O2', 'HO2', 'H2O2'], dtype='<U4')
We then initialize the encoder weights using PCA. Notice how the 2D manifold is squeezed tightly and has overlapping states in some regions.
n_components = 2
pca = PCA(X, n_components=n_components, scaling='auto')
encoder_weights_init = pca.A[:,:n_components]
X_projected = X.dot(encoder_weights_init)
S_X_projected = S_X.dot(encoder_weights_init)
plt.scatter(X_projected[:,0], X_projected[:,1],s=3, c=S_X_projected[:,0], cmap='viridis')
plt.colorbar()
plt.grid()
plt.show()

Next, we finish initializing the encoder-decoder with the POUnet
parameters. The helper function init_uniform_partitions
is used as
done in the POUnet demo, but note the independent variable space for the
POUnet is the projected state variables X_projected
. We have chosen
a linear basis below.
When building the graph for the encoder-decoder, a function is required for computing the dependent training variables (QoIs). This allows for flexibility in using the projection parameters, which get updated themselves during training, for the dependent variable definitions. Therefore, the projection training can be informed by how well the projected source terms are represented, for example. The function must take in the encoder weights as an argument, but these do not have to be used. We also perform a nonlinear transformation on the source terms, which can help penalize projections introducing overlap in values. Below, we build a function that computes the projected source terms and concatenates these values with those of the OH and water mass fractions. These four variables provide the QoIs for which the loss function is computed during training. Note the QoI function must be written using tensorflow operations.
The graph is then built. Below we have turned on the optional
constrain_positivity
flag. As mass fractions are naturally positive,
this only penalizes projections that create negative projected source
terms. This can have advantages in simplifying regression and reducing
the impact of regression errors with the wrong sign during simulation.
Finally, we train the QoIAwareProjectionPOUnet
for 1000 iterations,
archiving every 100th iteration, and save the parameters with the lowest
overall errors. This is for demonstration, but more iterations are
generally needed to converge to an optimal solution.
ednet = QoIAwareProjectionPOUnet(encoder_weights_init,
**init_uniform_partitions([8,8], X_projected),
basis_type='linear'
)
# define the function to produce the dependent variables for training
def define_dvar(proj_weights):
dvar_y = tf.Variable(X[:,3:5], name='non_transform', dtype=tf.float64) # mass fractions for OH and H2O
dvar_s = tf.Variable(np.expand_dims(S_X, axis=2), name='non_transform', dtype=tf.float64)
dvar_s = ednet.tf_projection(dvar_s, nobias=True) # projected source terms
dvar_st = tf.math.sqrt(tf.cast(tf.abs(dvar_s+1.e-4), dtype=tf.float64)) * tf.math.sign(dvar_s+1.e-4)+1.e-2 * tf.math.sign(dvar_s+1.e-4) # power transform source terms
dvar_st_norm = dvar_st/tf.reduce_max(tf.cast(tf.abs(dvar_st), dtype=tf.float64), axis=0, keepdims=True) # normalize
dvar = tf.concat([dvar_y, dvar_st_norm], axis=1) # train on combination
return dvar
ednet.build_training_graph(X, define_dvar, error_type='abs', constrain_positivity=True)
ednet.train(1000, archive_rate=100, verbose=True)
------------------------------------------------------------
iteration | mean sqr | % max | sum sqr
------------------------------------------------------------
100 | 7.85e-04 | 79.79% | 4.56e+01
resetting best error
200 | 6.72e-04 | 79.53% | 3.91e+01
resetting best error
300 | 6.04e-04 | 78.90% | 3.51e+01
resetting best error
400 | 5.29e-04 | 78.87% | 3.07e+01
resetting best error
500 | 4.49e-04 | 76.84% | 2.61e+01
resetting best error
600 | 3.53e-04 | 73.46% | 2.05e+01
resetting best error
700 | 2.20e-04 | 67.99% | 1.28e+01
resetting best error
800 | 1.18e-04 | 59.24% | 6.86e+00
resetting best error
900 | 9.86e-05 | 59.00% | 5.73e+00
resetting best error
1000 | 9.25e-05 | 59.07% | 5.37e+00
resetting best error
The learning rate (default 1e-3) and least squares l2 regularization (default 1e-10) can also be updated at any time.
ednet.update_lr(1.e-4)
ednet.update_l2reg(1.e-11)
ednet.train(200, archive_rate=100, verbose=True)
updating lr: 0.0001
updating l2reg: 1e-11
------------------------------------------------------------
iteration | mean sqr | % max | sum sqr
------------------------------------------------------------
100 | 9.03e-05 | 59.40% | 5.25e+00
resetting best error
200 | 8.96e-05 | 59.41% | 5.21e+00
resetting best error
We can look at the trained projection weights:
print(ednet.projection_weights)
[[-0.35640105 0.05729153]
[ 0.42022997 0.05765012]
[-0.48311619 -0.24169431]
[-0.24244533 -0.20839019]
[-0.11743472 0.78807212]
[-0.24317541 -0.00543714]
[-1.16916608 -0.3213446 ]
[-1.52308699 0.10786119]]
We can look at the projection after the progress in training. We see how the projected source term values are closer to being positive than before and the overlap has been removed. We would expect further training to create more separation between observations. Other QoIs for training may also lead to better separation faster.
X_projected = ednet.projection(X)
S_X_projected = ednet.projection(X, nobias=True)
plt.scatter(X_projected[:,0], X_projected[:,1],s=3, c=S_X_projected[:,0], cmap='viridis')
plt.colorbar()
plt.grid()
plt.show()

Below we grab the archived states during training and visualize the errors.
err_dict = ednet.training_archive
for k in ['mse', 'sse', 'inf']:
plt.loglog(ednet.iterations,err_dict[k],'-', label=k)
plt.grid()
plt.xlabel('iterations')
plt.ylabel('error')
plt.legend()
plt.show()

We may also save and load a QoIAwareProjectionPOUnet
to/from file.
Rebuilding the graph is not necessary to grab the projection off a
loaded QoIAwareProjectionPOUnet
.
# Save the data to a file
ednet.write_data_to_file('filename.pkl')
# reload projection data from file
ednet2 = QoIAwareProjectionPOUnet.load_from_file('filename.pkl')
#compute projection without needing to rebuild graph:
X_projected = ednet2.projection(X)
It can then be useful to create multiple POUnets for separate variables
using the same trained projection and partitions from the
QoIAwareProjectionPOUnet
. Below we demonstrate this for the water
mass fraction.
net = PartitionOfUnityNetwork(
partition_centers=ednet.partition_centers,
partition_shapes=ednet.partition_shapes,
basis_type=ednet.basis_type,
ivar_center=ednet.proj_ivar_center,
ivar_scale=ednet.proj_ivar_scale
)
i_dvar = 4
dvar1 = X[:,i_dvar]
net.build_training_graph(ednet.projection(X), dvar1)
net.lstsq()
pred = net(ednet.projection(X))
plt.plot(dvar1, dvar1, 'k-')
plt.plot(dvar1, pred.ravel(), 'r.')
plt.title(X_names[i_dvar])
plt.grid()
plt.show()
performing least-squares solve

There is also an option when building the QoIAwareProjectionPOUnet
graph of separating trainable from nontrainable projection weights. This
can be useful if certain dimensions of the projection are predefined,
such as mixture fraction commonly used in combustion. In order to set
certain columns of the projection weight matrix constant, specify the
first index for which the weights are trainable
(first_trainable_idx
).
Below is an example of holding the first weights constant. We see how the second weights change after training, but the first do not.
ednet2 = QoIAwareProjectionPOUnet.load_from_file('filename.pkl')
ednet2.build_training_graph(X, define_dvar, first_trainable_idx=1)
old_weights = ednet2.projection_weights
ednet2.train(10, archive_rate=1)
print('difference in weigths before and after training:\n', old_weights-ednet2.projection_weights)
difference in weigths before and after training:
[[ 0. -0.001 ]
[ 0. -0.001 ]
[ 0. -0.001 ]
[ 0. -0.001 ]
[ 0. 0.001 ]
[ 0. 0.001 ]
[ 0. 0.001 ]
[ 0. 0.0009999]]