Data preprocessing#

The preprocess module can be used for performing data preprocessing including centering and scaling, outlier detection and removal, kernel density weighting of data sets, data clustering and data sampling. It also includes functionalities that allow the user to perform initial data inspection such as computing conditional statistics, calculating statistically representative sample sizes, or ordering variables in a data set according to a criterion.

Note

The format for the user-supplied input data matrix $\mathbf{X} \in \mathbb{R}^{N \times Q}$, common to all modules, is that $N$ observations are stored in rows and $Q$ variables are stored in columns. Since typically $N \gg Q$, the initial dimensionality of the data set is determined by the number of variables, $Q$.

\[\begin{split}\mathbf{X} = \begin{bmatrix} \vdots & \vdots & & \vdots \\ X_1 & X_2 & \dots & X_{Q} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix}\end{split}\]

The general agreement throughout this documentation is that $i$ will index observations and $j$ will index variables.

The representation of the user-supplied data matrix in PCAfold is the input parameter X, which should be of type numpy.ndarray and of size (n_observations,n_variables).

Data manipulation#

This section includes functions for performing basic data manipulation such as centering and scaling and outlier detection and removal.

`center_scale`#

PCAfold.preprocess.center_scale(X, scaling, nocenter=False)#

Centers and scales the original data set, $\mathbf{X}$. In the discussion below, we understand that $X_j$ is the $j^{th}$ column of $\mathbf{X}$.

Centering is performed by subtracting the center, $c_j$, from $X_j$, where centers for all columns are stored in the matrix $\mathbf{C}$:

\[\mathbf{X_c} = \mathbf{X} - \mathbf{C}\]

Centers for each column are computed as:

\[c_j = mean(X_j)\]

with the only exceptions of '0to1' and '-1to1' scalings, which introduce a different quantity to center each column.

Scaling is performed by dividing $X_j$ by the scaling factor, $d_j$, where scaling factors for all columns are stored in the diagonal matrix $\mathbf{D}$:

\[\mathbf{X_s} = \mathbf{X} \cdot \mathbf{D}^{-1}\]

If both centering and scaling is applied:

\[\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\]

Several scaling options are implemented here:

Scaling method	`scaling`	Scaling factor $d_j$
None	`'none'` or `''`	1
Auto [PvdBHW+06]	`'auto'` or `'std'`	$\sigma$
Pareto [PNod08]	`'pareto'`	$\sqrt{\sigma}$
VAST [PKEA+03]	`'vast'`	$\sigma^2 / mean(X_j)$
Range [PvdBHW+06]	`'range'`	$max(X_j) - min(X_j)$
0 to 1	`'0to1'`	$d_j = max(X_j) - min(X_j)$ $c_j = min(X_j)$
-1 to 1	`'-1to1'`	$d_j = 0.5 \cdot (max(X_j) - min(X_j))$ $c_j = 0.5 \cdot (max(X_j) + min(X_j))$
Level [PvdBHW+06]	`'level'`	$mean(X_j)$
Max	`'max'`	$max(X_j)$
Variance	`'variance'`	$var(X_j)$
Median	`'median'`	$median(X_j)$
Poisson [PKK04]	`'poisson'`	$\sqrt{mean(X_j)}$
S1	`'vast_2'`	$\sigma^2 k^2 / mean(X_j)$
S2	`'vast_3'`	$\sigma^2 k^2 / max(X_j)$
S3	`'vast_4'`	$\sigma^2 k^2 / (max(X_j) - min(X_j))$
L2-norm	`'l2-norm'`	$\\|X_j\\|_2$

where $\sigma$ is the standard deviation of $X_j$ and $k$ is the kurtosis of $X_j$.

The effect of data preprocessing (including scaling) on low-dimensional manifolds was studied in [PPS13].

Example:

from PCAfold import center_scale
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Center and scale:
(X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
scaling – str specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'variance', 'median', 'poisson', 'vast_2', 'vast_3', 'vast_4', 'l2-norm'.
nocenter – (optional) bool specifying whether data should be centered by mean. If set to True data will not be centered.

Returns

X_cs - numpy.ndarray specifying the centered and scaled data set, $\mathbf{X_{cs}}$. It has size (n_observations,n_variables).
X_center - numpy.ndarray specifying the centers, $c_j$, applied on the original data set $\mathbf{X}$. It has size (n_variables,).
X_scale - numpy.ndarray specifying the scales, $d_j$, applied on the original data set $\mathbf{X}$. It has size (n_variables,).

`invert_center_scale`#

PCAfold.preprocess.invert_center_scale(X_cs, X_center, X_scale)#

Inverts whatever centering and scaling was done by the center_scale function:

\[\mathbf{X} = \mathbf{X_{cs}} \cdot \mathbf{D} + \mathbf{C}\]

Example:

from PCAfold import center_scale, invert_center_scale
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Center and scale:
(X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)

# Uncenter and unscale:
X = invert_center_scale(X_cs, X_center, X_scale)

Parameters

X_cs – numpy.ndarray specifying the centered and scaled data set, $\mathbf{X_{cs}}$. It should be of size (n_observations,n_variables).
X_center – numpy.ndarray specifying the centers, $c_j$, applied on the original data set, $\mathbf{X}$. It should be of size (n_variables,).
X_scale – numpy.ndarray specifying the scales, $d_j$, applied on the original data set, $\mathbf{X}$. It should be of size (n_variables,).

Returns

X - numpy.ndarray specifying the original data set, $\mathbf{X}$. It has size (n_observations,n_variables).

`power_transform`#

PCAfold.preprocess.power_transform(X, transform_power, transform_shift=0.0, transform_sign_shift=0.0, invert=False)#

Performs a power transformation of the provided data. The equation for the transformation of variable $X$ is

\[(|X + s_1|)^\alpha \text{sign}(X + s_1) + s_2 \text{sign}(X + s_1)\]

where $\alpha$ is the transform_power, $s_1$ is the transform_shift, and $s_2$ is the transform_sign_shift.

Example:

from PCAfold import power_transform
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20) + 1

# Perform power transformation:
X_pow = power_transform(X, 0.5)

# undo the transformation:
X_orig = power_transform(X_pow, 0.5, invert=True)

Parameters

X – array of the variable(s) to be transformed
transform_power – the power parameter used in the transformation equation
transform_shift – (optional, default 0.) the shift parameter used in the transformation equation
transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation
invert – (optional, default False) when True, will undo the transformation

Returns

array of the transformed variables

`log_transform`#

PCAfold.preprocess.log_transform(X, method='log', threshold=1e-06)#

Performs log transformation of the original data set, $\mathbf{X}$.

For an example original function:

../_images/log_transform-original-function.svg

The symlog transformation can be obtained with method='symlog':

The continuous symlog transformation can be obtained with method='continuous-symlog':

../_images/log_transform-continuous-symlog.svg

Example:

from PCAfold import log_transform
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20) + 1

# Perform log transformation:
X_log = log_transform(X)

# Perform symlog transformation:
X_symlog = log_transform(X, method='symlog', threshold=1.e-4)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
method – (optional) str specifying the log-transformation method. It can be one of the following: log, ln, symlog, continuous-symlog.
threshold – (optional) float or int specifying the threshold for symlog transformation.

Returns

X_transformed - numpy.ndarray specifying the log-transformed data set. It has size (n_observations,n_variables).

`remove_constant_vars`#

PCAfold.preprocess.remove_constant_vars(X, maxtol=1e-12, rangetol=0.0001)#

Removes any constant columns from the original data set, $\mathbf{X}$. The $j^{th}$ column, $X_j$, is considered constant if either of the following is true:

The maximum of an absolute value of a column $X_j$ is less than maxtol:

\[max(|X_j|) < \verb|maxtol|\]

The ratio of the range of values in a column $X_j$ to $max(|X_j|)$ is less than rangetol:

\[\frac{max(X_j) - min(X_j)}{max(|X_j|)} < \verb|rangetol|\]

Specifically, it can be used as preprocessing for PCA so the eigenvalue calculation doesn’t break.

Example:

from PCAfold import remove_constant_vars
import numpy as np

# Generate dummy data set with a constant variable:
X = np.random.rand(100,20)
X[:,5] = np.ones((100,))

# Remove the constant column:
(X_removed, idx_removed, idx_retained) = remove_constant_vars(X)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
maxtol – (optional) float specifying the tolerance for $max(|X_j|)$.
rangetol – (optional) float specifying the tolerance for $max(X_j) - min(X_j)$ over $max(|X_j|)$.

Returns

X_removed - numpy.ndarray specifying the original data set, $\mathbf{X}$ with any constant columns removed. It has size (n_observations,n_variables).
idx_removed - list specifying the indices of columns removed from $\mathbf{X}$.
idx_retained - list specifying the indices of columns retained in $\mathbf{X}$.

`order_variables`#

PCAfold.preprocess.order_variables(X, method='mean', descending=True)#

Orders variables in the original data set, $\mathbf{X}$, using a selected method.

Example:

from PCAfold import order_variables
import numpy as np

# Generate a dummy data set:
X = np.array([[100, 1, 10],
              [200, 2, 20],
              [300, 3, 30]])

# Order variables by the mean value in the descending order:
(X_ordered, idx) = order_variables(X, method='mean', descending=True)

The code above should return an ordered data set:

array([[100,  10,   1],
       [200,  20,   2],
       [300,  30,   3]])

and the list of ordered variable indices:

[1, 2, 0]

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
method – (optional) str or list of int specifying the ordering method. If str, it can be one of the following: 'mean', 'min', 'max', 'std' or 'var'. If list, it is a custom user-provided list of indices for how the variables should be ordered.
descending – (optional) bool specifying whether variables should be ordered in the descending order. If set to False, variables will be ordered in the ascending order.

Returns

X_ordered - numpy.ndarray specifying the original data set with ordered variables. It has size (n_observations,n_variables).
idx - list specifying the indices of the ordered variables. It has length n_variables.

Class `PreProcessing`#

class PCAfold.preprocess.PreProcessing(X, scaling='none', nocenter=False)#

Performs a composition of data manipulation done by remove_constant_vars and center_scale functions on the original data set, $\mathbf{X}$. It can be used to store the result of that manipulation. Specifically, it:

checks for constant columns in a data set and removes them,
centers and scales the data.

Example:

from PCAfold import PreProcessing
import numpy as np

# Generate dummy data set with a constant variable:
X = np.random.rand(100,20)
X[:,5] = np.ones((100,))

# Instantiate PreProcessing class object:
preprocessed = PreProcessing(X, 'range', nocenter=False)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
scaling – str specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'poisson', 'vast_2', 'vast_3', 'vast_4'.
nocenter – (optional) bool specifying whether data should be centered by mean. If set to True data will not be centered.

Attributes:

X_removed - (read only) numpy.ndarray specifying the original data set with any constant columns removed. It has size (n_observations,n_variables).
idx_removed - (read only) list specifying the indices of columns removed from $\mathbf{X}$.
idx_retained - (read only) list specifying the indices of columns retained in $\mathbf{X}$.
X_cs - (read only) numpy.ndarray specifying the centered and scaled data set, $\mathbf{X_{cs}}$. It should be of size (n_observations,n_variables).
X_center - (read only) numpy.ndarray specifying the centers, $c_j$, applied on the original data set $\mathbf{X}$. It should be of size (n_variables,).
X_scale - (read only) numpy.ndarray specifying the scales, $d_j$, applied on the original data set $\mathbf{X}$. It should be of size (n_variables,).

`outlier_detection`#

PCAfold.preprocess.outlier_detection(X, scaling, method='MULTIVARIATE TRIMMING', trimming_threshold=0.5, quantile_threshold=0.9899, verbose=False)#

Finds outliers in the original data set, $\mathbf{X}$, and returns indices of observations without outliers as well as indices of the outliers themselves. Two options are implemented here:

'MULTIVARIATE TRIMMING'

Outliers are detected based on multivariate Mahalanobis distance, $D_M$:

\[D_M = \sqrt{(\mathbf{X} - \mathbf{\bar{X}})^T \mathbf{S}^{-1} (\mathbf{X} - \mathbf{\bar{X}})}\]

where $\mathbf{\bar{X}}$ is a matrix of the same size as $\mathbf{X}$ storing in each column a copy of the average value of the same column in $\mathbf{X}$. $\mathbf{S}$ is the covariance matrix computed as per PCA class. Note that the scaling option selected will affect the covariance matrix $\mathbf{S}$. Since Mahalanobis distance takes into account covariance between variables, observations with sufficiently large $D_M$ can be considered as outliers. For more detailed information on Mahalanobis distance the user is referred to [PBis06] or [PDMJRM00].

The threshold above which observations will be classified as outliers can be specified using trimming_threshold parameter. Specifically, the $i^{th}$ observation is classified as an outlier if:

\[D_{M, i} > \verb|trimming_threshold| \cdot max(D_M)\]

'PC CLASSIFIER'

Outliers are detected based on major and minor principal components (PCs). The method of principal component classifier (PCC) was first proposed in [PSCSC03]. The application of this technique to combustion data sets was studied in [PPS13]. Specifically, the $i^{th}$ observation is classified as an outlier if the first PC classifier based on $q$-first (major) PCs:

\[\sum_{j=1}^{q} \frac{z_{ij}^2}{L_j} > c_1\]

or if the second PC classifier based on $(Q-k+1)$-last (minor) PCs:

\[\sum_{j=k}^{Q} \frac{z_{ij}^2}{L_j} > c_2\]

where $z_{ij}$ is the $i^{th}, j^{th}$ element from the principal components matrix $\mathbf{Z}$ and $L_j$ is the $j^{th}$ eigenvalue from $\mathbf{L}$ (as per PCA class). Major PCs are selected such that the total variance explained is 50%. Minor PCs are selected such that the remaining variance they explain is 20%.

Coefficients $c_1$ and $c_2$ are found such that they represent the quantile_threshold (by default 98.99%) quantile of the empirical distributions of the first and second PC classifier respectively.

Example:

from PCAfold import outlier_detection
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Find outliers:
(idx_outliers_removed, idx_outliers) = outlier_detection(X, scaling='auto', method='MULTIVARIATE TRIMMING', trimming_threshold=0.8, verbose=True)

# New data set without outliers can be obtained as:
X_outliers_removed = X[idx_outliers_removed,:]

# Observations that were classified as outliers can be obtained as:
X_outliers = X[idx_outliers,:]

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
scaling – str specifying the scaling methodology. It can be one of the following: 'none', '', 'auto', 'std', 'pareto', 'vast', 'range', '0to1', '-1to1', 'level', 'max', 'poisson', 'vast_2', 'vast_3', 'vast_4'.
method – (optional) str specifying the outlier detection method to use. It should be 'MULTIVARIATE TRIMMING' or 'PC CLASSIFIER'.
trimming_threshold – (optional) float specifying the trimming threshold to use in combination with 'MULTIVARIATE TRIMMING' method.
quantile_threshold – (optional) float specifying the quantile threshold to use in combination with 'PC CLASSIFIER' method.
verbose – (optional) bool for printing verbose details.

Returns

idx_outliers_removed - list specifying the indices of observations without outliers.
idx_outliers - list specifying the indices of observations that were classified as outliers.

`representative_sample_size`#

PCAfold.preprocess.representative_sample_size(depvars, percentages, thresholds, variable_names=None, method='kl-divergence', statistics='median', n_resamples=10, random_seed=None, verbose=False)#

Computes a representative sample size given dependent variables that serve as ground truth (100% of data). It is assumed that the full dataset is representative of some physical phenomena.

Two general approaches are available:

If method='kl-divergence', the representative sample size is computed based on Kullback-Leibler divergence.
If method='mean', method='median', method='variance', or method='std', the representative sample size is computed based on convergence of a first order (mean or median) or of second order (variance, standard deviation) statistics.

Example:

from PCAfold import center_scale, representative_sample_size
import numpy as np

# Generate dummy data set and two dependent variables:
x, y = np.meshgrid(np.linspace(-1,1,100), np.linspace(-1,1,100))
xy = np.hstack((x.ravel()[:,None],y.ravel()[:,None]))

phi_1 = np.exp(-((x*x+y*y) / (1 * 1**2)))
phi_1 = phi_1.ravel()[:,None]

phi_2 = np.exp(-((x*x+y*y) / (0.01 * 1**2)))
phi_2 = phi_2.ravel()[:,None]

depvars = np.column_stack((phi_1, phi_2))
depvars, _, _ = center_scale(depvars, scaling='0to1')

# Specify the list of percentages to explore:
percentages = list(np.linspace(1,99.9,200))

# Specify the list of thresholds for each dependent variable:
thresholds = [10**-4, 10**-4]

# Specify the names of the dependent variables:
variable_names = ['Phi-1', 'Phi-2']

# Compute representative sample size for each dependent variable:
(idx, sample_sizes, statistics) = representative_sample_size(depvars,
                                                             percentages,
                                                             thresholds=thresholds,
                                                             variable_names=variable_names,
                                                             method='kl-divergence',
                                                             statistics='median',
                                                             n_resamples=20,
                                                             random_seed=100,
                                                             verbose=True)

With verbose=True we will see some detailed information:

Dependent variable Phi-1 ...
KL divergence threshold used: 0.0001
Representative sample size for dependent variable Phi-1: 2833 samples (28.3% of data).


Dependent variable Phi-2 ...
KL divergence threshold used: 0.0001
Representative sample size for dependent variable Phi-2: 9890 samples (98.9% of data).

Parameters

depvars – numpy.ndarray specifying the dependent variables that should be well represented in a sampled dataset. . It should be of size (n_observations,n_dependent_variables).
percentages – list of percentages to explore. It should be ordered in ascending order. Elements should be larger than 0 and not larger than 100.
thresholds – (optional) list of float specifying the target thresholds for each dependent variable. The thresholds should be appropriate to the method based on which a representative sample size is computed.
variable_names – (optional) list of str specifying names for all dependent variables. If set to None, dependent variables are called with consecutive integers.
method – (optional) str specifying the method used to compute the sample size statistics. It can be mean, median, variance, std, or 'kl-divergence'.
statistics – (optional) str specifying the overall statistics that should be computed from a given method. It can be min, max, mean, or median.
n_resamples – (optional) int specifying the number of resamples to perform for each percentage in the percentages vector. It is recommended to set this parameters to above 1, since it might accidentally happen that a random sample is statistically representative of the full dataset. Re-sampling helps to average-out the effect of such one-off “lucky” random samples.
random_seed – (optional) int specifying the random seed.
verbose – (optional) bool for printing verbose details.

Returns

threshold_idx - list of int specifying the highest indices from the percentages list where the representative number of samples condition was still met. It has length n_depvars. If the condition for a representative sample size was not met for a dependent variable, a value of -1 is returned in the list for that dependent variable.
representatitive_sample_sizes - numpy.ndarray of int specifying the representative number of samples. It has size (1,n_depvars). If the condition for a representative sample size was not met for a dependent variable, a value of -1 is returned in the array for that dependent variable.
sample_size_statistics - numpy.ndarray specifying the full vector of computed statistics correponding to each entry in percentages and each dependent variable. It has size (n_percentages,n_depvars).

Class `ConditionalStatistics`#

class PCAfold.preprocess.ConditionalStatistics(X, conditioning_variable, k=20, split_values=None, verbose=False)#

Enables computing conditional statistics on the original data set, $\mathbf{X}$. This includes:

conditional mean
conditional minimum
conditional maximum
conditional standard deviation

Other quantities can be added in the future at the user’s request.

Example:

from PCAfold import ConditionalStatistics
import numpy as np

# Generate dummy variables:
conditioning_variable = np.linspace(-1,1,100)
y = -conditioning_variable**2 + 1

# Instantiate an object of the ConditionalStatistics class
# and compute conditional statistics in 10 bins of the conditioning variable:
cond = ConditionalStatistics(y[:,None], conditioning_variable, k=10)

# Access conditional statistics:
conditional_mean = cond.conditional_mean
conditional_min = cond.conditional_minimum
conditional_max = cond.conditional_maximum
conditional_std = cond.conditional_standard_deviation

# Access the centroids of the created bins:
centroids = cond.centroids

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
conditioning_variable – numpy.ndarray specifying a single variable to be used as a conditioning variable. It should be of size (n_observations,1) or (n_observations,).
k – int specifying the number of bins to create in the conditioning variable. It has to be a positive number.
split_values – list specifying values at which splits should be performed. If set to None, splits will be performed using $k$ equal variable bins.
verbose – (optional) bool for printing verbose details.

Attributes:

idx - (read only) numpy.ndarray of cluster (bins) classifications. It has size (n_observations,).
borders - (read only) list of values that define borders for the clusters (bins). It has length k+1.
centroids - (read only) list of values that specify bins centers. It has length k.
conditional_mean - (read only) numpy.ndarray specifying the conditional means of all original variables in the $k$ bins created. It has size (k,n_variables).
conditional_minimum - (read only) numpy.ndarray specifying the conditional minimums of all original variables in the $k$ bins created. It has size (k,n_variables).
conditional_maximum - (read only) numpy.ndarray specifying the conditional maximums of all original variables in the $k$ bins created. It has size (k,n_variables).
conditional_standard_deviation - (read only) numpy.ndarray specifying the conditional standard deviations of all original variables in the $k$ bins created. It has size (k,n_variables).

Class `KernelDensity`#

class PCAfold.preprocess.KernelDensity(X, conditioning_variable, verbose=False)#

Enables kernel density weighting of the original data set, $\mathbf{X}$, based on single-variable or multi-variable case as proposed in [PCGP12].

The goal of both cases is to obtain a vector of weights, $\mathbf{W_c}$, that has the same number of elements as there are observations in the original data set, $\mathbf{X}$. Each observation will then get multiplied by the corresponding weight from $\mathbf{W_c}$.

Note

Kernel density weighting technique is usually very expensive, even on data sets with relatively small number of observations. Since the single-variable case is a cheaper option than the multi-variable case, it is recommended that this technique is tried first for larger data sets.

Gaussian kernel is used in both approaches:

\[K_{c, c'} = \sqrt{\frac{1}{2 \pi h^2}} exp(- \frac{d^2}{2 h^2})\]

$h$ is the kernel bandwidth:

\[h = \Big( \frac{4 \hat{\sigma}}{3 n} \Big)^{1/5}\]

where $\hat{\sigma}$ is the standard deviation of the considered variable and $n$ is the number of observations in the data set.

$d$ is the distance between two observations $c$ and $c'$:

\[d = |x_c - x_{c'}|\]

Single-variable

If the conditioning_variable argument is a single vector, weighting will be performed according to the single-variable case. It begins by summing Gaussian kernels:

\[\mathbf{K_c} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c'}\]

and weights are then computed as:

\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]

Multi-variable

If the conditioning_variable argument is a matrix of multiple variables, weighting will be performed according to the multi-variable case. It begins by summing Gaussian kernels for a $k^{th}$ variable:

\[\mathbf{K_c}_{, k} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c', k}\]

Global density taking into account all variables is then obtained as:

\[\mathbf{K_{c}} = \prod_{k=1}^{k=Q} \mathbf{K_c}_{, k}\]

where $Q$ is the total number of conditioning variables, and weights are computed as:

\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]

Example:

from PCAfold import KernelDensity
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Perform kernel density weighting based on the first variable:
kerneld = KernelDensity(X, X[:,0])

# Access the weighted data set:
X_weighted = kerneld.X_weighted

# Access the weights used to scale the data set:
weights = kerneld.weights

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
conditioning_variable – numpy.ndarray specifying either a single variable or multiple variables to be used as a conditioning variable for kernel weighting procedure. Note that it can also be passed as the data set $\mathbf{X}$.

Attributes:

weights - numpy.ndarray specifying the computed weights, $\mathbf{W_c}$. It has size (n_observations,1).
X_weighted - numpy.ndarray specifying the weighted data set (each observation in $\mathbf{X}$ is multiplied by the corresponding weight in $\mathbf{W_c}$). It has size (n_observations,n_variables).

Class `DensityEstimation`#

class PCAfold.preprocess.DensityEstimation(X, n_neighbors)#

Enables density estimation on point-cloud data.

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
n_neighbors – int specifying the number of nearest neighbors, or the $k$ th nearest neighbor when applicable.

`DensityEstimation.average_knn_distance`#

PCAfold.preprocess.DensityEstimation.average_knn_distance(self, verbose=False)#

Computes an average Euclidean distances to $k$ nearest neighbors on a manifold defined by the independent variables.

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)

# Compute average distances on a manifold defined by the PCs:
average_distances = density_estimation.average_knn_distance(verbose=True)

With verbose=True, minimum, maximum and average distance will be printed:

Minimum distance:   0.1388300829487847
Maximum distance:   0.4689587542132183
Average distance:   0.20824964953425693
Median distance:    0.18333873029179215

Note

This function requires the scikit-learn module. You can install it through:

pip install scikit-learn

Parameters

verbose – (optional) bool for printing verbose details.

Returns

average_distances - numpy.ndarray specifying the vector of average distances for every observation in a data set to its $k$ nearest neighbors. It has size (n_observations,).

`DensityEstimation.kth_nearest_neighbor_codensity`#

PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_codensity(self)#

Computes the Euclidean distance to the $k$ th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data codensity defined as:

\[\delta_k(x) = d(x, v_k(x))\]

where $v_k(x)$ is the $k$ th nearest neighbor of $x$.

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)

# Compute the distance to the kth nearest neighbor:
data_codensity = density_estimation.kth_nearest_neighbor_codensity()

Note

This function requires the scikit-learn module. You can install it through:

pip install scikit-learn

Returns

data_codensity - numpy.ndarray specifying the vector of distances to the $k$ th nearest neighbor of every data observation. It has size (n_observations,).

`DensityEstimation.kth_nearest_neighbor_density`#

PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_density(self)#

Computes an inverse of the Euclidean distance to the $k$ th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data density defined as:

\[\rho_k(x) = \frac{1}{\delta_k(x)}\]

where $\delta_k(x)$ is the codensity.

Example:

from PCAfold import PCA, DensityEstimation
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,20)

# Instantiate PCA class object:
pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False)

# Calculate the principal components:
principal_components = pca_X.transform(X)

# Instantiate an object of the DensityEstimation class:
density_estimation = DensityEstimation(principal_components, n_neighbors=10)

# Compute the distance to the kth nearest neighbor:
data_density = density_estimation.kth_nearest_neighbor_density()

Note

This function requires the scikit-learn module. You can install it through:

pip install scikit-learn

Returns

data_density - numpy.ndarray specifying the vector of inverse distances to the $k$ th nearest neighbor of every data observation. It has size (n_observations,).

Data clustering#

This section includes functions for classifying data sets into local clusters and performing some basic operations on clusters [PELL09], [PKR09].

Clustering functions#

Each function that clusters the data set returns a vector of integers idx of type numpy.ndarray of size (n_observations,) that specifies classification of each observation from the original data set, $\mathbf{X}$, to a local cluster.

Note

The first cluster has index 0 within all idx vectors returned.

`variable_bins`#

PCAfold.preprocess.variable_bins(var, k, verbose=False)#

Clusters the data by dividing a variable vector var into bins of equal lengths.

An example of how a vector can be partitioned with this function is presented below:

Example:

from PCAfold import variable_bins
import numpy as np

# Generate dummy variable:
x = np.linspace(-1,1,100)

# Create partitioning according to bins of x:
(idx, borders) = variable_bins(x, 4, verbose=True)

Parameters

var – numpy.ndarray specifying the variable values. It should be of size (n_observations,) or (n_observations,1).
k – int specifying the number of clusters to create. It has to be a positive number.
verbose – (optional) bool for printing verbose details.

Returns

idx - numpy.ndarray of cluster classifications. It has size (n_observations,).
borders - list of values that define borders for the clusters. It has length k+1.

`predefined_variable_bins`#

PCAfold.preprocess.predefined_variable_bins(var, split_values, verbose=False)#

Clusters the data by dividing a variable vector var into bins such that splits are done at user-specified values. Split values can be specified in the split_values list. In general: split_values = [value_1, value_2, ..., value_n].

Note: When a split is performed at a given value_i, the observation in var that takes exactly that value is assigned to the newly created bin.

An example of how a vector can be partitioned with this function is presented below:

../_images/clustering-predefined-variable-bins.svg

Example:

from PCAfold import predefined_variable_bins
import numpy as np

# Generate dummy variable:
x = np.linspace(-1,1,100)

# Create partitioning according to pre-defined bins of x:
(idx, borders) = predefined_variable_bins(x, [-0.6, 0.4, 0.8], verbose=True)

Parameters

var – numpy.ndarray specifying the variable values. It should be of size (n_observations,) or (n_observations,1).
split_values – list specifying values at which splits should be performed.
verbose – (optional) bool for printing verbose details.

Returns

idx - numpy.ndarray of cluster classifications. It has size (n_observations,).
borders - list of values that define borders for the clusters. It has length k+1.

`mixture_fraction_bins`#

PCAfold.preprocess.mixture_fraction_bins(Z, k, Z_stoich, verbose=False)#

Clusters the data by dividing a mixture fraction vector Z into bins of equal lengths. This technique can be used to partition combustion data sets as proposed in [PPSTS09]. The vector is first split to lean and rich side (according to the stoichiometric mixture fraction Z_stoich) and then the sides get divided further into clusters. When k is odd, there will always be one more cluster on the side with larger range in mixture fraction space compared to the other side.

An example of how a vector can be partitioned with this function is presented below:

$../_images/clustering-mixture-fraction-bins.svg$

Example:

from PCAfold import mixture_fraction_bins
import numpy as np

# Generate dummy mixture fraction variable:
Z = np.linspace(0,1,100)

# Create partitioning according to bins of mixture fraction:
(idx, borders) = mixture_fraction_bins(Z, 4, 0.4, verbose=True)

Parameters

Z – numpy.ndarray specifying the mixture fraction values. It should be of size (n_observations,) or (n_observations,1).
k – int specifying the number of clusters to create. It has to be a positive number.
Z_stoich – float specifying the stoichiometric mixture fraction. It has to be between 0 and 1.
verbose – (optional) bool for printing verbose details.

Returns

idx - numpy.ndarray of cluster classifications. It has size (n_observations,).
borders - list of values that define borders for the clusters. It has length k+1.

`zero_neighborhood_bins`#

PCAfold.preprocess.zero_neighborhood_bins(var, k, zero_offset_percentage=0.1, split_at_zero=False, verbose=False)#

Clusters the data by separating close-to-zero observations in a vector into one cluster (split_at_zero=False) or two clusters (split_at_zero=True). The offset from zero at which splits are performed is computed based on the input parameter zero_offset_percentage:

\[\verb|offset| = \frac{(max(\verb|var|) - min(\verb|var|)) \cdot \verb|zero_offset_percentage|}{100}\]

Further clusters are found by splitting positive and negative values in a vector alternatingly into bins of equal lengths.

This clustering technique can be useful for partitioning any variable that has many observations clustered around zero value and relatively few observations far away from zero on either side.

Two examples of how a vector can be partitioned with this function are presented below:

With split_at_zero=False:

../_images/clustering-zero-neighborhood-bins.svg

If split_at_zero=False the smallest allowed number of clusters is 3. This is to assure that there are at least three clusters: with negative values, with close to zero values, with positive values.

When k is even, there will always be one more cluster on the side with larger range compared to the other side.

With split_at_zero=True:

../_images/clustering-zero-neighborhood-bins-zero-split.svg

If split_at_zero=True the smallest allowed number of clusters is 4. This is to assure that there are at least four clusters: with negative values, with negative values close to zero, with positive values close to zero and with positive values.

When k is odd, there will always be one more cluster on the side with larger range compared to the other side.

Note

This clustering technique is well suited for partitioning chemical source terms, $\mathbf{S_X}$, or sources of principal components, $\mathbf{S_Z}$, (as per [TSP09]) since it relies on unbalanced vectors that have many observations numerically close to zero. Using split_at_zero=True it can further differentiate between negative and positive sources.

Example:

from PCAfold import zero_neighborhood_bins
import numpy as np

# Generate dummy variable:
x = np.linspace(-100,100,1000)

# Create partitioning according to bins of x:
(idx, borders) = zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True)

Parameters

var – numpy.ndarray specifying the variable values. It should be of size (n_observations,) or (n_observations,1).
k – int specifying the number of clusters to create. It has to be a positive number. It cannot be smaller than 3 if split_at_zero=False or smaller than 4 if split_at_zero=True.
zero_offset_percentage – (optional) percentage of $max(\verb|var|) - min(\verb|var|)$ range to take as the offset from zero value. For instance, set zero_offset_percentage=10 if you want 10% as offset.
split_at_zero – (optional) bool specifying whether partitioning should be done at var=0.
verbose – (optional) bool for printing verbose details.

Returns

idx - numpy.ndarray of cluster classifications. It has size (n_observations,).
borders - list of values that define borders for the clusters. It has length k+1.

Auxiliary functions#

`degrade_clusters`#

PCAfold.preprocess.degrade_clusters(idx, verbose=False)#

Re-numerates clusters if either of these two cases is true:

idx is composed of non-consecutive integers, or
the smallest cluster index in idx is not equal to 0.

Example:

from PCAfold import degrade_clusters
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 2, 0, 5, 10])

# Degrade clusters:
(idx_degraded, k_update) = degrade_clusters(idx)

The code above will produce:

>>> idx_degraded
array([0, 0, 1, 0, 2, 3])

Alternatively:

from PCAfold import degrade_clusters
import numpy as np

# Generate dummy idx vector:
idx = np.array([1, 1, 2, 2, 3, 3])

# Degrade clusters:
(idx_degraded, k_update) = degrade_clusters(idx)

will produce:

>>> idx_degraded
array([0, 0, 1, 1, 2, 2])

Parameters

idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).
verbose – (optional) bool for printing verbose details.

Returns

idx_degraded - numpy.ndarray of degraded cluster classifications. It has size (n_observations,).
k_update - int specifying the updated number of clusters.

`flip_clusters`#

PCAfold.preprocess.flip_clusters(idx, dictionary)#

Flips cluster labelling according to instructions provided in a dictionary. For a dictionary = {key : value}, a cluster with a number key will get a number value.

Example:

from PCAfold import flip_clusters
import numpy as np

# Generate dummy idx vector:
idx = np.array([0,0,0,1,1,1,1,2,2])

# Swap cluster number 1 with cluster number 2:
flipped_idx = flip_clusters(idx, {1:2, 2:1})

The code above will produce:

>>> flipped_idx
array([0, 0, 0, 2, 2, 2, 2, 1, 1])

Note

This function can also be used to merge clusters. Using the idx from the example above, if we call:

flipped_idx = flip_clusters(idx, {2:1})

the result will be:

>>> flipped_idx
array([0,0,0,1,1,1,1,1,1])

where clusters 1 and 2 have been merged into one cluster numbered 1.

Parameters

idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).
dictionary – dict specifying instructions for cluster label flipping.

Returns

flipped_idx - numpy.ndarray specifying the re-labelled cluster classifications. It has size (n_observations,).

`get_centroids`#

PCAfold.preprocess.get_centroids(X, idx)#

Computes the centroids for all variables in the original data set, $\mathbf{X}$, and for each cluster specified in the idx vector. The centroid $c_{n, j}$ for variable $X_j$ in the $n^{th}$ cluster, is computed as:

\[c_{n, j} = mean(X_j), \,\,\,\, \text{for} \,\, X_j \in \text{cluster} \,\, n\]

Centroids for all variables from all clusters are stored in the matrix $\mathbf{c} \in \mathbb{R}^{k \times Q}$ returned:

\[\begin{split}\mathbf{c} = \begin{bmatrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, Q} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, Q} \\ \vdots & \vdots & \vdots & \vdots \\ c_{k, 1} & c_{k, 2} & \dots & c_{k, Q} \\ \end{bmatrix}\end{split}\]

Example:

from PCAfold import get_centroids
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy clustering of the data set:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Compute the centroids of each cluster:
centroids = get_centroids(X, idx)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

Returns

centroids - numpy.ndarray specifying the centroids matrix, $\mathbf{c}$, for all clusters and for all variables. It has size (k,n_variables).

`get_partition`#

PCAfold.preprocess.get_partition(X, idx)#

Partitions the observations from the original data set, $\mathbf{X}$, into $k$ clusters according to idx provided.

Example:

from PCAfold import get_partition
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy clustering of the data set:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Generate partitioning of the data set according to idx:
(X_in_clusters, idx_in_clusters) = get_partition(X, idx)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

Returns

X_in_clusters - list of $k$ numpy.ndarray that contains original data set observations partitioned to $k$ clusters. It has length k.
idx_in_clusters - list of $k$ numpy.ndarray that contains indices of the original data set observations partitioned to $k$ clusters. It has length k.

`get_populations`#

PCAfold.preprocess.get_populations(idx)#

Computes populations (number of observations) in clusters specified in the idx vector. As an example, if there are 100 observations in the first cluster and 500 observations in the second cluster this function will return a list: [100, 500].

Example:

from PCAfold import variable_bins, get_populations
import numpy as np

# Generate dummy partitioning:
x = np.linspace(-1,1,100)
(idx, borders) = variable_bins(x, 4, verbose=True)

# Compute cluster populations:
populations = get_populations(idx)

The code above will produce:

>>> populations
[25, 25, 25, 25]

Parameters

idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).

Returns

populations - list of cluster populations. Each entry referes to one cluster ordered according to idx. It has length k.

`get_average_centroid_distance`#

PCAfold.preprocess.get_average_centroid_distance(X, idx, weighted=False)#

Computes the average Euclidean distance between observations and the centroids of clusters to which each observation belongs.

The average can be computed as an arithmetic average from all clusters (weighted=False) or as a weighted average (weighted=True). In the latter, the distances are weighted by the number of observations in a cluster so that the average centroid distance will approach the average distance in the largest cluster.

Example:

from PCAfold import get_average_centroid_distance
import numpy as np

# Generate dummy data set:
X = np.random.rand(100,5)

# Generate dummy clustering of the data set:
idx = np.zeros((100,))
idx[50:80] = 1
idx = idx.astype(int)

# Compute average distance from cluster centroids:
average_centroid_distance = get_average_centroid_distance(X, idx, weighted=False)

Parameters

X – numpy.ndarray specifying the original data set, $\mathbf{X}$. It should be of size (n_observations,n_variables).
idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).
weighted – (optional) bool specifying whether distances from centroid should be weighted by the number of observations in a cluster. If set to False, arithmetic average will be computed.

Returns

average_centroid_distance - float specifying the average distance from centroids, averaged over all observations and all clusters.

Data sampling#

This section includes functions for splitting data sets into train and test data for use in machine learning algorithms. Apart from random splitting that can be achieved with the commonly used sklearn.model_selection.train_test_split, extended methods are implemented here that allow for purposive sampling [PNey92], such as drawing samples at certain amount from local clusters [PMMD10], [PGSB04]. These functionalities can be specifically used to tackle imbalanced data sets [PHG09], [PRLM+16].

The general idea is to divide the entire data set X (or its portion) into train and test samples as presented below:

Train data is always sampled in the same way for a given sampling function. Depending on the option selected, test data will be sampled differently, either as all remaining samples that were not included in train data or as a subset of those. You can select the option by setting the test_selection_option parameter for each sampling function. Reach out to the documentation for a specific sampling function to see what options are available.

All splitting functions in this module return a tuple of two variables: (idx_train, idx_test). Both idx_train and idx_test are vectors of integers of type numpy.ndarray and of size (_,). These variables contain indices of observations that went into train data and test data respectively.

In your model learning algorithm you can then get the train and test observations, for instance in the following way:

X_train = X[idx_train,:]
X_test = X[idx_test,:]

All functions are equipped with verbose parameter. If it is set to True some additional information on train and test selection is printed.

Note

It is assumed that the first cluster has index 0 within all input idx vectors.

Class `DataSampler`#

class PCAfold.preprocess.DataSampler(idx, idx_test=None, random_seed=None, verbose=False)#

Enables selecting train and test data samples.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, idx_test=np.array([5,9]), random_seed=100, verbose=True)

Parameters

idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).
idx_test – (optional) numpy.ndarray specifying the user-provided indices for test data. If specified, train data will be selected ignoring the indices in idx_test and the test data will be returned the same as the user-provided idx_test. If not specified, test samples will be selected according to the test_selection_option parameter (see documentation for each sampling function). Setting fixed idx_test parameter may be useful if training a machine learning model on specific test samples is desired. It should be of size (n_test_samples,) or (n_test_samples,1).
random_seed – (optional) int specifying random seed for random sample selection.
verbose – (optional) bool for printing verbose details.

`DataSampler.number`#

PCAfold.preprocess.DataSampler.number(self, perc, test_selection_option=1)#

Uses classifications into $k$ clusters and samples fixed number of observations from every cluster as training data. In general, this results in a balanced representation of features identified by a clustering algorithm.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.number(20, test_selection_option=1)

Train data:

The number of train samples is estimated based on the percentage perc provided. First, the total number of samples for training is estimated as a percentage perc from the total number of observations n_observations in a data set. Next, this number is divided equally into $k$ clusters. The result n_of_samples is the number of samples that will be selected from each cluster:

\[\verb|n_of_samples| = \verb|int| \Big( \frac{\verb|perc| \cdot \verb|n_observations|}{k \cdot 100} \Big)\]

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2, the smallest cluster is found and the remaining number of observations $m$ are taken as test data in that cluster. Next, the same number of samples $m$ is taken from all remaining larger clusters.

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

../_images/sampling-test-selection-option-number.svg

Here $n$ and $m$ are fixed numbers for each cluster. In general, $n \neq m$.

Parameters

perc – percentage of data to be selected as training data from the entire data set. For instance, set perc=20 if you want to select 20%.
test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).
idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).

`DataSampler.percentage`#

PCAfold.preprocess.DataSampler.percentage(self, perc, test_selection_option=1)#

Uses classifications into $k$ clusters and samples a certain percentage perc from every cluster as the training data.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.percentage(20, test_selection_option=1)

Note: If the cluster sizes are comparable, this function will give a similar train sample distribution as random sampling (DataSampler.random). This sampling can be useful in cases where one cluster is significantly smaller than others and there is a chance that this cluster will not get covered in the train data if random sampling was used.

Train data:

The number of train samples is estimated based on the percentage perc provided. First, the size of the $i^{th}$ cluster is estimated cluster_size_i and then a percentage perc of that number is selected.

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2 the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

../_images/sampling-test-selection-option-percentage.svg

Here $p$ is the percentage perc provided.

Parameters

perc – percentage of data to be selected as training data from each cluster. For instance, set perc=20 if you want to select 20%.
test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).
idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).

`DataSampler.manual`#

PCAfold.preprocess.DataSampler.manual(self, sampling_dictionary, sampling_type='percentage', test_selection_option=1)#

Uses classifications into $k$ clusters and a dictionary sampling_dictionary in which you manually specify what 'percentage' (or what 'number') of samples will be selected as the train data from each cluster. The dictionary keys are cluster classifications as per idx and the dictionary values are either percentage or number of train samples to be selected. The default dictionary values are percentage but you can select sampling_type='number' in order to interpret the values as a number of samples.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.manual({0:1, 1:1, 2:1}, sampling_type='number', test_selection_option=1)

Train data:

The number of train samples selected from each cluster is estimated based on the sampling_dictionary. For key : value, percentage value (or number value) of samples will be selected from cluster key.

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2 the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

../_images/sampling-test-selection-option-manual.svg

Here it is understood that $n_1$ train samples were requested from the first cluster, $n_2$ from the second cluster and $n_3$ from the third cluster, where $n_i$ can be interpreted as number or as percentage. This can be achieved by setting:

sampling_dictionary = {0:n_1, 1:n_2, 2:n_3}

Parameters

sampling_dictionary – dict specifying manual sampling. Keys are cluster classifications and values are either percentage or number of samples to be taken from that cluster. Keys should match the cluster classifications as per idx.
sampling_type – (optional) str specifying whether percentage or number is given in the sampling_dictionary. Available options: percentage or number. The default is percentage.
test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).
idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).

`DataSampler.random`#

PCAfold.preprocess.DataSampler.random(self, perc, test_selection_option=1)#

Samples train data at random from the entire data set.

Example:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

# Instantiate DataSampler class object:
selection = DataSampler(idx, verbose=True)

# Generate sampling:
(idx_train, idx_test) = selection.random(20, test_selection_option=1)

Due to the nature of this sampling technique, it is not necessary to have idx classifications since random samples can also be selected from unclassified data sets. You can achieve that by generating a dummy idx vector that has the same number of observations n_observations as your data set. For instance:

from PCAfold import DataSampler
import numpy as np

# Generate dummy idx vector:
n_observations = 100
idx = np.zeros(n_observations)

# Instantiate DataSampler class object:
selection = DataSampler(idx)

# Generate sampling:
(idx_train, idx_test) = selection.random(20, test_selection_option=1)

Train data:

The total number of train samples is computed as a percentage perc from the total number of observations in a data set. These samples are then drawn at random from the entire data set, independent of cluster classifications.

Test data:

Two options for sampling test data are implemented. If you select test_selection_option=1 all remaining samples that were not taken as train data become the test data. If you select test_selection_option=2 the same procedure is used to select test data as was used to select train data (only allowed if perc is less than 50%).

The scheme below presents graphically how train and test data can be selected using test_selection_option parameter:

../_images/sampling-test-selection-option-random.svg

Here $p$ is the percentage perc provided.

Parameters

perc – percentage of data to be selected as training data from each cluster. Set perc=20 if you want 20%.
test_selection_option – (optional) int specifying the option for how the test data is selected. Select test_selection_option=1 if you want all remaining samples to become test data. Select test_selection_option=2 if you want to select a subset of the remaining samples as test data.

Returns

idx_train - numpy.ndarray of indices of the train data. It has size (n_train,).
idx_test - numpy.ndarray of indices of the test data. It has size (n_test,).

Plotting functions#

This section includes functions for data preprocessing related plotting such as visualizing the formed clusters, visualizing the selected train and test samples or plotting the conditional statistics.

`plot_2d_clustering`#

PCAfold.preprocess.plot_2d_clustering(x, y, idx, clean=False, x_label=None, y_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, grid_on=False, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a two-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.

Example:

from PCAfold import variable_bins, plot_2d_clustering
import numpy as np

# Generate dummy data set:
x = np.linspace(-1,1,100)
y = -x**2 + 1

# Generate dummy clustering of the data set:
(idx, _) = variable_bins(x, 4, verbose=False)

# Plot the clustering result:
plt = plot_2d_clustering(x,
                     y,
                     idx,
                     x_label='$x$',
                     y_label='$y$',
                     color_map='viridis',
                     first_cluster_index_zero=False,
                     grid_on=True,
                     figure_size=(10,6),
                     title='x-y data set',
                     save_filename='clustering.pdf')
plt.close()

Parameters

x – numpy.ndarray specifying the variable on the $x$-axis. It should be of size (n_observations,) or (n_observations,1).
y – numpy.ndarray specifying the variable on the $y$-axis. It should be of size (n_observations,) or (n_observations,1).
idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).
clean – (optional) bool specifying if a clean plot should be made. If set to True, nothing else but the data points is plotted.
x_label – (optional) str specifying $x$-axis label annotation. If set to None label will not be plotted.
y_label – (optional) str specifying $y$-axis label annotation. If set to None label will not be plotted.
color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.
alphas – (optional) list specifying the opacity of each cluster.
first_cluster_index_zero – (optional) bool specifying if the first cluster should be indexed 0 on the plot. If set to False the first cluster will be indexed 1.
grid_on – bool specifying whether grid should be plotted.
s – (optional) int or float specifying the scatter point size.
markerscale – (optional) int or float specifying the scale for the legend marker.
legend – (optional) bool specifying the whether legend should be plotted.
figure_size – (optional) tuple specifying figure size.
title – (optional) str specifying plot title. If set to None title will not be plotted.
save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

plt - matplotlib.pyplot plot handle.

`plot_3d_clustering`#

PCAfold.preprocess.plot_3d_clustering(x, y, z, idx, elev=45, azim=-45, x_label=None, y_label=None, z_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#

Plots a three-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.

Example:

from PCAfold import variable_bins, plot_3d_clustering
import numpy as np

# Generate dummy data set:
x = np.linspace(-1,1,100)
y = -x**2 + 1
z = x + 10

# Generate dummy clustering of the data set:
(idx, _) = variable_bins(x, 4, verbose=False)

# Plot the clustering result:
plt = plot_3d_clustering(x,
                         y,
                         z,
                         idx,
                         x_label='$x$',
                         y_label='$y$',
                         z_label='$z$',
                         color_map='viridis',
                         first_cluster_index_zero=False,
                         figure_size=(10,6),
                         title='x-y-z data set',
                         save_filename='clustering.pdf')
plt.close()

Parameters

x – numpy.ndarray specifying the variable on the $x$-axis. It should be of size (n_observations,) or (n_observations,1).
y – numpy.ndarray specifying the variable on the $y$-axis. It should be of size (n_observations,) or (n_observations,1).
z – numpy.ndarray specifying the variable on the $z$-axis. It should be of size (n_observations,) or (n_observations,1).
idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).
elev – (optional) elevation angle.
azim – (optional) azimuth angle.
x_label – (optional) str specifying $x$-axis label annotation. If set to None label will not be plotted.
y_label – (optional) str specifying $y$-axis label annotation. If set to None label will not be plotted.
z_label – (optional) str specifying $z$-axis label annotation. If set to None label will not be plotted.
color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.
alphas – (optional) list specifying the opacity of each cluster.
first_cluster_index_zero – (optional) bool specifying if the first cluster should be indexed 0 on the plot. If set to False the first cluster will be indexed 1.
s – (optional) int or float specifying the scatter point size.
markerscale – (optional) int or float specifying the scale for the legend marker.
legend – (optional) bool specifying the whether legend should be plotted.
figure_size – (optional) tuple specifying figure size.
title – (optional) str specifying plot title. If set to None title will not be plotted.
save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

plt - matplotlib.pyplot plot handle.

`plot_2d_train_test_samples`#

PCAfold.preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label=None, y_label=None, color_map='viridis', first_cluster_index_zero=True, grid_on=False, figure_size=(14, 7), title=None, save_filename=None)#

Plots a two-dimensional manifold divided into train and test samples. Number of observations in train and test data respectively will be plotted in the legend.

Example:

from PCAfold import variable_bins, DataSampler, plot_2d_train_test_samples
import numpy as np

# Generate dummy data set:
x = np.linspace(-1,1,100)
y = -x**2 + 1

# Generate dummy clustering of the data set:
(idx, borders) = variable_bins(x, 4, verbose=False)

# Generate dummy sampling of the data set:
sample = DataSampler(idx, random_seed=None, verbose=True)
(idx_train, idx_test) = sample.number(40, test_selection_option=1)

# Plot the sampling result:
plt = plot_2d_train_test_samples(x,
                                 y,
                                 idx,
                                 idx_train,
                                 idx_test,
                                 x_label='$x$',
                                 y_label='$y$',
                                 color_map='viridis',
                                 first_cluster_index_zero=False,
                                 grid_on=True,
                                 figure_size=(12,6),
                                 title='x-y data set',
                                 save_filename='sampling.pdf')
plt.close()

Parameters

x – numpy.ndarray specifying the variable on the $x$-axis. It should be of size (n_observations,) or (n_observations,1).
y – numpy.ndarray specifying the variable on the $y$-axis. It should be of size (n_observations,) or (n_observations,1).
idx – numpy.ndarray of cluster classifications. It should be of size (n_observations,) or (n_observations,1).
idx_train – numpy.ndarray specifying the indices of the train data. It should be of size (n_train,) or (n_train,1).
idx_test – numpy.ndarray specifying the indices of the test data. It should be of size (n_test,) or (n_test,1).
x_label – (optional) str specifying $x$-axis label annotation. If set to None label will not be plotted.
y_label – (optional) str specifying $y$-axis label annotation. If set to None label will not be plotted.
color_map – (optional) str or matplotlib.colors.ListedColormap specifying the colormap to use as per matplotlib.cm. Default is 'viridis'.
first_cluster_index_zero – (optional) bool specifying if the first cluster should be indexed 0 on the plot. If set to False the first cluster will be indexed 1.
grid_on – bool specifying whether grid should be plotted.
figure_size – (optional) tuple specifying figure size.
title – (optional) str specifying plot title. If set to None title will not be plotted.
save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

plt - matplotlib.pyplot plot handle.

`plot_conditional_statistics`#

PCAfold.preprocess.plot_conditional_statistics(variable, conditioning_variable, k=20, split_values=None, statistics_to_plot=['mean'], color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', figure_size=(7, 7), title=None, save_filename=None)#

Plots a two-dimensional manifold given by variable and conditioning_variable and the selected conditional statistics (as per preprocess.ConditionalStatistics).

Example:

from PCAfold import PCA, plot_conditional_statistics
import numpy as np

# Generate dummy variables:
conditioning_variable = np.linspace(-1,1,100)
y = -conditioning_variable**2 + 1

# Plot the conditional statistics:
plt = plot_conditional_statistics(y,
                                  conditioning_variable,
                                  k=10,
                                  x_label='$x$',
                                  y_label='$y$',
                                  figure_size=(10,3),
                                  title='Conditional mean',
                                  save_filename='conditional-mean.pdf')
plt.close()

Parameters

variable – numpy.ndarray specifying a single dependent variable to condition. This will be plotted on the $y$-axis. It should be of size (n_observations,) or (n_observations,1).
conditioning_variable – numpy.ndarray specifying a single variable to be used as a conditioning variable. This will be plotted on the $x$-axis. It should be of size (n_observations,) or (n_observations,1).
k – int specifying the number of bins to create in the conditioning variable. It has to be a positive number.
split_values – list specifying values at which splits should be performed. If set to None, splits will be performed using $k$ equal variable bins.
statistics_to_plot – list of str specifying conditional statistics to plot. The strings can be mean, min, max or std.
color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in x and y vectors. It should be of type numpy.ndarray and size (n_observations,) or (n_observations,1). It can also be set to a string specifying the color directly, for instance 'r' or '#006778'. If not specified, data will be plotted in black.
x_label – (optional) str specifying $x$-axis label annotation. If set to None label will not be plotted.
y_label – (optional) str specifying $y$-axis label annotation. If set to None label will not be plotted.
colorbar_label – (optional) string specifying colorbar label annotation. If set to None, colorbar label will not be plotted.
color_map – (optional) colormap to use as per matplotlib.cm. Default is viridis.
figure_size – (optional) tuple specifying figure size.
title – (optional) str specifying plot title. If set to None title will not be plotted.
save_filename – (optional) str specifying plot save location/filename. If set to None plot will not be saved. You can also set a desired file extension, for instance .pdf. If the file extension is not specified, the default is .png.

Returns

plt - matplotlib.pyplot plot handle.

Bibliography#

PBis06: Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
PCVJ21(1,2): Gunnar Carlsson and Mikael Vejdemo-Johansson. Topological Data Analysis with Applications. Cambridge University Press, 2021.
PCGP12: Axel Coussement, Olivier Gicquel, and Alessandro Parente. Kernel density weighted principal component analysis of combustion processes. Combustion and flame, 159(9):2844–2855, 2012.
PDMJRM00: Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The mahalanobis distance. Chemometrics and intelligent laboratory systems, 50(1):1–18, 2000.
PELL09: Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Wiley Publishing, 4th edition, 2009. ISBN 0340761199.
PGSB04: Abdul A. Gill, George D. Smith, and Anthony J. Bagnall. Improving decision tree performance through induction-and cluster-based stratified sampling. In International Conference on Intelligent Data Engineering and Automated Learning, 339–344. Springer, 2004.
PHG09: Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
PKR09: Leonard Kaufman and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. Volume 344. John Wiley & Sons, 2009.
PKK04: Michael R Keenan and Paul G Kotula. Accounting for poisson noise in the multivariate analysis of tof-sims spectrum images. Surface and Interface Analysis: An International Journal devoted to the development and application of techniques for the analysis of surfaces, interfaces and thin films, 36(3):203–212, 2004.
PKEA+03: Hector C Keun, Timothy MD Ebbels, Henrik Antti, Mary E Bollard, Olaf Beckonert, Elaine Holmes, John C Lindon, and Jeremy K Nicholson. Improved analysis of multivariate data by variable stability scaling: application to nmr-based metabolic profiling. Analytica chimica acta, 490(1-2):265–276, 2003.
PMMD10: Robert J. May, Holger R. Maier, and Graeme C. Dandy. Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks, 23(2):283–294, 2010.
PNey92: Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in Statistics, pages 123–150. Springer, 1992.
PNod08: Isao Noda. Scaling techniques to enhance two-dimensional correlation spectra. Journal of Molecular Structure, 883:216–227, 2008.
PPS13(1,2): Alessandro Parente and James C. Sutherland. Principal component analysis of turbulent combustion data: data pre-processing and manifold sensitivity. Combustion and flame, 160(2):340–350, 2013.
PPSTS09: Alessandro Parente, James C. Sutherland, Leonardo Tognotti, and Philip J. Smith. Identification of low-dimensional manifolds in turbulent flames. Proceedings of the Combustion Institute, 32(1):1579–1586, 2009.
PRLM+16: Mojdeh Rastgoo, Guillaume Lemaitre, Joan Massich, Olivier Morel, Franck Marzani, Rafael Garcia, and Fabrice Meriaudeau. Tackling the problem of data imbalancing for melanoma classification. BIOSTEC - 3rd International Conference on BIOIMAGING, 2016.
PSCSC03: Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003.
PvdBHW+06(1,2,3): Robert A van den Berg, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde, and Mariët J van der Werf. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics, 7(1):1–15, 2006.

Data preprocessing#

Data manipulation#

center_scale#

invert_center_scale#

power_transform#

log_transform#

remove_constant_vars#

order_variables#

Class PreProcessing#

outlier_detection#

representative_sample_size#

Class ConditionalStatistics#

Class KernelDensity#

Class DensityEstimation#

DensityEstimation.average_knn_distance#

DensityEstimation.kth_nearest_neighbor_codensity#

DensityEstimation.kth_nearest_neighbor_density#

Data clustering#

Clustering functions#

variable_bins#

predefined_variable_bins#

mixture_fraction_bins#

zero_neighborhood_bins#

Auxiliary functions#

degrade_clusters#

flip_clusters#

get_centroids#

get_partition#

get_populations#

get_average_centroid_distance#

Data sampling#

Class DataSampler#

DataSampler.number#

DataSampler.percentage#

DataSampler.manual#

DataSampler.random#

Plotting functions#

plot_2d_clustering#

plot_3d_clustering#

plot_2d_train_test_samples#

plot_conditional_statistics#

Bibliography#

`center_scale`#

`invert_center_scale`#

`power_transform`#

`log_transform`#

`remove_constant_vars`#

`order_variables`#

Class `PreProcessing`#

`outlier_detection`#

`representative_sample_size`#

Class `ConditionalStatistics`#

Class `KernelDensity`#

Class `DensityEstimation`#

`DensityEstimation.average_knn_distance`#

`DensityEstimation.kth_nearest_neighbor_codensity`#

`DensityEstimation.kth_nearest_neighbor_density`#

`variable_bins`#

`predefined_variable_bins`#

`mixture_fraction_bins`#

`zero_neighborhood_bins`#

`degrade_clusters`#

`flip_clusters`#

`get_centroids`#

`get_partition`#

`get_populations`#

`get_average_centroid_distance`#

Class `DataSampler`#

`DataSampler.number`#

`DataSampler.percentage`#

`DataSampler.manual`#

`DataSampler.random`#

`plot_2d_clustering`#

`plot_3d_clustering`#

`plot_2d_train_test_samples`#

`plot_conditional_statistics`#