Data preprocessing#
The preprocess module can be used for performing data preprocessing
including centering and scaling, outlier detection and removal, kernel density
weighting of data sets, data clustering and data sampling. It also includes
functionalities that allow the user to perform initial data inspection such
as computing conditional statistics, calculating statistically representative sample sizes,
or ordering variables in a data set according to a criterion.
Note
The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).
The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.
The representation of the user-supplied data matrix in PCAfold
is the input parameter X, which should be of type numpy.ndarray
and of size (n_observations,n_variables).
Data manipulation#
This section includes functions for performing basic data manipulation such as centering and scaling and outlier detection and removal.
center_scale#
- PCAfold.preprocess.center_scale(X, scaling, nocenter=False)#
Centers and scales the original data set, \(\mathbf{X}\). In the discussion below, we understand that \(X_j\) is the \(j^{th}\) column of \(\mathbf{X}\).
Centering is performed by subtracting the center, \(c_j\), from \(X_j\), where centers for all columns are stored in the matrix \(\mathbf{C}\):
\[\mathbf{X_c} = \mathbf{X} - \mathbf{C}\]Centers for each column are computed as:
\[c_j = mean(X_j)\]with the only exceptions of
'0to1'and'-1to1'scalings, which introduce a different quantity to center each column.Scaling is performed by dividing \(X_j\) by the scaling factor, \(d_j\), where scaling factors for all columns are stored in the diagonal matrix \(\mathbf{D}\):
\[\mathbf{X_s} = \mathbf{X} \cdot \mathbf{D}^{-1}\]If both centering and scaling is applied:
\[\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\]Several scaling options are implemented here:
Scaling method
scalingScaling factor \(d_j\)
None
'none'or''1
Auto [PvdBHW+06]
'auto'or'std'\(\sigma\)
Pareto [PNod08]
'pareto'\(\sqrt{\sigma}\)
VAST [PKEA+03]
'vast'\(\sigma^2 / mean(X_j)\)
Range [PvdBHW+06]
'range'\(max(X_j) - min(X_j)\)
0 to 1'0to1'\(d_j = max(X_j) - min(X_j)\)\(c_j = min(X_j)\)-1 to 1'-1to1'\(d_j = 0.5 \cdot (max(X_j) - min(X_j))\)\(c_j = 0.5 \cdot (max(X_j) + min(X_j))\)Level [PvdBHW+06]
'level'\(mean(X_j)\)
Max
'max'\(max(X_j)\)
Variance
'variance'\(var(X_j)\)
Median
'median'\(median(X_j)\)
Poisson [PKK04]
'poisson'\(\sqrt{mean(X_j)}\)
S1
'vast_2'\(\sigma^2 k^2 / mean(X_j)\)
S2
'vast_3'\(\sigma^2 k^2 / max(X_j)\)
S3
'vast_4'\(\sigma^2 k^2 / (max(X_j) - min(X_j))\)
L2-norm
'l2-norm'\(\|X_j\|_2\)
where \(\sigma\) is the standard deviation of \(X_j\) and \(k\) is the kurtosis of \(X_j\).
The effect of data preprocessing (including scaling) on low-dimensional manifolds was studied in [PPS13].
Example:
from PCAfold import center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).scaling –
strspecifying the scaling methodology. It can be one of the following:'none','','auto','std','pareto','vast','range','0to1','-1to1','level','max','variance','median','poisson','vast_2','vast_3','vast_4','l2-norm'.nocenter – (optional)
boolspecifying whether data should be centered by mean. If set toTruedata will not be centered.
- Returns
X_cs -
numpy.ndarrayspecifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It has size(n_observations,n_variables).X_center -
numpy.ndarrayspecifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,).X_scale -
numpy.ndarrayspecifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,).
invert_center_scale#
- PCAfold.preprocess.invert_center_scale(X_cs, X_center, X_scale)#
Inverts whatever centering and scaling was done by the
center_scalefunction:\[\mathbf{X} = \mathbf{X_{cs}} \cdot \mathbf{D} + \mathbf{C}\]Example:
from PCAfold import center_scale, invert_center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False) # Uncenter and unscale: X = invert_center_scale(X_cs, X_center, X_scale)
- Parameters
X_cs –
numpy.ndarrayspecifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables).X_center –
numpy.ndarrayspecifying the centers, \(c_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,).X_scale –
numpy.ndarrayspecifying the scales, \(d_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,).
- Returns
X -
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It has size(n_observations,n_variables).
power_transform#
- PCAfold.preprocess.power_transform(X, transform_power, transform_shift=0.0, transform_sign_shift=0.0, invert=False)#
Performs a power transformation of the provided data. The equation for the transformation of variable \(X\) is
\[(|X + s_1|)^\alpha \text{sign}(X + s_1) + s_2 \text{sign}(X + s_1)\]where \(\alpha\) is the
transform_power, \(s_1\) is thetransform_shift, and \(s_2\) is thetransform_sign_shift.Example:
from PCAfold import power_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform power transformation: X_pow = power_transform(X, 0.5) # undo the transformation: X_orig = power_transform(X_pow, 0.5, invert=True)
- Parameters
X – array of the variable(s) to be transformed
transform_power – the power parameter used in the transformation equation
transform_shift – (optional, default 0.) the shift parameter used in the transformation equation
transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation
invert – (optional, default False) when True, will undo the transformation
- Returns
array of the transformed variables
log_transform#
- PCAfold.preprocess.log_transform(X, method='log', threshold=1e-06)#
Performs log transformation of the original data set, \(\mathbf{X}\).
For an example original function:
The symlog transformation can be obtained with
method='symlog':The continuous symlog transformation can be obtained with
method='continuous-symlog':Example:
from PCAfold import log_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform log transformation: X_log = log_transform(X) # Perform symlog transformation: X_symlog = log_transform(X, method='symlog', threshold=1.e-4)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).method – (optional)
strspecifying the log-transformation method. It can be one of the following:log,ln,symlog,continuous-symlog.threshold – (optional)
floatorintspecifying the threshold for symlog transformation.
- Returns
X_transformed -
numpy.ndarrayspecifying the log-transformed data set. It has size(n_observations,n_variables).
remove_constant_vars#
- PCAfold.preprocess.remove_constant_vars(X, maxtol=1e-12, rangetol=0.0001)#
Removes any constant columns from the original data set, \(\mathbf{X}\). The \(j^{th}\) column, \(X_j\), is considered constant if either of the following is true:
The maximum of an absolute value of a column \(X_j\) is less than
maxtol:
\[max(|X_j|) < \verb|maxtol|\]The ratio of the range of values in a column \(X_j\) to \(max(|X_j|)\) is less than
rangetol:
\[\frac{max(X_j) - min(X_j)}{max(|X_j|)} < \verb|rangetol|\]Specifically, it can be used as preprocessing for PCA so the eigenvalue calculation doesn’t break.
Example:
from PCAfold import remove_constant_vars import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Remove the constant column: (X_removed, idx_removed, idx_retained) = remove_constant_vars(X)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).maxtol – (optional)
floatspecifying the tolerance for \(max(|X_j|)\).rangetol – (optional)
floatspecifying the tolerance for \(max(X_j) - min(X_j)\) over \(max(|X_j|)\).
- Returns
X_removed -
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\) with any constant columns removed. It has size(n_observations,n_variables).idx_removed -
listspecifying the indices of columns removed from \(\mathbf{X}\).idx_retained -
listspecifying the indices of columns retained in \(\mathbf{X}\).
order_variables#
- PCAfold.preprocess.order_variables(X, method='mean', descending=True)#
Orders variables in the original data set, \(\mathbf{X}\), using a selected method.
Example:
from PCAfold import order_variables import numpy as np # Generate a dummy data set: X = np.array([[100, 1, 10], [200, 2, 20], [300, 3, 30]]) # Order variables by the mean value in the descending order: (X_ordered, idx) = order_variables(X, method='mean', descending=True)
The code above should return an ordered data set:
array([[100, 10, 1], [200, 20, 2], [300, 30, 3]])and the list of ordered variable indices:
[1, 2, 0]
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).method – (optional)
strorlistofintspecifying the ordering method. Ifstr, it can be one of the following:'mean','min','max','std'or'var'. Iflist, it is a custom user-provided list of indices for how the variables should be ordered.descending – (optional)
boolspecifying whether variables should be ordered in the descending order. If set toFalse, variables will be ordered in the ascending order.
- Returns
X_ordered -
numpy.ndarrayspecifying the original data set with ordered variables. It has size(n_observations,n_variables).idx -
listspecifying the indices of the ordered variables. It has lengthn_variables.
Class PreProcessing#
- class PCAfold.preprocess.PreProcessing(X, scaling='none', nocenter=False)#
Performs a composition of data manipulation done by
remove_constant_varsandcenter_scalefunctions on the original data set, \(\mathbf{X}\). It can be used to store the result of that manipulation. Specifically, it:checks for constant columns in a data set and removes them,
centers and scales the data.
Example:
from PCAfold import PreProcessing import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Instantiate PreProcessing class object: preprocessed = PreProcessing(X, 'range', nocenter=False)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).scaling –
strspecifying the scaling methodology. It can be one of the following:'none','','auto','std','pareto','vast','range','0to1','-1to1','level','max','poisson','vast_2','vast_3','vast_4'.nocenter – (optional)
boolspecifying whether data should be centered by mean. If set toTruedata will not be centered.
Attributes:
X_removed - (read only)
numpy.ndarrayspecifying the original data set with any constant columns removed. It has size(n_observations,n_variables).idx_removed - (read only)
listspecifying the indices of columns removed from \(\mathbf{X}\).idx_retained - (read only)
listspecifying the indices of columns retained in \(\mathbf{X}\).X_cs - (read only)
numpy.ndarrayspecifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables).X_center - (read only)
numpy.ndarrayspecifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,).X_scale - (read only)
numpy.ndarrayspecifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,).
outlier_detection#
- PCAfold.preprocess.outlier_detection(X, scaling, method='MULTIVARIATE TRIMMING', trimming_threshold=0.5, quantile_threshold=0.9899, verbose=False)#
Finds outliers in the original data set, \(\mathbf{X}\), and returns indices of observations without outliers as well as indices of the outliers themselves. Two options are implemented here:
'MULTIVARIATE TRIMMING'
Outliers are detected based on multivariate Mahalanobis distance, \(D_M\):
\[D_M = \sqrt{(\mathbf{X} - \mathbf{\bar{X}})^T \mathbf{S}^{-1} (\mathbf{X} - \mathbf{\bar{X}})}\]where \(\mathbf{\bar{X}}\) is a matrix of the same size as \(\mathbf{X}\) storing in each column a copy of the average value of the same column in \(\mathbf{X}\). \(\mathbf{S}\) is the covariance matrix computed as per
PCAclass. Note that the scaling option selected will affect the covariance matrix \(\mathbf{S}\). Since Mahalanobis distance takes into account covariance between variables, observations with sufficiently large \(D_M\) can be considered as outliers. For more detailed information on Mahalanobis distance the user is referred to [PBis06] or [PDMJRM00].The threshold above which observations will be classified as outliers can be specified using
trimming_thresholdparameter. Specifically, the \(i^{th}\) observation is classified as an outlier if:\[D_{M, i} > \verb|trimming_threshold| \cdot max(D_M)\]'PC CLASSIFIER'
Outliers are detected based on major and minor principal components (PCs). The method of principal component classifier (PCC) was first proposed in [PSCSC03]. The application of this technique to combustion data sets was studied in [PPS13]. Specifically, the \(i^{th}\) observation is classified as an outlier if the first PC classifier based on \(q\)-first (major) PCs:
\[\sum_{j=1}^{q} \frac{z_{ij}^2}{L_j} > c_1\]or if the second PC classifier based on \((Q-k+1)\)-last (minor) PCs:
\[\sum_{j=k}^{Q} \frac{z_{ij}^2}{L_j} > c_2\]where \(z_{ij}\) is the \(i^{th}, j^{th}\) element from the principal components matrix \(\mathbf{Z}\) and \(L_j\) is the \(j^{th}\) eigenvalue from \(\mathbf{L}\) (as per
PCAclass). Major PCs are selected such that the total variance explained is 50%. Minor PCs are selected such that the remaining variance they explain is 20%.Coefficients \(c_1\) and \(c_2\) are found such that they represent the
quantile_threshold(by default 98.99%) quantile of the empirical distributions of the first and second PC classifier respectively.Example:
from PCAfold import outlier_detection import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Find outliers: (idx_outliers_removed, idx_outliers) = outlier_detection(X, scaling='auto', method='MULTIVARIATE TRIMMING', trimming_threshold=0.8, verbose=True) # New data set without outliers can be obtained as: X_outliers_removed = X[idx_outliers_removed,:] # Observations that were classified as outliers can be obtained as: X_outliers = X[idx_outliers,:]
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).scaling –
strspecifying the scaling methodology. It can be one of the following:'none','','auto','std','pareto','vast','range','0to1','-1to1','level','max','poisson','vast_2','vast_3','vast_4'.method – (optional)
strspecifying the outlier detection method to use. It should be'MULTIVARIATE TRIMMING'or'PC CLASSIFIER'.trimming_threshold – (optional)
floatspecifying the trimming threshold to use in combination with'MULTIVARIATE TRIMMING'method.quantile_threshold – (optional)
floatspecifying the quantile threshold to use in combination with'PC CLASSIFIER'method.verbose – (optional)
boolfor printing verbose details.
- Returns
idx_outliers_removed -
listspecifying the indices of observations without outliers.idx_outliers -
listspecifying the indices of observations that were classified as outliers.
representative_sample_size#
- PCAfold.preprocess.representative_sample_size(depvars, percentages, thresholds, variable_names=None, method='kl-divergence', statistics='median', n_resamples=10, random_seed=None, verbose=False)#
Computes a representative sample size given dependent variables that serve as ground truth (100% of data). It is assumed that the full dataset is representative of some physical phenomena.
Two general approaches are available:
If
method='kl-divergence', the representative sample size is computed based on Kullback-Leibler divergence.If
method='mean',method='median',method='variance', ormethod='std', the representative sample size is computed based on convergence of a first order (mean or median) or of second order (variance, standard deviation) statistics.
Example:
from PCAfold import center_scale, representative_sample_size import numpy as np # Generate dummy data set and two dependent variables: x, y = np.meshgrid(np.linspace(-1,1,100), np.linspace(-1,1,100)) xy = np.hstack((x.ravel()[:,None],y.ravel()[:,None])) phi_1 = np.exp(-((x*x+y*y) / (1 * 1**2))) phi_1 = phi_1.ravel()[:,None] phi_2 = np.exp(-((x*x+y*y) / (0.01 * 1**2))) phi_2 = phi_2.ravel()[:,None] depvars = np.column_stack((phi_1, phi_2)) depvars, _, _ = center_scale(depvars, scaling='0to1') # Specify the list of percentages to explore: percentages = list(np.linspace(1,99.9,200)) # Specify the list of thresholds for each dependent variable: thresholds = [10**-4, 10**-4] # Specify the names of the dependent variables: variable_names = ['Phi-1', 'Phi-2'] # Compute representative sample size for each dependent variable: (idx, sample_sizes, statistics) = representative_sample_size(depvars, percentages, thresholds=thresholds, variable_names=variable_names, method='kl-divergence', statistics='median', n_resamples=20, random_seed=100, verbose=True)
With
verbose=Truewe will see some detailed information:Dependent variable Phi-1 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-1: 2833 samples (28.3% of data). Dependent variable Phi-2 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-2: 9890 samples (98.9% of data).
- Parameters
depvars –
numpy.ndarrayspecifying the dependent variables that should be well represented in a sampled dataset. . It should be of size(n_observations,n_dependent_variables).percentages –
listof percentages to explore. It should be ordered in ascending order. Elements should be larger than 0 and not larger than 100.thresholds – (optional)
listoffloatspecifying the target thresholds for each dependent variable. The thresholds should be appropriate to the method based on which a representative sample size is computed.variable_names – (optional)
listofstrspecifying names for all dependent variables. If set toNone, dependent variables are called with consecutive integers.method – (optional)
strspecifying the method used to compute the sample size statistics. It can bemean,median,variance,std, or'kl-divergence'.statistics – (optional)
strspecifying the overall statistics that should be computed from a given method. It can bemin,max,mean, ormedian.n_resamples – (optional)
intspecifying the number of resamples to perform for each percentage in thepercentagesvector. It is recommended to set this parameters to above 1, since it might accidentally happen that a random sample is statistically representative of the full dataset. Re-sampling helps to average-out the effect of such one-off “lucky” random samples.random_seed – (optional)
intspecifying the random seed.verbose – (optional)
boolfor printing verbose details.
- Returns
threshold_idx -
listofintspecifying the highest indices from thepercentageslist where the representative number of samples condition was still met. It has lengthn_depvars. If the condition for a representative sample size was not met for a dependent variable, a value of-1is returned in the list for that dependent variable.representatitive_sample_sizes -
numpy.ndarrayofintspecifying the representative number of samples. It has size(1,n_depvars). If the condition for a representative sample size was not met for a dependent variable, a value of-1is returned in the array for that dependent variable.sample_size_statistics -
numpy.ndarrayspecifying the full vector of computed statistics correponding to each entry inpercentagesand each dependent variable. It has size(n_percentages,n_depvars).
Class ConditionalStatistics#
- class PCAfold.preprocess.ConditionalStatistics(X, conditioning_variable, k=20, split_values=None, verbose=False)#
Enables computing conditional statistics on the original data set, \(\mathbf{X}\). This includes:
conditional mean
conditional minimum
conditional maximum
conditional standard deviation
Other quantities can be added in the future at the user’s request.
Example:
from PCAfold import ConditionalStatistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Instantiate an object of the ConditionalStatistics class # and compute conditional statistics in 10 bins of the conditioning variable: cond = ConditionalStatistics(y[:,None], conditioning_variable, k=10) # Access conditional statistics: conditional_mean = cond.conditional_mean conditional_min = cond.conditional_minimum conditional_max = cond.conditional_maximum conditional_std = cond.conditional_standard_deviation # Access the centroids of the created bins: centroids = cond.centroids
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).conditioning_variable –
numpy.ndarrayspecifying a single variable to be used as a conditioning variable. It should be of size(n_observations,1)or(n_observations,).k –
intspecifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
listspecifying values at which splits should be performed. If set toNone, splits will be performed using \(k\) equal variable bins.verbose – (optional)
boolfor printing verbose details.
Attributes:
idx - (read only)
numpy.ndarrayof cluster (bins) classifications. It has size(n_observations,).borders - (read only)
listof values that define borders for the clusters (bins). It has lengthk+1.centroids - (read only)
listof values that specify bins centers. It has lengthk.conditional_mean - (read only)
numpy.ndarrayspecifying the conditional means of all original variables in the \(k\) bins created. It has size(k,n_variables).conditional_minimum - (read only)
numpy.ndarrayspecifying the conditional minimums of all original variables in the \(k\) bins created. It has size(k,n_variables).conditional_maximum - (read only)
numpy.ndarrayspecifying the conditional maximums of all original variables in the \(k\) bins created. It has size(k,n_variables).conditional_standard_deviation - (read only)
numpy.ndarrayspecifying the conditional standard deviations of all original variables in the \(k\) bins created. It has size(k,n_variables).
Class KernelDensity#
- class PCAfold.preprocess.KernelDensity(X, conditioning_variable, verbose=False)#
Enables kernel density weighting of the original data set, \(\mathbf{X}\), based on single-variable or multi-variable case as proposed in [PCGP12].
The goal of both cases is to obtain a vector of weights, \(\mathbf{W_c}\), that has the same number of elements as there are observations in the original data set, \(\mathbf{X}\). Each observation will then get multiplied by the corresponding weight from \(\mathbf{W_c}\).
Note
Kernel density weighting technique is usually very expensive, even on data sets with relatively small number of observations. Since the single-variable case is a cheaper option than the multi-variable case, it is recommended that this technique is tried first for larger data sets.
Gaussian kernel is used in both approaches:
\[K_{c, c'} = \sqrt{\frac{1}{2 \pi h^2}} exp(- \frac{d^2}{2 h^2})\]\(h\) is the kernel bandwidth:
\[h = \Big( \frac{4 \hat{\sigma}}{3 n} \Big)^{1/5}\]where \(\hat{\sigma}\) is the standard deviation of the considered variable and \(n\) is the number of observations in the data set.
\(d\) is the distance between two observations \(c\) and \(c'\):
\[d = |x_c - x_{c'}|\]Single-variable
If the
conditioning_variableargument is a single vector, weighting will be performed according to the single-variable case. It begins by summing Gaussian kernels:\[\mathbf{K_c} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c'}\]and weights are then computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Multi-variable
If the
conditioning_variableargument is a matrix of multiple variables, weighting will be performed according to the multi-variable case. It begins by summing Gaussian kernels for a \(k^{th}\) variable:\[\mathbf{K_c}_{, k} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c', k}\]Global density taking into account all variables is then obtained as:
\[\mathbf{K_{c}} = \prod_{k=1}^{k=Q} \mathbf{K_c}_{, k}\]where \(Q\) is the total number of conditioning variables, and weights are computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Example:
from PCAfold import KernelDensity import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Perform kernel density weighting based on the first variable: kerneld = KernelDensity(X, X[:,0]) # Access the weighted data set: X_weighted = kerneld.X_weighted # Access the weights used to scale the data set: weights = kerneld.weights
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).conditioning_variable –
numpy.ndarrayspecifying either a single variable or multiple variables to be used as a conditioning variable for kernel weighting procedure. Note that it can also be passed as the data set \(\mathbf{X}\).
Attributes:
weights -
numpy.ndarrayspecifying the computed weights, \(\mathbf{W_c}\). It has size(n_observations,1).X_weighted -
numpy.ndarrayspecifying the weighted data set (each observation in \(\mathbf{X}\) is multiplied by the corresponding weight in \(\mathbf{W_c}\)). It has size(n_observations,n_variables).
Class DensityEstimation#
- class PCAfold.preprocess.DensityEstimation(X, n_neighbors)#
Enables density estimation on point-cloud data.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).n_neighbors –
intspecifying the number of nearest neighbors, or the \(k\) th nearest neighbor when applicable.
DensityEstimation.average_knn_distance#
- PCAfold.preprocess.DensityEstimation.average_knn_distance(self, verbose=False)#
Computes an average Euclidean distances to \(k\) nearest neighbors on a manifold defined by the independent variables.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute average distances on a manifold defined by the PCs: average_distances = density_estimation.average_knn_distance(verbose=True)
With
verbose=True, minimum, maximum and average distance will be printed:Minimum distance: 0.1388300829487847 Maximum distance: 0.4689587542132183 Average distance: 0.20824964953425693 Median distance: 0.18333873029179215
Note
This function requires the
scikit-learnmodule. You can install it through:pip install scikit-learn- Parameters
verbose – (optional)
boolfor printing verbose details.- Returns
average_distances -
numpy.ndarrayspecifying the vector of average distances for every observation in a data set to its \(k\) nearest neighbors. It has size(n_observations,).
DensityEstimation.kth_nearest_neighbor_codensity#
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_codensity(self)#
Computes the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data codensity defined as:
\[\delta_k(x) = d(x, v_k(x))\]where \(v_k(x)\) is the \(k\) th nearest neighbor of \(x\).
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_codensity = density_estimation.kth_nearest_neighbor_codensity()
Note
This function requires the
scikit-learnmodule. You can install it through:pip install scikit-learn- Returns
data_codensity -
numpy.ndarrayspecifying the vector of distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,).
DensityEstimation.kth_nearest_neighbor_density#
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_density(self)#
Computes an inverse of the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data density defined as:
\[\rho_k(x) = \frac{1}{\delta_k(x)}\]where \(\delta_k(x)\) is the codensity.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_density = density_estimation.kth_nearest_neighbor_density()
Note
This function requires the
scikit-learnmodule. You can install it through:pip install scikit-learn- Returns
data_density -
numpy.ndarrayspecifying the vector of inverse distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,).
Data clustering#
This section includes functions for classifying data sets into local clusters and performing some basic operations on clusters [PELL09], [PKR09].
Clustering functions#
Each function that clusters the data set returns a vector of integers idx
of type numpy.ndarray of size (n_observations,) that specifies
classification of each observation from the original data set,
\(\mathbf{X}\), to a local cluster.
Note
The first cluster has index 0 within all idx vectors returned.
variable_bins#
- PCAfold.preprocess.variable_bins(var, k, verbose=False)#
Clusters the data by dividing a variable vector
varinto bins of equal lengths.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to bins of x: (idx, borders) = variable_bins(x, 4, verbose=True)
- Parameters
var –
numpy.ndarrayspecifying the variable values. It should be of size(n_observations,)or(n_observations,1).k –
intspecifying the number of clusters to create. It has to be a positive number.verbose – (optional)
boolfor printing verbose details.
- Returns
idx -
numpy.ndarrayof cluster classifications. It has size(n_observations,).borders -
listof values that define borders for the clusters. It has lengthk+1.
predefined_variable_bins#
- PCAfold.preprocess.predefined_variable_bins(var, split_values, verbose=False)#
Clusters the data by dividing a variable vector
varinto bins such that splits are done at user-specified values. Split values can be specified in thesplit_valueslist. In general:split_values = [value_1, value_2, ..., value_n].Note: When a split is performed at a given
value_i, the observation invarthat takes exactly that value is assigned to the newly created bin.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import predefined_variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to pre-defined bins of x: (idx, borders) = predefined_variable_bins(x, [-0.6, 0.4, 0.8], verbose=True)
- Parameters
var –
numpy.ndarrayspecifying the variable values. It should be of size(n_observations,)or(n_observations,1).split_values –
listspecifying values at which splits should be performed.verbose – (optional)
boolfor printing verbose details.
- Returns
idx -
numpy.ndarrayof cluster classifications. It has size(n_observations,).borders -
listof values that define borders for the clusters. It has lengthk+1.
mixture_fraction_bins#
- PCAfold.preprocess.mixture_fraction_bins(Z, k, Z_stoich, verbose=False)#
Clusters the data by dividing a mixture fraction vector
Zinto bins of equal lengths. This technique can be used to partition combustion data sets as proposed in [PPSTS09]. The vector is first split to lean and rich side (according to the stoichiometric mixture fractionZ_stoich) and then the sides get divided further into clusters. Whenkis odd, there will always be one more cluster on the side with larger range in mixture fraction space compared to the other side.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import mixture_fraction_bins import numpy as np # Generate dummy mixture fraction variable: Z = np.linspace(0,1,100) # Create partitioning according to bins of mixture fraction: (idx, borders) = mixture_fraction_bins(Z, 4, 0.4, verbose=True)
- Parameters
Z –
numpy.ndarrayspecifying the mixture fraction values. It should be of size(n_observations,)or(n_observations,1).k –
intspecifying the number of clusters to create. It has to be a positive number.Z_stoich –
floatspecifying the stoichiometric mixture fraction. It has to be between 0 and 1.verbose – (optional)
boolfor printing verbose details.
- Returns
idx -
numpy.ndarrayof cluster classifications. It has size(n_observations,).borders -
listof values that define borders for the clusters. It has lengthk+1.
zero_neighborhood_bins#
- PCAfold.preprocess.zero_neighborhood_bins(var, k, zero_offset_percentage=0.1, split_at_zero=False, verbose=False)#
Clusters the data by separating close-to-zero observations in a vector into one cluster (
split_at_zero=False) or two clusters (split_at_zero=True). The offset from zero at which splits are performed is computed based on the input parameterzero_offset_percentage:\[\verb|offset| = \frac{(max(\verb|var|) - min(\verb|var|)) \cdot \verb|zero_offset_percentage|}{100}\]Further clusters are found by splitting positive and negative values in a vector alternatingly into bins of equal lengths.
This clustering technique can be useful for partitioning any variable that has many observations clustered around zero value and relatively few observations far away from zero on either side.
Two examples of how a vector can be partitioned with this function are presented below:
With
split_at_zero=False:
If
split_at_zero=Falsethe smallest allowed number of clusters is 3. This is to assure that there are at least three clusters: with negative values, with close to zero values, with positive values.When
kis even, there will always be one more cluster on the side with larger range compared to the other side.With
split_at_zero=True:
If
split_at_zero=Truethe smallest allowed number of clusters is 4. This is to assure that there are at least four clusters: with negative values, with negative values close to zero, with positive values close to zero and with positive values.When
kis odd, there will always be one more cluster on the side with larger range compared to the other side.Note
This clustering technique is well suited for partitioning chemical source terms, \(\mathbf{S_X}\), or sources of principal components, \(\mathbf{S_Z}\), (as per [TSP09]) since it relies on unbalanced vectors that have many observations numerically close to zero. Using
split_at_zero=Trueit can further differentiate between negative and positive sources.Example:
from PCAfold import zero_neighborhood_bins import numpy as np # Generate dummy variable: x = np.linspace(-100,100,1000) # Create partitioning according to bins of x: (idx, borders) = zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True)
- Parameters
var –
numpy.ndarrayspecifying the variable values. It should be of size(n_observations,)or(n_observations,1).k –
intspecifying the number of clusters to create. It has to be a positive number. It cannot be smaller than 3 ifsplit_at_zero=Falseor smaller than 4 ifsplit_at_zero=True.zero_offset_percentage – (optional) percentage of \(max(\verb|var|) - min(\verb|var|)\) range to take as the offset from zero value. For instance, set
zero_offset_percentage=10if you want 10% as offset.split_at_zero – (optional)
boolspecifying whether partitioning should be done atvar=0.verbose – (optional)
boolfor printing verbose details.
- Returns
idx -
numpy.ndarrayof cluster classifications. It has size(n_observations,).borders -
listof values that define borders for the clusters. It has lengthk+1.
Auxiliary functions#
degrade_clusters#
- PCAfold.preprocess.degrade_clusters(idx, verbose=False)#
Re-numerates clusters if either of these two cases is true:
idxis composed of non-consecutive integers, orthe smallest cluster index in
idxis not equal to0.
Example:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 2, 0, 5, 10]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
The code above will produce:
>>> idx_degraded array([0, 0, 1, 0, 2, 3])
Alternatively:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([1, 1, 2, 2, 3, 3]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
will produce:
>>> idx_degraded array([0, 0, 1, 1, 2, 2])
- Parameters
idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).verbose – (optional)
boolfor printing verbose details.
- Returns
idx_degraded -
numpy.ndarrayof degraded cluster classifications. It has size(n_observations,).k_update -
intspecifying the updated number of clusters.
flip_clusters#
- PCAfold.preprocess.flip_clusters(idx, dictionary)#
Flips cluster labelling according to instructions provided in a dictionary. For a
dictionary = {key : value}, a cluster with a numberkeywill get a numbervalue.Example:
from PCAfold import flip_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0,0,0,1,1,1,1,2,2]) # Swap cluster number 1 with cluster number 2: flipped_idx = flip_clusters(idx, {1:2, 2:1})
The code above will produce:
>>> flipped_idx array([0, 0, 0, 2, 2, 2, 2, 1, 1])
Note
This function can also be used to merge clusters. Using the
idxfrom the example above, if we call:flipped_idx = flip_clusters(idx, {2:1})
the result will be:
>>> flipped_idx array([0,0,0,1,1,1,1,1,1])
where clusters
1and2have been merged into one cluster numbered1.- Parameters
idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).dictionary –
dictspecifying instructions for cluster label flipping.
- Returns
flipped_idx -
numpy.ndarrayspecifying the re-labelled cluster classifications. It has size(n_observations,).
get_centroids#
- PCAfold.preprocess.get_centroids(X, idx)#
Computes the centroids for all variables in the original data set, \(\mathbf{X}\), and for each cluster specified in the
idxvector. The centroid \(c_{n, j}\) for variable \(X_j\) in the \(n^{th}\) cluster, is computed as:\[c_{n, j} = mean(X_j), \,\,\,\, \text{for} \,\, X_j \in \text{cluster} \,\, n\]Centroids for all variables from all clusters are stored in the matrix \(\mathbf{c} \in \mathbb{R}^{k \times Q}\) returned:
\[\begin{split}\mathbf{c} = \begin{bmatrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, Q} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, Q} \\ \vdots & \vdots & \vdots & \vdots \\ c_{k, 1} & c_{k, 2} & \dots & c_{k, Q} \\ \end{bmatrix}\end{split}\]Example:
from PCAfold import get_centroids import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute the centroids of each cluster: centroids = get_centroids(X, idx)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).
- Returns
centroids -
numpy.ndarrayspecifying the centroids matrix, \(\mathbf{c}\), for all clusters and for all variables. It has size(k,n_variables).
get_partition#
- PCAfold.preprocess.get_partition(X, idx)#
Partitions the observations from the original data set, \(\mathbf{X}\), into \(k\) clusters according to
idxprovided.Example:
from PCAfold import get_partition import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Generate partitioning of the data set according to idx: (X_in_clusters, idx_in_clusters) = get_partition(X, idx)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).
- Returns
X_in_clusters -
listof \(k\)numpy.ndarraythat contains original data set observations partitioned to \(k\) clusters. It has lengthk.idx_in_clusters -
listof \(k\)numpy.ndarraythat contains indices of the original data set observations partitioned to \(k\) clusters. It has lengthk.
get_populations#
- PCAfold.preprocess.get_populations(idx)#
Computes populations (number of observations) in clusters specified in the
idxvector. As an example, if there are 100 observations in the first cluster and 500 observations in the second cluster this function will return a list:[100, 500].Example:
from PCAfold import variable_bins, get_populations import numpy as np # Generate dummy partitioning: x = np.linspace(-1,1,100) (idx, borders) = variable_bins(x, 4, verbose=True) # Compute cluster populations: populations = get_populations(idx)
The code above will produce:
>>> populations [25, 25, 25, 25]
- Parameters
idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).- Returns
populations -
listof cluster populations. Each entry referes to one cluster ordered according toidx. It has lengthk.
get_average_centroid_distance#
- PCAfold.preprocess.get_average_centroid_distance(X, idx, weighted=False)#
Computes the average Euclidean distance between observations and the centroids of clusters to which each observation belongs.
The average can be computed as an arithmetic average from all clusters (
weighted=False) or as a weighted average (weighted=True). In the latter, the distances are weighted by the number of observations in a cluster so that the average centroid distance will approach the average distance in the largest cluster.Example:
from PCAfold import get_average_centroid_distance import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute average distance from cluster centroids: average_centroid_distance = get_average_centroid_distance(X, idx, weighted=False)
- Parameters
X –
numpy.ndarrayspecifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables).idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).weighted – (optional)
boolspecifying whether distances from centroid should be weighted by the number of observations in a cluster. If set toFalse, arithmetic average will be computed.
- Returns
average_centroid_distance -
floatspecifying the average distance from centroids, averaged over all observations and all clusters.
Data sampling#
This section includes functions for splitting data sets into train and test data for use in machine learning algorithms. Apart from random splitting that can be achieved with the commonly used sklearn.model_selection.train_test_split, extended methods are implemented here that allow for purposive sampling [PNey92], such as drawing samples at certain amount from local clusters [PMMD10], [PGSB04]. These functionalities can be specifically used to tackle imbalanced data sets [PHG09], [PRLM+16].
The general idea is to divide the entire data set X (or its portion) into train and test samples as presented below:
Train data is always sampled in the same way for a given sampling function.
Depending on the option selected, test data will be sampled differently, either as all
remaining samples that were not included in train data or as a subset of those.
You can select the option by setting the test_selection_option parameter for each sampling function.
Reach out to the documentation for a specific sampling function to see what options are available.
All splitting functions in this module return a tuple of two variables: (idx_train, idx_test).
Both idx_train and idx_test are vectors of integers of type numpy.ndarray and of size (_,).
These variables contain indices of observations that went into train data and test data respectively.
In your model learning algorithm you can then get the train and test observations, for instance in the following way:
X_train = X[idx_train,:]
X_test = X[idx_test,:]
All functions are equipped with verbose parameter. If it is set to True some additional information on train and test selection is printed.
Note
It is assumed that the first cluster has index 0 within all input idx vectors.
Class DataSampler#
- class PCAfold.preprocess.DataSampler(idx, idx_test=None, random_seed=None, verbose=False)#
Enables selecting train and test data samples.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, idx_test=np.array([5,9]), random_seed=100, verbose=True)
- Parameters
idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).idx_test – (optional)
numpy.ndarrayspecifying the user-provided indices for test data. If specified, train data will be selected ignoring the indices inidx_testand the test data will be returned the same as the user-providedidx_test. If not specified, test samples will be selected according to thetest_selection_optionparameter (see documentation for each sampling function). Setting fixedidx_testparameter may be useful if training a machine learning model on specific test samples is desired. It should be of size(n_test_samples,)or(n_test_samples,1).random_seed – (optional)
intspecifying random seed for random sample selection.verbose – (optional)
boolfor printing verbose details.
DataSampler.number#
- PCAfold.preprocess.DataSampler.number(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples fixed number of observations from every cluster as training data. In general, this results in a balanced representation of features identified by a clustering algorithm.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.number(20, test_selection_option=1)
Train data:
The number of train samples is estimated based on the percentage
percprovided. First, the total number of samples for training is estimated as a percentagepercfrom the total number of observationsn_observationsin a data set. Next, this number is divided equally into \(k\) clusters. The resultn_of_samplesis the number of samples that will be selected from each cluster:\[\verb|n_of_samples| = \verb|int| \Big( \frac{\verb|perc| \cdot \verb|n_observations|}{k \cdot 100} \Big)\]Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2, the smallest cluster is found and the remaining number of observations \(m\) are taken as test data in that cluster. Next, the same number of samples \(m\) is taken from all remaining larger clusters.The scheme below presents graphically how train and test data can be selected using
test_selection_optionparameter:Here \(n\) and \(m\) are fixed numbers for each cluster. In general, \(n \neq m\).
- Parameters
perc – percentage of data to be selected as training data from the entire data set. For instance, set
perc=20if you want to select 20%.test_selection_option – (optional)
intspecifying the option for how the test data is selected. Selecttest_selection_option=1if you want all remaining samples to become test data. Selecttest_selection_option=2if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarrayof indices of the train data. It has size(n_train,).idx_test -
numpy.ndarrayof indices of the test data. It has size(n_test,).
DataSampler.percentage#
- PCAfold.preprocess.DataSampler.percentage(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples a certain percentage
percfrom every cluster as the training data.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.percentage(20, test_selection_option=1)
Note: If the cluster sizes are comparable, this function will give a similar train sample distribution as random sampling (
DataSampler.random). This sampling can be useful in cases where one cluster is significantly smaller than others and there is a chance that this cluster will not get covered in the train data if random sampling was used.Train data:
The number of train samples is estimated based on the percentage
percprovided. First, the size of the \(i^{th}\) cluster is estimatedcluster_size_iand then a percentagepercof that number is selected.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_optionparameter:Here \(p\) is the percentage
percprovided.- Parameters
perc – percentage of data to be selected as training data from each cluster. For instance, set
perc=20if you want to select 20%.test_selection_option – (optional)
intspecifying the option for how the test data is selected. Selecttest_selection_option=1if you want all remaining samples to become test data. Selecttest_selection_option=2if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarrayof indices of the train data. It has size(n_train,).idx_test -
numpy.ndarrayof indices of the test data. It has size(n_test,).
DataSampler.manual#
- PCAfold.preprocess.DataSampler.manual(self, sampling_dictionary, sampling_type='percentage', test_selection_option=1)#
Uses classifications into \(k\) clusters and a dictionary
sampling_dictionaryin which you manually specify what'percentage'(or what'number') of samples will be selected as the train data from each cluster. The dictionary keys are cluster classifications as peridxand the dictionary values are either percentage or number of train samples to be selected. The default dictionary values are percentage but you can selectsampling_type='number'in order to interpret the values as a number of samples.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.manual({0:1, 1:1, 2:1}, sampling_type='number', test_selection_option=1)
Train data:
The number of train samples selected from each cluster is estimated based on the
sampling_dictionary. Forkey : value, percentagevalue(or numbervalue) of samples will be selected from clusterkey.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_optionparameter:Here it is understood that \(n_1\) train samples were requested from the first cluster, \(n_2\) from the second cluster and \(n_3\) from the third cluster, where \(n_i\) can be interpreted as number or as percentage. This can be achieved by setting:
sampling_dictionary = {0:n_1, 1:n_2, 2:n_3}
- Parameters
sampling_dictionary –
dictspecifying manual sampling. Keys are cluster classifications and values are eitherpercentageornumberof samples to be taken from that cluster. Keys should match the cluster classifications as peridx.sampling_type – (optional)
strspecifying whether percentage or number is given in thesampling_dictionary. Available options:percentageornumber. The default ispercentage.test_selection_option – (optional)
intspecifying the option for how the test data is selected. Selecttest_selection_option=1if you want all remaining samples to become test data. Selecttest_selection_option=2if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarrayof indices of the train data. It has size(n_train,).idx_test -
numpy.ndarrayof indices of the test data. It has size(n_test,).
DataSampler.random#
- PCAfold.preprocess.DataSampler.random(self, perc, test_selection_option=1)#
Samples train data at random from the entire data set.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Due to the nature of this sampling technique, it is not necessary to have
idxclassifications since random samples can also be selected from unclassified data sets. You can achieve that by generating a dummyidxvector that has the same number of observationsn_observationsas your data set. For instance:from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: n_observations = 100 idx = np.zeros(n_observations) # Instantiate DataSampler class object: selection = DataSampler(idx) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Train data:
The total number of train samples is computed as a percentage
percfrom the total number of observations in a data set. These samples are then drawn at random from the entire data set, independent of cluster classifications.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2the same procedure is used to select test data as was used to select train data (only allowed ifpercis less than 50%).The scheme below presents graphically how train and test data can be selected using
test_selection_optionparameter:Here \(p\) is the percentage
percprovided.- Parameters
perc – percentage of data to be selected as training data from each cluster. Set
perc=20if you want 20%.test_selection_option – (optional)
intspecifying the option for how the test data is selected. Selecttest_selection_option=1if you want all remaining samples to become test data. Selecttest_selection_option=2if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarrayof indices of the train data. It has size(n_train,).idx_test -
numpy.ndarrayof indices of the test data. It has size(n_test,).
Plotting functions#
This section includes functions for data preprocessing related plotting such as visualizing the formed clusters, visualizing the selected train and test samples or plotting the conditional statistics.
plot_2d_clustering#
- PCAfold.preprocess.plot_2d_clustering(x, y, idx, clean=False, x_label=None, y_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, grid_on=False, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_2d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_2d_clustering(x, y, idx, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(10,6), title='x-y data set', save_filename='clustering.pdf') plt.close()
- Parameters
x –
numpy.ndarrayspecifying the variable on the \(x\)-axis. It should be of size(n_observations,)or(n_observations,1).y –
numpy.ndarrayspecifying the variable on the \(y\)-axis. It should be of size(n_observations,)or(n_observations,1).idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).clean – (optional)
boolspecifying if a clean plot should be made. If set toTrue, nothing else but the data points is plotted.x_label – (optional)
strspecifying \(x\)-axis label annotation. If set toNonelabel will not be plotted.y_label – (optional)
strspecifying \(y\)-axis label annotation. If set toNonelabel will not be plotted.color_map – (optional)
strormatplotlib.colors.ListedColormapspecifying the colormap to use as permatplotlib.cm. Default is'viridis'.alphas – (optional)
listspecifying the opacity of each cluster.first_cluster_index_zero – (optional)
boolspecifying if the first cluster should be indexed0on the plot. If set toFalsethe first cluster will be indexed1.grid_on –
boolspecifying whether grid should be plotted.s – (optional)
intorfloatspecifying the scatter point size.markerscale – (optional)
intorfloatspecifying the scale for the legend marker.legend – (optional)
boolspecifying the whether legend should be plotted.figure_size – (optional)
tuplespecifying figure size.title – (optional)
strspecifying plot title. If set toNonetitle will not be plotted.save_filename – (optional)
strspecifying plot save location/filename. If set toNoneplot will not be saved. You can also set a desired file extension, for instance.pdf. If the file extension is not specified, the default is.png.
- Returns
plt -
matplotlib.pyplotplot handle.
plot_3d_clustering#
- PCAfold.preprocess.plot_3d_clustering(x, y, z, idx, elev=45, azim=-45, x_label=None, y_label=None, z_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a three-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_3d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 z = x + 10 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_3d_clustering(x, y, z, idx, x_label='$x$', y_label='$y$', z_label='$z$', color_map='viridis', first_cluster_index_zero=False, figure_size=(10,6), title='x-y-z data set', save_filename='clustering.pdf') plt.close()
- Parameters
x –
numpy.ndarrayspecifying the variable on the \(x\)-axis. It should be of size(n_observations,)or(n_observations,1).y –
numpy.ndarrayspecifying the variable on the \(y\)-axis. It should be of size(n_observations,)or(n_observations,1).z –
numpy.ndarrayspecifying the variable on the \(z\)-axis. It should be of size(n_observations,)or(n_observations,1).idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).elev – (optional) elevation angle.
azim – (optional) azimuth angle.
x_label – (optional)
strspecifying \(x\)-axis label annotation. If set toNonelabel will not be plotted.y_label – (optional)
strspecifying \(y\)-axis label annotation. If set toNonelabel will not be plotted.z_label – (optional)
strspecifying \(z\)-axis label annotation. If set toNonelabel will not be plotted.color_map – (optional)
strormatplotlib.colors.ListedColormapspecifying the colormap to use as permatplotlib.cm. Default is'viridis'.alphas – (optional)
listspecifying the opacity of each cluster.first_cluster_index_zero – (optional)
boolspecifying if the first cluster should be indexed0on the plot. If set toFalsethe first cluster will be indexed1.s – (optional)
intorfloatspecifying the scatter point size.markerscale – (optional)
intorfloatspecifying the scale for the legend marker.legend – (optional)
boolspecifying the whether legend should be plotted.figure_size – (optional)
tuplespecifying figure size.title – (optional)
strspecifying plot title. If set toNonetitle will not be plotted.save_filename – (optional)
strspecifying plot save location/filename. If set toNoneplot will not be saved. You can also set a desired file extension, for instance.pdf. If the file extension is not specified, the default is.png.
- Returns
plt -
matplotlib.pyplotplot handle.
plot_2d_train_test_samples#
- PCAfold.preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label=None, y_label=None, color_map='viridis', first_cluster_index_zero=True, grid_on=False, figure_size=(14, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into train and test samples. Number of observations in train and test data respectively will be plotted in the legend.
Example:
from PCAfold import variable_bins, DataSampler, plot_2d_train_test_samples import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, borders) = variable_bins(x, 4, verbose=False) # Generate dummy sampling of the data set: sample = DataSampler(idx, random_seed=None, verbose=True) (idx_train, idx_test) = sample.number(40, test_selection_option=1) # Plot the sampling result: plt = plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(12,6), title='x-y data set', save_filename='sampling.pdf') plt.close()
- Parameters
x –
numpy.ndarrayspecifying the variable on the \(x\)-axis. It should be of size(n_observations,)or(n_observations,1).y –
numpy.ndarrayspecifying the variable on the \(y\)-axis. It should be of size(n_observations,)or(n_observations,1).idx –
numpy.ndarrayof cluster classifications. It should be of size(n_observations,)or(n_observations,1).idx_train –
numpy.ndarrayspecifying the indices of the train data. It should be of size(n_train,)or(n_train,1).idx_test –
numpy.ndarrayspecifying the indices of the test data. It should be of size(n_test,)or(n_test,1).x_label – (optional)
strspecifying \(x\)-axis label annotation. If set toNonelabel will not be plotted.y_label – (optional)
strspecifying \(y\)-axis label annotation. If set toNonelabel will not be plotted.color_map – (optional)
strormatplotlib.colors.ListedColormapspecifying the colormap to use as permatplotlib.cm. Default is'viridis'.first_cluster_index_zero – (optional)
boolspecifying if the first cluster should be indexed0on the plot. If set toFalsethe first cluster will be indexed1.grid_on –
boolspecifying whether grid should be plotted.figure_size – (optional)
tuplespecifying figure size.title – (optional)
strspecifying plot title. If set toNonetitle will not be plotted.save_filename – (optional)
strspecifying plot save location/filename. If set toNoneplot will not be saved. You can also set a desired file extension, for instance.pdf. If the file extension is not specified, the default is.png.
- Returns
plt -
matplotlib.pyplotplot handle.
plot_conditional_statistics#
- PCAfold.preprocess.plot_conditional_statistics(variable, conditioning_variable, k=20, split_values=None, statistics_to_plot=['mean'], color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold given by
variableandconditioning_variableand the selected conditional statistics (as perpreprocess.ConditionalStatistics).Example:
from PCAfold import PCA, plot_conditional_statistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Plot the conditional statistics: plt = plot_conditional_statistics(y, conditioning_variable, k=10, x_label='$x$', y_label='$y$', figure_size=(10,3), title='Conditional mean', save_filename='conditional-mean.pdf') plt.close()
- Parameters
variable –
numpy.ndarrayspecifying a single dependent variable to condition. This will be plotted on the \(y\)-axis. It should be of size(n_observations,)or(n_observations,1).conditioning_variable –
numpy.ndarrayspecifying a single variable to be used as a conditioning variable. This will be plotted on the \(x\)-axis. It should be of size(n_observations,)or(n_observations,1).k –
intspecifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
listspecifying values at which splits should be performed. If set toNone, splits will be performed using \(k\) equal variable bins.statistics_to_plot –
listofstrspecifying conditional statistics to plot. The strings can bemean,min,maxorstd.color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
xandyvectors. It should be of typenumpy.ndarrayand size(n_observations,)or(n_observations,1). It can also be set to a string specifying the color directly, for instance'r'or'#006778'. If not specified, data will be plotted in black.x_label – (optional)
strspecifying \(x\)-axis label annotation. If set toNonelabel will not be plotted.y_label – (optional)
strspecifying \(y\)-axis label annotation. If set toNonelabel will not be plotted.colorbar_label – (optional) string specifying colorbar label annotation. If set to
None, colorbar label will not be plotted.color_map – (optional) colormap to use as per
matplotlib.cm. Default is viridis.figure_size – (optional)
tuplespecifying figure size.title – (optional)
strspecifying plot title. If set toNonetitle will not be plotted.save_filename – (optional)
strspecifying plot save location/filename. If set toNoneplot will not be saved. You can also set a desired file extension, for instance.pdf. If the file extension is not specified, the default is.png.
- Returns
plt -
matplotlib.pyplotplot handle.
Bibliography#
- PBis06
Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
- PCVJ21(1,2)
Gunnar Carlsson and Mikael Vejdemo-Johansson. Topological Data Analysis with Applications. Cambridge University Press, 2021.
- PCGP12
Axel Coussement, Olivier Gicquel, and Alessandro Parente. Kernel density weighted principal component analysis of combustion processes. Combustion and flame, 159(9):2844–2855, 2012.
- PDMJRM00
Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The mahalanobis distance. Chemometrics and intelligent laboratory systems, 50(1):1–18, 2000.
- PELL09
Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Wiley Publishing, 4th edition, 2009. ISBN 0340761199.
- PGSB04
Abdul A. Gill, George D. Smith, and Anthony J. Bagnall. Improving decision tree performance through induction-and cluster-based stratified sampling. In International Conference on Intelligent Data Engineering and Automated Learning, 339–344. Springer, 2004.
- PHG09
Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
- PKR09
Leonard Kaufman and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. Volume 344. John Wiley & Sons, 2009.
- PKK04
Michael R Keenan and Paul G Kotula. Accounting for poisson noise in the multivariate analysis of tof-sims spectrum images. Surface and Interface Analysis: An International Journal devoted to the development and application of techniques for the analysis of surfaces, interfaces and thin films, 36(3):203–212, 2004.
- PKEA+03
Hector C Keun, Timothy MD Ebbels, Henrik Antti, Mary E Bollard, Olaf Beckonert, Elaine Holmes, John C Lindon, and Jeremy K Nicholson. Improved analysis of multivariate data by variable stability scaling: application to nmr-based metabolic profiling. Analytica chimica acta, 490(1-2):265–276, 2003.
- PMMD10
Robert J. May, Holger R. Maier, and Graeme C. Dandy. Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks, 23(2):283–294, 2010.
- PNey92
Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in Statistics, pages 123–150. Springer, 1992.
- PNod08
Isao Noda. Scaling techniques to enhance two-dimensional correlation spectra. Journal of Molecular Structure, 883:216–227, 2008.
- PPS13(1,2)
Alessandro Parente and James C. Sutherland. Principal component analysis of turbulent combustion data: data pre-processing and manifold sensitivity. Combustion and flame, 160(2):340–350, 2013.
- PPSTS09
Alessandro Parente, James C. Sutherland, Leonardo Tognotti, and Philip J. Smith. Identification of low-dimensional manifolds in turbulent flames. Proceedings of the Combustion Institute, 32(1):1579–1586, 2009.
- PRLM+16
Mojdeh Rastgoo, Guillaume Lemaitre, Joan Massich, Olivier Morel, Franck Marzani, Rafael Garcia, and Fabrice Meriaudeau. Tackling the problem of data imbalancing for melanoma classification. BIOSTEC - 3rd International Conference on BIOIMAGING, 2016.
- PSCSC03
Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003.
- PvdBHW+06(1,2,3)
Robert A van den Berg, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde, and Mariët J van der Werf. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics, 7(1):1–15, 2006.