Module: preprocess#
The preprocess
module can be used for performing data preprocessing
including centering and scaling, outlier detection and removal, kernel density
weighting of data sets, data clustering and data sampling. It also includes
functionalities that allow the user to perform initial data inspection such
as computing conditional statistics, calculating statistically representative sample sizes,
or ordering variables in a data set according to a criterion.
Note
The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).
The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.
The representation of the user-supplied data matrix in PCAfold
is the input parameter X
, which should be of type numpy.ndarray
and of size (n_observations,n_variables)
.
Data manipulation#
This section includes functions for performing basic data manipulation such as centering and scaling, nonlinear transformations, and outlier detection and removal.
Class PreProcessing
#
- class PCAfold.preprocess.PreProcessing(X, scaling='none', nocenter=False)#
Performs a composition of data manipulation done by
remove_constant_vars
andcenter_scale
functions on the original data set, \(\mathbf{X}\). It can be used to store the result of that manipulation. Specifically, it:checks for constant columns in a data set and removes them,
centers and scales the data.
Example:
from PCAfold import PreProcessing import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Instantiate PreProcessing class object: preprocessed = PreProcessing(X, 'range', nocenter=False)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.nocenter – (optional)
bool
specifying whether data should be centered by mean. If set toTrue
data will not be centered.
Attributes:
X_removed - (read only)
numpy.ndarray
specifying the original data set with any constant columns removed. It has size(n_observations,n_variables)
.idx_removed - (read only)
list
specifying the indices of columns removed from \(\mathbf{X}\).idx_retained - (read only)
list
specifying the indices of columns retained in \(\mathbf{X}\).X_cs - (read only)
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables)
.X_center - (read only)
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,)
.X_scale - (read only)
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,)
.
Preprocessing tools#
- PCAfold.preprocess.center_scale(X, scaling, nocenter=False)#
Centers and scales the original data set, \(\mathbf{X}\). In the discussion below, we understand that \(X_j\) is the \(j^{th}\) column of \(\mathbf{X}\).
Centering is performed by subtracting the center, \(c_j\), from \(X_j\), where centers for all columns are stored in the matrix \(\mathbf{C}\):
\[\mathbf{X_c} = \mathbf{X} - \mathbf{C}\]Centers for each column are computed as:
\[c_j = mean(X_j)\]with the only exceptions of
'0to1'
and'-1to1'
scalings, which introduce a different quantity to center each column.Scaling is performed by dividing \(X_j\) by the scaling factor, \(d_j\), where scaling factors for all columns are stored in the diagonal matrix \(\mathbf{D}\):
\[\mathbf{X_s} = \mathbf{X} \cdot \mathbf{D}^{-1}\]If both centering and scaling is applied:
\[\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\]Several scaling options are implemented here:
Scaling method
scaling
Scaling factor \(d_j\)
None
'none'
or''
1
Auto [vdBHW+06]
'auto'
or'std'
\(\sigma\)
Pareto [Nod08]
'pareto'
\(\sqrt{\sigma}\)
VAST [KEA+03]
'vast'
\(\sigma^2 / mean(X_j)\)
Range [vdBHW+06]
'range'
\(max(X_j) - min(X_j)\)
0 to 1'0to1'
\(d_j = max(X_j) - min(X_j)\)\(c_j = min(X_j)\)-1 to 1'-1to1'
\(d_j = 0.5 \cdot (max(X_j) - min(X_j))\)\(c_j = 0.5 \cdot (max(X_j) + min(X_j))\)Level [vdBHW+06]
'level'
\(mean(X_j)\)
Max
'max'
\(max(X_j)\)
Variance
'variance'
\(var(X_j)\)
Median
'median'
\(median(X_j)\)
Poisson [KK04]
'poisson'
\(\sqrt{mean(X_j)}\)
S1
'vast_2'
\(\sigma^2 k^2 / mean(X_j)\)
S2
'vast_3'
\(\sigma^2 k^2 / max(X_j)\)
S3
'vast_4'
\(\sigma^2 k^2 / (max(X_j) - min(X_j))\)
L2-norm
'l2-norm'
\(\|X_j\|_2\)
where \(\sigma\) is the standard deviation of \(X_j\) and \(k\) is the kurtosis of \(X_j\).
The effect of data preprocessing (including scaling) on low-dimensional manifolds was studied in [PS13].
Example:
from PCAfold import center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
,'l2-norm'
.nocenter – (optional)
bool
specifying whether data should be centered by mean. If set toTrue
data will not be centered.
- Returns:
X_cs -
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It has size(n_observations,n_variables)
.X_center -
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,)
.X_scale -
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,)
.
- PCAfold.preprocess.invert_center_scale(X_cs, X_center, X_scale)#
Inverts whatever centering and scaling was done by the
center_scale
function:\[\mathbf{X} = \mathbf{X_{cs}} \cdot \mathbf{D} + \mathbf{C}\]Example:
from PCAfold import center_scale, invert_center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False) # Uncenter and unscale: X = invert_center_scale(X_cs, X_center, X_scale)
- Parameters:
X_cs –
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables)
.X_center –
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,)
.X_scale –
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,)
.
- Returns:
X -
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It has size(n_observations,n_variables)
.
- PCAfold.preprocess.power_transform(X, transform_power, transform_shift=0.0, transform_sign_shift=0.0, invert=False)#
Performs a power transformation of the provided data. The equation for the transformation of variable \(X\) is
\[(|X + s_1|)^\alpha \text{sign}(X + s_1) + s_2 \text{sign}(X + s_1)\]where \(\alpha\) is the
transform_power
, \(s_1\) is thetransform_shift
, and \(s_2\) is thetransform_sign_shift
.Example:
from PCAfold import power_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform power transformation: X_pow = power_transform(X, 0.5) # undo the transformation: X_orig = power_transform(X_pow, 0.5, invert=True)
- Parameters:
X – array of the variable(s) to be transformed
transform_power – the power parameter used in the transformation equation
transform_shift – (optional, default 0.) the shift parameter used in the transformation equation
transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation
invert – (optional, default False) when True, will undo the transformation
- Returns:
array of the transformed variables
- PCAfold.preprocess.log_transform(X, method='log', threshold=1e-06)#
Performs log transformation of the original data set, \(\mathbf{X}\).
For an example original function:
The symlog transformation can be obtained with
method='symlog'
:The continuous symlog transformation can be obtained with
method='continuous-symlog'
:Example:
from PCAfold import log_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform log transformation: X_log = log_transform(X) # Perform symlog transformation: X_symlog = log_transform(X, method='symlog', threshold=1.e-4)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.method – (optional)
str
specifying the log-transformation method. It can be one of the following:log
,ln
,symlog
,continuous-symlog
.threshold – (optional)
float
orint
specifying the threshold for symlog transformation.
- Returns:
X_transformed -
numpy.ndarray
specifying the log-transformed data set. It has size(n_observations,n_variables)
.
- PCAfold.preprocess.zero_pivot_transform(X, method='-1to1', verbose=False)#
Scales negative and positive values according to a user-specified method, but preserves values that are zero. This function treats zero values in a variable as pivots that remain unchanged by scaling.
Example:
from PCAfold import zero_pivot_transform import numpy as np # Generate dummy data set: X = np.array([[1,10], [2,0], [3,30], [0,0], [-1,-10]]) # Perform transformation with zero pivot: X_transformed, maximum_positive, minimum_negative = zero_pivot_transform(X, verbose=True)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.method – (optional)
str
specifying the log-transformation method. It can be one of the following:'-1to1'
.verbose – (optional)
bool
for printing verbose details.
- Returns:
X_transformed -
numpy.ndarray
specifying the transformed data set. It has size(n_observations,n_variables)
.maximum_positive -
numpy.ndarray
specifying the maximum positive value in a dataset. It has size(n_variables,)
.minimum_negative -
numpy.ndarray
specifying the minimum negative value in a dataset. It has size(n_variables,)
.
- PCAfold.preprocess.invert_zero_pivot_transform(X_transformed, maximum_positive, minimum_negative, method='-1to1')#
Inverts the zero-pivot transform.
Example:
from PCAfold import zero_pivot_transform, invert_zero_pivot_transform import numpy as np # Generate dummy data set: X = np.array([[1,10], [2,0], [3,30], [0,0], [-1,-10]]) # Perform transformation with zero pivot: X_transformed, maximum_positive, minimum_negative = zero_pivot_transform(X, verbose=True) # Undo transformation with zero pivot: X_back = invert_zero_pivot_transform(X_transformed, maximum_positive, minimum_negative)
- Parameters:
X_transformed –
numpy.ndarray
specifying the transformed data set, \(\widetilde{\mathbf{X}}\). It should be of size(n_observations,n_variables)
.maximum_positive –
numpy.ndarray
specifying the maximum positive value in a dataset. It should be of size(n_variables,)
.minimum_negative –
numpy.ndarray
specifying the minimum negative value in a dataset. It should be of size(n_variables,)
.method – (optional)
str
specifying the log-transformation method. It can be one of the following:'-1to1'
.
- Returns:
X -
numpy.ndarray
specifying the original data set. It has size(n_observations,n_variables)
.
- PCAfold.preprocess.remove_constant_vars(X, maxtol=1e-12, rangetol=0.0001)#
Removes any constant columns from the original data set, \(\mathbf{X}\). The \(j^{th}\) column, \(X_j\), is considered constant if either of the following is true:
The maximum of an absolute value of a column \(X_j\) is less than
maxtol
:
\[max(|X_j|) < \verb|maxtol|\]The ratio of the range of values in a column \(X_j\) to \(max(|X_j|)\) is less than
rangetol
:
\[\frac{max(X_j) - min(X_j)}{max(|X_j|)} < \verb|rangetol|\]Specifically, it can be used as preprocessing for PCA so the eigenvalue calculation doesn’t break.
Example:
from PCAfold import remove_constant_vars import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Remove the constant column: (X_removed, idx_removed, idx_retained) = remove_constant_vars(X)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.maxtol – (optional)
float
specifying the tolerance for \(max(|X_j|)\).rangetol – (optional)
float
specifying the tolerance for \(max(X_j) - min(X_j)\) over \(max(|X_j|)\).
- Returns:
X_removed -
numpy.ndarray
specifying the original data set, \(\mathbf{X}\) with any constant columns removed. It has size(n_observations,n_variables)
.idx_removed -
list
specifying the indices of columns removed from \(\mathbf{X}\).idx_retained -
list
specifying the indices of columns retained in \(\mathbf{X}\).
- PCAfold.preprocess.order_variables(X, method='mean', descending=True)#
Orders variables in the original data set, \(\mathbf{X}\), using a selected method.
Example:
from PCAfold import order_variables import numpy as np # Generate a dummy data set: X = np.array([[100, 1, 10], [200, 2, 20], [300, 3, 30]]) # Order variables by the mean value in the descending order: (X_ordered, idx) = order_variables(X, method='mean', descending=True)
The code above should return an ordered data set:
array([[100, 10, 1], [200, 20, 2], [300, 30, 3]])
and the list of ordered variable indices:
[1, 2, 0]
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.method – (optional)
str
orlist
ofint
specifying the ordering method. Ifstr
, it can be one of the following:'mean'
,'min'
,'max'
,'std'
or'var'
. Iflist
, it is a custom user-provided list of indices for how the variables should be ordered.descending – (optional)
bool
specifying whether variables should be ordered in the descending order. If set toFalse
, variables will be ordered in the ascending order.
- Returns:
X_ordered -
numpy.ndarray
specifying the original data set with ordered variables. It has size(n_observations,n_variables)
.idx -
list
specifying the indices of the ordered variables. It has lengthn_variables
.
- PCAfold.preprocess.outlier_detection(X, scaling, method='MULTIVARIATE TRIMMING', trimming_threshold=0.5, quantile_threshold=0.9899, verbose=False)#
Finds outliers in the original data set, \(\mathbf{X}\), and returns indices of observations without outliers as well as indices of the outliers themselves. Two options are implemented here:
'MULTIVARIATE TRIMMING'
Outliers are detected based on multivariate Mahalanobis distance, \(D_M\):
\[D_M = \sqrt{(\mathbf{X} - \mathbf{\bar{X}})^T \mathbf{S}^{-1} (\mathbf{X} - \mathbf{\bar{X}})}\]where \(\mathbf{\bar{X}}\) is a matrix of the same size as \(\mathbf{X}\) storing in each column a copy of the average value of the same column in \(\mathbf{X}\). \(\mathbf{S}\) is the covariance matrix computed as per
PCA
class. Note that the scaling option selected will affect the covariance matrix \(\mathbf{S}\). Since Mahalanobis distance takes into account covariance between variables, observations with sufficiently large \(D_M\) can be considered as outliers. For more detailed information on Mahalanobis distance the user is referred to [Bis06] or [DMJRM00].The threshold above which observations will be classified as outliers can be specified using
trimming_threshold
parameter. Specifically, the \(i^{th}\) observation is classified as an outlier if:\[D_{M, i} > \verb|trimming_threshold| \cdot max(D_M)\]'PC CLASSIFIER'
Outliers are detected based on major and minor principal components (PCs). The method of principal component classifier (PCC) was first proposed in [SCSC03]. The application of this technique to combustion data sets was studied in [PS13]. Specifically, the \(i^{th}\) observation is classified as an outlier if the first PC classifier based on \(q\)-first (major) PCs:
\[\sum_{j=1}^{q} \frac{z_{ij}^2}{L_j} > c_1\]or if the second PC classifier based on \((Q-k+1)\)-last (minor) PCs:
\[\sum_{j=k}^{Q} \frac{z_{ij}^2}{L_j} > c_2\]where \(z_{ij}\) is the \(i^{th}, j^{th}\) element from the principal components matrix \(\mathbf{Z}\) and \(L_j\) is the \(j^{th}\) eigenvalue from \(\mathbf{L}\) (as per
PCA
class). Major PCs are selected such that the total variance explained is 50%. Minor PCs are selected such that the remaining variance they explain is 20%.Coefficients \(c_1\) and \(c_2\) are found such that they represent the
quantile_threshold
(by default 98.99%) quantile of the empirical distributions of the first and second PC classifier respectively.Example:
from PCAfold import outlier_detection import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Find outliers: (idx_outliers_removed, idx_outliers) = outlier_detection(X, scaling='auto', method='MULTIVARIATE TRIMMING', trimming_threshold=0.8, verbose=True) # New data set without outliers can be obtained as: X_outliers_removed = X[idx_outliers_removed,:] # Observations that were classified as outliers can be obtained as: X_outliers = X[idx_outliers,:]
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.method – (optional)
str
specifying the outlier detection method to use. It should be'MULTIVARIATE TRIMMING'
or'PC CLASSIFIER'
.trimming_threshold – (optional)
float
specifying the trimming threshold to use in combination with'MULTIVARIATE TRIMMING'
method.quantile_threshold – (optional)
float
specifying the quantile threshold to use in combination with'PC CLASSIFIER'
method.verbose – (optional)
bool
for printing verbose details.
- Returns:
idx_outliers_removed -
list
specifying the indices of observations without outliers.idx_outliers -
list
specifying the indices of observations that were classified as outliers.
- PCAfold.preprocess.representative_sample_size(depvars, percentages, thresholds, variable_names=None, method='kl-divergence', statistics='median', n_resamples=10, random_seed=None, verbose=False)#
Computes a representative sample size given dependent variables that serve as ground truth (100% of data). It is assumed that the full dataset is representative of some physical phenomena.
Two general approaches are available:
If
method='kl-divergence'
, the representative sample size is computed based on Kullback-Leibler divergence.If
method='mean'
,method='median'
,method='variance'
, ormethod='std'
, the representative sample size is computed based on convergence of a first order (mean or median) or of second order (variance, standard deviation) statistics.
Example:
from PCAfold import center_scale, representative_sample_size import numpy as np # Generate dummy data set and two dependent variables: x, y = np.meshgrid(np.linspace(-1,1,100), np.linspace(-1,1,100)) xy = np.hstack((x.ravel()[:,None],y.ravel()[:,None])) phi_1 = np.exp(-((x*x+y*y) / (1 * 1**2))) phi_1 = phi_1.ravel()[:,None] phi_2 = np.exp(-((x*x+y*y) / (0.01 * 1**2))) phi_2 = phi_2.ravel()[:,None] depvars = np.column_stack((phi_1, phi_2)) depvars, _, _ = center_scale(depvars, scaling='0to1') # Specify the list of percentages to explore: percentages = list(np.linspace(1,99.9,200)) # Specify the list of thresholds for each dependent variable: thresholds = [10**-4, 10**-4] # Specify the names of the dependent variables: variable_names = ['Phi-1', 'Phi-2'] # Compute representative sample size for each dependent variable: (idx, sample_sizes, statistics) = representative_sample_size(depvars, percentages, thresholds=thresholds, variable_names=variable_names, method='kl-divergence', statistics='median', n_resamples=20, random_seed=100, verbose=True)
With
verbose=True
we will see some detailed information:Dependent variable Phi-1 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-1: 2833 samples (28.3% of data). Dependent variable Phi-2 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-2: 9890 samples (98.9% of data).
- Parameters:
depvars –
numpy.ndarray
specifying the dependent variables that should be well represented in a sampled dataset. . It should be of size(n_observations,n_dependent_variables)
.percentages –
list
of percentages to explore. It should be ordered in ascending order. Elements should be larger than 0 and not larger than 100.thresholds – (optional)
list
offloat
specifying the target thresholds for each dependent variable. The thresholds should be appropriate to the method based on which a representative sample size is computed.variable_names – (optional)
list
ofstr
specifying names for all dependent variables. If set toNone
, dependent variables are called with consecutive integers.method – (optional)
str
specifying the method used to compute the sample size statistics. It can bemean
,median
,variance
,std
, or'kl-divergence'
.statistics – (optional)
str
specifying the overall statistics that should be computed from a given method. It can bemin
,max
,mean
, ormedian
.n_resamples – (optional)
int
specifying the number of resamples to perform for each percentage in thepercentages
vector. It is recommended to set this parameters to above 1, since it might accidentally happen that a random sample is statistically representative of the full dataset. Re-sampling helps to average-out the effect of such one-off “lucky” random samples.random_seed – (optional)
int
specifying the random seed.verbose – (optional)
bool
for printing verbose details.
- Returns:
threshold_idx -
list
ofint
specifying the highest indices from thepercentages
list where the representative number of samples condition was still met. It has lengthn_depvars
. If the condition for a representative sample size was not met for a dependent variable, a value of-1
is returned in the list for that dependent variable.representatitive_sample_sizes -
numpy.ndarray
ofint
specifying the representative number of samples. It has size(1,n_depvars)
. If the condition for a representative sample size was not met for a dependent variable, a value of-1
is returned in the array for that dependent variable.sample_size_statistics -
numpy.ndarray
specifying the full vector of computed statistics correponding to each entry inpercentages
and each dependent variable. It has size(n_percentages,n_depvars)
.
Class ConditionalStatistics
#
- class PCAfold.preprocess.ConditionalStatistics(X, conditioning_variable, k=20, split_values=None, verbose=False)#
Enables computing conditional statistics on the original data set, \(\mathbf{X}\). This includes:
conditional mean
conditional minimum
conditional maximum
conditional standard deviation
Other quantities can be added in the future at the user’s request.
Example:
from PCAfold import ConditionalStatistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Instantiate an object of the ConditionalStatistics class # and compute conditional statistics in 10 bins of the conditioning variable: cond = ConditionalStatistics(y[:,None], conditioning_variable, k=10) # Access conditional statistics: conditional_mean = cond.conditional_mean conditional_min = cond.conditional_minimum conditional_max = cond.conditional_maximum conditional_std = cond.conditional_standard_deviation # Access the centroids of the created bins: centroids = cond.centroids
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.conditioning_variable –
numpy.ndarray
specifying a single variable to be used as a conditioning variable. It should be of size(n_observations,1)
or(n_observations,)
.k –
int
specifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
list
specifying values at which splits should be performed. If set toNone
, splits will be performed using \(k\) equal variable bins.verbose – (optional)
bool
for printing verbose details.
Attributes:
idx - (read only)
numpy.ndarray
of cluster (bins) classifications. It has size(n_observations,)
.borders - (read only)
list
of values that define borders for the clusters (bins). It has lengthk+1
.centroids - (read only)
list
of values that specify bins centers. It has lengthk
.conditional_mean - (read only)
numpy.ndarray
specifying the conditional means of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_minimum - (read only)
numpy.ndarray
specifying the conditional minimums of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_maximum - (read only)
numpy.ndarray
specifying the conditional maximums of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_standard_deviation - (read only)
numpy.ndarray
specifying the conditional standard deviations of all original variables in the \(k\) bins created. It has size(k,n_variables)
.
Class KernelDensity
#
- class PCAfold.preprocess.KernelDensity(X, conditioning_variable, verbose=False)#
Enables kernel density weighting of the original data set, \(\mathbf{X}\), based on single-variable or multi-variable case as proposed in [CGP12].
The goal of both cases is to obtain a vector of weights, \(\mathbf{W_c}\), that has the same number of elements as there are observations in the original data set, \(\mathbf{X}\). Each observation will then get multiplied by the corresponding weight from \(\mathbf{W_c}\).
Note
Kernel density weighting technique is usually very expensive, even on data sets with relatively small number of observations. Since the single-variable case is a cheaper option than the multi-variable case, it is recommended that this technique is tried first for larger data sets.
Gaussian kernel is used in both approaches:
\[K_{c, c'} = \sqrt{\frac{1}{2 \pi h^2}} exp(- \frac{d^2}{2 h^2})\]\(h\) is the kernel bandwidth:
\[h = \Big( \frac{4 \hat{\sigma}}{3 n} \Big)^{1/5}\]where \(\hat{\sigma}\) is the standard deviation of the considered variable and \(n\) is the number of observations in the data set.
\(d\) is the distance between two observations \(c\) and \(c'\):
\[d = |x_c - x_{c'}|\]Single-variable
If the
conditioning_variable
argument is a single vector, weighting will be performed according to the single-variable case. It begins by summing Gaussian kernels:\[\mathbf{K_c} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c'}\]and weights are then computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Multi-variable
If the
conditioning_variable
argument is a matrix of multiple variables, weighting will be performed according to the multi-variable case. It begins by summing Gaussian kernels for a \(k^{th}\) variable:\[\mathbf{K_c}_{, k} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c', k}\]Global density taking into account all variables is then obtained as:
\[\mathbf{K_{c}} = \prod_{k=1}^{k=Q} \mathbf{K_c}_{, k}\]where \(Q\) is the total number of conditioning variables, and weights are computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Example:
from PCAfold import KernelDensity import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Perform kernel density weighting based on the first variable: kerneld = KernelDensity(X, X[:,0]) # Access the weighted data set: X_weighted = kerneld.X_weighted # Access the weights used to scale the data set: weights = kerneld.weights
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.conditioning_variable –
numpy.ndarray
specifying either a single variable or multiple variables to be used as a conditioning variable for kernel weighting procedure. Note that it can also be passed as the data set \(\mathbf{X}\).
Attributes:
weights -
numpy.ndarray
specifying the computed weights, \(\mathbf{W_c}\). It has size(n_observations,1)
.X_weighted -
numpy.ndarray
specifying the weighted data set (each observation in \(\mathbf{X}\) is multiplied by the corresponding weight in \(\mathbf{W_c}\)). It has size(n_observations,n_variables)
.
Class DensityEstimation
#
- class PCAfold.preprocess.DensityEstimation(X, n_neighbors)#
Enables density estimation on point-cloud data.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.n_neighbors –
int
specifying the number of nearest neighbors, or the \(k\) th nearest neighbor when applicable.
- PCAfold.preprocess.DensityEstimation.average_knn_distance(self, verbose=False)#
Computes an average Euclidean distances to \(k\) nearest neighbors on a manifold defined by the independent variables.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute average distances on a manifold defined by the PCs: average_distances = density_estimation.average_knn_distance(verbose=True)
With
verbose=True
, minimum, maximum and average distance will be printed:Minimum distance: 0.1388300829487847 Maximum distance: 0.4689587542132183 Average distance: 0.20824964953425693 Median distance: 0.18333873029179215
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Parameters:
verbose – (optional)
bool
for printing verbose details.- Returns:
average_distances -
numpy.ndarray
specifying the vector of average distances for every observation in a data set to its \(k\) nearest neighbors. It has size(n_observations,)
.
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_codensity(self)#
Computes the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [CVJ21]. This value has an interpretation of a data codensity defined as:
\[\delta_k(x) = d(x, v_k(x))\]where \(v_k(x)\) is the \(k\) th nearest neighbor of \(x\).
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_codensity = density_estimation.kth_nearest_neighbor_codensity()
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Returns:
data_codensity -
numpy.ndarray
specifying the vector of distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,)
.
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_density(self)#
Computes an inverse of the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [CVJ21]. This value has an interpretation of a data density defined as:
\[\rho_k(x) = \frac{1}{\delta_k(x)}\]where \(\delta_k(x)\) is the codensity.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_density = density_estimation.kth_nearest_neighbor_density()
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Returns:
data_density -
numpy.ndarray
specifying the vector of inverse distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,)
.
Data clustering#
This section includes functions for classifying data sets into local clusters and performing some basic operations on clusters [ELL09], [KR09].
Clustering functions#
Each function that clusters the data set returns a vector of integers idx
of type numpy.ndarray
of size (n_observations,)
that specifies
classification of each observation from the original data set,
\(\mathbf{X}\), to a local cluster.
Note
The first cluster has index 0
within all idx
vectors returned.
- PCAfold.preprocess.variable_bins(var, k, verbose=False)#
Clusters the data by dividing a variable vector
var
into bins of equal lengths.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to bins of x: (idx, borders) = variable_bins(x, 4, verbose=True)
- Parameters:
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number.verbose – (optional)
bool
for printing verbose details.
- Returns:
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
- PCAfold.preprocess.predefined_variable_bins(var, split_values, verbose=False)#
Clusters the data by dividing a variable vector
var
into bins such that splits are done at user-specified values. Split values can be specified in thesplit_values
list. In general:split_values = [value_1, value_2, ..., value_n]
.Note: When a split is performed at a given
value_i
, the observation invar
that takes exactly that value is assigned to the newly created bin.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import predefined_variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to pre-defined bins of x: (idx, borders) = predefined_variable_bins(x, [-0.6, 0.4, 0.8], verbose=True)
- Parameters:
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.split_values –
list
specifying values at which splits should be performed.verbose – (optional)
bool
for printing verbose details.
- Returns:
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
- PCAfold.preprocess.mixture_fraction_bins(Z, k, Z_stoich, verbose=False)#
Clusters the data by dividing a mixture fraction vector
Z
into bins of equal lengths. This technique can be used to partition combustion data sets as proposed in [PSTS09]. The vector is first split to lean and rich side (according to the stoichiometric mixture fractionZ_stoich
) and then the sides get divided further into clusters. Whenk
is odd, there will always be one more cluster on the side with larger range in mixture fraction space compared to the other side.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import mixture_fraction_bins import numpy as np # Generate dummy mixture fraction variable: Z = np.linspace(0,1,100) # Create partitioning according to bins of mixture fraction: (idx, borders) = mixture_fraction_bins(Z, 4, 0.4, verbose=True)
- Parameters:
Z –
numpy.ndarray
specifying the mixture fraction values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number.Z_stoich –
float
specifying the stoichiometric mixture fraction. It has to be between 0 and 1.verbose – (optional)
bool
for printing verbose details.
- Returns:
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
- PCAfold.preprocess.zero_neighborhood_bins(var, k, zero_offset_percentage=0.1, split_at_zero=False, verbose=False)#
Clusters the data by separating close-to-zero observations in a vector into one cluster (
split_at_zero=False
) or two clusters (split_at_zero=True
). The offset from zero at which splits are performed is computed based on the input parameterzero_offset_percentage
:\[\verb|offset| = \frac{(max(\verb|var|) - min(\verb|var|)) \cdot \verb|zero_offset_percentage|}{100}\]Further clusters are found by splitting positive and negative values in a vector alternatingly into bins of equal lengths.
This clustering technique can be useful for partitioning any variable that has many observations clustered around zero value and relatively few observations far away from zero on either side.
Two examples of how a vector can be partitioned with this function are presented below:
With
split_at_zero=False
:
If
split_at_zero=False
the smallest allowed number of clusters is 3. This is to assure that there are at least three clusters: with negative values, with close to zero values, with positive values.When
k
is even, there will always be one more cluster on the side with larger range compared to the other side.With
split_at_zero=True
:
If
split_at_zero=True
the smallest allowed number of clusters is 4. This is to assure that there are at least four clusters: with negative values, with negative values close to zero, with positive values close to zero and with positive values.When
k
is odd, there will always be one more cluster on the side with larger range compared to the other side.Note
This clustering technique is well suited for partitioning chemical source terms, \(\mathbf{S_X}\), or sources of principal components, \(\mathbf{S_Z}\), (as per [SP09]) since it relies on unbalanced vectors that have many observations numerically close to zero. Using
split_at_zero=True
it can further differentiate between negative and positive sources.Example:
from PCAfold import zero_neighborhood_bins import numpy as np # Generate dummy variable: x = np.linspace(-100,100,1000) # Create partitioning according to bins of x: (idx, borders) = zero_neighborhood_bins(x, k=4, zero_offset_percentage=10, split_at_zero=True, verbose=True)
- Parameters:
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number. It cannot be smaller than 3 ifsplit_at_zero=False
or smaller than 4 ifsplit_at_zero=True
.zero_offset_percentage – (optional) percentage of \(max(\verb|var|) - min(\verb|var|)\) range to take as the offset from zero value. For instance, set
zero_offset_percentage=10
if you want 10% as offset.split_at_zero – (optional)
bool
specifying whether partitioning should be done atvar=0
.verbose – (optional)
bool
for printing verbose details.
- Returns:
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
Auxiliary functions#
- PCAfold.preprocess.degrade_clusters(idx, verbose=False)#
Re-numerates clusters if either of these two cases is true:
idx
is composed of non-consecutive integers, orthe smallest cluster index in
idx
is not equal to0
.
Example:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 2, 0, 5, 10]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
The code above will produce:
>>> idx_degraded array([0, 0, 1, 0, 2, 3])
Alternatively:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([1, 1, 2, 2, 3, 3]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
will produce:
>>> idx_degraded array([0, 0, 1, 1, 2, 2])
- Parameters:
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.verbose – (optional)
bool
for printing verbose details.
- Returns:
idx_degraded -
numpy.ndarray
of degraded cluster classifications. It has size(n_observations,)
.k_update -
int
specifying the updated number of clusters.
- PCAfold.preprocess.flip_clusters(idx, dictionary)#
Flips cluster labelling according to instructions provided in a dictionary. For a
dictionary = {key : value}
, a cluster with a numberkey
will get a numbervalue
.Example:
from PCAfold import flip_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0,0,0,1,1,1,1,2,2]) # Swap cluster number 1 with cluster number 2: flipped_idx = flip_clusters(idx, {1:2, 2:1})
The code above will produce:
>>> flipped_idx array([0, 0, 0, 2, 2, 2, 2, 1, 1])
Note
This function can also be used to merge clusters. Using the
idx
from the example above, if we call:flipped_idx = flip_clusters(idx, {2:1})
the result will be:
>>> flipped_idx array([0,0,0,1,1,1,1,1,1])
where clusters
1
and2
have been merged into one cluster numbered1
.- Parameters:
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.dictionary –
dict
specifying instructions for cluster label flipping.
- Returns:
flipped_idx -
numpy.ndarray
specifying the re-labelled cluster classifications. It has size(n_observations,)
.
- PCAfold.preprocess.get_centroids(X, idx)#
Computes the centroids for all variables in the original data set, \(\mathbf{X}\), and for each cluster specified in the
idx
vector. The centroid \(c_{n, j}\) for variable \(X_j\) in the \(n^{th}\) cluster, is computed as:\[c_{n, j} = mean(X_j), \,\,\,\, \text{for} \,\, X_j \in \text{cluster} \,\, n\]Centroids for all variables from all clusters are stored in the matrix \(\mathbf{c} \in \mathbb{R}^{k \times Q}\) returned:
\[\begin{split}\mathbf{c} = \begin{bmatrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, Q} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, Q} \\ \vdots & \vdots & \vdots & \vdots \\ c_{k, 1} & c_{k, 2} & \dots & c_{k, Q} \\ \end{bmatrix}\end{split}\]Example:
from PCAfold import get_centroids import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute the centroids of each cluster: centroids = get_centroids(X, idx)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.
- Returns:
centroids -
numpy.ndarray
specifying the centroids matrix, \(\mathbf{c}\), for all clusters and for all variables. It has size(k,n_variables)
.
- PCAfold.preprocess.get_partition(X, idx)#
Partitions the observations from the original data set, \(\mathbf{X}\), into \(k\) clusters according to
idx
provided.Example:
from PCAfold import get_partition import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Generate partitioning of the data set according to idx: (X_in_clusters, idx_in_clusters) = get_partition(X, idx)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.
- Returns:
X_in_clusters -
list
of \(k\)numpy.ndarray
that contains original data set observations partitioned to \(k\) clusters. It has lengthk
.idx_in_clusters -
list
of \(k\)numpy.ndarray
that contains indices of the original data set observations partitioned to \(k\) clusters. It has lengthk
.
- PCAfold.preprocess.get_populations(idx)#
Computes populations (number of observations) in clusters specified in the
idx
vector. As an example, if there are 100 observations in the first cluster and 500 observations in the second cluster this function will return a list:[100, 500]
.Example:
from PCAfold import variable_bins, get_populations import numpy as np # Generate dummy partitioning: x = np.linspace(-1,1,100) (idx, borders) = variable_bins(x, 4, verbose=True) # Compute cluster populations: populations = get_populations(idx)
The code above will produce:
>>> populations [25, 25, 25, 25]
- Parameters:
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.- Returns:
populations -
list
of cluster populations. Each entry referes to one cluster ordered according toidx
. It has lengthk
.
- PCAfold.preprocess.get_average_centroid_distance(X, idx, weighted=False)#
Computes the average Euclidean distance between observations and the centroids of clusters to which each observation belongs.
The average can be computed as an arithmetic average from all clusters (
weighted=False
) or as a weighted average (weighted=True
). In the latter, the distances are weighted by the number of observations in a cluster so that the average centroid distance will approach the average distance in the largest cluster.Example:
from PCAfold import get_average_centroid_distance import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute average distance from cluster centroids: average_centroid_distance = get_average_centroid_distance(X, idx, weighted=False)
- Parameters:
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.weighted – (optional)
bool
specifying whether distances from centroid should be weighted by the number of observations in a cluster. If set toFalse
, arithmetic average will be computed.
- Returns:
average_centroid_distance -
float
specifying the average distance from centroids, averaged over all observations and all clusters.
Data sampling#
This section includes functions for splitting data sets into train and test data for use in machine learning algorithms. Apart from random splitting that can be achieved with the commonly used sklearn.model_selection.train_test_split, extended methods are implemented here that allow for purposive sampling [Ney92], such as drawing samples at certain amount from local clusters [MMD10], [GSB04]. These functionalities can be specifically used to tackle imbalanced data sets [HG09], [RLM+16].
The general idea is to divide the entire data set X
(or its portion) into train and test samples as presented below:
Train data is always sampled in the same way for a given sampling function.
Depending on the option selected, test data will be sampled differently, either as all
remaining samples that were not included in train data or as a subset of those.
You can select the option by setting the test_selection_option
parameter for each sampling function.
Reach out to the documentation for a specific sampling function to see what options are available.
All splitting functions in this module return a tuple of two variables: (idx_train, idx_test)
.
Both idx_train
and idx_test
are vectors of integers of type numpy.ndarray
and of size (_,)
.
These variables contain indices of observations that went into train data and test data respectively.
In your model learning algorithm you can then get the train and test observations, for instance in the following way:
X_train = X[idx_train,:]
X_test = X[idx_test,:]
All functions are equipped with verbose
parameter. If it is set to True
some additional information on train and test selection is printed.
Note
It is assumed that the first cluster has index 0
within all input idx
vectors.
Class DataSampler
#
- class PCAfold.preprocess.DataSampler(idx, idx_test=None, random_seed=None, verbose=False)#
Enables selecting train and test data samples.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, idx_test=np.array([5,9]), random_seed=100, verbose=True)
- Parameters:
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.idx_test – (optional)
numpy.ndarray
specifying the user-provided indices for test data. If specified, train data will be selected ignoring the indices inidx_test
and the test data will be returned the same as the user-providedidx_test
. If not specified, test samples will be selected according to thetest_selection_option
parameter (see documentation for each sampling function). Setting fixedidx_test
parameter may be useful if training a machine learning model on specific test samples is desired. It should be of size(n_test_samples,)
or(n_test_samples,1)
.random_seed – (optional)
int
specifying random seed for random sample selection.verbose – (optional)
bool
for printing verbose details.
- PCAfold.preprocess.DataSampler.number(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples fixed number of observations from every cluster as training data. In general, this results in a balanced representation of features identified by a clustering algorithm.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.number(20, test_selection_option=1)
Train data:
The number of train samples is estimated based on the percentage
perc
provided. First, the total number of samples for training is estimated as a percentageperc
from the total number of observationsn_observations
in a data set. Next, this number is divided equally into \(k\) clusters. The resultn_of_samples
is the number of samples that will be selected from each cluster:\[\verb|n_of_samples| = \verb|int| \Big( \frac{\verb|perc| \cdot \verb|n_observations|}{k \cdot 100} \Big)\]Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
, the smallest cluster is found and the remaining number of observations \(m\) are taken as test data in that cluster. Next, the same number of samples \(m\) is taken from all remaining larger clusters.The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(n\) and \(m\) are fixed numbers for each cluster. In general, \(n \neq m\).
- Parameters:
perc – percentage of data to be selected as training data from the entire data set. For instance, set
perc=20
if you want to select 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns:
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
- PCAfold.preprocess.DataSampler.percentage(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples a certain percentage
perc
from every cluster as the training data.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.percentage(20, test_selection_option=1)
Note: If the cluster sizes are comparable, this function will give a similar train sample distribution as random sampling (
DataSampler.random
). This sampling can be useful in cases where one cluster is significantly smaller than others and there is a chance that this cluster will not get covered in the train data if random sampling was used.Train data:
The number of train samples is estimated based on the percentage
perc
provided. First, the size of the \(i^{th}\) cluster is estimatedcluster_size_i
and then a percentageperc
of that number is selected.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(p\) is the percentage
perc
provided.- Parameters:
perc – percentage of data to be selected as training data from each cluster. For instance, set
perc=20
if you want to select 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns:
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
- PCAfold.preprocess.DataSampler.manual(self, sampling_dictionary, sampling_type='percentage', test_selection_option=1)#
Uses classifications into \(k\) clusters and a dictionary
sampling_dictionary
in which you manually specify what'percentage'
(or what'number'
) of samples will be selected as the train data from each cluster. The dictionary keys are cluster classifications as peridx
and the dictionary values are either percentage or number of train samples to be selected. The default dictionary values are percentage but you can selectsampling_type='number'
in order to interpret the values as a number of samples.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.manual({0:1, 1:1, 2:1}, sampling_type='number', test_selection_option=1)
Train data:
The number of train samples selected from each cluster is estimated based on the
sampling_dictionary
. Forkey : value
, percentagevalue
(or numbervalue
) of samples will be selected from clusterkey
.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here it is understood that \(n_1\) train samples were requested from the first cluster, \(n_2\) from the second cluster and \(n_3\) from the third cluster, where \(n_i\) can be interpreted as number or as percentage. This can be achieved by setting:
sampling_dictionary = {0:n_1, 1:n_2, 2:n_3}
- Parameters:
sampling_dictionary –
dict
specifying manual sampling. Keys are cluster classifications and values are eitherpercentage
ornumber
of samples to be taken from that cluster. Keys should match the cluster classifications as peridx
.sampling_type – (optional)
str
specifying whether percentage or number is given in thesampling_dictionary
. Available options:percentage
ornumber
. The default ispercentage
.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns:
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
- PCAfold.preprocess.DataSampler.random(self, perc, test_selection_option=1)#
Samples train data at random from the entire data set.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Due to the nature of this sampling technique, it is not necessary to have
idx
classifications since random samples can also be selected from unclassified data sets. You can achieve that by generating a dummyidx
vector that has the same number of observationsn_observations
as your data set. For instance:from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: n_observations = 100 idx = np.zeros(n_observations) # Instantiate DataSampler class object: selection = DataSampler(idx) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Train data:
The total number of train samples is computed as a percentage
perc
from the total number of observations in a data set. These samples are then drawn at random from the entire data set, independent of cluster classifications.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure is used to select test data as was used to select train data (only allowed ifperc
is less than 50%).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(p\) is the percentage
perc
provided.- Parameters:
perc – percentage of data to be selected as training data from each cluster. Set
perc=20
if you want 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns:
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
Plotting functions#
This section includes functions for data preprocessing related plotting such as visualizing the formed clusters, visualizing the selected train and test samples or plotting the conditional statistics.
- PCAfold.preprocess.plot_2d_clustering(x, y, idx, clean=False, x_label=None, y_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, grid_on=False, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_2d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_2d_clustering(x, y, idx, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(10,6), title='x-y data set', save_filename='clustering.pdf') plt.close()
- Parameters:
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.clean – (optional)
bool
specifying if a clean plot should be made. If set toTrue
, nothing else but the data points is plotted.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.alphas – (optional)
list
specifying the opacity of each cluster.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.grid_on –
bool
specifying whether grid should be plotted.s – (optional)
int
orfloat
specifying the scatter point size.markerscale – (optional)
int
orfloat
specifying the scale for the legend marker.legend – (optional)
bool
specifying the whether legend should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns:
plt -
matplotlib.pyplot
plot handle.
- PCAfold.preprocess.plot_3d_clustering(x, y, z, idx, elev=45, azim=-45, x_label=None, y_label=None, z_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a three-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_3d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 z = x + 10 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_3d_clustering(x, y, z, idx, x_label='$x$', y_label='$y$', z_label='$z$', color_map='viridis', first_cluster_index_zero=False, figure_size=(10,6), title='x-y-z data set', save_filename='clustering.pdf') plt.close()
- Parameters:
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.z –
numpy.ndarray
specifying the variable on the \(z\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.elev – (optional) elevation angle.
azim – (optional) azimuth angle.
x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.z_label – (optional)
str
specifying \(z\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.alphas – (optional)
list
specifying the opacity of each cluster.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.s – (optional)
int
orfloat
specifying the scatter point size.markerscale – (optional)
int
orfloat
specifying the scale for the legend marker.legend – (optional)
bool
specifying the whether legend should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns:
plt -
matplotlib.pyplot
plot handle.
- PCAfold.preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label=None, y_label=None, color_map='viridis', first_cluster_index_zero=True, grid_on=False, figure_size=(14, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into train and test samples. Number of observations in train and test data respectively will be plotted in the legend.
Example:
from PCAfold import variable_bins, DataSampler, plot_2d_train_test_samples import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, borders) = variable_bins(x, 4, verbose=False) # Generate dummy sampling of the data set: sample = DataSampler(idx, random_seed=None, verbose=True) (idx_train, idx_test) = sample.number(40, test_selection_option=1) # Plot the sampling result: plt = plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(12,6), title='x-y data set', save_filename='sampling.pdf') plt.close()
- Parameters:
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.idx_train –
numpy.ndarray
specifying the indices of the train data. It should be of size(n_train,)
or(n_train,1)
.idx_test –
numpy.ndarray
specifying the indices of the test data. It should be of size(n_test,)
or(n_test,1)
.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.grid_on –
bool
specifying whether grid should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns:
plt -
matplotlib.pyplot
plot handle.
- PCAfold.preprocess.plot_conditional_statistics(variable, conditioning_variable, k=20, split_values=None, statistics_to_plot=['mean'], color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold given by
variable
andconditioning_variable
and the selected conditional statistics (as perpreprocess.ConditionalStatistics
).Example:
from PCAfold import PCA, plot_conditional_statistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Plot the conditional statistics: plt = plot_conditional_statistics(y, conditioning_variable, k=10, x_label='$x$', y_label='$y$', figure_size=(10,3), title='Conditional mean', save_filename='conditional-mean.pdf') plt.close()
- Parameters:
variable –
numpy.ndarray
specifying a single dependent variable to condition. This will be plotted on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.conditioning_variable –
numpy.ndarray
specifying a single variable to be used as a conditioning variable. This will be plotted on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
list
specifying values at which splits should be performed. If set toNone
, splits will be performed using \(k\) equal variable bins.statistics_to_plot –
list
ofstr
specifying conditional statistics to plot. The strings can bemean
,min
,max
orstd
.color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
x
andy
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, data will be plotted in black.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.colorbar_label – (optional) string specifying colorbar label annotation. If set to
None
, colorbar label will not be plotted.color_map – (optional) colormap to use as per
matplotlib.cm
. Default is viridis.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns:
plt -
matplotlib.pyplot
plot handle.