Data preprocessing#
The preprocess
module can be used for performing data preprocessing
including centering and scaling, outlier detection and removal, kernel density
weighting of data sets, data clustering and data sampling. It also includes
functionalities that allow the user to perform initial data inspection such
as computing conditional statistics, calculating statistically representative sample sizes,
or ordering variables in a data set according to a criterion.
Note
The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).
The general agreement throughout this documentation is that \(i\) will index observations and \(j\) will index variables.
The representation of the user-supplied data matrix in PCAfold
is the input parameter X
, which should be of type numpy.ndarray
and of size (n_observations,n_variables)
.
Data manipulation#
This section includes functions for performing basic data manipulation such as centering and scaling and outlier detection and removal.
center_scale
#
- PCAfold.preprocess.center_scale(X, scaling, nocenter=False)#
Centers and scales the original data set, \(\mathbf{X}\). In the discussion below, we understand that \(X_j\) is the \(j^{th}\) column of \(\mathbf{X}\).
Centering is performed by subtracting the center, \(c_j\), from \(X_j\), where centers for all columns are stored in the matrix \(\mathbf{C}\):
\[\mathbf{X_c} = \mathbf{X} - \mathbf{C}\]Centers for each column are computed as:
\[c_j = mean(X_j)\]with the only exceptions of
'0to1'
and'-1to1'
scalings, which introduce a different quantity to center each column.Scaling is performed by dividing \(X_j\) by the scaling factor, \(d_j\), where scaling factors for all columns are stored in the diagonal matrix \(\mathbf{D}\):
\[\mathbf{X_s} = \mathbf{X} \cdot \mathbf{D}^{-1}\]If both centering and scaling is applied:
\[\mathbf{X_{cs}} = (\mathbf{X} - \mathbf{C}) \cdot \mathbf{D}^{-1}\]Several scaling options are implemented here:
Scaling method
scaling
Scaling factor \(d_j\)
None
'none'
or''
1
Auto [PvdBHW+06]
'auto'
or'std'
\(\sigma\)
Pareto [PNod08]
'pareto'
\(\sqrt{\sigma}\)
VAST [PKEA+03]
'vast'
\(\sigma^2 / mean(X_j)\)
Range [PvdBHW+06]
'range'
\(max(X_j) - min(X_j)\)
0 to 1'0to1'
\(d_j = max(X_j) - min(X_j)\)\(c_j = min(X_j)\)-1 to 1'-1to1'
\(d_j = 0.5 \cdot (max(X_j) - min(X_j))\)\(c_j = 0.5 \cdot (max(X_j) + min(X_j))\)Level [PvdBHW+06]
'level'
\(mean(X_j)\)
Max
'max'
\(max(X_j)\)
Variance
'variance'
\(var(X_j)\)
Median
'median'
\(median(X_j)\)
Poisson [PKK04]
'poisson'
\(\sqrt{mean(X_j)}\)
S1
'vast_2'
\(\sigma^2 k^2 / mean(X_j)\)
S2
'vast_3'
\(\sigma^2 k^2 / max(X_j)\)
S3
'vast_4'
\(\sigma^2 k^2 / (max(X_j) - min(X_j))\)
L2-norm
'l2-norm'
\(\|X_j\|_2\)
where \(\sigma\) is the standard deviation of \(X_j\) and \(k\) is the kurtosis of \(X_j\).
The effect of data preprocessing (including scaling) on low-dimensional manifolds was studied in [PPS13].
Example:
from PCAfold import center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'variance'
,'median'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
,'l2-norm'
.nocenter – (optional)
bool
specifying whether data should be centered by mean. If set toTrue
data will not be centered.
- Returns
X_cs -
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It has size(n_observations,n_variables)
.X_center -
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,)
.X_scale -
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It has size(n_variables,)
.
invert_center_scale
#
- PCAfold.preprocess.invert_center_scale(X_cs, X_center, X_scale)#
Inverts whatever centering and scaling was done by the
center_scale
function:\[\mathbf{X} = \mathbf{X_{cs}} \cdot \mathbf{D} + \mathbf{C}\]Example:
from PCAfold import center_scale, invert_center_scale import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Center and scale: (X_cs, X_center, X_scale) = center_scale(X, 'range', nocenter=False) # Uncenter and unscale: X = invert_center_scale(X_cs, X_center, X_scale)
- Parameters
X_cs –
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables)
.X_center –
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,)
.X_scale –
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set, \(\mathbf{X}\). It should be of size(n_variables,)
.
- Returns
X -
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It has size(n_observations,n_variables)
.
power_transform
#
- PCAfold.preprocess.power_transform(X, transform_power, transform_shift=0.0, transform_sign_shift=0.0, invert=False)#
Performs a power transformation of the provided data. The equation for the transformation of variable \(X\) is
\[(|X + s_1|)^\alpha \text{sign}(X + s_1) + s_2 \text{sign}(X + s_1)\]where \(\alpha\) is the
transform_power
, \(s_1\) is thetransform_shift
, and \(s_2\) is thetransform_sign_shift
.Example:
from PCAfold import power_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform power transformation: X_pow = power_transform(X, 0.5) # undo the transformation: X_orig = power_transform(X_pow, 0.5, invert=True)
- Parameters
X – array of the variable(s) to be transformed
transform_power – the power parameter used in the transformation equation
transform_shift – (optional, default 0.) the shift parameter used in the transformation equation
transform_sign_shift – (optional, default 0.) the signed shift parameter used in the transformation equation
invert – (optional, default False) when True, will undo the transformation
- Returns
array of the transformed variables
log_transform
#
- PCAfold.preprocess.log_transform(X, method='log', threshold=1e-06)#
Performs log transformation of the original data set, \(\mathbf{X}\).
For an example original function:
The symlog transformation can be obtained with
method='symlog'
:The continuous symlog transformation can be obtained with
method='continuous-symlog'
:Example:
from PCAfold import log_transform import numpy as np # Generate dummy data set: X = np.random.rand(100,20) + 1 # Perform log transformation: X_log = log_transform(X) # Perform symlog transformation: X_symlog = log_transform(X, method='symlog', threshold=1.e-4)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.method – (optional)
str
specifying the log-transformation method. It can be one of the following:log
,ln
,symlog
,continuous-symlog
.threshold – (optional)
float
orint
specifying the threshold for symlog transformation.
- Returns
X_transformed -
numpy.ndarray
specifying the log-transformed data set. It has size(n_observations,n_variables)
.
remove_constant_vars
#
- PCAfold.preprocess.remove_constant_vars(X, maxtol=1e-12, rangetol=0.0001)#
Removes any constant columns from the original data set, \(\mathbf{X}\). The \(j^{th}\) column, \(X_j\), is considered constant if either of the following is true:
The maximum of an absolute value of a column \(X_j\) is less than
maxtol
:
\[max(|X_j|) < \verb|maxtol|\]The ratio of the range of values in a column \(X_j\) to \(max(|X_j|)\) is less than
rangetol
:
\[\frac{max(X_j) - min(X_j)}{max(|X_j|)} < \verb|rangetol|\]Specifically, it can be used as preprocessing for PCA so the eigenvalue calculation doesn’t break.
Example:
from PCAfold import remove_constant_vars import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Remove the constant column: (X_removed, idx_removed, idx_retained) = remove_constant_vars(X)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.maxtol – (optional)
float
specifying the tolerance for \(max(|X_j|)\).rangetol – (optional)
float
specifying the tolerance for \(max(X_j) - min(X_j)\) over \(max(|X_j|)\).
- Returns
X_removed -
numpy.ndarray
specifying the original data set, \(\mathbf{X}\) with any constant columns removed. It has size(n_observations,n_variables)
.idx_removed -
list
specifying the indices of columns removed from \(\mathbf{X}\).idx_retained -
list
specifying the indices of columns retained in \(\mathbf{X}\).
order_variables
#
- PCAfold.preprocess.order_variables(X, method='mean', descending=True)#
Orders variables in the original data set, \(\mathbf{X}\), using a selected method.
Example:
from PCAfold import order_variables import numpy as np # Generate a dummy data set: X = np.array([[100, 1, 10], [200, 2, 20], [300, 3, 30]]) # Order variables by the mean value in the descending order: (X_ordered, idx) = order_variables(X, method='mean', descending=True)
The code above should return an ordered data set:
array([[100, 10, 1], [200, 20, 2], [300, 30, 3]])
and the list of ordered variable indices:
[1, 2, 0]
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.method – (optional)
str
orlist
ofint
specifying the ordering method. Ifstr
, it can be one of the following:'mean'
,'min'
,'max'
,'std'
or'var'
. Iflist
, it is a custom user-provided list of indices for how the variables should be ordered.descending – (optional)
bool
specifying whether variables should be ordered in the descending order. If set toFalse
, variables will be ordered in the ascending order.
- Returns
X_ordered -
numpy.ndarray
specifying the original data set with ordered variables. It has size(n_observations,n_variables)
.idx -
list
specifying the indices of the ordered variables. It has lengthn_variables
.
Class PreProcessing
#
- class PCAfold.preprocess.PreProcessing(X, scaling='none', nocenter=False)#
Performs a composition of data manipulation done by
remove_constant_vars
andcenter_scale
functions on the original data set, \(\mathbf{X}\). It can be used to store the result of that manipulation. Specifically, it:checks for constant columns in a data set and removes them,
centers and scales the data.
Example:
from PCAfold import PreProcessing import numpy as np # Generate dummy data set with a constant variable: X = np.random.rand(100,20) X[:,5] = np.ones((100,)) # Instantiate PreProcessing class object: preprocessed = PreProcessing(X, 'range', nocenter=False)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.nocenter – (optional)
bool
specifying whether data should be centered by mean. If set toTrue
data will not be centered.
Attributes:
X_removed - (read only)
numpy.ndarray
specifying the original data set with any constant columns removed. It has size(n_observations,n_variables)
.idx_removed - (read only)
list
specifying the indices of columns removed from \(\mathbf{X}\).idx_retained - (read only)
list
specifying the indices of columns retained in \(\mathbf{X}\).X_cs - (read only)
numpy.ndarray
specifying the centered and scaled data set, \(\mathbf{X_{cs}}\). It should be of size(n_observations,n_variables)
.X_center - (read only)
numpy.ndarray
specifying the centers, \(c_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,)
.X_scale - (read only)
numpy.ndarray
specifying the scales, \(d_j\), applied on the original data set \(\mathbf{X}\). It should be of size(n_variables,)
.
outlier_detection
#
- PCAfold.preprocess.outlier_detection(X, scaling, method='MULTIVARIATE TRIMMING', trimming_threshold=0.5, quantile_threshold=0.9899, verbose=False)#
Finds outliers in the original data set, \(\mathbf{X}\), and returns indices of observations without outliers as well as indices of the outliers themselves. Two options are implemented here:
'MULTIVARIATE TRIMMING'
Outliers are detected based on multivariate Mahalanobis distance, \(D_M\):
\[D_M = \sqrt{(\mathbf{X} - \mathbf{\bar{X}})^T \mathbf{S}^{-1} (\mathbf{X} - \mathbf{\bar{X}})}\]where \(\mathbf{\bar{X}}\) is a matrix of the same size as \(\mathbf{X}\) storing in each column a copy of the average value of the same column in \(\mathbf{X}\). \(\mathbf{S}\) is the covariance matrix computed as per
PCA
class. Note that the scaling option selected will affect the covariance matrix \(\mathbf{S}\). Since Mahalanobis distance takes into account covariance between variables, observations with sufficiently large \(D_M\) can be considered as outliers. For more detailed information on Mahalanobis distance the user is referred to [PBis06] or [PDMJRM00].The threshold above which observations will be classified as outliers can be specified using
trimming_threshold
parameter. Specifically, the \(i^{th}\) observation is classified as an outlier if:\[D_{M, i} > \verb|trimming_threshold| \cdot max(D_M)\]'PC CLASSIFIER'
Outliers are detected based on major and minor principal components (PCs). The method of principal component classifier (PCC) was first proposed in [PSCSC03]. The application of this technique to combustion data sets was studied in [PPS13]. Specifically, the \(i^{th}\) observation is classified as an outlier if the first PC classifier based on \(q\)-first (major) PCs:
\[\sum_{j=1}^{q} \frac{z_{ij}^2}{L_j} > c_1\]or if the second PC classifier based on \((Q-k+1)\)-last (minor) PCs:
\[\sum_{j=k}^{Q} \frac{z_{ij}^2}{L_j} > c_2\]where \(z_{ij}\) is the \(i^{th}, j^{th}\) element from the principal components matrix \(\mathbf{Z}\) and \(L_j\) is the \(j^{th}\) eigenvalue from \(\mathbf{L}\) (as per
PCA
class). Major PCs are selected such that the total variance explained is 50%. Minor PCs are selected such that the remaining variance they explain is 20%.Coefficients \(c_1\) and \(c_2\) are found such that they represent the
quantile_threshold
(by default 98.99%) quantile of the empirical distributions of the first and second PC classifier respectively.Example:
from PCAfold import outlier_detection import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Find outliers: (idx_outliers_removed, idx_outliers) = outlier_detection(X, scaling='auto', method='MULTIVARIATE TRIMMING', trimming_threshold=0.8, verbose=True) # New data set without outliers can be obtained as: X_outliers_removed = X[idx_outliers_removed,:] # Observations that were classified as outliers can be obtained as: X_outliers = X[idx_outliers,:]
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.scaling –
str
specifying the scaling methodology. It can be one of the following:'none'
,''
,'auto'
,'std'
,'pareto'
,'vast'
,'range'
,'0to1'
,'-1to1'
,'level'
,'max'
,'poisson'
,'vast_2'
,'vast_3'
,'vast_4'
.method – (optional)
str
specifying the outlier detection method to use. It should be'MULTIVARIATE TRIMMING'
or'PC CLASSIFIER'
.trimming_threshold – (optional)
float
specifying the trimming threshold to use in combination with'MULTIVARIATE TRIMMING'
method.quantile_threshold – (optional)
float
specifying the quantile threshold to use in combination with'PC CLASSIFIER'
method.verbose – (optional)
bool
for printing verbose details.
- Returns
idx_outliers_removed -
list
specifying the indices of observations without outliers.idx_outliers -
list
specifying the indices of observations that were classified as outliers.
representative_sample_size
#
- PCAfold.preprocess.representative_sample_size(depvars, percentages, thresholds, variable_names=None, method='kl-divergence', statistics='median', n_resamples=10, random_seed=None, verbose=False)#
Computes a representative sample size given dependent variables that serve as ground truth (100% of data). It is assumed that the full dataset is representative of some physical phenomena.
Two general approaches are available:
If
method='kl-divergence'
, the representative sample size is computed based on Kullback-Leibler divergence.If
method='mean'
,method='median'
,method='variance'
, ormethod='std'
, the representative sample size is computed based on convergence of a first order (mean or median) or of second order (variance, standard deviation) statistics.
Example:
from PCAfold import center_scale, representative_sample_size import numpy as np # Generate dummy data set and two dependent variables: x, y = np.meshgrid(np.linspace(-1,1,100), np.linspace(-1,1,100)) xy = np.hstack((x.ravel()[:,None],y.ravel()[:,None])) phi_1 = np.exp(-((x*x+y*y) / (1 * 1**2))) phi_1 = phi_1.ravel()[:,None] phi_2 = np.exp(-((x*x+y*y) / (0.01 * 1**2))) phi_2 = phi_2.ravel()[:,None] depvars = np.column_stack((phi_1, phi_2)) depvars, _, _ = center_scale(depvars, scaling='0to1') # Specify the list of percentages to explore: percentages = list(np.linspace(1,99.9,200)) # Specify the list of thresholds for each dependent variable: thresholds = [10**-4, 10**-4] # Specify the names of the dependent variables: variable_names = ['Phi-1', 'Phi-2'] # Compute representative sample size for each dependent variable: (idx, sample_sizes, statistics) = representative_sample_size(depvars, percentages, thresholds=thresholds, variable_names=variable_names, method='kl-divergence', statistics='median', n_resamples=20, random_seed=100, verbose=True)
With
verbose=True
we will see some detailed information:Dependent variable Phi-1 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-1: 2833 samples (28.3% of data). Dependent variable Phi-2 ... KL divergence threshold used: 0.0001 Representative sample size for dependent variable Phi-2: 9890 samples (98.9% of data).
- Parameters
depvars –
numpy.ndarray
specifying the dependent variables that should be well represented in a sampled dataset. . It should be of size(n_observations,n_dependent_variables)
.percentages –
list
of percentages to explore. It should be ordered in ascending order. Elements should be larger than 0 and not larger than 100.thresholds – (optional)
list
offloat
specifying the target thresholds for each dependent variable. The thresholds should be appropriate to the method based on which a representative sample size is computed.variable_names – (optional)
list
ofstr
specifying names for all dependent variables. If set toNone
, dependent variables are called with consecutive integers.method – (optional)
str
specifying the method used to compute the sample size statistics. It can bemean
,median
,variance
,std
, or'kl-divergence'
.statistics – (optional)
str
specifying the overall statistics that should be computed from a given method. It can bemin
,max
,mean
, ormedian
.n_resamples – (optional)
int
specifying the number of resamples to perform for each percentage in thepercentages
vector. It is recommended to set this parameters to above 1, since it might accidentally happen that a random sample is statistically representative of the full dataset. Re-sampling helps to average-out the effect of such one-off “lucky” random samples.random_seed – (optional)
int
specifying the random seed.verbose – (optional)
bool
for printing verbose details.
- Returns
threshold_idx -
list
ofint
specifying the highest indices from thepercentages
list where the representative number of samples condition was still met. It has lengthn_depvars
. If the condition for a representative sample size was not met for a dependent variable, a value of-1
is returned in the list for that dependent variable.representatitive_sample_sizes -
numpy.ndarray
ofint
specifying the representative number of samples. It has size(1,n_depvars)
. If the condition for a representative sample size was not met for a dependent variable, a value of-1
is returned in the array for that dependent variable.sample_size_statistics -
numpy.ndarray
specifying the full vector of computed statistics correponding to each entry inpercentages
and each dependent variable. It has size(n_percentages,n_depvars)
.
Class ConditionalStatistics
#
- class PCAfold.preprocess.ConditionalStatistics(X, conditioning_variable, k=20, split_values=None, verbose=False)#
Enables computing conditional statistics on the original data set, \(\mathbf{X}\). This includes:
conditional mean
conditional minimum
conditional maximum
conditional standard deviation
Other quantities can be added in the future at the user’s request.
Example:
from PCAfold import ConditionalStatistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Instantiate an object of the ConditionalStatistics class # and compute conditional statistics in 10 bins of the conditioning variable: cond = ConditionalStatistics(y[:,None], conditioning_variable, k=10) # Access conditional statistics: conditional_mean = cond.conditional_mean conditional_min = cond.conditional_minimum conditional_max = cond.conditional_maximum conditional_std = cond.conditional_standard_deviation # Access the centroids of the created bins: centroids = cond.centroids
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.conditioning_variable –
numpy.ndarray
specifying a single variable to be used as a conditioning variable. It should be of size(n_observations,1)
or(n_observations,)
.k –
int
specifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
list
specifying values at which splits should be performed. If set toNone
, splits will be performed using \(k\) equal variable bins.verbose – (optional)
bool
for printing verbose details.
Attributes:
idx - (read only)
numpy.ndarray
of cluster (bins) classifications. It has size(n_observations,)
.borders - (read only)
list
of values that define borders for the clusters (bins). It has lengthk+1
.centroids - (read only)
list
of values that specify bins centers. It has lengthk
.conditional_mean - (read only)
numpy.ndarray
specifying the conditional means of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_minimum - (read only)
numpy.ndarray
specifying the conditional minimums of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_maximum - (read only)
numpy.ndarray
specifying the conditional maximums of all original variables in the \(k\) bins created. It has size(k,n_variables)
.conditional_standard_deviation - (read only)
numpy.ndarray
specifying the conditional standard deviations of all original variables in the \(k\) bins created. It has size(k,n_variables)
.
Class KernelDensity
#
- class PCAfold.preprocess.KernelDensity(X, conditioning_variable, verbose=False)#
Enables kernel density weighting of the original data set, \(\mathbf{X}\), based on single-variable or multi-variable case as proposed in [PCGP12].
The goal of both cases is to obtain a vector of weights, \(\mathbf{W_c}\), that has the same number of elements as there are observations in the original data set, \(\mathbf{X}\). Each observation will then get multiplied by the corresponding weight from \(\mathbf{W_c}\).
Note
Kernel density weighting technique is usually very expensive, even on data sets with relatively small number of observations. Since the single-variable case is a cheaper option than the multi-variable case, it is recommended that this technique is tried first for larger data sets.
Gaussian kernel is used in both approaches:
\[K_{c, c'} = \sqrt{\frac{1}{2 \pi h^2}} exp(- \frac{d^2}{2 h^2})\]\(h\) is the kernel bandwidth:
\[h = \Big( \frac{4 \hat{\sigma}}{3 n} \Big)^{1/5}\]where \(\hat{\sigma}\) is the standard deviation of the considered variable and \(n\) is the number of observations in the data set.
\(d\) is the distance between two observations \(c\) and \(c'\):
\[d = |x_c - x_{c'}|\]Single-variable
If the
conditioning_variable
argument is a single vector, weighting will be performed according to the single-variable case. It begins by summing Gaussian kernels:\[\mathbf{K_c} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c'}\]and weights are then computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Multi-variable
If the
conditioning_variable
argument is a matrix of multiple variables, weighting will be performed according to the multi-variable case. It begins by summing Gaussian kernels for a \(k^{th}\) variable:\[\mathbf{K_c}_{, k} = \sum_{c' = 1}^{c' = n} \frac{1}{n} K_{c, c', k}\]Global density taking into account all variables is then obtained as:
\[\mathbf{K_{c}} = \prod_{k=1}^{k=Q} \mathbf{K_c}_{, k}\]where \(Q\) is the total number of conditioning variables, and weights are computed as:
\[\mathbf{W_c} = \frac{\frac{1}{\mathbf{K_c}}}{max(\frac{1}{\mathbf{K_c}})}\]Example:
from PCAfold import KernelDensity import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Perform kernel density weighting based on the first variable: kerneld = KernelDensity(X, X[:,0]) # Access the weighted data set: X_weighted = kerneld.X_weighted # Access the weights used to scale the data set: weights = kerneld.weights
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.conditioning_variable –
numpy.ndarray
specifying either a single variable or multiple variables to be used as a conditioning variable for kernel weighting procedure. Note that it can also be passed as the data set \(\mathbf{X}\).
Attributes:
weights -
numpy.ndarray
specifying the computed weights, \(\mathbf{W_c}\). It has size(n_observations,1)
.X_weighted -
numpy.ndarray
specifying the weighted data set (each observation in \(\mathbf{X}\) is multiplied by the corresponding weight in \(\mathbf{W_c}\)). It has size(n_observations,n_variables)
.
Class DensityEstimation
#
- class PCAfold.preprocess.DensityEstimation(X, n_neighbors)#
Enables density estimation on point-cloud data.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.n_neighbors –
int
specifying the number of nearest neighbors, or the \(k\) th nearest neighbor when applicable.
DensityEstimation.average_knn_distance
#
- PCAfold.preprocess.DensityEstimation.average_knn_distance(self, verbose=False)#
Computes an average Euclidean distances to \(k\) nearest neighbors on a manifold defined by the independent variables.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute average distances on a manifold defined by the PCs: average_distances = density_estimation.average_knn_distance(verbose=True)
With
verbose=True
, minimum, maximum and average distance will be printed:Minimum distance: 0.1388300829487847 Maximum distance: 0.4689587542132183 Average distance: 0.20824964953425693 Median distance: 0.18333873029179215
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Parameters
verbose – (optional)
bool
for printing verbose details.- Returns
average_distances -
numpy.ndarray
specifying the vector of average distances for every observation in a data set to its \(k\) nearest neighbors. It has size(n_observations,)
.
DensityEstimation.kth_nearest_neighbor_codensity
#
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_codensity(self)#
Computes the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data codensity defined as:
\[\delta_k(x) = d(x, v_k(x))\]where \(v_k(x)\) is the \(k\) th nearest neighbor of \(x\).
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_codensity = density_estimation.kth_nearest_neighbor_codensity()
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Returns
data_codensity -
numpy.ndarray
specifying the vector of distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,)
.
DensityEstimation.kth_nearest_neighbor_density
#
- PCAfold.preprocess.DensityEstimation.kth_nearest_neighbor_density(self)#
Computes an inverse of the Euclidean distance to the \(k\) th nearest neighbor on a manifold defined by the independent variables as per [PCVJ21]. This value has an interpretation of a data density defined as:
\[\rho_k(x) = \frac{1}{\delta_k(x)}\]where \(\delta_k(x)\) is the codensity.
Example:
from PCAfold import PCA, DensityEstimation import numpy as np # Generate dummy data set: X = np.random.rand(100,20) # Instantiate PCA class object: pca_X = PCA(X, scaling='none', n_components=2, use_eigendec=True, nocenter=False) # Calculate the principal components: principal_components = pca_X.transform(X) # Instantiate an object of the DensityEstimation class: density_estimation = DensityEstimation(principal_components, n_neighbors=10) # Compute the distance to the kth nearest neighbor: data_density = density_estimation.kth_nearest_neighbor_density()
Note
This function requires the
scikit-learn
module. You can install it through:pip install scikit-learn
- Returns
data_density -
numpy.ndarray
specifying the vector of inverse distances to the \(k\) th nearest neighbor of every data observation. It has size(n_observations,)
.
Data clustering#
This section includes functions for classifying data sets into local clusters and performing some basic operations on clusters [PELL09], [PKR09].
Clustering functions#
Each function that clusters the data set returns a vector of integers idx
of type numpy.ndarray
of size (n_observations,)
that specifies
classification of each observation from the original data set,
\(\mathbf{X}\), to a local cluster.
Note
The first cluster has index 0
within all idx
vectors returned.
variable_bins
#
- PCAfold.preprocess.variable_bins(var, k, verbose=False)#
Clusters the data by dividing a variable vector
var
into bins of equal lengths.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to bins of x: (idx, borders) = variable_bins(x, 4, verbose=True)
- Parameters
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
predefined_variable_bins
#
- PCAfold.preprocess.predefined_variable_bins(var, split_values, verbose=False)#
Clusters the data by dividing a variable vector
var
into bins such that splits are done at user-specified values. Split values can be specified in thesplit_values
list. In general:split_values = [value_1, value_2, ..., value_n]
.Note: When a split is performed at a given
value_i
, the observation invar
that takes exactly that value is assigned to the newly created bin.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import predefined_variable_bins import numpy as np # Generate dummy variable: x = np.linspace(-1,1,100) # Create partitioning according to pre-defined bins of x: (idx, borders) = predefined_variable_bins(x, [-0.6, 0.4, 0.8], verbose=True)
- Parameters
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.split_values –
list
specifying values at which splits should be performed.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
mixture_fraction_bins
#
- PCAfold.preprocess.mixture_fraction_bins(Z, k, Z_stoich, verbose=False)#
Clusters the data by dividing a mixture fraction vector
Z
into bins of equal lengths. This technique can be used to partition combustion data sets as proposed in [PPSTS09]. The vector is first split to lean and rich side (according to the stoichiometric mixture fractionZ_stoich
) and then the sides get divided further into clusters. Whenk
is odd, there will always be one more cluster on the side with larger range in mixture fraction space compared to the other side.An example of how a vector can be partitioned with this function is presented below:
Example:
from PCAfold import mixture_fraction_bins import numpy as np # Generate dummy mixture fraction variable: Z = np.linspace(0,1,100) # Create partitioning according to bins of mixture fraction: (idx, borders) = mixture_fraction_bins(Z, 4, 0.4, verbose=True)
- Parameters
Z –
numpy.ndarray
specifying the mixture fraction values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number.Z_stoich –
float
specifying the stoichiometric mixture fraction. It has to be between 0 and 1.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
zero_neighborhood_bins
#
- PCAfold.preprocess.zero_neighborhood_bins(var, k, zero_offset_percentage=0.1, split_at_zero=False, verbose=False)#
Clusters the data by separating close-to-zero observations in a vector into one cluster (
split_at_zero=False
) or two clusters (split_at_zero=True
). The offset from zero at which splits are performed is computed based on the input parameterzero_offset_percentage
:\[\verb|offset| = \frac{(max(\verb|var|) - min(\verb|var|)) \cdot \verb|zero_offset_percentage|}{100}\]Further clusters are found by splitting positive and negative values in a vector alternatingly into bins of equal lengths.
This clustering technique can be useful for partitioning any variable that has many observations clustered around zero value and relatively few observations far away from zero on either side.
Two examples of how a vector can be partitioned with this function are presented below:
With
split_at_zero=False
:
If
split_at_zero=False
the smallest allowed number of clusters is 3. This is to assure that there are at least three clusters: with negative values, with close to zero values, with positive values.When
k
is even, there will always be one more cluster on the side with larger range compared to the other side.With
split_at_zero=True
:
If
split_at_zero=True
the smallest allowed number of clusters is 4. This is to assure that there are at least four clusters: with negative values, with negative values close to zero, with positive values close to zero and with positive values.When
k
is odd, there will always be one more cluster on the side with larger range compared to the other side.Note
This clustering technique is well suited for partitioning chemical source terms, \(\mathbf{S_X}\), or sources of principal components, \(\mathbf{S_Z}\), (as per [TSP09]) since it relies on unbalanced vectors that have many observations numerically close to zero. Using
split_at_zero=True
it can further differentiate between negative and positive sources.Example:
from PCAfold import zero_neighborhood_bins import numpy as np # Generate dummy variable: x = np.linspace(-100,100,1000) # Create partitioning according to bins of x: (idx, borders) = zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True)
- Parameters
var –
numpy.ndarray
specifying the variable values. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of clusters to create. It has to be a positive number. It cannot be smaller than 3 ifsplit_at_zero=False
or smaller than 4 ifsplit_at_zero=True
.zero_offset_percentage – (optional) percentage of \(max(\verb|var|) - min(\verb|var|)\) range to take as the offset from zero value. For instance, set
zero_offset_percentage=10
if you want 10% as offset.split_at_zero – (optional)
bool
specifying whether partitioning should be done atvar=0
.verbose – (optional)
bool
for printing verbose details.
- Returns
idx -
numpy.ndarray
of cluster classifications. It has size(n_observations,)
.borders -
list
of values that define borders for the clusters. It has lengthk+1
.
Auxiliary functions#
degrade_clusters
#
- PCAfold.preprocess.degrade_clusters(idx, verbose=False)#
Re-numerates clusters if either of these two cases is true:
idx
is composed of non-consecutive integers, orthe smallest cluster index in
idx
is not equal to0
.
Example:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 2, 0, 5, 10]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
The code above will produce:
>>> idx_degraded array([0, 0, 1, 0, 2, 3])
Alternatively:
from PCAfold import degrade_clusters import numpy as np # Generate dummy idx vector: idx = np.array([1, 1, 2, 2, 3, 3]) # Degrade clusters: (idx_degraded, k_update) = degrade_clusters(idx)
will produce:
>>> idx_degraded array([0, 0, 1, 1, 2, 2])
- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.verbose – (optional)
bool
for printing verbose details.
- Returns
idx_degraded -
numpy.ndarray
of degraded cluster classifications. It has size(n_observations,)
.k_update -
int
specifying the updated number of clusters.
flip_clusters
#
- PCAfold.preprocess.flip_clusters(idx, dictionary)#
Flips cluster labelling according to instructions provided in a dictionary. For a
dictionary = {key : value}
, a cluster with a numberkey
will get a numbervalue
.Example:
from PCAfold import flip_clusters import numpy as np # Generate dummy idx vector: idx = np.array([0,0,0,1,1,1,1,2,2]) # Swap cluster number 1 with cluster number 2: flipped_idx = flip_clusters(idx, {1:2, 2:1})
The code above will produce:
>>> flipped_idx array([0, 0, 0, 2, 2, 2, 2, 1, 1])
Note
This function can also be used to merge clusters. Using the
idx
from the example above, if we call:flipped_idx = flip_clusters(idx, {2:1})
the result will be:
>>> flipped_idx array([0,0,0,1,1,1,1,1,1])
where clusters
1
and2
have been merged into one cluster numbered1
.- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.dictionary –
dict
specifying instructions for cluster label flipping.
- Returns
flipped_idx -
numpy.ndarray
specifying the re-labelled cluster classifications. It has size(n_observations,)
.
get_centroids
#
- PCAfold.preprocess.get_centroids(X, idx)#
Computes the centroids for all variables in the original data set, \(\mathbf{X}\), and for each cluster specified in the
idx
vector. The centroid \(c_{n, j}\) for variable \(X_j\) in the \(n^{th}\) cluster, is computed as:\[c_{n, j} = mean(X_j), \,\,\,\, \text{for} \,\, X_j \in \text{cluster} \,\, n\]Centroids for all variables from all clusters are stored in the matrix \(\mathbf{c} \in \mathbb{R}^{k \times Q}\) returned:
\[\begin{split}\mathbf{c} = \begin{bmatrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, Q} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, Q} \\ \vdots & \vdots & \vdots & \vdots \\ c_{k, 1} & c_{k, 2} & \dots & c_{k, Q} \\ \end{bmatrix}\end{split}\]Example:
from PCAfold import get_centroids import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute the centroids of each cluster: centroids = get_centroids(X, idx)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.
- Returns
centroids -
numpy.ndarray
specifying the centroids matrix, \(\mathbf{c}\), for all clusters and for all variables. It has size(k,n_variables)
.
get_partition
#
- PCAfold.preprocess.get_partition(X, idx)#
Partitions the observations from the original data set, \(\mathbf{X}\), into \(k\) clusters according to
idx
provided.Example:
from PCAfold import get_partition import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Generate partitioning of the data set according to idx: (X_in_clusters, idx_in_clusters) = get_partition(X, idx)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.
- Returns
X_in_clusters -
list
of \(k\)numpy.ndarray
that contains original data set observations partitioned to \(k\) clusters. It has lengthk
.idx_in_clusters -
list
of \(k\)numpy.ndarray
that contains indices of the original data set observations partitioned to \(k\) clusters. It has lengthk
.
get_populations
#
- PCAfold.preprocess.get_populations(idx)#
Computes populations (number of observations) in clusters specified in the
idx
vector. As an example, if there are 100 observations in the first cluster and 500 observations in the second cluster this function will return a list:[100, 500]
.Example:
from PCAfold import variable_bins, get_populations import numpy as np # Generate dummy partitioning: x = np.linspace(-1,1,100) (idx, borders) = variable_bins(x, 4, verbose=True) # Compute cluster populations: populations = get_populations(idx)
The code above will produce:
>>> populations [25, 25, 25, 25]
- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.- Returns
populations -
list
of cluster populations. Each entry referes to one cluster ordered according toidx
. It has lengthk
.
get_average_centroid_distance
#
- PCAfold.preprocess.get_average_centroid_distance(X, idx, weighted=False)#
Computes the average Euclidean distance between observations and the centroids of clusters to which each observation belongs.
The average can be computed as an arithmetic average from all clusters (
weighted=False
) or as a weighted average (weighted=True
). In the latter, the distances are weighted by the number of observations in a cluster so that the average centroid distance will approach the average distance in the largest cluster.Example:
from PCAfold import get_average_centroid_distance import numpy as np # Generate dummy data set: X = np.random.rand(100,5) # Generate dummy clustering of the data set: idx = np.zeros((100,)) idx[50:80] = 1 idx = idx.astype(int) # Compute average distance from cluster centroids: average_centroid_distance = get_average_centroid_distance(X, idx, weighted=False)
- Parameters
X –
numpy.ndarray
specifying the original data set, \(\mathbf{X}\). It should be of size(n_observations,n_variables)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.weighted – (optional)
bool
specifying whether distances from centroid should be weighted by the number of observations in a cluster. If set toFalse
, arithmetic average will be computed.
- Returns
average_centroid_distance -
float
specifying the average distance from centroids, averaged over all observations and all clusters.
Data sampling#
This section includes functions for splitting data sets into train and test data for use in machine learning algorithms. Apart from random splitting that can be achieved with the commonly used sklearn.model_selection.train_test_split, extended methods are implemented here that allow for purposive sampling [PNey92], such as drawing samples at certain amount from local clusters [PMMD10], [PGSB04]. These functionalities can be specifically used to tackle imbalanced data sets [PHG09], [PRLM+16].
The general idea is to divide the entire data set X
(or its portion) into train and test samples as presented below:
Train data is always sampled in the same way for a given sampling function.
Depending on the option selected, test data will be sampled differently, either as all
remaining samples that were not included in train data or as a subset of those.
You can select the option by setting the test_selection_option
parameter for each sampling function.
Reach out to the documentation for a specific sampling function to see what options are available.
All splitting functions in this module return a tuple of two variables: (idx_train, idx_test)
.
Both idx_train
and idx_test
are vectors of integers of type numpy.ndarray
and of size (_,)
.
These variables contain indices of observations that went into train data and test data respectively.
In your model learning algorithm you can then get the train and test observations, for instance in the following way:
X_train = X[idx_train,:]
X_test = X[idx_test,:]
All functions are equipped with verbose
parameter. If it is set to True
some additional information on train and test selection is printed.
Note
It is assumed that the first cluster has index 0
within all input idx
vectors.
Class DataSampler
#
- class PCAfold.preprocess.DataSampler(idx, idx_test=None, random_seed=None, verbose=False)#
Enables selecting train and test data samples.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, idx_test=np.array([5,9]), random_seed=100, verbose=True)
- Parameters
idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.idx_test – (optional)
numpy.ndarray
specifying the user-provided indices for test data. If specified, train data will be selected ignoring the indices inidx_test
and the test data will be returned the same as the user-providedidx_test
. If not specified, test samples will be selected according to thetest_selection_option
parameter (see documentation for each sampling function). Setting fixedidx_test
parameter may be useful if training a machine learning model on specific test samples is desired. It should be of size(n_test_samples,)
or(n_test_samples,1)
.random_seed – (optional)
int
specifying random seed for random sample selection.verbose – (optional)
bool
for printing verbose details.
DataSampler.number
#
- PCAfold.preprocess.DataSampler.number(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples fixed number of observations from every cluster as training data. In general, this results in a balanced representation of features identified by a clustering algorithm.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.number(20, test_selection_option=1)
Train data:
The number of train samples is estimated based on the percentage
perc
provided. First, the total number of samples for training is estimated as a percentageperc
from the total number of observationsn_observations
in a data set. Next, this number is divided equally into \(k\) clusters. The resultn_of_samples
is the number of samples that will be selected from each cluster:\[\verb|n_of_samples| = \verb|int| \Big( \frac{\verb|perc| \cdot \verb|n_observations|}{k \cdot 100} \Big)\]Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
, the smallest cluster is found and the remaining number of observations \(m\) are taken as test data in that cluster. Next, the same number of samples \(m\) is taken from all remaining larger clusters.The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(n\) and \(m\) are fixed numbers for each cluster. In general, \(n \neq m\).
- Parameters
perc – percentage of data to be selected as training data from the entire data set. For instance, set
perc=20
if you want to select 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
DataSampler.percentage
#
- PCAfold.preprocess.DataSampler.percentage(self, perc, test_selection_option=1)#
Uses classifications into \(k\) clusters and samples a certain percentage
perc
from every cluster as the training data.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.percentage(20, test_selection_option=1)
Note: If the cluster sizes are comparable, this function will give a similar train sample distribution as random sampling (
DataSampler.random
). This sampling can be useful in cases where one cluster is significantly smaller than others and there is a chance that this cluster will not get covered in the train data if random sampling was used.Train data:
The number of train samples is estimated based on the percentage
perc
provided. First, the size of the \(i^{th}\) cluster is estimatedcluster_size_i
and then a percentageperc
of that number is selected.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(p\) is the percentage
perc
provided.- Parameters
perc – percentage of data to be selected as training data from each cluster. For instance, set
perc=20
if you want to select 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
DataSampler.manual
#
- PCAfold.preprocess.DataSampler.manual(self, sampling_dictionary, sampling_type='percentage', test_selection_option=1)#
Uses classifications into \(k\) clusters and a dictionary
sampling_dictionary
in which you manually specify what'percentage'
(or what'number'
) of samples will be selected as the train data from each cluster. The dictionary keys are cluster classifications as peridx
and the dictionary values are either percentage or number of train samples to be selected. The default dictionary values are percentage but you can selectsampling_type='number'
in order to interpret the values as a number of samples.Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.manual({0:1, 1:1, 2:1}, sampling_type='number', test_selection_option=1)
Train data:
The number of train samples selected from each cluster is estimated based on the
sampling_dictionary
. Forkey : value
, percentagevalue
(or numbervalue
) of samples will be selected from clusterkey
.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure will be used to select test data as was used to select train data (only allowed if the number of samples taken as train data from any cluster did not exceed 50% of observations in that cluster).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here it is understood that \(n_1\) train samples were requested from the first cluster, \(n_2\) from the second cluster and \(n_3\) from the third cluster, where \(n_i\) can be interpreted as number or as percentage. This can be achieved by setting:
sampling_dictionary = {0:n_1, 1:n_2, 2:n_3}
- Parameters
sampling_dictionary –
dict
specifying manual sampling. Keys are cluster classifications and values are eitherpercentage
ornumber
of samples to be taken from that cluster. Keys should match the cluster classifications as peridx
.sampling_type – (optional)
str
specifying whether percentage or number is given in thesampling_dictionary
. Available options:percentage
ornumber
. The default ispercentage
.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
DataSampler.random
#
- PCAfold.preprocess.DataSampler.random(self, perc, test_selection_option=1)#
Samples train data at random from the entire data set.
Example:
from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: idx = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1]) # Instantiate DataSampler class object: selection = DataSampler(idx, verbose=True) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Due to the nature of this sampling technique, it is not necessary to have
idx
classifications since random samples can also be selected from unclassified data sets. You can achieve that by generating a dummyidx
vector that has the same number of observationsn_observations
as your data set. For instance:from PCAfold import DataSampler import numpy as np # Generate dummy idx vector: n_observations = 100 idx = np.zeros(n_observations) # Instantiate DataSampler class object: selection = DataSampler(idx) # Generate sampling: (idx_train, idx_test) = selection.random(20, test_selection_option=1)
Train data:
The total number of train samples is computed as a percentage
perc
from the total number of observations in a data set. These samples are then drawn at random from the entire data set, independent of cluster classifications.Test data:
Two options for sampling test data are implemented. If you select
test_selection_option=1
all remaining samples that were not taken as train data become the test data. If you selecttest_selection_option=2
the same procedure is used to select test data as was used to select train data (only allowed ifperc
is less than 50%).The scheme below presents graphically how train and test data can be selected using
test_selection_option
parameter:Here \(p\) is the percentage
perc
provided.- Parameters
perc – percentage of data to be selected as training data from each cluster. Set
perc=20
if you want 20%.test_selection_option – (optional)
int
specifying the option for how the test data is selected. Selecttest_selection_option=1
if you want all remaining samples to become test data. Selecttest_selection_option=2
if you want to select a subset of the remaining samples as test data.
- Returns
idx_train -
numpy.ndarray
of indices of the train data. It has size(n_train,)
.idx_test -
numpy.ndarray
of indices of the test data. It has size(n_test,)
.
Plotting functions#
This section includes functions for data preprocessing related plotting such as visualizing the formed clusters, visualizing the selected train and test samples or plotting the conditional statistics.
plot_2d_clustering
#
- PCAfold.preprocess.plot_2d_clustering(x, y, idx, clean=False, x_label=None, y_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, grid_on=False, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_2d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_2d_clustering(x, y, idx, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(10,6), title='x-y data set', save_filename='clustering.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.clean – (optional)
bool
specifying if a clean plot should be made. If set toTrue
, nothing else but the data points is plotted.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.alphas – (optional)
list
specifying the opacity of each cluster.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.grid_on –
bool
specifying whether grid should be plotted.s – (optional)
int
orfloat
specifying the scatter point size.markerscale – (optional)
int
orfloat
specifying the scale for the legend marker.legend – (optional)
bool
specifying the whether legend should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_3d_clustering
#
- PCAfold.preprocess.plot_3d_clustering(x, y, z, idx, elev=45, azim=-45, x_label=None, y_label=None, z_label=None, color_map='viridis', alphas=None, first_cluster_index_zero=True, s=None, markerscale=None, legend=True, figure_size=(7, 7), title=None, save_filename=None)#
Plots a three-dimensional manifold divided into clusters. Number of observations in each cluster will be plotted in the legend.
Example:
from PCAfold import variable_bins, plot_3d_clustering import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 z = x + 10 # Generate dummy clustering of the data set: (idx, _) = variable_bins(x, 4, verbose=False) # Plot the clustering result: plt = plot_3d_clustering(x, y, z, idx, x_label='$x$', y_label='$y$', z_label='$z$', color_map='viridis', first_cluster_index_zero=False, figure_size=(10,6), title='x-y-z data set', save_filename='clustering.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.z –
numpy.ndarray
specifying the variable on the \(z\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.elev – (optional) elevation angle.
azim – (optional) azimuth angle.
x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.z_label – (optional)
str
specifying \(z\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.alphas – (optional)
list
specifying the opacity of each cluster.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.s – (optional)
int
orfloat
specifying the scatter point size.markerscale – (optional)
int
orfloat
specifying the scale for the legend marker.legend – (optional)
bool
specifying the whether legend should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_2d_train_test_samples
#
- PCAfold.preprocess.plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label=None, y_label=None, color_map='viridis', first_cluster_index_zero=True, grid_on=False, figure_size=(14, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold divided into train and test samples. Number of observations in train and test data respectively will be plotted in the legend.
Example:
from PCAfold import variable_bins, DataSampler, plot_2d_train_test_samples import numpy as np # Generate dummy data set: x = np.linspace(-1,1,100) y = -x**2 + 1 # Generate dummy clustering of the data set: (idx, borders) = variable_bins(x, 4, verbose=False) # Generate dummy sampling of the data set: sample = DataSampler(idx, random_seed=None, verbose=True) (idx_train, idx_test) = sample.number(40, test_selection_option=1) # Plot the sampling result: plt = plot_2d_train_test_samples(x, y, idx, idx_train, idx_test, x_label='$x$', y_label='$y$', color_map='viridis', first_cluster_index_zero=False, grid_on=True, figure_size=(12,6), title='x-y data set', save_filename='sampling.pdf') plt.close()
- Parameters
x –
numpy.ndarray
specifying the variable on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.y –
numpy.ndarray
specifying the variable on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.idx –
numpy.ndarray
of cluster classifications. It should be of size(n_observations,)
or(n_observations,1)
.idx_train –
numpy.ndarray
specifying the indices of the train data. It should be of size(n_train,)
or(n_train,1)
.idx_test –
numpy.ndarray
specifying the indices of the test data. It should be of size(n_test,)
or(n_test,1)
.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.color_map – (optional)
str
ormatplotlib.colors.ListedColormap
specifying the colormap to use as permatplotlib.cm
. Default is'viridis'
.first_cluster_index_zero – (optional)
bool
specifying if the first cluster should be indexed0
on the plot. If set toFalse
the first cluster will be indexed1
.grid_on –
bool
specifying whether grid should be plotted.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
plot_conditional_statistics
#
- PCAfold.preprocess.plot_conditional_statistics(variable, conditioning_variable, k=20, split_values=None, statistics_to_plot=['mean'], color=None, x_label=None, y_label=None, colorbar_label=None, color_map='viridis', figure_size=(7, 7), title=None, save_filename=None)#
Plots a two-dimensional manifold given by
variable
andconditioning_variable
and the selected conditional statistics (as perpreprocess.ConditionalStatistics
).Example:
from PCAfold import PCA, plot_conditional_statistics import numpy as np # Generate dummy variables: conditioning_variable = np.linspace(-1,1,100) y = -conditioning_variable**2 + 1 # Plot the conditional statistics: plt = plot_conditional_statistics(y, conditioning_variable, k=10, x_label='$x$', y_label='$y$', figure_size=(10,3), title='Conditional mean', save_filename='conditional-mean.pdf') plt.close()
- Parameters
variable –
numpy.ndarray
specifying a single dependent variable to condition. This will be plotted on the \(y\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.conditioning_variable –
numpy.ndarray
specifying a single variable to be used as a conditioning variable. This will be plotted on the \(x\)-axis. It should be of size(n_observations,)
or(n_observations,1)
.k –
int
specifying the number of bins to create in the conditioning variable. It has to be a positive number.split_values –
list
specifying values at which splits should be performed. If set toNone
, splits will be performed using \(k\) equal variable bins.statistics_to_plot –
list
ofstr
specifying conditional statistics to plot. The strings can bemean
,min
,max
orstd
.color – (optional) vector or string specifying color for the manifold. If it is a vector, it has to have length consistent with the number of observations in
x
andy
vectors. It should be of typenumpy.ndarray
and size(n_observations,)
or(n_observations,1)
. It can also be set to a string specifying the color directly, for instance'r'
or'#006778'
. If not specified, data will be plotted in black.x_label – (optional)
str
specifying \(x\)-axis label annotation. If set toNone
label will not be plotted.y_label – (optional)
str
specifying \(y\)-axis label annotation. If set toNone
label will not be plotted.colorbar_label – (optional) string specifying colorbar label annotation. If set to
None
, colorbar label will not be plotted.color_map – (optional) colormap to use as per
matplotlib.cm
. Default is viridis.figure_size – (optional)
tuple
specifying figure size.title – (optional)
str
specifying plot title. If set toNone
title will not be plotted.save_filename – (optional)
str
specifying plot save location/filename. If set toNone
plot will not be saved. You can also set a desired file extension, for instance.pdf
. If the file extension is not specified, the default is.png
.
- Returns
plt -
matplotlib.pyplot
plot handle.
Bibliography#
- PBis06
Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
- PCVJ21(1,2)
Gunnar Carlsson and Mikael Vejdemo-Johansson. Topological Data Analysis with Applications. Cambridge University Press, 2021.
- PCGP12
Axel Coussement, Olivier Gicquel, and Alessandro Parente. Kernel density weighted principal component analysis of combustion processes. Combustion and flame, 159(9):2844–2855, 2012.
- PDMJRM00
Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The mahalanobis distance. Chemometrics and intelligent laboratory systems, 50(1):1–18, 2000.
- PELL09
Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Wiley Publishing, 4th edition, 2009. ISBN 0340761199.
- PGSB04
Abdul A. Gill, George D. Smith, and Anthony J. Bagnall. Improving decision tree performance through induction-and cluster-based stratified sampling. In International Conference on Intelligent Data Engineering and Automated Learning, 339–344. Springer, 2004.
- PHG09
Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
- PKR09
Leonard Kaufman and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. Volume 344. John Wiley & Sons, 2009.
- PKK04
Michael R Keenan and Paul G Kotula. Accounting for poisson noise in the multivariate analysis of tof-sims spectrum images. Surface and Interface Analysis: An International Journal devoted to the development and application of techniques for the analysis of surfaces, interfaces and thin films, 36(3):203–212, 2004.
- PKEA+03
Hector C Keun, Timothy MD Ebbels, Henrik Antti, Mary E Bollard, Olaf Beckonert, Elaine Holmes, John C Lindon, and Jeremy K Nicholson. Improved analysis of multivariate data by variable stability scaling: application to nmr-based metabolic profiling. Analytica chimica acta, 490(1-2):265–276, 2003.
- PMMD10
Robert J. May, Holger R. Maier, and Graeme C. Dandy. Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks, 23(2):283–294, 2010.
- PNey92
Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in Statistics, pages 123–150. Springer, 1992.
- PNod08
Isao Noda. Scaling techniques to enhance two-dimensional correlation spectra. Journal of Molecular Structure, 883:216–227, 2008.
- PPS13(1,2)
Alessandro Parente and James C. Sutherland. Principal component analysis of turbulent combustion data: data pre-processing and manifold sensitivity. Combustion and flame, 160(2):340–350, 2013.
- PPSTS09
Alessandro Parente, James C. Sutherland, Leonardo Tognotti, and Philip J. Smith. Identification of low-dimensional manifolds in turbulent flames. Proceedings of the Combustion Institute, 32(1):1579–1586, 2009.
- PRLM+16
Mojdeh Rastgoo, Guillaume Lemaitre, Joan Massich, Olivier Morel, Franck Marzani, Rafael Garcia, and Fabrice Meriaudeau. Tackling the problem of data imbalancing for melanoma classification. BIOSTEC - 3rd International Conference on BIOIMAGING, 2016.
- PSCSC03
Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003.
- PvdBHW+06(1,2,3)
Robert A van den Berg, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde, and Mariët J van der Werf. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics, 7(1):1–15, 2006.