.. note:: This tutorial was generated from a Jupyter notebook that can be accessed `here `_. ################################## PCA on sampled data sets ################################## In this tutorial, we present how PCA can be performed on sampled data sets using various helpful functions from the ``preprocess`` and the ``reduction`` module. Those functions essentially allow to compare PCA done on the original full data set, :math:`\mathbf{X}`, and on the sampled data set, :math:`\mathbf{X_r}`. We are first going to present major functionalities for performing and analyzing PCA on a sampled data set using a special case of sampling - by taking equal number of samples from each cluster. Next, we are going to show a more general way to perform PCA on data sets that are sampled in any way of choice. A general overview for performing PCA on a sampled data set is presented below: .. image:: ../images/pca-on-sampled-data-set.svg :width: 700 The main goal is to inform the PCA transformation with some of the characteristics of the sampled data set, :math:`\mathbf{X_r}`. There are several ways in which that information can be incorporated and they can be controlled using a selected *biasing option* and setting the ``biasing_option`` input parameter whenever needed. The user is referred to the documentation for more information on the available options (under **User guide** :math:`\rightarrow` **Data reduction** :math:`\rightarrow` **Biasing options**). It is understood, that PCA performed on a sampled data set is *biased* in some way, since that data set contains different proportions of features in terms of sample density compared to their original contribution within the full original data set, :math:`\mathbf{X}`. Those features can be identified using any clustering technique of choice. We import the necessary modules: .. code:: python from PCAfold import preprocess from PCAfold import reduction from PCAfold import DataSampler from PCAfold import PCA import numpy as np from matplotlib.colors import ListedColormap and we set some initial parameters: .. code:: python scaling = 'auto' biasing_option = 2 n_clusters = 4 n_components = 2 random_seed = 100 legend_label = ['$\mathbf{X}$', '$\mathbf{X_r}$'] color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54']) save_filename = None -------------------------------------------------------------------------------- ************************************ Load and cluster the data set ************************************ As an example, we will use a data set representing combustion of syngas in air generated from the steady laminar flamelet model using chemical mechanism by Hawkes et al. :cite:`Hawkes2007`. This data set has 11 variables and 50,000 observations. The data set was generated using *Spitfire* software :cite:`Hansen2020`. To load the data set from the tutorials directory: .. code:: python X = np.genfromtxt('data-state-space.csv', delimiter=',') X_names = ['$T$', '$H_2$', '$O_2$', '$O$', '$OH$', '$H_2O$', '$H$', '$HO_2$', '$CO$', '$CO_2$', '$HCO$'] S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',') Z = np.genfromtxt('data-mixture-fraction.csv', delimiter=',') We start with clustering the data set that will result in an ``idx`` vector of cluster classifications. Clustering can be performed with any technique of choice. Here we will use one of the available functions from the ``preprocess`` module ``preprocess.zero_neighborhood_bins`` and use the first principal component source term as the conditioning variable. Perform global PCA on the data set and transform source terms of the original variables: .. code:: python pca_X = PCA(X, scaling=scaling, n_components=n_components) S_Z = pca_X.transform(S, nocenter=True) Cluster the data set: .. code:: python (idx, borders) = preprocess.zero_neighborhood_bins(S_Z[:,0], k=4, zero_offset_percentage=2, split_at_zero=True, verbose=True) Visualize the result of clustering: .. code:: python plt = preprocess.plot_2d_clustering(Z, X[:,0], idx, x_label='Mixture fraction [-]', y_label='$T$ [K]', color_map=color_map, first_cluster_index_zero=False, grid_on=True, figure_size=(8, 3), save_filename=save_filename) .. image:: ../images/tutorial-sampled-pca-clustering.svg :width: 600 :align: center -------------------------------------------------------------------------------- *********************************************** Special case of PCA on sampled data sets *********************************************** In this section, we present the special case for performing PCA on data sets formed by taking equal number of samples from local clusters. The ``reduction.EquilibratedSamplePCA`` class enables a special case of performing PCA on a sampled data set. It uses equal number of samples from each cluster and allows to analyze what happens when the data set is sampled gradually. It begins with performing PCA on the original data set and then in ``n_iterations`` it will gradually decrease the number of populations in each cluster larger than the smallest cluster, heading towards population of the smallest cluster, in each cluster. At each iteration, we obtain a new sampled data set on which PCA is performed. At the last iteration, the number of populations in each cluster are equal and, finally, PCA is performed on this *equilibrated* data set. A schematic representation of this procedure is presented in the figure below: .. image:: ../images/cluster-equilibration-scheme.svg :width: 700 :align: center Run cluster equilibration ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python equilibrated_pca = reduction.EquilibratedSamplePCA(X, idx, scaling=scaling, X_source=S_X, n_components=n_components, biasing_option=biasing_option, n_iterations=10, stop_iter=0, random_seed=random_seed, verbose=True) With ``verbose=True`` we will see some detailed information on thee number of samples in each cluster at each iteration: .. code-block:: text Biasing is performed with option 2. At iteration 1 taking samples: {0: 4144, 1: 14719, 2: 24689, 3: 2416} At iteration 2 taking samples: {0: 3953, 1: 13352, 2: 22215, 3: 2416} At iteration 3 taking samples: {0: 3762, 1: 11985, 2: 19741, 3: 2416} At iteration 4 taking samples: {0: 3571, 1: 10618, 2: 17267, 3: 2416} At iteration 5 taking samples: {0: 3380, 1: 9251, 2: 14793, 3: 2416} At iteration 6 taking samples: {0: 3189, 1: 7884, 2: 12319, 3: 2416} At iteration 7 taking samples: {0: 2998, 1: 6517, 2: 9845, 3: 2416} At iteration 8 taking samples: {0: 2807, 1: 5150, 2: 7371, 3: 2416} At iteration 9 taking samples: {0: 2616, 1: 3783, 2: 4897, 3: 2416} At iteration 10 taking samples: {0: 2416, 1: 2416, 2: 2416, 3: 2416} .. code:: python eigenvalues = equilibrated_pca.eigenvalues eigenvectors = equilibrated_pca.eigenvectors PCs = equilibrated_pca.pc_scores PC_sources = equilibrated_pca.pc_sources idx_train = equilibrated_pca.idx_train Analyze centers change ^^^^^^^^^^^^^^^^^^^^^^^^^ The ``reduction.analyze_centers_change`` function compares centers computed on the original data set, :math:`\mathbf{X}`, versus on the sampled data set, :math:`\mathbf{X_r}`. The ``idx_train`` input parameter could for instance be obtained from ``reduction.EquilibratedSamplePCA`` and will thus represent the equilibrated data set sampled from the original data set. It could also be obtained as sampled indices using any of the sampling function from the ``preprocess.DataSampler`` class. This function will produce a plot that shows the normalized centers and a percentage by which the new centers have moved with respect to the original ones. Example of a plot: .. code:: python (centers_X, centers_X_r, perc, plt) = reduction.analyze_centers_change(X, idx_train, variable_names=X_names, legend_label=legend_label, save_filename=save_filename) .. image:: ../images/centers-change.svg :width: 500 :align: center If you do not wish to plot all variables present in a data set, use the ``plot_variables`` list as an input parameter to select indices of variables to plot: .. code:: python (centers_X, centers_X_r, perc, plt) = reduction.analyze_centers_change(X, idx_train, variable_names=X_names, plot_variables=[1,3,4,6,8], legend_label=legend_label, save_filename=save_filename) .. image:: ../images/centers-change-selected-variables.svg :width: 260 :align: center Analyze eigenvector weights change ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``eigenvectors`` 3D array obtained from ``reduction.EquilibratedSamplePCA`` can now be used as an input parameter for plotting the eigenvector weights change as we were gradually equilibrating cluster populations. We are going to plot the first eigenvector (corresponding to PC-1) weights change with three variants of normalization. To access the first eigenvector one can simply do: .. code:: python eigenvectors[:,0,:] similarly, to access the second eigenvector: .. code:: python eigenvectors[:,1,:] and so on. Three weight normalization variants are available: - No normalization, the absolute values of the eigenvector weights are plotted. \ To use this variant set ``normalize=False``. Example can be seen below: .. code:: python plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=False, zero_norm=False, save_filename=save_filename) .. image:: ../images/eigenvector-weights-movement-non-normalized.svg :width: 500 :align: center - Normalizing so that the highest weight is equal to 1 and the smallest weight \ is between 0 and 1. This is useful for judging the severity of the weight change. \ To use this variant set ``normalize=True`` and ``zero_norm=False``. \ Example can be seen below: .. code:: python plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=True, zero_norm=False, save_filename=save_filename) .. image:: ../images/eigenvector-weights-movement-normalized.svg :width: 500 :align: center - Normalizing so that weights are between 0 and 1. This is useful for judging \ the change trends since it will blow up even the smallest changes to the entire \ range 0-1. To use this variant set ``normalize=True`` and ``zero_norm=True``. \ Example can be seen below: .. code:: python plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,:], X_names, plot_variables=[], normalize=True, zero_norm=True, save_filename=save_filename) .. image:: ../images/eigenvector-weights-movement-normalized-to-zero.svg :width: 500 :align: center Note, that in the above example, the color bar marks the iteration number and so the :math:`0^{th}` iteration represents eigenvectors from the original data set, :math:`\mathbf{X}`. The last iteration, in this example the :math:`10^{th}` iteration, represents eigenvectors computed on the *equilibrated*, sampled data set. If you do not wish to plot all variables present in a data set, use the ``plot_variables`` list as an input parameter to select indices of variables to plot: .. image:: ../images/eigenvector-weights-movement-selected-variables.svg :width: 280 :align: center If you are only interested in plotting a comparison in the eigenvector weights change between the original data set, :math:`\mathbf{X}`, and one target sampled data set, :math:`\mathbf{X_r}`, (for instance the *equilibrated* data set) you can set the ``eigenvectors`` input parameter to only contain these two sets of weights. The function will then understand that only these two should be compared: .. code:: python plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,[0,-1]], X_names, normalize=False, zero_norm=False, legend_label=legend_label, save_filename=save_filename) .. image:: ../images/eigenvector-weights-movement-X-Xr.svg :width: 500 :align: center Such plot can be done for the pre-selected variables as well using the ``plot_variables`` list: .. code:: python plt = reduction.analyze_eigenvector_weights_change(eigenvectors[:,0,[0,-1]], X_names, plot_variables=[1,3,4,6,8], normalize=False, zero_norm=False, legend_label=legend_label, save_filename=save_filename) .. image:: ../images/eigenvector-weights-movement-X-Xr-selected-variables.svg :width: 280 :align: center Analyze eigenvalue distribution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``reduction.analyze_eigenvalue_distribution`` function will produce a plot that shows the normalized eigenvalues distribution for the original data set, :math:`\mathbf{X}`, and for the sampled data set, :math:`\mathbf{X_r}`. Example of a plot: .. code:: python plt = reduction.analyze_eigenvalue_distribution(state_space, idx_train, scal_crit, biasing_option, legend_label=legend_label, save_filename=save_filename) .. image:: ../images/eigenvalue-distribution.svg :width: 500 :align: center Visualize the re-sampled manifold ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Using the function ``reduction.plot_2d_manifold`` you can visualize any two-dimensional manifold and additionally color it with a variable of choice. Here we are going to plot the re-sampled manifold resulting from performing PCA on the sampled data set. Example of a plot: .. code:: python plt = reduction.plot_2d_manifold(PCs[:,0,-1], PCs[:,1,-1], color=X[:,0], x_label='$Z_{r, 1}$', y_label='$Z_{r, 2}$', colorbar_label='$T$ [K]', save_filename=save_filename) .. image:: ../images/re-sampled-manifold.svg :width: 500 :align: center -------------------------------------------------------------------------------- *********************************************** Generalization of PCA on sampled data sets *********************************************** A more general approach to performing PCA on sampled data sets (instead of using the ``reduction.EquilibratedSamplePCA`` class) is to use the the ``reduction.SamplePCA`` class. This function allows to perform PCA on a data set that has been sampled in any way (in contrast to *equilibrated* sampling which always samples equal number of samples from each cluster). .. note:: It is worth noting that the class ``reduction.EquilibratedSamplePCA`` uses ``reduction.SamplePCA`` inside. We first inspect how many samples each cluster has (in the clusters we identified earlier by binning the first principal component source term): .. code:: python print(preprocess.get_populations(idx)) which shows us populations of each cluster to be: .. code-block:: text [4335, 16086, 27163, 2416] We begin by performing manual sampling. Suppose that we would like to severely under-represent the two largest clusters and over-represent the features of the two smallest clusters. Let's select 4000 samples from :math:`k_0`, 1000 samples from :math:`k_1`, 1000 samples from :math:`k_2` and 2400 samples from :math:`k_3`. In this example we are not interested in generating test samples, so we can suppress returning those. .. code:: python sample = DataSampler(idx, idx_test=None, random_seed=random_seed, verbose=True) (idx_manual, _) = sample.manual({0:4000, 1:1000, 2:1000, 3:2400}, sampling_type='number', test_selection_option=1) The verbose information will tell us how sample densities compare in terms of percentage of samples in each cluster: .. code-block:: text Cluster 0: taking 4000 train samples out of 4335 observations (92.3%). Cluster 1: taking 1000 train samples out of 16086 observations (6.2%). Cluster 2: taking 1000 train samples out of 27163 observations (3.7%). Cluster 3: taking 2400 train samples out of 2416 observations (99.3%). Cluster 0: taking 335 test samples out of 335 remaining observations (100.0%). Cluster 1: taking 15086 test samples out of 15086 remaining observations (100.0%). Cluster 2: taking 26163 test samples out of 26163 remaining observations (100.0%). Cluster 3: taking 16 test samples out of 16 remaining observations (100.0%). Selected 8400 train samples (16.8%) and 41600 test samples (83.2%). We now perform PCA on a data set that has been sampled according to ``idx_manual`` using the ``reduction.SamplePCA`` class: .. code:: python sample_pca = reduction.SamplePCA(X, idx_manual, scaling, n_components, biasing_option) .. code:: python eigenvalues_manual = sample_pca.eigenvalues eigenvectors_manual = sample_pca.eigenvectors PCs_manual = sample_pca.pc_scores Finally, we can generate all the same plots that were shown before. Here, we are only going to present the new re-sampled manifold resulting from current manual sampling: .. code:: plt = reduction.plot_2d_manifold(PCs_manual[:,0], PCs_manual[:,1], color=X[:,0], x_label='$Z_{r, 1}$', y_label='$Z_{r, 2}$', colorbar_label='$T$ [K]', save_filename=save_filename) .. image:: ../images/generalize-sampling-re-sampled-manifold.svg :width: 500 :align: center