.. note:: This tutorial was generated from a Jupyter notebook that can be accessed `here `_. ################## Data clustering ################## In this tutorial, we present the clustering functionalities from the ``preprocess`` module. We import the necessary modules: .. code:: python from PCAfold import preprocess from PCAfold import reduction import numpy as np from matplotlib.colors import ListedColormap from sklearn.cluster import KMeans and we set some initial parameters: .. code:: python x_label = '$x$' y_label = '$y$' z_label = '$z$' figure_size = (6,3) color_map = ListedColormap(['#0e7da7', '#ceca70', '#b45050', '#2d2d54']) save_filename = None random_seed = 200 -------------------------------------------------------------------------------- ****************************************************** Visualize the clustering result in 2D ****************************************************** We begin by demonstrating how the result of clustering can be visualized using the plotting functionalities from the ``preprocess`` module. We generate a synthetic 2D data set composed of two distinct clouds: .. code:: python np.random.seed(seed=random_seed) n_observations = 1000 mean_1 = [0,1] mean_2 = [6,4] covariance_1 = [[2, 0.5], [0.5, 0.5]] covariance_2 = [[3, 0.3], [0.3, 0.5]] x_1, y_1 = np.random.multivariate_normal(mean_1, covariance_1, n_observations).T x_2, y_2 = np.random.multivariate_normal(mean_2, covariance_2, n_observations).T x = np.concatenate([x_1, x_2]) y = np.concatenate([y_1, y_2]) The original data set can be visualized using the function from the ``reduction`` module: .. code:: python plt = reduction.plot_2d_manifold(x, y, x_label=x_label, y_label=y_label, figure_size=figure_size, save_filename=None) .. image:: ../images/tutorial-clustering-cloud-2d-data-set.svg :width: 450 :align: center We divide the data into two clusters using the K-Means algorithm: .. code:: python idx_kmeans = KMeans(n_clusters=2).fit(np.hstack((x, y))).labels_ As soon as the ``idx`` vector of cluster classification is known for the data set, the result of clustering can be visualized using the ``plot_2d_clustering`` function. We plot the result of K-Means clustering on the 2D data set: .. code:: python plt = preprocess.plot_2d_clustering(x, y, idx_kmeans, x_label=x_label, y_label=y_label, color_map=color_map, first_cluster_index_zero=False, figure_size=figure_size, save_filename=None) .. image:: ../images/tutorial-clustering-cloud-2d-data-set-kmeans.svg :width: 600 :align: center Note, that the numbers in the legend, next to each cluster number, represent the number of samples in a particular cluster. The populations of each cluster can also be computed and printed, for instance through: .. code:: python print(preprocess.get_populations(idx_kmeans)) which in this case will print: .. code-block:: text [991, 1009] -------------------------------------------------------------------------------- ****************************************************** Visualize the clustering result in 3D ****************************************************** Clustering result can also be visualized in a three-dimensional space. In this example, we generate a synthetic 3D data set composed of three connected planes: .. code:: python n_observations = 50 x = np.tile(np.linspace(0,50,n_observations), n_observations) y = np.zeros((n_observations,1)) z = np.zeros((n_observations*n_observations,1)) for i in range(1,n_observations): y = np.vstack((y, np.ones((n_observations,1))*i)) y = y.ravel() for observation, x_value in enumerate(x): y_value = y[observation] if x_value <= 10: z[observation] = 2 * x_value + y_value elif x_value > 10 and x_value <= 35: z[observation] = 10 * x_value + y_value - 80 elif x_value > 35: z[observation] = 5 * x_value + y_value + 95 (x, _, _) = preprocess.center_scale(x[:,None], scaling='0to1') (y, _, _) = preprocess.center_scale(y[:,None], scaling='0to1') (z, _, _) = preprocess.center_scale(z, scaling='0to1') The original data set can be visualized using the function from the ``reduction`` module: .. code:: python plt = reduction.plot_3d_manifold(x, y, z, elev=30, azim=-100, x_label=x_label, y_label=y_label, z_label=z_label, figure_size=(12,8), save_filename=None) .. image:: ../images/tutorial-clustering-3d-data-set.svg :width: 500 :align: center We divide the data into four clusters using the K-Means algorithm: .. code:: python idx_kmeans = KMeans(n_clusters=4).fit(np.hstack((x, y, z))).labels_ The result of K-Means clustering can then be plotted in 3D: .. code:: python plt = preprocess.plot_3d_clustering(x, y, z, idx_kmeans, elev=30, azim=-100, x_label=x_label, y_label=y_label, z_label=z_label, color_map=color_map, first_cluster_index_zero=False, figure_size=(12,8), save_filename=None) .. image:: ../images/tutorial-clustering-3d-data-set-kmeans.svg :width: 630 :align: center -------------------------------------------------------------------------------- ****************************************************** Clustering based on binning a single variable ****************************************************** In this section, we demonstrate a few clustering functions that are implemented in **PCAfold**. All of them cluster data sets based on binning a single variable. First, we generate a synthetic two-dimensional data set: .. code:: python x = np.linspace(-1,1,100) y = -x**2 + 1 The data set can be visualized using the function from the ``reduction`` module: .. code:: python plt = reduction.plot_2d_manifold(x, y, x_label=x_label, y_label=y_label, figure_size=figure_size, save_filename=None) .. image:: ../images/tutorial-clustering-original-data-set.svg :width: 400 :align: center We will now cluster the 2D data set according to bins of a single variable, :math:`x`. Cluster into equal variable bins ================================= .. image:: ../images/clustering-variable-bins.svg :width: 600 :align: center This clustering will divide the data set based on equal bins of a variable vector. .. code:: python (idx_variable_bins, borders_variable_bins) = preprocess.variable_bins(x, 4, verbose=True) With ``verbose=True`` we will see some detailed information on clustering: .. code-block:: text Border values for bins: [-1.0, -0.5, 0.0, 0.5, 1.0] Bounds for cluster 0: -1.0, -0.5152 Bounds for cluster 1: -0.4949, -0.0101 Bounds for cluster 2: 0.0101, 0.4949 Bounds for cluster 3: 0.5152, 1.0 The result of clustering can be plotted in 2D: .. code:: python plt = preprocess.plot_2d_clusteringplt = preprocess.plot_2d_clustering(x, y, idx_variable_bins, x_label=x_label, y_label=y_label, color_map=color_map, first_cluster_index_zero=False, grid_on=True, figure_size=figure_size, save_filename=None) The visual result of this clustering can be seen below: .. image:: ../images/tutorial-clustering-variable-bins-k4.svg :width: 500 :align: center Note that this clustering function created four equal bins in the space of :math:`x`. In this case, since :math:`x` ranges from -1 to 1, the bins are created as intervals of length 0.5 in the :math:`x`-space. Cluster into pre-defined variable bins ====================================== .. image:: ../images/clustering-predefined-variable-bins.svg :width: 600 :align: center This clustering will divide the data set into bins of a one-dimensional variable vector whose borders are specified by the user. Let's specify the split values as ``split_values = [-0.6, 0.4, 0.8]``: .. code:: python split_values = [-0.6, 0.4, 0.8] (idx_predefined_variable_bins, borders_predefined_variable_bins) = preprocess.predefined_variable_bins(x, split_values, verbose=True) With ``verbose=True`` we will see some detailed information on clustering: .. code-block:: text Border values for bins: [-1.0, -0.6, 0.4, 0.8, 1.0] Bounds for cluster 0: -1.0, -0.6162 Bounds for cluster 1: -0.596, 0.3939 Bounds for cluster 2: 0.4141, 0.798 Bounds for cluster 3: 0.8182, 1.0 The visual result of this clustering can be seen below: .. image:: ../images/tutorial-clustering-predefined-variable-bins-k4.svg :width: 500 :align: center This clustering function created four bins in the space of :math:`x`, where the splits in the :math:`x`-space are located at :math:`x=-0.6`, :math:`x=0.4` and :math:`x=0.8`. Cluster into zero-neighborhood variable bins ============================================ This partitioning relies on unbalanced variable vector which, in principle, is assumed to have a lot of observations whose values are close to zero and relatively few observations with values away from zero. This function can be used to separate close-to-zero observations into one cluster (``split_at_zero=False``) or two clusters (``split_at_zero=True``). Without splitting at zero, ``split_at_zero=False`` ------------------------------------------------------ .. image:: ../images/clustering-zero-neighborhood-bins.svg :width: 700 :align: center .. code:: python (idx_zero_neighborhood_bins, borders_zero_neighborhood_bins) = preprocess.zero_neighborhood_bins(x, 3, zero_offset_percentage=10, split_at_zero=False, verbose=True) With ``verbose=True`` we will see some detailed information on clustering: .. code-block:: text Border values for bins: [-1. -0.2 0.2 1. ] Bounds for cluster 0: -1.0, -0.2121 Bounds for cluster 1: -0.1919, 0.1919 Bounds for cluster 2: 0.2121, 1.0 The visual result of this clustering can be seen below: .. image:: ../images/tutorial-clustering-zero-neighborhood-bins-k3.svg :width: 500 :align: center We note that the observations corresponding to :math:`x \approx 0` have been classified into one cluster (:math:`k_2`). With splitting at zero, ``split_at_zero=True`` ------------------------------------------------------ .. image:: ../images/clustering-zero-neighborhood-bins-zero-split.svg :width: 700 :align: center .. code:: python (idx_zero_neighborhood_bins_split_at_zero, borders_zero_neighborhood_bins_split_at_zero) = preprocess.zero_neighborhood_bins(x, 4, zero_offset_percentage=10, split_at_zero=True, verbose=True) With ``verbose=True`` we will see some detailed information on clustering: .. code-block:: text Border values for bins: [-1. -0.2 0. 0.2 1. ] Bounds for cluster 0: -1.0, -0.2121 Bounds for cluster 1: -0.1919, -0.0101 Bounds for cluster 2: 0.0101, 0.1919 Bounds for cluster 3: 0.2121, 1.0 The visual result of this clustering can be seen below: .. image:: ../images/tutorial-clustering-zero-neighborhood-bins-split-at-zero-k4.svg :width: 500 :align: center We note that the observations corresponding to :math:`x \approx 0^{-}` have been classified into one cluster (:math:`k_2`) and the observations corresponding to :math:`x \approx 0^{+}` have been classified into another cluster (:math:`k_3`). -------------------------------------------------------------------------------- ****************************************************** Clustering combustion data sets ****************************************************** In this section, we present functions that are specifically aimed for clustering reactive flows data sets. We will use a data set representing combustion of syngas in air, generated from the steady laminar flamelet model using *Spitfire* software :cite:`Hansen2020` and a chemical mechanism by Hawkes et al. :cite:`Hawkes2007`. We import the flamelet data set: .. code:: python X = np.genfromtxt('data-state-space.csv', delimiter=',') S_X = np.genfromtxt('data-state-space-sources.csv', delimiter=',') mixture_fraction = np.genfromtxt('data-mixture-fraction.csv', delimiter=',') Cluster into bins of the mixture fraction vector ================================================ .. image:: ../images/clustering-mixture-fraction-bins.svg :width: 600 :align: center In this example, we partition the data set into five bins of the mixture fraction vector. This is a feasible clustering strategy for non-premixed flames which takes advantage of the physics-based (supervised) partitioning of the data set based on local stoichiometry. The partitioning function requires specifying the value for the stoichiometric mixture fraction, :math:`Z_{st}` (``Z_stoich``). Note that the first split in the data set is performed at :math:`Z_{st}` and further splits are performed automatically on the fuel-lean and the fuel-rich branch. .. code:: python Z_stoich = 0.273 (idx_mixture_fraction_bins, borders_mixture_fraction_bins) = preprocess.mixture_fraction_bins(mixture_fraction, 5, Z_stoich, verbose=True) With ``verbose=True`` we will see some detailed information on clustering: .. code-block:: text Border values for bins: [0. 0.1365 0.273 0.51533333 0.75766667 1. ] Bounds for cluster 0: 0.0, 0.1313 Bounds for cluster 1: 0.1414, 0.2727 Bounds for cluster 2: 0.2828, 0.5152 Bounds for cluster 3: 0.5253, 0.7576 Bounds for cluster 4: 0.7677, 1.0 The visual result of this clustering can be seen below: .. image:: ../images/tutorial-clustering-mixture-fraction-bins-k4.svg :width: 550 :align: center It can be seen that the data set is divided at the stoichiometric value of mixture fraction, in this case :math:`Z_{st} \approx 0.273`. The fuel-lean branch (the part of the flamelet to the left of :math:`Z_{st}`) is divided into two clusters (:math:`k_1` and :math:`k_2`) and the fuel-rich branch (the part of the flamelet to the right of :math:`Z_{st}`) is divided into three clusters (:math:`k_3`, :math:`k_4` and :math:`k_5`), since this branch has a longer range in the mixture fraction space. Separating close-to-zero principal component source terms ========================================================= The function ``zero_neighborhood_bins`` can be used to separate close-to-zero source terms of the original variables (or close-to-zero source terms of the principal components (PCs)). The zero source terms physically correspond to the steady-state. We first compute the source terms of the principal components by transforming the source terms of the original variables to the new PC-basis: .. code:: python pca_X = reduction.PCA(X, scaling='auto', n_components=2) S_Z = pca_X.transform(S_X, nocenter=True) and we use the first PC source term, :math:`S_{Z,1}`, as the conditioning variable for the clustering function: .. code:: python (idx_close_to_zero_source_terms, borders_close_to_zero_source_terms) = preprocess.zero_neighborhood_bins(S_Z[:,0], 4, zero_offset_percentage=5, split_at_zero=True, verbose=True) With ``verbose=True`` we will see some detailed information on clustering: .. code-block:: text Border values for bins: [-87229.83051401 -5718.91469641 0. 5718.91469641 27148.46341416] Bounds for cluster 0: -87229.8305, -5722.1432 Bounds for cluster 1: -5717.5228, -0.0 Bounds for cluster 2: 0.0, 5705.7159 Bounds for cluster 3: 5719.0347, 27148.4634 The visual result of this clustering can be seen below: .. image:: ../images/tutorial-clustering-close-to-zero-source-terms-k4.svg :width: 550 :align: center From the verbose information, we can see that the first cluster (:math:`k_1`) contains observations corresponding to the highly negative values of :math:`S_{Z,1}`, the second cluster (:math:`k_2`) to the close-to-zero but negative values of :math:`S_{Z,1}`, the third cluster (:math:`k_3`) to the close-to-zero but positive values of :math:`S_{Z,1}` and the fourth cluster (:math:`k_4`) to the highly positive values of :math:`S_{Z,1}`. We can further merge the two clusters that contain observations corresponding to the high magnitudes of :math:`S_{Z, 1}` into one cluster. This can be achieved using the function ``flip_clusters``. We change the label of the fourth cluster to ``0`` and thus all observations from the fourth cluster are now assigned to the first cluster. .. code:: python idx_merged = preprocess.flip_clusters(idx_close_to_zero_source_terms, {3:0}) The visual result of this merged clustering can be seen below: .. image:: ../images/tutorial-clustering-close-to-zero-source-terms-merged-k4.svg :width: 550 :align: center If we further plot the two-dimensional flamelet manifold, colored by :math:`S_{Z, 1}`, we can check that the clustering technique correctly identified the regions on the manifold where :math:`S_{Z, 1} \approx 0` as well as the regions where :math:`S_{Z, 1}` has high positive or high negative magnitudes. .. image:: ../images/tutorial-clustering-close-to-zero-source-terms-manifold.svg :width: 590 :align: center