Getting started#

Installation#

Dependencies#

PCAfold requires Python>=3.8 and the latest versions of the following packages:

pip install Cython
pip install matplotlib
pip install numpy
pip install scipy
pip install termcolor
pip install pandas
pip install scikit-learn
pip install tensorflow
pip install keras
pip install tqdm

Build from source#

Clone the PCAfold repository and move into the PCAfold directory created:

git clone http://gitlab.multiscale.utah.edu/common/PCAfold.git
cd PCAfold

Run the installation from setup.py:

python setup.py build_ext --inplace
python setup.py install

Note, that this will be deprecated in the future, but should still work.

Alternatively, run the installation using pip:

python -m pip install .

If the installation was successful, you are ready to import PCAfold! In Python, you can now import all modules:

from PCAfold import preprocess
from PCAfold import reduction
from PCAfold import analysis
from PCAfold import reconstruction
from PCAfold import utilities

Testing#

To run regression tests from the base repo directory run:

python -m unittest discover

To switch verbose on, use the -v flag.

All tests should be passing. If any of the tests is failing and you can’t sort out why, please open an issue on GitLab.

Local documentation build#

To build the documentation locally, you need sphinx installed on your machine, along with few extensions:

pip install Sphinx
pip install sphinxcontrib-bibtex
pip install furo

Then, navigate to docs/ directory and build the documentation:

sphinx-build -b html . builddir

make html

Documentation main page _build/html/index.html can be opened in a web browser.

On MacOS, you can open it directly from the terminal:

open _build/html/index.html

Plotting#

Some functions within PCAfold result in plot outputs. Global styles for the plots, such as font types and sizes, are set using the PCAfold/styles.py file. This file can be updated with new settings that will be seen globally by all PCAfold modules. Re-install the project after changing styles.py file:

python -m pip install .

Note, that all plotting functions return handles to generated plots.

Workflows#

In this section, we present several popular workflows that can be achieved using functionalities of PCAfold. An overview for combining PCAfold modules into a complete workflow is presented in the diagram below:

../_images/PCAfold-software-architecture.svg

Each module’s functionalities can also be used as a standalone tool for performing a specific task and can easily combine with techniques from outside of this software.

The format for the user-supplied input data matrix \(\mathbf{X} \in \mathbb{R}^{N \times Q}\), common to all modules, is that \(N\) observations are stored in rows and \(Q\) variables are stored in columns. Since typically \(N \gg Q\), the initial dimensionality of the data set is determined by the number of variables, \(Q\).

\[\begin{split}\mathbf{X} = \begin{bmatrix} \vdots & \vdots & & \vdots \\ X_1 & X_2 & \dots & X_{Q} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix}\end{split}\]

Below are brief descriptions of several workflows that utilize functionalities of PCAfold:

Data manipulation#

Basic data manipulation such as centering, scaling, outlier detection and removal or kernel density weighting of data sets can be achieved using the preprocess module.

Data clustering#

Data clustering can be achieved using the preprocess module. This functionality can be useful for data analysis or feature detection and can also be the first step for applying data reduction techniques locally (on local portions of the data). It is also worth pointing out that clustering algorithms from outside of PCAfold software can be brought into the workflow.

Data sampling#

Data sampling can be achieved using the preprocess module. Possible use-case for sampling data sets could be to split data sets into train and test samples for other Machine Learning algorithms. Another use-case can be sampling imbalanced data sets.

Global PCA#

Global PCA can be performed using PCA class available in the reduction module.

Local PCA#

Local PCA can be performed using LPCA class available in the reduction module.

PCA on sampled data sets#

PCA on sampled data sets can be performed by combining sampling techniques from the preprocess module, with PCA class available in the reduction module. The reduction module additionally contains a few more functions specifically designed to help analyze the results of performing PCA on sampled data sets.

Assessing manifold quality#

Once a low-dimensional manifold is obtained, the quality of the manifold can be assessed using functionalities available in the analysis module. It is worth noting that the manifold assessment metrics available can be equally applied to manifolds derived by means of techniques other than PCA.

Reconstructing quantities of interest (QoIs)#

Using the reconstruction module, quantities of interest (QoIs) can be reconstructed from the reduced data representations using kernel regression, artificial neural networks (ANN) and a novel approach called partition of unity networks (POUnets).

Improving projection topologies#

Two novel algorithms based on the quantitative cost function are introduced in the utilities module that can help improve topologies of PCA projections through appropriate variable selection. We also introduce an autoencoder-like strategy that optimizes the projection topology directly based on the custom projection-independent and projection-dependent quantities of interest (QoIs).