Getting started
===============

Installation
------------

Dependencies
^^^^^^^^^^^^

**PCAfold** requires Python>=3.8 and the latest versions of the following packages:

.. code-block:: text

  pip install Cython
  pip install matplotlib
  pip install numpy
  pip install scipy
  pip install termcolor
  pip install pandas
  pip install scikit-learn
  pip install tensorflow
  pip install keras
  pip install tqdm

Build from source
^^^^^^^^^^^^^^^^^

Clone the **PCAfold** repository and move into the ``PCAfold`` directory created:

.. code-block:: text

  git clone http://gitlab.multiscale.utah.edu/common/PCAfold.git
  cd PCAfold

Run the installation from ``setup.py``:

.. code-block:: text

    python setup.py build_ext --inplace
    python setup.py install

Note, that this will be deprecated in the future, but should still work.

Alternatively, run the installation using ``pip``:

.. code-block:: text

  python -m pip install .

If the installation was successful, you are ready to import **PCAfold**!
In Python, you can now import all modules:

.. code-block:: python

  from PCAfold import preprocess
  from PCAfold import reduction
  from PCAfold import analysis
  from PCAfold import reconstruction
  from PCAfold import utilities

Testing
^^^^^^^

To run regression tests from the base repo directory run:

.. code-block:: text

  python -m unittest discover

To switch verbose on, use the ``-v`` flag.

All tests should be passing. If any of the tests is failing and you can’t sort
out why, please open an issue on `GitLab <https://gitlab.multiscale.utah.edu/common/PCAfold>`_.

Local documentation build
^^^^^^^^^^^^^^^^^^^^^^^^^

To build the documentation locally, you need ``sphinx`` installed on your machine,
along with few extensions:

.. code-block:: text

  pip install Sphinx
  pip install sphinxcontrib-bibtex
  pip install furo

Then, navigate to ``docs/`` directory and build the documentation:

.. code-block:: text

  sphinx-build -b html . builddir

  make html

Documentation main page ``_build/html/index.html`` can be opened in a web browser.

On MacOS, you can open it directly from the terminal:

.. code-block:: text

  open _build/html/index.html

Plotting
--------

Some functions within **PCAfold** result in plot outputs. Global styles for the
plots, such as font types and sizes, are set using the ``PCAfold/styles.py`` file.
This file can be updated with new settings that will be seen globally by all
**PCAfold** modules. Re-install the project after changing ``styles.py`` file:

.. code-block:: text

  python -m pip install .

Note, that all plotting functions return handles to generated plots.

Workflows
---------

In this section, we present several popular workflows that can be achieved using
functionalities of **PCAfold**. An overview for combining **PCAfold** modules
into a complete workflow is presented in the diagram below:

.. image:: ../images/PCAfold-software-architecture.svg
  :width: 800
  :align: center

Each module’s functionalities can also be used as a standalone tool for
performing a specific task and can easily combine with techniques from outside of
this software.

The format for the user-supplied input data matrix
:math:`\mathbf{X} \in \mathbb{R}^{N \times Q}`, common to all modules, is that
:math:`N` observations are stored in rows and :math:`Q` variables are stored
in columns. Since typically :math:`N \gg Q`, the initial dimensionality of the
data set is determined by the number of variables, :math:`Q`.

.. math::

  \mathbf{X} =
  \begin{bmatrix}
  \vdots & \vdots & & \vdots \\
  X_1 & X_2 & \dots & X_{Q} \\
  \vdots & \vdots & & \vdots \\
  \end{bmatrix}

Below are brief descriptions of several workflows that utilize functionalities of **PCAfold**:

Data manipulation
^^^^^^^^^^^^^^^^^

Basic data manipulation such as centering, scaling, outlier detection and removal
or kernel density weighting of data sets can be achieved using the ``preprocess``
module.

Data clustering
^^^^^^^^^^^^^^^

Data clustering can be achieved using the ``preprocess`` module. This functionality can be
useful for data analysis or feature detection and can also be the first
step for applying data reduction techniques locally (on local portions of the data).
It is also worth pointing out that clustering algorithms from outside of
**PCAfold** software can be brought into the workflow.

Data sampling
^^^^^^^^^^^^^

Data sampling can be achieved using the ``preprocess`` module. Possible
use-case for sampling data sets could be to split data sets into train and test
samples for other Machine Learning algorithms. Another use-case can be sampling
imbalanced data sets.

Global PCA
^^^^^^^^^^

Global PCA can be performed using ``PCA`` class available in the ``reduction`` module.

Local PCA
^^^^^^^^^

Local PCA can be performed using ``LPCA`` class available in the ``reduction`` module.

PCA on sampled data sets
^^^^^^^^^^^^^^^^^^^^^^^^

PCA on sampled data sets can be performed by combining sampling techniques from
the ``preprocess`` module, with ``PCA`` class
available in the ``reduction`` module. The ``reduction`` module additionally
contains a few more functions specifically designed to help analyze the results of
performing PCA on sampled data sets.

Assessing manifold quality
^^^^^^^^^^^^^^^^^^^^^^^^^^

Once a low-dimensional manifold is obtained, the quality of the manifold can be
assessed using functionalities available in the ``analysis`` module.
It is worth noting that the manifold assessment metrics available can be
equally applied to manifolds derived by means of techniques other than PCA.

Reconstructing quantities of interest (QoIs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using the ``reconstruction`` module, quantities of interest (QoIs) can be reconstructed from the reduced
data representations using kernel regression, artificial neural networks (ANN) and a novel
approach called partition of unity networks (POUnets).

Improving projection topologies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Two novel algorithms based on the quantitative cost function are introduced in the ``utilities`` module that can help
improve topologies of PCA projections through appropriate variable selection. We also introduce an autoencoder-like strategy
that optimizes the projection topology directly based on the custom projection-independent and projection-dependent quantities of interest (QoIs).