Difference between revisions of "Principal component analysis"

From Dynamo
Jump to navigation Jump to search
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Operative steps =
 
 
[[Category:PCA]]
 
[[Category:PCA]]
 
[[Category:Classification]]
 
[[Category:Classification]]
In general, a Principal Component Analysis (PCA) aims at analyzing a data set and discovering a set of coordinates that capture the most representative features of said data. Often the term ''PCA classification'' is used, although PCA is not a classification method: classification itself is performed on the features extracted through PCA.
+
In general, a Principal Component Analysis (PCA) aims at analyzing a data set and discovering a set of coordinates that capture the most representative features of said data. Often the term ''PCA classification'' is loosely used. PCA is not a classification method: classification itself is performed on the features extracted through PCA.
  
 
In ''Dynamo'', the PCA is the process of finding a reduced set of "eigenvolumes" that allow to approximatively represent each particle in our data set as a combination of these eigenvolumes.  Which this representation, a generic particle can be represented by the contributions of each "eigenvolume" to the particle, i.e., by a set of "eigencomponents", normally in a number no much higher than 20.
 
In ''Dynamo'', the PCA is the process of finding a reduced set of "eigenvolumes" that allow to approximatively represent each particle in our data set as a combination of these eigenvolumes.  Which this representation, a generic particle can be represented by the contributions of each "eigenvolume" to the particle, i.e., by a set of "eigencomponents", normally in a number no much higher than 20.
Line 8: Line 7:
 
Once the particles are represent by small sets of scalars, they can be classified with standard methods like k-means.
 
Once the particles are represent by small sets of scalars, they can be classified with standard methods like k-means.
  
 +
=Operative steps =
  
Operatively, this entails:
+
PCA classifications are most easily handled through ''classification wrokrkflows''. These projects can be controled through [[#GUIs for PCA classification|GUIs]] or the [[#PCA classification through the command line | command line]]
; Selecting the input
+
 
a data folder, a table, a mask
+
In whichever way you control the classification project, operatively a PCA based classification will require the completion of these steps:
; Computing a cross-correlation matrix
+
;Selecting the input
: this is typically the most consuming part, as it involves to compare all particles in the data folder against all particles.
+
:a data folder, a table, a mask
; Computing the eigenvalues, eigenvolumes and eigencomponents
+
;Computing a cross-correlation matrix
; Using the eigencomponents to create a classification.
+
;Computing the eigenvalues, eigenvolumes and eigencomponents
 +
:Using the eigencomponents to create a classification.
 +
 
 +
==Input==
 +
PCA is computed on a set of aligned particles. Thus, you need a [[data folder]] and a [[table]] that describes the alignment.
 +
In the most common case, you want to focus the classification in a region of the box, so that you need a [[classification mask]].
 +
 
 +
Additionally, there are some fine tuning parameters that can be passed: particles can be symmetrized, resized or bandpassed.
 +
 
 +
==Computation of cross-correlation matrix==
 +
 
 +
===Computation of cross-correlation matrix===
 +
{{main|Cross correlation matrix|Cross correlation matrix}}
 +
All the aligned particles are compared to each other through cross correlation. This produces an NxN matrix for a set of N matrix.  
 +
This is typically the most time consuming part of the PCA workflow.
 +
 
 +
==Computation of PCA==
 +
 
 +
===Eigenvalues===
 +
The cross-correlation matrix is diagonalized, producing  a set eigenvalues which should decay to zero (the slower the decay, the more eigenvolumes will be relevant).  This computation occurs very fast.
 +
 
 +
===Eigenvolumes===
 +
To each eigenvalue an eigenvector is attached. Eigenvectors are called ''[[eigenvolumes]]'' in this context.
 +
Note that they will be only defined inside the classification mask attached to the classification.
 +
 
 +
===Eigencomponents===
 +
{{main|Eigentable|Eigentable}}
 +
Also a time consuming step (although much less intensive than the computation of the ccmatrix). Each particle is compared to each eigenvolume.

Latest revision as of 18:03, 28 March 2020

In general, a Principal Component Analysis (PCA) aims at analyzing a data set and discovering a set of coordinates that capture the most representative features of said data. Often the term PCA classification is loosely used. PCA is not a classification method: classification itself is performed on the features extracted through PCA.

In Dynamo, the PCA is the process of finding a reduced set of "eigenvolumes" that allow to approximatively represent each particle in our data set as a combination of these eigenvolumes. Which this representation, a generic particle can be represented by the contributions of each "eigenvolume" to the particle, i.e., by a set of "eigencomponents", normally in a number no much higher than 20.

Once the particles are represent by small sets of scalars, they can be classified with standard methods like k-means.

Operative steps

PCA classifications are most easily handled through classification wrokrkflows. These projects can be controled through GUIs or the command line

In whichever way you control the classification project, operatively a PCA based classification will require the completion of these steps:

Selecting the input
a data folder, a table, a mask
Computing a cross-correlation matrix
Computing the eigenvalues, eigenvolumes and eigencomponents
Using the eigencomponents to create a classification.

Input

PCA is computed on a set of aligned particles. Thus, you need a data folder and a table that describes the alignment. In the most common case, you want to focus the classification in a region of the box, so that you need a classification mask.

Additionally, there are some fine tuning parameters that can be passed: particles can be symmetrized, resized or bandpassed.

Computation of cross-correlation matrix

Computation of cross-correlation matrix

Main article: Cross correlation matrix

All the aligned particles are compared to each other through cross correlation. This produces an NxN matrix for a set of N matrix. This is typically the most time consuming part of the PCA workflow.

Computation of PCA

Eigenvalues

The cross-correlation matrix is diagonalized, producing a set eigenvalues which should decay to zero (the slower the decay, the more eigenvolumes will be relevant). This computation occurs very fast.

Eigenvolumes

To each eigenvalue an eigenvector is attached. Eigenvectors are called eigenvolumes in this context. Note that they will be only defined inside the classification mask attached to the classification.

Eigencomponents

Main article: Eigentable

Also a time consuming step (although much less intensive than the computation of the ccmatrix). Each particle is compared to each eigenvolume.