Multivariate statistical Analysis (MSA)
There are essentially only two steps here:
- Dimension-reduction -- expression of a set of mxn
images using only a few terms of an expansion into eigenvectors,
or factors. This expansion results from an analysis of the
interimage variability of the entire image set. A low-dimensional
space spanned by only afew factors is often sufficient to represent
each image of the set.
- Classification of the images in the low-dimensional factor space.
For more details, consult pp. 145-192 in Frank, Oxford University Press
(2006).
Note that for the purpose of classification, the dimension-reduction step is optional.
In principle, one could classify the raw images (which is what SPIDER operation
'AP CM' does). Dimension-reduction has two purposes:
(1) it greatly reduces the amount of data that needs to be analyzed, and (2) it
removes a large amount of noise, or information without any systematic trend among
the images.
The example given below uses correspondence analysis for the
dimension-reduction. A similar method is principal component
analysis (PCA); to run PCA, one needs to change an option
under SPIDER operation 'CA S'
in the batch file ca-pca.spi.
There are three methods for classification presented
here: Diday's method, Ward's method, and K-means.
Use of the individual SPIDER operations are described in more depth
here.
- Dimension-reduction
- BATCH FILE: ca-pca.spi
- uses Python script
eigendoc.py
(J.S.L.) and Gnuplot script ploteigen.gnu
(J.S.L. & T.R.S.)
- INPUT PARAMETER: number of eigenfactors to calculate
- INPUT: particles
- OUTPUTS: eigenvalue histogram (example),
eigenimages, factormaps (example)
- To switch to PCA, change the option in ca-pca.spi under the operation 'CA S'
from 'C' to 'P', and remove the line referring to the additive constant.
- After running, examine the eigenimages and decide which ones to use.
Typically all but the first few are noisy.
- Classification -- choose one of three options:
- Diday's method -- I hear that this method works exceedingly well.
In practice though, I find that I have limited control over the number
of classes, which may or may not be a problem depending on the application.
- BATCH FILE: cluster.spi
- INPUT PARAMETER: number of eigenfactors to use
- OUTPUT: dendrograms, PostScript
(example) and
SPIDER formats
- After running, decide how many classes to use. The SPIDER-format
dendrogram document can be viewed with WEB (Commands
-> Dendrogram).
- BATCH FILE: classavg.spi
- INPUT PARAMETER: desired number of classes
- OUTPUT: class averages
- Ward's method -- The advantage is that, unlike Diday's method above, the
dendrogram branches to any arbitrary number of classes, down in size to
individual particles. The disadvantage is that the dendrogram is unreadable if
there are too many branches. You can truncate the dendrogram in WEB as
described below.
- BATCH FILE: hierarchical.spi
- INPUT PARAMETER: number of eigenfactors to use
- OUTPUT: dendrograms, PostScript and SPIDER formats
- After running, decide how many classes to use. The PostScript
file may be highly branched, and nodes may be unreadable
(example). The SPIDER-format
dendrogram document can be viewed with WEB and truncated. (Commands
-> Dendrogram; example).
- BATCH FILE: classavg.spi
- INPUT PARAMETER: desired number of classes
- OUTPUT: class averages
- K-means classification -- The primary input is the number of classes
to divide the particles into.
- BATCH FILE: kmeans.spi
- INPUT PARAMETERS: number of eigenfactors, number of classes
- OUTPUT: class averages
There is a Python utility,
classavg.py,
that upon clicking on a class average, displays the constituent
individual particles.
Source: techs/MSA/index.htm
Page updated: 01/20/05
Tanvir Shaikh