Multivariate Data Analysis
Previously known as multivariate statistical analysis
There are essentially only four steps here:
- Low-pass filtration
- Alignment in two dimensions
- Dimension-reduction -- expression of a mxn image
using only a few terms, i.e., eigenvectors
- Classification
The low-pass filtration is optional, but if you plan to look at individual
particles, this step will help.
For the classification below to be sensible, the images will need to
have been aligned. The alignment step here is optional if the images
have been aligned already.
The dimension-reduction step is even optional, in theory. In principle,
one could classify the raw images (which is what SPIDER operation
'AP C'
does). As an example here, I'm using correspondence analysis for
the dimension-reduction. A similar method is principal-component
analysis (PCA); to run PCA, one needs to change an option under SPIDER
operation 'CA S'
(here, in the batch file ca-pca.spi).
For classification, there are three methods illustrated here: Diday's
method, Ward's method, and K-means. The individual classification
operations are described in more depth in the
classification tutorial.
Getting started
- All files are located under: techs/MSA
(Modified June 2009).
Procedure
- Low-pass filtration
- BATCH FILE: filtershrink.spi
- INPUT PARAMETERS: filter radii, decimation factor
- INPUTS: selection file, unfiltered particles
If you don't have a selection file, run
mkfilenums.py (W.T.B. & T.R.S.),
substituting the appropriate filenames:
mkfilenums.py listparticles.dat win/ser*.dat
- OUTPUTS: filtered particles
- Reference-free alignment. -- choose one of these two options:
- Using 'AP SR'
- BATCH FILE: apsr4class.spi
- INPUT PARAMETER: object diameter (pixels, after decimation)
- INPUTS: unaligned particles, selection file
- OUTPUTS: aligned particles, averages
- There may to be a memory limit in
'AP SR'.
If you get a core dump, truncate the selection file and try again.
- Using pairwise alignment
- BATCH FILE: pairwise.spi
- INPUT PARAMETER: object diameter (pixels, after decimation)
- INPUTS: unaligned particles, selection file
- OUTPUTS: aligned particles, averages/li>
- Conceptually, this alignment first aligns pairs of images and averages them.
Then, it aligns pairs of those pairs and averages them, and so forth.
This type of alignment appears to be less random than does
'AP SR',
which chooses seed images as alignment references.
-
Dimension-reduction
- BATCH FILE: ca-pca.spi
Uses Python script
eigendoc.py
(J.S.L.) and Gnuplot script ploteigen.gnu
(J.S.L. & T.R.S.)
- INPUT PARAMETER: number of eigenfactors to calculate
(more than 99 will require some user modification)
- INPUT: aligned particles
- OUTPUTS: eigenvalue histogram,
eigenimages,
factormaps, &
reconstituted images
- To switch to PCA (or iterative PCA), change the option in
ca-pca.spi
under the operation 'CA S' from 'C' to 'P' (or 'I', respectively), and remove the
line referring to the additive constant.
- After running, examine the eigenimages and decide which ones to use.
Typically all but the first few are noisy. If not, increase the number
of eigenfactors to calculate, and re-run this batch file.
- Classification -- choose one of three options:
- 1. Diday's method, using
'CL CLA' -- I hear that this method
works exceedingly well. In
practice though, I find that I have limited control over the number of
classes, which may or may not be a problem depending on the
application. Also, I sometimes get errors with large data sets with
this method.
- BATCH FILE: cluster.spi
- INPUT PARAMETER: number of eigenfactors to use
- OUTPUT: dendrograms
(PostScript
and SPIDER formats)
- After running, decide how many classes to include. using
WEB/
JWEB
(Commands -> Dendrogram) and clicking on
Show averaged images.
- BATCH FILE: classavg.spi
- INPUT PARAMETER: desired number of classes
- OUTPUT: class averages
- 2. Ward's method, using
'CL HC' --
The pro is that, unlike Diday's method above, the
dendrogram branches to any arbitrary number of classes, down in size to
individual particles. The con is that the dendrogram is unreadable if
there are so many branches. You can truncate the dendrogram in
WEB/JWEB as described below.
- BATCH FILE: hierarchical.spi
- INPUT PARAMETER: number of eigenfactors to use
- OUTPUT: dendrograms (PostScript and
SPIDER formats)
- After running, decide how many classes to use.
The PostScript file may be highly branched, and nodes may be
unreadable.
- The SPIDER-format dendrogram document can be viewed
with WEB/JWEB and
truncated. In WEB,
go to Commands -> Dendrogram
(example).
In JWEB,
go to File -> Open SPIDER Document File.
- BATCH FILE: classavg.spi
- INPUT PARAMETER: desired number of classes
- OUTPUT: class averages
- 3. K-means classification, using
'CL KM' -- The primary input is the number of classes
to divide the particles into.
- BATCH FILE: kmeans.spi
- INPUT PARAMETERS: number of eigenfactors, number of classes
- OUTPUT: class averages
- It can be informative to look at the individual particles from a class.
You can use
WEB/
JWEB, or
montagefromdoc.py.
Usage:
./montagefromdoc.py KM/docclass001.dat
If you have requested too many classes, there will be
similar-looking class averages.
If you have requested too few, there will be dissimilar
particles within a class.
Miscellaneous tools:
- There is a Python utility,
classavg.py,
that upon clicking on a class average displays the constituent
individual particles.
- Binary tree -- It is often not clear where to truncate the dendrogram.
In X-Window WEB,
one only sees the terminal nodes in the dendrogram averaged.
(In JWEB,
averaged images in not implemented at the time of this writing,
although Bill Rice says that if the prefix is two characters long, it works.)
- BATCH FILE: binarytree.spi
- Visualize the output using tree.py. (Requires
Spiderutils.py,
part of the SPIRE
installation.) Syntax:
tree.py labeled001.dat 4 2 1024
where:
- labeled001.dat is an example filename
(without a wild card)
- 4 (optional) is the tree depth (default is 6)
- 2 (optional) is the margin width (default is 2)
- 1024 (optional) is the canvas width
- If Spiderutils.py
is not installed, try tree.spi.
The output is a SPIDER-format image. However, the file size
may be very large.
- INPUT PARAMETERS: tree depth (number of averages
will be (2**depth - 1))
- INPUTS: averages from
binarytree.spi
- OUTPUTS: SPIDER-format tree image
Source: techs/MSA/index.htm
Page updated: 8/03/09
Tanvir Shaikh