| Table of Contents | SPIDER operation |
|---|---|
| Correspondence Analysis or Principal Component Analysis | CA S |
| Eigenvalues : determine what variations are attribute or noise based | view with Gnuplot |
| Factor Maps | CA SM |
| Clustering and Hierarchical Classification | CL HC |
| Reconstitute images from eigenvectors | CA SR |
| Create "virtual" images from eigenvectors | CA SR |
| Difference images | CA SRD |
| Dendrograms | View in WEB |
| Subgrouping images | CA SMI |
| Viewing Eigenimages | CA SRE |
| Reconstitute Arbitary (Virtual) Images | CA SRA |
| References |
Makefaces.bat and face.bat were used to create the eight original faces below. The faces differ in three ways: oval vs. round head, left vs. right eyes, and big vs. small mouth. Ten copies of each face with random noise created the sample data set. These procedure files create four kinds of files. Scr* files are the face templates, seen here. Sma* files are the noise-filled data set, example below. Sca* files carry the average of the ten noise images for each template, and scv* hold the variance for each template.

CA is the prefered method of
finding variations and we will principally be discussing inter-image
variance. PCA computes the distance between data vectors with Euclidean
distances. While CA uses Chi-squared distance. This is superior because
it ignores differences in exposure between images, eliminating the need
to rescale between images.
Cas.bat is
a procedure file that runs the CA S command. The procedure file assumes you
prefer CA and creates a user-defined circular mask. Cas.bat also
creates eigendoc.dat
CA S Hints
Determining Useful Eigenvalues
Display the eigenvalues as a histogram to see which ones contain useful information. The histogram below shows most information is contained in the first three eigenvalues, indicating they represent most of the inter-image variance. The last five eigenvalues are small and level, which is typical of noise.
To create an Eigenvalue histogram :
python eigenhist.py xxx_EIG.ext [output_docfile] > gnuplot_filexxx_EIG.ext is the output from CA S, output_docfile is a doc file of eigenvalues as percentages (this argument is optional - the default value is eigenhist.doc)gnuplot_file is a file of gnuplot commands for displaying the histogram.gnuplot -persist gnuplot_file
Creating Factor Maps to View Groupings, If Any
With it known which eigenvectors that have some meaning and which are noise, we can see if there are any clear groupings. CA SM is the operation used to view 2D factor maps (graphs) of selected eigenvalues. To view these maps CA S MUST be run first.
CA SM HintsThe three factor maps below are of images and were created using GNUplot. Postscript files are similar.
Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3
CA SM is also known to work for pixel factor maps as well. A common
problem when creating a pixel factor map is the
"*** ERROR: *** MAP ABORTED, MORE THAN 264 POINTS ON FRAME" error message.
This is because of the large number of points. This can be fixed by increasing
the value of ".NUMBER OF SD OR
Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3
CL CLA is limited in that it only uses Diday's and Ward's methods. CL HC is more robust because the user controls the clustering criterion and can alter the "weights" for each factor. To best represent the "truth" set all factor weights equal to each other. It has been tested with only _IMC files produced with CORAN analysis.
CLHC HelpClhc.bat is a procedure file used to automate running CL HC.
CLHCdendro02.gif and
CLHCdendro05.gif
are IRIX snaphots of the PostScript dendrogram of the face data using
clustering option 2 (complete linkage) and option 5 (Ward's method),
respectively.
Clhcdoc02.dat and
Clhcdoc05.dat are
the dendrogram .doc files affiliated with the above dendrograms.
Clhcweb2.gif is a screenshot of
WEB displaying the complete linkage dendrogram. Note that all of the
images are properly seperated by their head shape and eye direction.
Mouth size is not so clear because it's eigenvalue is close to the
eigenvalue for noise.
CL HD
is a operation to be used with CL HC. It determines how many classes there are and
how many images are in each class for a given threshold. It is similar to viewing
the dendrogram in WEB, setting a threshold, and recording the number of images
and the number above each image.
Clhd02.dat is the result of a CL HD run set at 0.15
threshold using clhcdoc02.dat as input. It corresponds exactly with the WEB
screenshot above.
CL HE
is another operation it be used after CL HC. It's purpose is to create lists of the
images that are in each class, for a given threshold. To recreate the average
image of a given class at a certain threshold, this operation will output which
images need to be averaged together.
Clhe201.dat ,
Clhe202.dat ,
Clhe203.dat ,
Clhe204.dat
are output files from a CL HE run with a threshold of 0.5.
The images are grouped perfectly according to head shape and eye direction.
With the "useful" eigenvectors known, we can much more effeciently determine the representitive clusters. This also allows compression of information with minimal loss. Some output of CL CLA can also work with CA SM to produce more meaningful maps.
CL CLA differs from the other clustering operations in that for clustering it uses Diday's method, and for HAC it uses only Ward's criterion. A disadvantage of the K-means method is that the final grouping is very dependent of what seeds are initially chosen. Diday surpassed this by appplying the K-means technique multiple times with different seeds. Then, cross-tabuluating the results, and using only the clusters that were repeatedly formed. Ward's criterion states that merging HAC clusters should be focused on minimizing the added interclass variance. The two clusters that differ the least between each other will be merged and create a new group, one "level" higher.
CL CLA HintsDendro.gif is a .gif conversion of a PostScript dendrogram output. A vertical and horizontal line meeting signify the joining of two groups below it. A representitive reconstruction can be formed for each group with a CA SR comand. The larger vertical bars signify a greater difference between groups. The many small difference at the bottom can be eliminated with an increase of the "% cutoff" command. The .gif file was obtained by using the IRIX snapshot command.
The results.bat.* files that are formed after every SPIDER run also hold a large amount of information after a CA CLA run. The useful information begins after line 93 "OPENED (SU):"K-Means is a method of clustering that devides the data into a user defined number of groups. Two random images "seeds" are chosen, and their centers of gravity are computed. A partition is drawn down the middle between the centers, the new centers of gravity are computed, and the process is repeated for a given number of times. The final result is VERY dependent on which image seeds are the first chosen.
Because our faces data set is manufactured. We know exactly which images are identical, except the random noise, and the exact number of groups. The output discussed was obtained with 8 classes, using factors 1-3, and an even factor weight of 1.0 between those three factors.
IMC453doc.dat is the summary file of a run of CL KM with the above input values with a random seed number of 453. The third column describes the image number and the fourth column is the class that CL KM placed the image in. Images 1-10 were all placed in group 6, which is what we expect because they are all noisy images of the same protoimage. CL KM kept the images from the same protoimage grouped together somewhat, except for the last ten images. However, it prefered to place images 11-20 and 31-40 in the same group, instead of each giving them each their own group. The average image for images 11-20 and 31-40 are shown below. They differ by their mouth size.

IMC789doc.dat is the file from a CL KM run with the exact same as the previous, except with a random number of 789. Once again, most images were placed into the correct protoimage group correctly, except for the last few images. But in addition to images 11-20 and 31-40 being grouped together, images 1-10 and 21-30 were placed in the same group as well. This clearly demonstrates that K-means is highly dependent on the first image chosen, and should be used with extreme caution. Below are the average images for 1-10 and 21-30.

SEQ453doc.dat and PIX453doc.dat are outputs of CL KM being run on the same data as above, but the SEQ and PIX files, respectively. The PIX453doc.dat file is 95Kb in size. The results for the SEQ run should be the same reuslt as the previous runs, because it is still comparing images. However, the PIX results are expected to be different because it is trying to place the PIXELS in eight different classes.
CL KM HintsClkm.bat is a procedure file that automates CL KM. It also places all outputs in a folder named with the random number used.
Re-creation of sample images from eigenvectors can eliminate the noise in the reconstituted image, this also results in large data compression. Below, is a image from our sample data set and four seperate reconstructions from the first four eigenvectors. The "halo" of noise and dark corners is a result of the masking function in the procedure files used to create this data.(face.bat and makefaces.bat)

It is difficult to see what traits it has because of noise. With the first three single eigenvalue reconstructions we can see that it has an oval head, with eyes looking left, and a small mouth. The fourth eigenvector does not "add" anything to our knowledge of the image because there are only three attributes that carry information. Threfore the fourth carries only noise
Below is the sample image again, and a reconstruction from the first three eigenvectors combined. The image is from the seventh prototype image shown in the source data area.

Below we have the same "protoimage"images used to create the source data, a sample image from each protoimage,and we also have re-created the noisy samples using each eigenvector. The first row is the protoimages used to create the 80 sample images, and the second row consists of a sample image created from the protoimage above it. The third row is each sample image re-created with only the first eigenvector used. The next row includes the second eigenvector, and the last includes all relevant eigenvectors.

CA SRD can create images or pixels that were not actually captured,
using eigenimages. To do this, the eigenvalues must be given. This can
be useful to interpolate images in between groups.
To choose what values to use for eigenvalues, use the factor maps. If you try to use a value outside the range
of values for an eigenvalue, the results will be difficult to predict
and interpret.
The images for eigenvector 1 equal to -0.1 and 0.1 are shown in the first two images. They "make sense" in that this vector controls headshape only. These values were chosen because they are the most extreme values shown on the factor maps.But the images for eigenvector 1 equal to -1 and 1,the last two images, do not represent "extreme" roundness like one might think. This is because these values are outside the range that they actually exist, i.e. the factor maps.

If it is useful to see the different eigenimages that are used to recreate images, use CA SRD.

The first image is the average among all 80 sample images. The next
three are the difference image for each useful eigenvector for one
image. The dark slivers in the first eigenvector image show that in
order to obtain the correct head shape for this image, we must add
black to either side of the face. Similarily, for the eyes we would
make the left side of each eye socket brighter, and the right side
darker. From the fourth image this image has a wide mouth .
Below are the seperate difference eigenimages for another
image, as well as the combination of all three. This image has the
opposite of every trait of the image represented above. We can tell
because of white slivers on a dark background, right-hand side of the
eyes light, etc. The last image below is the composite of the first
three eigenimages.

Below is the average of all the images as well as the composite difference eigenimage from above. When we add (superimpose) these together we get the third image, a re-creation of the original sample image. However, the re-creation has no noise. This is because we only used the first three eigenimages. If we included the other five, then we would have re-created the sample image, with noise.

WEB can be used to view a dendrogram of the data instead of using CL CLA. Should also work with the output of CL HC. WEB will display not only the usual dendrogram, but also the average of all the images below a threshold, and the number of images in each average.
How To Use The WEB DendrogramWebdendro.bat is a procedure file that
automates the running of CL CLA so that a new dendrogram document file is the
only thing created, used to change the lower threshold.
Webdendoc.dat is an example of
a dendrogram document file.
Numbers.bat is a procedure file to change the
extensions of a whole series of images. Should not be a problem if follow
number three above.
In WEB you view a factor map and select which images are similar using a "lasso" interface. Then the average of that group is computed and can be viewed or stored as a new image. Also, a SPIDER document file can be printed to what group each image was placed in. I believe that only images can be used, not pixels.
How To Use The WEB Factor MapSdcout.dat is the result of running
SD C on a _IMC file. It lists the image number and factor co-ordinates
of each image.
Corrmap.gif is a
screen-shot of a corr-map run. Please note the placement of the average
images. The upper-right image is overlaped with it's respective mask.
We can also see what two traits were being compared in this factor map,
head shape and eye direction.
Docimg001.dat is the file created with
the "save images in Doc. file" option. It lists the image number, X and
Y co-ordinates and the order of group it was formed from. This
paticular group was the third formed, but the first image document
list.
The primary purpose of CA SMI is to separate a series of images into active/inactive groups. It appears that this can actually perform operations on a series of images using the CA S files from another series. This has not been tested. If the images used in CA SMI will be used with their CA S run, it is a good idea to create a Postscript map before running CA SMI for comparison later. The online manual page is straightfoward, with a minor change.
CA SMI HintsCasmi.bat is a procedure file that
automates the command. It is correct as of 5/11/04.
Below are the CA SM maps with no CA SMI input, and below those maps with CA SMI
input. Note the labeled images are the same in all three, but the axis
switch with an odd numbered factor.
5/13/04 The axis switch is caused by using non-transposed data set. In order to have CA SMI run on this data, I forced CAS to not transpose the data, with the CN entry. Because of this, CA SM reads in the images different the transposed data. Be sure to take note of axes if using CA SMI -> CA SM
Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3
Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3
The purpose of CA SRE is to easily create eigenimages from CA S outputs. With eigenimages, the user can easily see what the computer has determined as the factors to classify images by. Tested this with CORAN output only.
CA SRE HintsCasre.bat is a procedure file to automate running CA SRE. It assumes that if you want more than one factor, they are continuous. Also assumes that you are using CORAN output. Both of these assumptions are in the procedure file, not the operation. Edit the procedure file to change them. Below are some example results.

Very similar to Virtual CA SR above, but simpler. CA SRA does not require the input of a non-existing image to begin the virtual reconstitution.
CA SRA HintsCasra.bat is the procedure file automation.
Casramontage.bat is the procedure file
wrote to create the image below.
The image below shows virtual images by changing only one eigenvalue at a time
from -0.2 to 0.2 at regular intervals. The top row is eigenvalue one, and the
bottom row eigenvalue three. Along each row all other eigenvalues were held to
zero. The center column is identical baecause it is at this point that the
varying eigenvalues are equal to zero. Reconstitution with all eigenvalues set
to zero is equivalent to reconstiting the average of the series. If
reconstitution of each indiviual eigenvalue had progressed to one, the result
would be the first three images directly above. Reconstitution of the three
eigenvalues together will result in the fourth image above.

The official SPIDER web page. http://www.wadsworth.org/spider_doc/spider/docs/master.html
Frank, J. (2006) Three-Dimensional Electron Microscopy of Macromolecular Assemblies. Oxford University Press, New York.Updated: Jan 19, 2006