Classification and Clustering Tutorial


Contents

  1. Correspondence or Principal Component Analysis.
  2. Determine what variations are attribute or noise based.
  3. Create Factor Maps .
  4. Controlled Clustering and Hierarchical Classification.
  5. Re-create certain images from eigenvectors.
  6. Create "virtual" images from eigenvectors.
  7. View dendrogram averages & factor (correspondence) map averages.
  8. References.


Automatic Classification/Clustering -- General Steps

  1. Execute CORAN or PCA
  2. Determine What Factors Are Important
  3. View Images
  4. Web Use

Source Data

SPIDER procedures: makefaces.spi (and the face.spi procedure which it calls) were used to create the eight original faces below. The faces differ in three ways: oval vs. round head, left vs. right eyes, and big vs. small mouth. Ten copies of each face with random noise created the sample data set. These procedure files create four kinds of files. The scr* files are the face templates, seen here. The sma* files are the noise-filled data set, example below. The sca* files carry the average of the ten noise images for each template, and scv* hold the variance for each template.


Correspondence Analysis (CA) or Principal Component Analysis (PCA)

CA is the preferred method of finding variations and we will principally be discussing inter-image variance. PCA computes the distance between data vectors with Euclidean distances, while CA uses Chi-squared distance. This is superior because it ignores differences in exposure between images, eliminating the need to rescale between images.
The procedure file: cas.spi invokes the 'CA S' operation. The procedure file assumes you prefer CA and creates a user-defined circular mask. The procedure cas.spi also produces eigendoc.dat document file which contains listing of eigenvectors for each image for factors 1 and 2.

Hints for: 'CA S'

Description of Output Files (links lead to output from faces example)


Determining Which Eigenvalues Are Useful, And Which are from Noise

The best method to determine what eigenvalues are useful to use is to view a histogram of them. As the histogram to the right shows, the last five eigenvalues are small compared to the first three. Also they are level, typical of noise. This is interpreted as the first three eigenvectors control more inter-image variance than the last five.

How To View Eigenvalue Histogram
  1. Run: python eigendoc.py name_EIG.* > gnuplot_file
    where
  2. View file of gnuplot commands: gnuplot -persist gnuplot_file
    Note: The gnuplot output can be converted to a gif image using: gnu2gif

Creating Factor Maps to View Groupings, If Any

Once we know which eigenvectors have some meaning and which are noise, we can see if there are any clear clustering. 'CA SM' is an operation used to view 2D factor maps (graphs) of selected eigenvalues. Note that to view these maps you must have already run: CA S first.

Hints for: 'CA SM'

The procedure file: factor.spi invokes the SPIDER procedure: factor_casm.spi three times, creating three different factor maps using 'CA SM' and selecting different pairs of factors. 'CA SM' is run under the assumptions that you want image space with "id's" to create PostScript plots.

The three factor maps below were created from image (IMC) factor maps using GNUplot. Postscript files are similar.

Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3

'CA SM' also works for pixel factor maps (PIX)as well. A common problem when creating a pixel factor map is the "ERROR: *** MAP ABORTED, MORE THAN 264 POINTS ON FRAME" error message. This is because of the large number of points. This can be fixed by increasing the value of ".NUMBER OF SD OR ≤CR>=2.3:". This increases the scale of the factor map. The pixel factor maps below were created from a screen snapshot. The large number of pixels located in the center show that most pixels vary little if at all between eigenvectors.

Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3


Controlled Hierarchical Clustering

The operation 'CL CLA' is limited in that it only uses Diday's and Ward's methods. 'CL HC' is more robust because the user controls the clustering criterion and can alter the "weights" for each factor. To best represent the "truth" set all factor weights equal to each other.

Hints for: 'CL HC'

The procedure: clhc.spi invokes 'CL HC' which reads the CA/PCA files from 'CA S' and creates dendroram plot files and document files.

Screen snaphots of the PostScript dendrogram from 'CL HC' for our "faces" data using clustering option 2 (complete linkage) and option 5 (Ward's method):

Complete linkage --- Ward's method

Dendrogram doc files affiliated with the above dendrograms:

Complete linkage doc --- Ward's method doc

Screen snapshot of Web display of the complete linkage dendrogram truncated at 20% level. Note that all of the images are properly seperated by head shape and eye direction. Mouth size is not so clear because it's eigenvalue is close to the eigenvalue for noise:

Dendrogram snapshot

The operation 'CL HD' uses the dendrogram doc. file from 'CL HC' or 'CL CLA' to create a document file listing how many classes there for a given threshold and how many images there are in each class. This operation is similar to viewing the dendrogram in Web, setting a threshold, and recording the number of images and the number above each image.

The operation 'CL HE' uses the dendrogram doc. file from 'CL HC' or 'CL CLA' to create a series of class doc. files (e.g. class : 1) , listing which images are assigned to each class.

The procedure: clhe.spi invokes 'CL HD', 'CL HE', and then 'AS R' to threshold the dendrogram produced by 'CL HC' at a theshold of 66%, then average together the 20 images in each of the resulting classes. The class averages are then placed in a montage. The images are classified perfectly according to head shape and eye direction.


Automatic Clustering and Hierarchical Ascendant Classifications (HAC)

With the "useful" eigenvectors known, we can much more efficiently determine the representative clusters. This also allows compression of information with minimal loss. Some output of 'CL CLA' can also work with 'CA SM' to produce more meaningful maps.

'CL CLA' differs from the other clustering operations in that for clustering it uses Diday's method, and for HAC it uses only Ward's criterion. A disadvantage of the K-means method is that the final clustering is very dependent of what seeds are initially chosen. Diday surpassed this by appplying the K-means technique multiple times with different seeds. Then, cross-tabuluating the results, and using only the clusters that were repeatedly formed. Ward's criterion states that merging HAC clusters should be focused on minimizing the added interclass variance. The two clusters that differ the least between each other will be merged and create a new cluster, one "level" higher.

Hints for using:
'CL CLA'

The procedure: clcla.spi invokes 'CL CLA' to create a dendrogram plot containing an image of a PostScript dendrogram output. A vertical and horizontal line meeting signify the joining of two classes below it. A representitive reconstruction can be formed for each class with a 'CA SR' comand. The larger vertical bars signify a greater difference between classes. The many small differences at the bottom can be eliminated with an increase of the "% cutoff" command. The jpg file was obtained by using screen snapshot.

The results file that is formed after every SPIDER run also holds a large amount of information after a 'CA CLA' run.


Examination Of K-Means Clustering

K-Means is a method of clustering that devides the data into a user defined number of clusters. Two random images "seeds" are chosen, and their centers of gravity are computed. A partition is drawn down the middle between the centers, the new centers of gravity are computed, and the process is repeated for a given number of times. The final result is VERY dependent on which image seeds are the first chosen.

Because our faces data set is manufactured. We know exactly which images are identical, except the random noise, and the exact number of clusters. The output discussed was obtained with 8 classes, using factors 1-3, and an even factor weight of 1.0 between those three factors. (NOTE: The results were from a data set with somewhat different noise from other runs discussed in this tutorial)

The doc file: IMC453doc.dat was produced by a run of 'CL KM' with the above input values with a random seed number of 453. The third column describes the image number and the fourth column is the class that 'CL KM' placed the image in. Images 1-10 were all placed in cluster 6, which is what we expect because they are all noisy images of the same protoimage. 'CL KM' kept the images from the same protoimage clustered together somewhat, except for the last ten images. However, it prefered to place images 11-20 and 31-40 in the same cluster, instead of each giving them each their own cluster. The average image for images 11-20 and 31-40 are shown below. They differ by their mouth size.

The doc file: IMC789doc.dat is a file from a 'CL KM' run exactly same as the previous, except with a random number of 789. Once again, most images were placed into the correct protoimage cluster correctly, except for the last few images. But in addition to images 11-20 and 31-40 being clustered together, images 1-10 and 21-30 were placed in the same cluster as well. This clearly demonstrates that K-means is highly dependent on the first image chosen, and should be used with extreme caution. Below are the average images for 1-10 and 21-30.

The doc files: SEQ453doc.dat and PIX453doc.dat are outputs from 'CL KM' being run on the same data as above, but using the SEQ and PIX files, respectively. The PIX453doc.dat file is 95Kb in size. The results for the SEQ run should be the same result as the previous runs, because it is still comparing images. However, the PIX results are expected to be different because it is trying to place the PIXELS in eight different classes.

Hints for: 'CL KM'

The procedure file: clkm.spi invokes 'CL KM' and created class averages for specified number of classes.


Re-Create Images From Eigenvectors

Re-creation of sample images from eigenvectors can eliminate the noise in the reconstituted image, this also results in large data compression. Below, is an image from our sample data set and four seperate reconstructions using the first four eigenvectors. The "halo" of noise and dark corners is a result of the masking function in the procedure files (face.spi and makefaces.spi) used to create this data.

It is difficult to see what traits the original image has because of noise. With the first three single eigenvalue reconstructions we can see that it has an oval head, with eyes looking left, and a small mouth. The fourth eigenvector does not "add" anything to our knowledge of the image because there are only three attributes that carry information. Threfore the fourth carries only noise.

Below is the sample image again, and a reconstruction from the first three eigenvectors combined. The image is from the 69'th prototype image shown in the source data area.

Hints for: 'CA SR'

The procedure file: casr.spi invokes 'CA SR' to re-constitute any image, or images, from the eigenvectors.

Below we have the same "protoimage" images used to create the source data, a sample image from each protoimage,and we also have re-created the noisy samples using each eigenvector. The first row is the protoimages used to create the 80 sample images, and the second row consists of a sample image created from the protoimage above it. The third row is each sample image re-created with only the first eigenvector used. The next row includes the second eigenvector, and the last includes all relevant eigenvectors.


Create Arbitrary "Virtual" Images From Eigenvectors

The operation: 'CA SRA' can create images or pixels that were not actually captured, using eigenimages. To do this, the eigenvalues must be given. This can be useful to interpolate images in between classes.
To choose what values to use for eigenvalues, use the factor maps. If you try to use a value outside the range of values for an eigenvalue, the results will be difficult to predict and interpret.

The images for eigenvector 1 equal to -0.1 and 0.1 are shown in the first two images. They "make sense" in that this vector controls headshape only. These values were chosen because they are the most extreme values shown on the factor maps. But the images for eigenvector 1 equal to -1 and 1,the last two images, do not represent "extreme" roundness like one might think. This is because these values are outside the range that they actually exist, i.e. the factor maps.

Hints for: 'CA SRA'

The procedure file: casra.spi invokes 'CA SR'. The procedure file: casramontage.spi was used to create the image below.
The virtual images are created by changing only one eigenvalue at a time from -0.2 to 0.2 at regular intervals. The top row is eigenvalue one, and the bottom row eigenvalue three. Along each row all other eigenvalues were held to zero. The center column is identical because it is at this point that the varrying eigenvalues are equal to zero. Reconstitution with all eigenvalues set to zero is equivalent to reconstiting the average of the series. If reconstitution of each indiviual eigenvalue had progressed to one, the result would be the first three images directly above. Reconstitution of the three eigenvalues together will result in the fourth image above.


Viewing Eigenimages

The operation: 'CA SRE' creates eigenimages from 'CA S' outputs. With eigenimages, the user can easily see what the computer has determined as the factors to classify images by.

Hints for: 'CA SRE'

The procedure file:casre.spi invokes 'CA SRE'. It assumes that if you want more than one factor, they are continuous. Also assumes that you are using CORAN output. Both of these assumptions are in the procedure file, not the operation. Edit the procedure file to change them. Example results.


Show Differences Between Images (Eigenimages)

Use 'CA SRD' to see the different eigenimages that are used to recreate images.

The first image is the average among all 80 sample images. The next three are the difference image for each useful eigenvector for one image. The dark slivers in the first eigenvector image show that in order to obtain the correct head shape for this image, we must add black to either side of the face. Similarily, for the eyes we would make the left side of each eye socket brighter, and the right side darker. From the fourth image this image has a wide mouth .

Below are the seperate difference eigenimages for another image, as well as the combination of all three. This image has the opposite of every trait of the image represented above. We can tell because of white slivers on a dark background, right-hand side of the eyes light, etc. The last image below is the composite of the first three eigenimages.

Below is the average of all the images as well as the composite difference eigenimage from above. When we add (superimpose) these together we get the third image, a re-creation of the original sample image. However, the re-creation has no noise. This is because we only used the first three eigenimages. If we included the other five, then we would have re-created the sample image, with noise.

Hints for: 'CA SRD'

The procedure file: casrd.spi invokes 'CA SRD'. Note that it is similar to casr.spi except for minor changes.


Subgrouping images

Use 'CA SMI' to separate a series of images into active/inactive clusters. It appears that this can actually perform operations on a series of images using the 'CA S' files from another series. This has not been tested. If the images used in 'CA SMI' will be used with their 'CA S' run, it is a good idea to create a Postscript map before running 'CA SMI' for comparison later.

Hints for: 'CA SMI'

The procedure file: casmi.spi invokes 'CA SMI'.
Below are the 'CA SM' maps with no 'CA SMI' input, and below those maps with 'CA SMI' input. Note the labeled images are the same in all three, but the axis switch with an odd numbered factor. This axis switch is caused by using non-transposed data set. In order to have 'CA SMI' run on this data, forced 'CA S' to not transpose the data, with the "CN entry". Because of this, 'CA SM' reads in the images different from the transposed data. Be sure to take note of axes if using 'CA SMI' -> 'CA SM'

Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3

Factor 1 vs. Factor 2 --- Factor 1 vs. Factor 3 --- Factor 2 vs. Factor 3


View Dendrogram In Web

Web can be used to view a dendrogram of the data instead of using CL CLA. Should also work with the output of CL HC. Web will display not only a usual dendrogram, but also the average of all the images below a threshold, and the number of images in each average.

How To Use The Web Dendrogram
  1. Run 'CL CLA' (or 'CL HC'), and request a dendrogram doc file.
  2. Make sure that your image files and dendrogram doc file have the same extension and are in the same folder, it's easier.
  3. Start Web with your extension (i.e."Web dat &") and select Command -> image from the menus to navigate to where the images and cluster files are. Press "OK".
  4. Commands -> Dendrogram and select the cluster file.
  5. When the Dendrogram window pops up, to see all of what you selected, choose lowest threshold option. All the variation that was carried over from the 'CL CLA' or 'CL HC' operation is scaled from what the lowest level selected to +1 above.
  6. If the "Show average images" option is selected, Web will show the average of the lowest level shown. This is why having the images in the same folder as the dendrogram doc file is easier.

The procedure file: webdendro.spi invokes 'CL CLA' so that a new dendrogram document file is the only thing created, used to change the lower threshold.
The document file: webdendoc.dat is an example of a dendrogram document file.

The procedure file: rename.spi can be used to change the extensions of a whole series of images. Should not be a problem if follow hint number three above.


View Factor (Correspondence) Map In Web and Compute Averages

In Web you view a factor map and select which images are similar using a "lasso" interface. Then the average of that class is computed and can be viewed or stored as a new image. Also, a SPIDER document file can be printed to what class each image was placed in. I believe that only images can be used, not pixels.

How To Use The Web Factor Map
  1. Run the 'SD C' command in SPIDER.
  2. Start Web with your data file extension (i.e."Web dat &") and use the command -> image and "accept" to navigate to where the cluster file is.
  3. If you are going to make image averages, the data files and the 'SD C' doc. file should be in the same folder and have the same extension. It is easier.
  4. In Web, Commands -> Corr-map. Explanation of window options.
  5. When you "accept", a Web window will appear, with either numbers or circles representing the images, depending on the "show image numbers" toggle.
  6. "Lasso" a group of images with the left mouse button, and after completely enclosing the intended images, press the center button. A new window will appear.
  7. Continue selecting image markers with new/continue masking until satisfied. Note that one image can be in many different groups.

The doc. file: sdcout.dat is the result of running 'SD C' on a _IMC file. It lists the image number and factor co-ordinates of each image.

The image: corrmap.gif is a screen-shot of a corr-map run. Note the placement of the average images. The upper-right image is overlapped with it's respective mask. We can also see what two traits were being compared in this factor map, head shape and eye direction.
docimg001.dat is the file created with the "save images in Doc. file" option. It lists the image number, X and Y co-ordinates and the order of class it was formed from. This particular class was the third formed, but the first image document list.


References


Source: tutorial.html     Last update: 19 June 2009