9 Mar. 2011 ArDean Leith
Since hardware speeds are stagnant or decreasing there is increased interest in optimizing SPIDER's processing speed. Since SPIDER is a general purpose EM imaging package this means different things to different users. Locally the biggest time demand for our single particle reconstructions is alignment of images with reference projections (SPIDER operations: 'AP SH' and 'AP REF'). In order to access effect of changes in compiler options I used the operation: 'AP SHC' which is the latest highly 'tweaked' version of 'AP SH'). Usual data was a set of 375x375 pixel images and a comparison of 50 experimental images versus 550 references.
30 Sep. 2010 ArDean Leith
Nvidia GPU's vary in their compute capability and the amounts of three different types of memory which have critical influence on how a problem can be approached. In addition alignment tasks usually take more than 5 minutes of GPU time which means that the GPU can not currently be shared with graphics. Thus there must be a dedicated GPU (often a Tesla/Fermi board).
Computer science publications and anecdotes commonly report speed-ups as the increase in speed of the parallelized portion of the application over speed on a single processor. In usual reconstructions (e.g. realistic ribosome reconstructions) significant time is required to read images from disk. Such input typically occupies 3-10% of the time during an alignment. If only 4% of the time is spent loading the largest possible overall speed-up is 25X. 100X is impossible overall. Another trick is to report speed-ups from a cluster of GPU enabled compute nodes, sometimes with multiple GPU's per processor.
SPIDER and other single particle reconstruction software usually have high optimized alignment operations, commonly using OpenMP or MPI. Alignment speed as tested on our dual-hexcore computer scales very well with increased number of cores (11X). Few computers today have a single core and a usefull speed-up should be defined in comparison to a reasonable computer setup not versus speed on a single core.
In EM single particle reconstruction from reference projections using programs such as SPIDER, there is a vast range of different practical applications. The number of experimental images(x), number of reference images(y), and the size of the images(z), can vary over orders of magnitudes. E.g. x=200-10,000 experimental images; y=80-5000 reference images, z=50x50 - 480x480 pixels.
The gold standard for alignment is still exhaustive search within a translation/rotation space and the alignment is usually implemented with Fourier space cross-correlation of polar images. The common algorithm has an excess of ways that the processing can be parallelized. A naive implementation on a GPU seldom results in more than a 2X speed-up. Only by tedious tuning the transfer of data within the GPU among the different memories can a speed-up of 12-20X be achieved. However a small change in the x, y, x variables mentioned above, or a change of compute capability in the GPU can completely negate the speed-up resulting in even poorer performance than without a GPU. Such a change requires a new implementation.
It is probably possible to create implementations that will give 12-20X speed-ups for any specific set of x,y,z and hardware. However a general implementation giving such speed-up is currently impossible. Multiple (10-20?) implementations will be needed for each hardware and the logic to select the implementation is complex. Each implementation requires substantial programming effort.
Currently reported alignment implementions admit that there have been unreported changes (degredationss) in search algorithms or severe restrictions on various parameters. One report gives a rotational alignment resolution of only 6 degrees. Such a restriction makes the implementation useless on images greater than 100 pixels.
We can provide a single implementation in SPIDER that can give a 16X speed-up for specific small range of parameters. However the overhead required to do so including instructions on how to interact with 9 different run-time libraries for FFT, BLAS, and NVIDIA make even this minimally usefull implementation painfull. When compared to a run on a dual-hexcore computer this is really only about an effective speed-up of 1.5X!
Currently my advice is to carefully evaluate multi-core computers versus GPU enabled computers. Only if you have a extremely heavy compute load involving a single set of x,y,z parameters would it be worthwhile to go to a GPU solution. Then you will need software that is capable of handling your specific problem parameters. Otherwise split the problem among standard multi-core compute nodes. It probably will not be much more expensive to do so. If you still need increased speed invest in a parallel filesystem for enhanced disk access (e.g. Panasas disk array).
This recommendation may change in the future and I will revisit this subject when I get access to the new Tesla GPU and the newly announced CUDA 4.0.
6 Mar. 2009 ArDean Leith
While getting ready to retire a bunch of old SGI MIPS based servers and workstations, I wondered how much faster our current AMD Opteron 64 bit Linux boxes are than our trusted old machines of 5-10 years ago. Benchmark table.
11 Feb. 2009 ArDean Leith
If you are using a Beowulf type cluster for parallel execution of time consuming operations during single particle reconstruction, there are three common methods of parallelizing discussed on our website. Since the iterative alignment and defocus group backprojection steps typically consume more than 98% of the compute time and are trivially parallelizable by defocus group, we commonly use a simple PubSub script for distributing jobs to different compute nodes. Other sites have their own scripts to handle the distribution. However if you have a inexpensive cluster with simple Ethernet networking this method has a large inefficiency when there are many nodes accessing a single storage disk or simple RAID array on a file server using NFS mounts from the compute nodes.
When many compute nodes attempt to access a single disk (or RAID array) using NFS there is a significant slowdown in overall through put. There is a lot of effort currently to overcome this problem with various methods e.g. Parallel NFS. However if your compute nodes include adequate local storage on all the nodes there is a simple solution that may improve through-put. At the beginning of a compute node computation, copy all the files that are accessed to the local disk with a systems call, then carry out the computations. At the end of the compute nodes processing, copy any altered files back to the file server.
We have recently altered the scripts that we use during the projection matching step of 3D Reconstruction so that pub_refine.pam and its associated procedures (especially pub_refine_start.pam) handle the cloning of the necessary files on local compute nodes and the transmission back to the server at the end of the processiong on the compute nodes.
On our compute cluster this modification is very productive. The speed increase will of course depend on the number of simultaneous processes, and the pattern of disk access.
Source: random.html Page updated: 9 Mar. 2011 ArDean Leith