HOME GUIDE OPERATIONS DOCS FAQ TECHNIQUES FORMATS INSTALL NEW TIPS WEB Wadsworth Labs

SPIDER logo

SPIDER: Random Info

Occasional Thoughts about SPIDER, etc.


9 Mar. 2011     ArDean Leith

Optimization

Since hardware speeds are stagnant or decreasing there is increased interest in optimizing SPIDER's processing speed. Since SPIDER is a general purpose EM imaging package this means different things to different users. Locally the biggest time demand for our single particle reconstructions is alignment of images with reference projections (SPIDER operations: 'AP SH' and 'AP REF'). In order to access effect of changes in compiler options I used the operation: 'AP SHC' which is the latest highly 'tweaked' version of 'AP SH'). Usual data was a set of 375x375 pixel images and a comparison of 50 experimental images versus 550 references.

Compiler choice
We have access to both PGI and Intel Fortran compilers. I choose to use the PGI compilers because the Intel compiler produces poorly optimized executables for use on AMD Opteron hardware. The PGI compiled executables work well both on Intel and AMD hardware. The results reported here are using the current release for PGI compiler: Release 11.1).

Optimization Level
Aggressive optimization with PGI -O3 gives 3-4% speedup on the benchmark code. However this optimization level can only be used with great care. Some SPIDER operations give erroneous results with this compilation. This is probably due to differences in the execution order of statements and is a problem with floating point data that can potentially have wide variations in absolute value of the numbers. Changing order of arithmetic operations like subtract and divide can sometimes affect accuracy of the output. Thus use of -O3 can only be justified with carefull testing. Code for operation: 'AP SH' is mostly compiled at level O3 now following such extensive testing. Most non-alignment operations are compiled with level -O2.

Kieee FLag
Since SPIDER was ported to Linux from SGI I have always used the PGI flag -Kiee which says to strictly use IEEE conventions inside mathematical operations. Originally I used this in order to get same results from code compiled with PGI as with results from SGI code. PGI says this flag may slow operation but I am surprised to find that it increases speed of my benchmark by as much as 8%. Since it is also presumed to be more accurate, including use of this flag is a no-brainer.

Inlining Subroutines
Inlining subroutines/functions is expected to increase speed. There is less overhead stacking current subroutine data when invoking a called function. However in my benchmark it has negative effect on speed, slowing operation as much as 10%. Since inlining is also dangerous as it is tricky ensuring that the inlined code is kept in sync with the actual latest source, inlining is not helpfull.

Compiling for Large data
PGI compilers have flags -mcmodel=medium, -Mlarge_arrays which affect ability of the executable to handle large static data and large dynamically allocated data (typical of some operations which import large files of data). Depending on how SPIDER is used (particularly if inline/incore files are defined) some sites require the ability to handle these large files. The executables distributed with SPIDER have usually been compiled with -mcmodel=medium for handling large static arrays. Benchmarking shows that this has a insignificant impact on executable speed.

Compiling Static vs Dynamic Executables
Statically compiled executables do not require the presence of certain PGI or system libraries at execution time. In return the executable is larger than a dynamically linked executable. SPIDER has usually been distributed with static executables. My benchmark shows no difference in speed for these two types of executables. Since static executables have far less installation problems over varied Linux distributions and ages I have always preferred this option.

Compiling for use with OpenMP
PGI compilers have flag .-mp for creation of code that utilizes OpenMP parallization on suitable hardware. The executables distributed with SPIDER have been compiled with this flag for 20 years. Using all 12 cores of a dual-hexcore AMD Opteron gives 905% speedup over a single process on my benchmark.
Compiling For NUMA
AMD Opterons should support NUMA (Non-uniform memory architecture) execution when used on multi-processor hardware. PGI compilers have flag mp=numa that would utilize this capability when inside OpenMp. My benchmark shows no difference in speed for executables compiled with/without this flag on a dual-hexcore AMD Opteron compute node. Since use of this flag also requires dynamic executables it is not used in our distributed executables.

Compiling for use with SSE SIMD Vectorization
PGI and Intel compilers have flags e.g.-fastsse which allows optimization for use with SSE SIMD. This vectorization increases speed on suitable hardware. The executables distributed with SPIDER have been compiled with this flag for several years.

Compiling with Interprocedural analysis
PGI compilers have flag -ipa allows optimization across procedural boundaries. This may increase speed. My benchmark shows no difference in speed for executables compiled with/without this flag. However I am not certain that the compiler applies this analysis when source code is in different files so it may not have been a complete test of this option.

Compiling with Older Compiler Releases
The executables distributed with SPIDER have been created with PGI Release 8.6 for several years. This was done because this release had good support for creation of static executables. Release 11.1 now supports quality static executable creation and will be used in further distributions. I see no significant speed increase in executables with the newer release but they allow use of newer Fortran 2003 conventions which are valuable in coding.


30 Sep. 2010     ArDean Leith

CUDA SnakeOil

Question:
Alignment is the major time step in creating a EM single particle reconstruction and is easily parallelized with many different schemes. Why aren't GPU's more usefull in alignment of images during EM single particle reconstruction? What is the hold up? These techniques have been available for five years now and are common in other fields.
Answer:
News reports and anecdotes about the tremendous speed increases coming from application of graphic programming units (GPU's), usually involving Nvidia and CUDA, to computing tasks have created unrealistic expectations. For some problems GPU's offer great improvement. However for some easily parallelizable problems such as alignment they lack utility. Some of the claims about the use of GPU's can even be characterized as 'snake oil'.

Nvidia GPU's vary in their compute capability and the amounts of three different types of memory which have critical influence on how a problem can be approached. In addition alignment tasks usually take more than 5 minutes of GPU time which means that the GPU can not currently be shared with graphics. Thus there must be a dedicated GPU (often a Tesla/Fermi board).

Computer science publications and anecdotes commonly report speed-ups as the increase in speed of the parallelized portion of the application over speed on a single processor. In usual reconstructions (e.g. realistic ribosome reconstructions) significant time is required to read images from disk. Such input typically occupies 3-10% of the time during an alignment. If only 4% of the time is spent loading the largest possible overall speed-up is 25X. 100X is impossible overall. Another trick is to report speed-ups from a cluster of GPU enabled compute nodes, sometimes with multiple GPU's per processor.

SPIDER and other single particle reconstruction software usually have high optimized alignment operations, commonly using OpenMP or MPI. Alignment speed as tested on our dual-hexcore computer scales very well with increased number of cores (11X). Few computers today have a single core and a usefull speed-up should be defined in comparison to a reasonable computer setup not versus speed on a single core.

In EM single particle reconstruction from reference projections using programs such as SPIDER, there is a vast range of different practical applications. The number of experimental images(x), number of reference images(y), and the size of the images(z), can vary over orders of magnitudes. E.g. x=200-10,000 experimental images; y=80-5000 reference images, z=50x50 - 480x480 pixels.

The gold standard for alignment is still exhaustive search within a translation/rotation space and the alignment is usually implemented with Fourier space cross-correlation of polar images. The common algorithm has an excess of ways that the processing can be parallelized. A naive implementation on a GPU seldom results in more than a 2X speed-up. Only by tedious tuning the transfer of data within the GPU among the different memories can a speed-up of 12-20X be achieved. However a small change in the x, y, x variables mentioned above, or a change of compute capability in the GPU can completely negate the speed-up resulting in even poorer performance than without a GPU. Such a change requires a new implementation.

It is probably possible to create implementations that will give 12-20X speed-ups for any specific set of x,y,z and hardware. However a general implementation giving such speed-up is currently impossible. Multiple (10-20?) implementations will be needed for each hardware and the logic to select the implementation is complex. Each implementation requires substantial programming effort.

Currently reported alignment implementions admit that there have been unreported changes (degredationss) in search algorithms or severe restrictions on various parameters. One report gives a rotational alignment resolution of only 6 degrees. Such a restriction makes the implementation useless on images greater than 100 pixels.

We can provide a single implementation in SPIDER that can give a 16X speed-up for specific small range of parameters. However the overhead required to do so including instructions on how to interact with 9 different run-time libraries for FFT, BLAS, and NVIDIA make even this minimally usefull implementation painfull. When compared to a run on a dual-hexcore computer this is really only about an effective speed-up of 1.5X!

Currently my advice is to carefully evaluate multi-core computers versus GPU enabled computers. Only if you have a extremely heavy compute load involving a single set of x,y,z parameters would it be worthwhile to go to a GPU solution. Then you will need software that is capable of handling your specific problem parameters. Otherwise split the problem among standard multi-core compute nodes. It probably will not be much more expensive to do so. If you still need increased speed invest in a parallel filesystem for enhanced disk access (e.g. Panasas disk array).

This recommendation may change in the future and I will revisit this subject when I get access to the new Tesla GPU and the newly announced CUDA 4.0.


6 Mar. 2009     ArDean Leith

While getting ready to retire a bunch of old SGI MIPS based servers and workstations, I wondered how much faster our current AMD Opteron 64 bit Linux boxes are than our trusted old machines of 5-10 years ago. Benchmark table.


11 Feb. 2009     ArDean Leith

If you are using a Beowulf type cluster for parallel execution of time consuming operations during single particle reconstruction, there are three common methods of parallelizing discussed on our website. Since the iterative alignment and defocus group backprojection steps typically consume more than 98% of the compute time and are trivially parallelizable by defocus group, we commonly use a simple PubSub script for distributing jobs to different compute nodes. Other sites have their own scripts to handle the distribution. However if you have a inexpensive cluster with simple Ethernet networking this method has a large inefficiency when there are many nodes accessing a single storage disk or simple RAID array on a file server using NFS mounts from the compute nodes.

When many compute nodes attempt to access a single disk (or RAID array) using NFS there is a significant slowdown in overall through put. There is a lot of effort currently to overcome this problem with various methods e.g. Parallel NFS. However if your compute nodes include adequate local storage on all the nodes there is a simple solution that may improve through-put. At the beginning of a compute node computation, copy all the files that are accessed to the local disk with a systems call, then carry out the computations. At the end of the compute nodes processing, copy any altered files back to the file server.

We have recently altered the scripts that we use during the projection matching step of 3D Reconstruction so that pub_refine.pam and its associated procedures (especially pub_refine_start.pam) handle the cloning of the necessary files on local compute nodes and the transmission back to the server at the end of the processiong on the compute nodes.

On our compute cluster this modification is very productive. The speed increase will of course depend on the number of simultaneous processes, and the pattern of disk access.


Source: random.html     Page updated: 9 Mar. 2011     ArDean Leith


© Copyright Notice /       Enquiries: spider@wadsworth.org