Installing SPIDER's PubSub System for Distributed Processing

Introduction

With PubSub, SPIDER procedures can be run in parallel on a distributed cluster of computers or within a single cluster. The user places his SPIDER job in a shared que. Each of the subscriber machines can take jobs from the que. Each subscriber machine can specify when it will take jobs and how many jobs it can take at a time. If the machines vary greatly in processing power, it is best to partition the SPIDER jobs so that they will take a reasonable length of time (e.g. 20...60 minutes) so that the subscription process is most efficient.


Requirements for PubSub

  1. Systems must have Perl and standard POSIX utilities.

  2. (If Perl is not located in: /usr/local/bin/perl you will have to place a link there or alter the first line of each Perl script.)

  3. Systems must have disks cross-mounted so that they are accessible from all processors using same path e.g. /net/location/.
  4. Systems must be able to use rsh to run operations remotely on all computers in the cluster.
  5. The file which will be used for the 'publisher que' must have 'read/write' permissions suitable for all users.

PubSub Software Installation

  1. Create environment variables PUBSUB_DIR for the location of the PubSub installation directory and PUBSUB_MASTER for name of the host where PubSub master is run. These environment variables must be set in each users startup file (i.e. for csh users in their .cshrc file) e.g.
    setenv   PUBSUB_DIR   /net/bali/usr1/spider/pubsub
    setenv   PUBSUB_MASTER   bali

  2. Following steps should be done when logged in on PUBSUB_MASTER as member of group that is planning on using PubSub NOT as: root

  3. cd PUBSUB_DIRECTORY e.g.
    mkdir   $PUBSUB_DIR ;   cd $PUBSUB_DIR

  4. Copy the PubSub files normally distributed in: SPIDER_DIR/pubsub to your PUBSUB_DIRECTORY e.g.
    cp   SPIDER_DIR/pubsub/*   $PUBSUB_DIR

  5. Edit pubsub.permit e.g.
    xedit pubsub.permit
    Set machine specific permissions. Currently contains: machine name, limit for number of simultaneous jobs, permitted run days, permitted start-time, permitted end-time, and que check frequency (seconds), and comments. The machine names here determine where jobs can be run.

  6. Create an empty que file e.g. touch pubsub.que

  7. Tune NFS (if master node responds slowly).
    If your master and compute nodes node will be accessing lots of data from a NFS mounted disk you may want to speed up the process by altering the /etc/fstab mount for the data disks to increase the read and write buffersize e.g.:
    tonga2:/usr6 /usr6 nfs rsize=8192,wsize=8192 0 0
    See: NFS tuning for discussion.

    You also may want to increase the number of nfs threads on the master node and any other machines where the data is located using:
    /usr/sbin/rpc.nfsd nproc
    This should be placed in your init file so that it will be preserved on reboot. On redhat GNU/Linux this is set in: /etc/rc.d/init.d/nfs. Both changes will require root access to the machine. See your Unix manual pages for: fstab & nfsd

Starting PubSub

  1. cd YOUR_PUBSUB_DIRECTORY    e.g.
    cd   $PUBSUB_DIR
  2. Run startsub.perl to start a subscription process on the master node. If this process dies you will have to restart it again in the same way. .e.g.
    startsub.perl

Killing PubSub Subscription Process

  1. Run: killsub e.g.
    killsub

Running SPIDER jobs using PubSub

    Instructions are available for use of PubSub.


PubSub components

Note: You do not need to understand this to utilize PubSub. Perl code which may have to be altered since it may be site specific is marked with %%%% in the source files.

startsub.perl
Start subscriber process.

subscribe.perl
Subscriber process. Watches publisher que for any new jobs. If job appears, the subscriber looks for a suitable machine. When a machine is found the subscriber signals the publishing process where to run the job. This subscriber process checks the publisher que at specified frequency until it dies or is "killed'.

publish.perl
Submits a job to the publisher que. System flock is used internally to avoid update collisions.

delete.perl
Places job statistics in pubsub.log when a job is finished. System flock is used to avoid update collisions.

pubsub.permit
A single shared file containing machine specific permissions. Currently contains: machine name, limit for number of simultaneous jobs, permitted run days, permitted start-time, permitted end-time, and que check frequency (seconds), and comments.

pubsub.que
Publisher que. This is a single shared file that is accessed by the subscriber process to obtain jobs. System flock is used to avoid update collisions. The job number becomes negative when a job is 'subscribed' to. Jobs are deleted from the que when delete.perl runs.

killsub
Kills the PubSub subcriber process.

wherespi
Should tell you where SPIDER is currently running on all nodes. This is currently specific to our installation.

pubsub.log
PubSub log. This is a file that is created in the user's directory to log job progress. System flock is used to avoid update collisions. The run time for the job run is recorded here as well as the node name.


Source: spider/pubsub/pubsub_inst.html     Last page update: 9 Mar. 2005     ArDean Leith