Introduction to the Roth Lab Llama Cluster

Here is a basic introduction to Llama, the Roth Lab Linux cluster. This introduction attempts to address the following topics:

What is the Llama Cluster?

The Llama Cluster is a collection of Intel-based personal computers, connected to one another by a network switch. Currently Llama consists of 40 CPUs (12 nodes are dual-processor 3GHz, 4 are dual-processor Intel Xeon 2.4GHz, and 4 nodes are dual processor AMD Athlon 1.5 GHz) 17 dual-processor Pentium-III 933MHz and four dual-CPU 1.8GHz machines, each running the Linux operating system.

The head node is called "llama.med.harvard.edu", the other nodes are 'dolly01' through 'dolly20'. Many users require Web/FTP access, e.g., to access databases, so this is available from all nodes.

Connecting to Llama

The Cluster is reached by logging in to "llama.med.harvard.edu". For security reasons, the traditional means of login and file transfer, telnet and ftp, will not work for connections to Llama. You will need to obtain telnet-equivalent and ftp-equivalent client software which uses SSH (secure shell).

Program Available here
Telnet equivalent for PC Users: Putty
FTP equivalent for PC Users: WinSCP, available here or here.
Telnet equivalent for Mac Users: MacSSH
FTP equivalent for Mac Users: F-Secure SSH (not free, but demo is available)

These are only suggestions, other applications which support SSH2 may also work.

File locations and accessibility

All users should have a home directory on the head node, (type 'echo $HOME' to see the path to this directory). This directory has been NFS mounted to all of the daughter nodes, so files in your home directory are accessible from everywhere within the cluster via the same path name.

A scratch directory exists on each of the daughter nodes. When your application requires large files or frequent file access, performance will be improved by placing these files in the scratch space. You may not know on which daughter node your application will be launched, so you may need to copy required files to all nodes simultaneously using the "rdist" command.

Running programs on the nodes

Llama uses job queuing software called PBS (Portable Batch System). You submit your job to a queue using the "qsub" command (or "qqsub", see note below for details), and the PBS software sends your job to a daughter node with an open CPU. The output from standard output and standard error are returned in two separate files, with file names derived from the job ID.

NOTE: The 'qsub' command is part of PBS, but is fairly primitive, which causes a number of problems. The 'qqsub' command was written at the Roth Lab, to overcome those problems. It is highly recommended that you use 'qqsub' in preference to 'qsub'. Throughout this guide, we will refer only to qqsub, since it is hard to conceive of circumstances where you would choose to use the more basic 'qsub'. The syntax is almost identical, qqsub accepts all the same arguments as qsub. For details (including a list of all the problems that qqsub helps you avoid), type 'man qqsub' at the shell prompt. For an example of how to submit a job to the queue, try submitting the jobs "sleeptest.sh":

"qqsub" and "qstat" are pretty straightforward features of PBS. For more advanced tricks, see the PBS manual. You may use the login "RothLabHMS" and password "Llamarama".

If you do not like the defaults supplied by "qqsub", you can override them by specifying new values using embedded #PBS commands just as you would with 'qsub'. See 'man qsub' for details on how to do this.

One last thing: there is a FAQ ('frequently asked questions') for Llama. You may want to check it out if you run into problems getting your program to run on the Llama (most people do, that's where the FAQ came from!).

Interactive use of the nodes

You can start an interactive job by using the '-I' flag. (Interactive mode doesn't allow any program as an argument, since it starts a shell on the remote node.) You'll have some informational messages, and after a few seconds you'll be connected to one of the dollies. Here's an example, executed from the directory /home/fgibbons/cvs/ToolTime:
llama ToolTime> qqsub -I
Concurrent jobs running (including this one): 1 (limit: 18)
Establishing an interactive session on PBS.
This may take several seconds, depending on system load...

qsub: waiting for job 201541.llama.med.harvard.edu to start
qsub: job 201541.llama.med.harvard.edu ready

cd /d0/home/fgibbons/cvs/ToolTime

dolly04 ToolTime>
Any commands I type at this point are executed on dolly04, not on the llama. This allows me to run interactive jobs (debugging springs to mind, there are probably other examples) on a dolly. This way I don't worry about hogging resources on the head node. And it's better than logging in directly to the node, because PBS keeps track of what's free, and sends you there automatically. You'll notice that it's transparent: it automatically changes directory into the same one I issued the qqsub command from. To make this work properly, you need to add the following lines to your .bashrc file:
if [ ?PBS_ENVIRONMENT ]; then
  if [ "$PBS_ENVIRONMENT" = "PBS_INTERACTIVE" ]; then
    cat $HOME/.qqsub_interactive
    . $HOME/.qqsub_interactive
  fi
fi
These lines merely check to see if the current shell is being run under PBS's interactive mode. If not, it does nothing, leaving regular PBS and direct log-ins unaffected. If it *is* interactive, it executes the contents of the file .qqsub_interactive, created automatically by qqsub in your home directory. It also shows you what it's doing, with the 'cat' command (remove it if you don't like it). So that's how the magic change of directory is accomplished. When you're done, you can quit the interactive session by typing 'exit', which will terminate the job, with this message:

qsub: job 201541.llama.med.harvard.edu completed
llama ToolTime>
You are encouraged to use this feature whenever you have programs to debug, or have some kind of interactive work to do (e.g., run a two-minute job, check results, change parameters, run again; check results, change params, run again; etc.). That way we can save the head node for its intended purpose: to control the other nodes. There is one caveat: leaving an interactive job like this running will tie up the node, since PBS allows only two concurrent jobs per node. Consideration for your fellow llama-users is probably the best motivator here: "do unto others..." Lastly, you can see which jobs are interactive (see Viewing results) by looking for the word 'Interactive' in the output of qstat.

Viewing results

As mentioned above, standard output and standard error for jobs launched using "qqsub" are returned in two separate files when the job is complete. Standard output and error file names are derived from the job ID, with standard error files ending in ".e###" and standard output ending in ".o###", where ### is your qqsub job number.

Normally PBS does not deliver the output/error files until the job has completed. "qcat" is a script that allows you to monitor the progress of your jobs, while they are still running. If your job was assigned the ID number 456, when you submitted it to PBS, you would check on the output with 'qcat 456'. You can look at any errors with 'qcat -e 456'.

Handling queues on the Llama

There are two queues, called 'rothlab' and 'guest'. Members of the Roth Lab may submit to either queue, other users are restricted to the 'guest' queue. At present (March 2002), only 4 jobs may run at one time in the guest queue (not four per user, but four in total). The 'rothlab' queue has no limits, beyond the number of nodes online. If all nodes are busy, and there are jobs waiting in both queues, those in the 'rothlab' queue will be run first. The exact balance between the queues may shift from time to time to reflect the needs of the Roth lab.

Type 'groups' at the UNIX prompt to see which UNIX groups you're a member of. By default, qqsub will submit the jobs of users who are in the 'rothlab' group to the 'rothlab' queue. It will submit the jobs of users who are NOT in that group to the 'guest' queue. (There is no 'guest' group.)

If you're in the rothlab group and wish to submit your job to the 'guest' queue, you need to add '-q guest' to qqsub on the command line. Non-members of the 'rothlab' may submit only to the 'guest' queue. Specifying the name of your default queue will not cause any harm. An example should illustrate (attempting to submit program 'myProg' to PBS):

For users in 'rothlab' group
qqsub myProg no queue specified, submitted to 'rothlab' (default) queue
qqsub myProg -q rothlab default queue specified, submitted to 'rothlab' queue, same as above
qqsub myProg -q guest non-default queue, submitted to 'guest' queue

For users NOT in 'rothlab' group:
qqsub myProg no queue specified, submitted to 'guest' (default) queue
qqsub myProg -q guest default queue specified, submitted to 'guest' queue, same as above
qqsub myProg -q rothlab non-default queue, generates 'qsub: Unauthorized Request' error message

You can monitor the queues in a number of ways:
qstat -Q Lists each queue, showing number of jobs allowed ('Max'), queued ('Que'), running ('Run')
qstat Shows status of each job running, rightmost column indicates queue.

If you have any trouble using qqsub please let qqsub's maintainer know.

Llama etiquette

Where to go with questions

This server is supported by the West Quad Computing Group. Please visit wqcg.med.harvard.edu to request support or report problems.

Enjoy!

This page was developed by Frank Gibbons and last modified by Fritz Roth on 10 October 2006.