Command disabled: export_raw

Information about TIFR-CAM cluster

Hardware and software

Feature Available
Number of nodes 16
Number of cores 104
RAM 1 GB/core, total 104 GB
Interconnect Gigabit ethernet
Operating system CentOS 5, Rocks 5
Compilers gcc 4.1.2, gfortran 4.1.2, pgi 7.2-2
MPI MPICH2, OpenMPI

Torque/PBS

The nodes are put into two groups, see the file /opt/torque/server_priv/nodes

Group Nodes Cores/node Total cores CPU
nash c0-0 to c0-4 4 20 Dual-core AMD Opteron 2220 @ 1 GHz
hardy c0-5 to c0-14 8 80 Quad-core AMD Opteron 2352 @ 1.05 GHz

PBS is a batch handling system to manage parallel applications submitted by users. On the cluster, PBS uses Maui as the scheduler. Jobs are submitted to PBS using a script; examples are given below under the openmpi and mpich2 sections. If the script is called famosa.pbs, you can submit the job to PBS using

$ qsub famosa.pbs

Here are some parameters that can be given in a PBS script file:

  • -N jobname (name the job “jobname”)
  • -q @nic-cluster.cc.umr.edu (The cluster address to send the job to)
  • -e errfile (redirect standard error to a file named errfile)
  • -o outfile (redirect standard output to a file named outfile)
  • -j oe (combine standard output and standard error)
  • -l walltime=N (request a walltime of N in the form hh:mm:ss)
  • -l cput=N (request N sec of CPU time; or in the form hh:mm:ss)
  • -l mem=N[KMG][BW] (request total N kilo| mega| giga} {bytes|words} of memory on all requested processors together)
  • -l nodes=N:ppn=M (request N nodes with M processors per node)
  • -m abe (mail the user when the job aborts/began running/ended)
  • -S shell (use shell instead of your login shell to interpret the batch script; must include a complete path)
  • -V (job inherits the full environment of the current shell, including $DISPLAY)

Once a job is submitted, you can check its status using qstat

[praveen@master piaggio_pso]$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
25.master                 20080919_3       roms            22:58:50 R default        
28.master                 FAMOSA           praveen         00:11:11 R default 

To get more detailed information, use qstat -f or qstat -f <jobid>

To delete a running job, use

$ qdel <jobid>

If the job is not killed by the above command, then force it using

$ qdel -p <jobid>

Note: qpeek was not working when torque was installed from Rocks 5. It would give an error that there is no file in /opt/torque/spool. After commenting line 142 in /opt/torque/bin/qpeek, it works.

OpenMPI

Openmpi was compiled with the following configure options

./configure --with-tm=/opt/torque --prefix=/opt/openmpi-1.2.7 \
                      --enable-prefix-by-default --enable-static

After compiling, check that all required features are enabled using ompi_info. In particular, to verify that torque support is built in, do

[praveen@master ]$ /opt/openmpi-1.2.7/bin/ompi_info |grep tm
              MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.7)
                 MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7)
                 MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7)

The following is an example PBS script for use with openmpi.

famosa.pbs

#PBS -N "rae2822"
#PBS -l "nodes=5:hardy:ppn=6"
#PBS -l "walltime=48:00:00"
#PBS -j oe
#PBS -o famosa.log
#PBS -m e

export OPENMPI=/opt/openmpi-1.2.7
export PATH=$OPENMPI/bin:$PATH
export LD_LIBRARY_PATH=$OPENMPI/lib
 
cd $PBS_O_WORKDIR
 
mpirun $HOME/src/famosa/build/bin/Famosa_mpi

MPICH2

Mpich2 is installed using Rocks in /opt/mpich2/gnu and uses gfortran as the fortran compiler.

Using /opt/mpich2/gnu/bin/mpirun

famosa.pbs

#PBS -N "FAMOSA"
#PBS -l "nodes=5:hardy:ppn=6"
#PBS -l "walltime=00:10:00"
#PBS -j oe
#PBS -o "famosa.log"
#PBS -m e

export LD_LIBRARY_PATH=/opt/mpich2/gnu/lib
export PATH=/opt/mpich2/gnu/bin:$PATH
 
# got to working directory
cd $PBS_O_WORKDIR
 
# run mpd demon on all nodes
N_ALL=`cat $PBS_NODEFILE | wc -l`
N_UNI=`sort -u < $PBS_NODEFILE | wc -l`
 
cp $PBS_NODEFILE  ./nodes_all.txt
sort -u < $PBS_NODEFILE > nodes_unique.txt
 
mpdboot -n $N_UNI -f nodes_unique.txt
sleep 10
mpirun -n $N_ALL -machinefile nodes_all.txt ~/src/famosa/build/bin/Famosa_mpi
mpdallexit

Using /opt/mpiexec/bin/mpiexec

Use mpiexec in /opt/mpiexec to launch mpich2 programs together with PBS. An example script is given below

famosa.pbs

#PBS -N "rae2822"
#PBS -l "nodes=5:hardy:ppn=6"
#PBS -l "walltime=48:00:00"
#PBS -j oe
#PBS -o famosa.log
#PBS -m e

cd $PBS_O_WORKDIR
 
/opt/mpiexec/bin/mpiexec --comm=pmi $HOME/src/famosa/build/bin/Famosa_mpi

Useful commands

cluster-fork

This command can be use to execute something on all nodes. For example to see the list of processes for user praveen, do

cluster-fork ps -U praveen

To run some command only on a particular set of nodes, use

cluster-fork -n "c0-0 c0-1 c0-2 c0-3 c0-4" ps -U praveen

Another was is to use

cluster-fork --nodes="c0-%d:5-14" ps -U praveen

checkjob

This command gives some information about a submitted job

checkjob -v <JOBID>

where JOBID is given by qstat.

showq

showq gives a concise summary of all jobs running or in the queue.

showscipt

showscript will return the contents of the PBS script that you have submitted. The only argument is the job’s PBS jobid.

mjobctl

You can use this to suspend or resume a PBS job. See the help

[praveen@master ]$ mjobctl --help
Usage: mjobctl [FLAGS]
  --about
  --configfile=<FILENAME>
  --format=<FORMAT>
  --help
  --host=<SERVERHOSTNAME>
  --keyfile=<FILENAME>
  --loglevel=<LOGLEVEL>
  --port=<SERVERPORT>
  --version

  -c <JOBID> // CANCEL
  -C <JOBID> // CHECKPOINT
  -h <JOBID> // HOLD
  -r <JOBID> // RESUME
  -R <JOBID> // REQUEUE
  -s <JOBID> // SUSPEND
  -S <JOBID> // SUBMIT
  -x <JOBID> // EXECUTE
 
comp/tifr-cluster.txt · Last modified: 2009/07/02 09:22 by pc     Back to top
Get Firefox! Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki
This site will render correctly in Mozilla/Firefox.