| Feature | Available |
|---|---|
| Number of nodes | 16 |
| Number of cores | 104 |
| RAM | 1 GB/core, total 104 GB |
| Interconnect | Gigabit ethernet |
| Operating system | CentOS 5, Rocks 5 |
| Compilers | gcc 4.1.2, gfortran 4.1.2, pgi 7.2-2 |
| MPI | MPICH2, OpenMPI |
The nodes are put into two groups, see the file /opt/torque/server_priv/nodes
| Group | Nodes | Cores/node | Total cores | CPU |
|---|---|---|---|---|
| nash | c0-0 to c0-4 | 4 | 20 | Dual-core AMD Opteron 2220 @ 1 GHz |
| hardy | c0-5 to c0-14 | 8 | 80 | Quad-core AMD Opteron 2352 @ 1.05 GHz |
PBS is a batch handling system to manage parallel applications submitted by users. On the cluster, PBS uses Maui as the scheduler. Jobs are submitted to PBS using a script; examples are given below under the openmpi and mpich2 sections. If the script is called famosa.pbs, you can submit the job to PBS using
$ qsub famosa.pbs
Here are some parameters that can be given in a PBS script file:
Once a job is submitted, you can check its status using qstat
[praveen@master piaggio_pso]$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 25.master 20080919_3 roms 22:58:50 R default 28.master FAMOSA praveen 00:11:11 R default
To get more detailed information, use qstat -f or qstat -f <jobid>
To delete a running job, use
$ qdel <jobid>
If the job is not killed by the above command, then force it using
$ qdel -p <jobid>
Note: qpeek was not working when torque was installed from Rocks 5. It would give an error that there is no file in /opt/torque/spool. After commenting line 142 in /opt/torque/bin/qpeek, it works.
Openmpi was compiled with the following configure options
./configure --with-tm=/opt/torque --prefix=/opt/openmpi-1.2.7 \
--enable-prefix-by-default --enable-static
After compiling, check that all required features are enabled using ompi_info. In particular, to verify that torque support is built in, do
[praveen@master ]$ /opt/openmpi-1.2.7/bin/ompi_info |grep tm
MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.7)
MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7)
MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7)
The following is an example PBS script for use with openmpi.
famosa.pbs
#PBS -N "rae2822" #PBS -l "nodes=5:hardy:ppn=6" #PBS -l "walltime=48:00:00" #PBS -j oe #PBS -o famosa.log #PBS -m e export OPENMPI=/opt/openmpi-1.2.7 export PATH=$OPENMPI/bin:$PATH export LD_LIBRARY_PATH=$OPENMPI/lib cd $PBS_O_WORKDIR mpirun $HOME/src/famosa/build/bin/Famosa_mpi
Mpich2 is installed using Rocks in /opt/mpich2/gnu and uses gfortran as the fortran compiler.
famosa.pbs
#PBS -N "FAMOSA" #PBS -l "nodes=5:hardy:ppn=6" #PBS -l "walltime=00:10:00" #PBS -j oe #PBS -o "famosa.log" #PBS -m e export LD_LIBRARY_PATH=/opt/mpich2/gnu/lib export PATH=/opt/mpich2/gnu/bin:$PATH # got to working directory cd $PBS_O_WORKDIR # run mpd demon on all nodes N_ALL=`cat $PBS_NODEFILE | wc -l` N_UNI=`sort -u < $PBS_NODEFILE | wc -l` cp $PBS_NODEFILE ./nodes_all.txt sort -u < $PBS_NODEFILE > nodes_unique.txt mpdboot -n $N_UNI -f nodes_unique.txt sleep 10 mpirun -n $N_ALL -machinefile nodes_all.txt ~/src/famosa/build/bin/Famosa_mpi mpdallexit
Use mpiexec in /opt/mpiexec to launch mpich2 programs together with PBS. An example script is given below
famosa.pbs
#PBS -N "rae2822" #PBS -l "nodes=5:hardy:ppn=6" #PBS -l "walltime=48:00:00" #PBS -j oe #PBS -o famosa.log #PBS -m e cd $PBS_O_WORKDIR /opt/mpiexec/bin/mpiexec --comm=pmi $HOME/src/famosa/build/bin/Famosa_mpi
This command can be use to execute something on all nodes. For example to see the list of processes for user praveen, do
cluster-fork ps -U praveen
To run some command only on a particular set of nodes, use
cluster-fork -n "c0-0 c0-1 c0-2 c0-3 c0-4" ps -U praveen
Another was is to use
cluster-fork --nodes="c0-%d:5-14" ps -U praveen
This command gives some information about a submitted job
checkjob -v <JOBID>
where JOBID is given by qstat.
showq gives a concise summary of all jobs running or in the queue.
showscript will return the contents of the PBS script that you have submitted. The only argument is the job’s PBS jobid.
You can use this to suspend or resume a PBS job. See the help
[praveen@master ]$ mjobctl --help Usage: mjobctl [FLAGS] --about --configfile=<FILENAME> --format=<FORMAT> --help --host=<SERVERHOSTNAME> --keyfile=<FILENAME> --loglevel=<LOGLEVEL> --port=<SERVERPORT> --version -c <JOBID> // CANCEL -C <JOBID> // CHECKPOINT -h <JOBID> // HOLD -r <JOBID> // RESUME -R <JOBID> // REQUEUE -s <JOBID> // SUSPEND -S <JOBID> // SUBMIT -x <JOBID> // EXECUTE