poulailler.jpg



About the Poulailler

The poulailler is a linux cluster at the Centre Universitaire d'Informatique (CUI) of the University of Geneva. It consists of:

  • 1 Master node: AMD Athlon 1600Mhz 1.5GB memory
  • 52 Computation nodes: Intel pentium 4 1500Mhz 512MB memory
  • Network: 1 Gigabit switched ethernet
  • Shared filesystem: NFS
  • The occasional feather…

The activity of the cluster can be monitored with ganglia. For questions, problems et cetera, contact Nicolas Mayencourt, the system administrator of the Poulailler:

EMail: Nicolas [dot] Mayencourt [at] cui [dot] unige [dot] ch
Phone: +41 (0)22 379 0198

If any information on this page is incorrect or outdated, contact the system administrator or try to find someone in the SPC group who knows what to do. And update the webpage while you're at it…

Usage

All users should log in and compile their code on node poule00. The nodes poule01-poule10 may be used for any purpose, such as launch parallel jobs using mpirun and distributed compiling. The rest of the cluster, poule11-poule52, is reserved for jobs started with PBS. PBS jobs should be launched from poule00, the PBS server. See the following sections for more details. Users should not log in to or run programs on the computation nodes designated for use with PBS.

Software

All commands like `setenv', `source' or `export' can be placed in your ~/.cshrc, ~/.login, ~/.bashrc or ~/.profile. Refer to the shell documentation. All commands should be executed on the poule00. A C-shell is the default on poule00.

Note: It is advised to check the contents of ~/.cshrc, ~/.login, ~/.bashrc and ~/.profile and remove anything that might interfere with the instructions given in this page, such as manually setting the path to include an alternative compiler or the library path for Matlab.

MPI

There are several compilers available on Poulailler, for each a corresponding version of MPI is installed. To use a particular compiler and mpi version, you need to `source' a script corresponding to the environment of your choice.

Setting up the environment

The recommended environment is the GNU compiler version 4.1. See the section on compilers. To setup the environment, you need to source the appropriate script. It is adviced to place these commands into your ~/.cshrc and ~/.bashrc!

In a C-Shell:

[user@poule00 ~] source /opt/env/mpich-gcc41.csh

For BASH:

user@poule00:~$ source /opt/env/mpich-gcc41.sh

Here is an overview of scripts for C-Shell:

/opt/env/mpich-intel.csh Intel compiler 7.1, Mpich 1.2.5..12
/opt/env/mpich-gcc41.csh GNU compiler 4.1.0, Mpich 1.2.7
/opt/env/mpich-gcc4.csh GNU compiler 4.0.2, Mpich 1.2.7
/opt/env/mpich-gcc.csh GNU compiler 3.4.2, Mpich 1.2.6..13b

The scripts for BASH have the extention `.sh' in stead of `.csh'.

Note: The scripts in `/unige/util/env/' for mpi are old - don't use these anymore.

Using the environment

After the environment has been set up, use `mpicc' as your compiler for C code and `mpiCC' for C++. All mpi related programs, i.e. `mpicc', `mpiCC' and `mpirun' should be used without including full paths to binaries in `/opt/'. So, do NOT do:

[user@poule00 ~] /opt/mpich-1.2.6..13b-gcc/bin/mpicc ..... # WRONG !

MPI and C++

There exists a conflict between MPI and C++. The constants `SEEK_SET', `SEEK_CUR' and `SEEK_END' are defined in both standards. This is considered a bug in the MPI standard. It is recommended to not use these constants, for instance by replacing `fseek()' by `fgetpos()'.

Compilers

The GNU compiler

The default compiler of the system is the GNU compiler version 3.3.1. This should not be used for MPI programs. For MPI, version 3.4.2 and 4.1.0 and 4.0.2 of the GNU compiler are installed. See the section on setting up the environment for MPI. The version 4.1 compiler should be faster.

To use the GCC compiler without mpich, the following commands need to be used: In a C-shell:

[user@poule00 ~] source /opt/env/gcc41.csh

In BASH:

user@poule00:~$ source /opt/env/gcc41.sh

It is adviced to place these commands in ~/.cshrc and ~/.bashrc, respectively.

The following compiler flags might be interesting:

-O2 or -O3 Enable compiler optimizations.
-march=pentium4 Generate code exclusively for the type of processor in the computation nodes, including vectorization (SSE2).
-funroll-loops Unroll loops if the number of iterations can be determined at compile-time.

The Intel compiler

The Intel compiler version 7.1, which is installed on Poulailler, is probably the fastest compiler available. However, it is not fully compliant with the C++ standards and may not compile valid source code. The Intel compiler can be used after executing the following command when using a C-shell:

[user@poule00 ~] source /unige/util/env/intel.csh

In BASH:

user@poule00:~$ source /unige/util/env/intel.sh

The following compiler flags might be interesting:

-ipo Enable interprocedural optimizations.
-xW Generate code exclusively for the type of processor in the computation nodes.
-fno-alias Assume no aliasing in program. Use with caution - only if you know what you are doing. Your program might show unexpected behavior otherwise.
-fno-falias Assume aliasing in program, but not in functions. Use with caution - only if you know what you are doing. Your program might show unexpected behavior otherwise.

For more information, execute `icc –help | less'.

Matlab

First, source the appropriate script. In a C-Shell:

[user@poule00 ~] source /unige/util/env/matlab_7.csh

In BASH:

user@poule00:~$ source /unige/util/env/matlab_7.sh

In order to use the graphical environment, login to poule00 using `ssh -X poule00' or `ssh -Y poule00'. Launch `matlab' normally. To avoid using the graphical interface, start Matlab with `matlab -nodesktop'.

Mpirun

If you do not source the appropriate script to set up the environment in your ~/.cshrc or ~/.bashrc, running programs with `mpirun' will most likely fail, because the environment is not properly setup on the nodes.

To launch an mpi-capable program, use the command mpirun. Example:

[ user@poule00 ~] mpirun -map poule01:poule02:poule03 myprogram input.txt

In this example, a program named `myprogram' is started on nodes poule1, poule2 and poule3. `input.txt' is a command line parameter to `myprogram'.

Do not lauch any jobs on poule00 or poule11-poule52. This will interfere with other people's programs.

Please clean up after you are finished with your jobs. Often, jobs are not properly terminated on all nodes due to various reasons. Copy the following script to a file `killscript.csh' on poule00:

#!/bin/csh
 
foreach a (`seq -w 1 10`)
  ssh poule${a} kill -9 -1
end

Make the script executable with `chmod 755 killscript.csh'. Run it with the command `./killscript.csh' on poule00. It will kill all your jobs on nodes poule01-poule10. You should see 10 lines like:

Connection to poule01 closed by remote host.

Portable Batch System

The Portable Batch System (PBS) is a system designed to run batch jobs from queues. The scheduler and the queues implement policies, which allow some form of coordinated use of resources.

The configuration is set to assign nodes exclusively to one job at a time for a given duration. At the end of the job, all processes of the user on the assigned nodes are killed. Jobs are never suspended and nodes are never shared between jobs because this could lead to reduced performance and complications. The setup has been kept simple to make sure the system works correctly and is transparant to the users.

PBS script

It is assumed here that your environment is set up using your ~/.cshrc and / or ~/.bashrc. To run a job, a script is needed. Example:

#!/bin/sh
 
cd /program/working/directory
mpirun myprogram input.txt

In the rest of the text we suppose that the above script is called `job.sh'.

Another example:

#!/bin/sh
#PBS -l nodes=3
#PBS -q test
#PBS -l walltime=00:01:00
 
cd /program/working/directory
mpirun myprogram input.txt

This script contains directives for PBS, so they don't need to be specified on the command line when submitting a job. In the rest of the text we suppose that the above script is called `job_pbs.sh'.

If you do not set up your environment in ~/.cshrc and / or ~/.bashrc, the PBS script needs to source the script needed to set up the environment. The first example would then look like:

#!/bin/sh
 
source /opt/env/mpich-gcc41.sh
 
cd /program/working/directory
mpirun myprogram input.txt

Submitting a job

To run the program, one must submit a job to PBS with the command 'qsub'. However, several parameters must be taken into account: the number of required nodes, the running time and the job queue. If your job exceeds its specified runtime, it will be killed!

Before you submit a job you need to set up the environment as was done to compile your program by sourcing the proper script.

Example 1

qsub -l nodes=3 -q test -l walltime=00:01:00 job.sh

This will submit a job in script `job.sh' to queue `test' to run on 3 nodes with a runtime of upto 1 minute.

Example 2

qsub -l nodes=6 -q default -l walltime=01:00:00 job.sh

This will submit a job in script `job.sh' to queue `default' to run on 6 nodes with a runtime of upto 1 hour.

Example 3

qsub -l nodes=6 -l walltime=01:00:00 job.sh

Equivalent to the previous example - `default' is the default queue.

Example 4

qsub job.sh

Submits `job.sh' to the default queue, using the default number of nodes upto the default runtime.

Example 5

qsub job_pbs.sh

Submits `job_pbs.sh' to the test queue, using 3 nodes, running upto 1 minute - as specified by the PBS directives in the script.

Example 6

qsub -l nodes=4 job_pbs.sh

Submits `job_pbs.sh' to the test queue, running upto 1 minute - as specified by the PBS directives in the script. The number of nodes is 4, not 3, as the command line parameter overrides the PBS directive in the script.

Example 7

qsub -l nodes=poule11+poule22+poule33 job.sh

Submits `job.sh' to the default queue, upto the default runtime, but demanding specifically nodes poule11, poule22 and poule33.

Queues

There are two queues: `default' and `test'.

Queue `default'

The default queue is meant for all serious computations. The following defaults apply to all jobs:

  • default number of nodes: 1
  • default runtime: 10 days (240:00:00)

The following limits apply to all jobs:

  • maximal number of nodes per job: 32
  • maximal runtime: 21 days (504:00:00)

There is an additional limit for queue `default': the maximal number of nodes for all jobs combined in this queue is 38.

Although the limits allow you to use 32 nodes for upto 3 weeks, please consider that the cluster is a resource to be shared with collegues. For jobs on more than 16 nodes, please satisfy either one of the following conditions:

  • Limit walltime to 4 hours.
  • Send an email to all other users to ask if they mind if you hog the cluster. Launch only when no objections are voiced within a day. Compromise otherwise.

This should allow people to run benchmarks and make full use of the cluster when it is idle without leading to conflicts.

Queue `test'

The test queue is meant to test programs during development. The following defaults apply:

  • default number of nodes per job: 1
  • default runtime: 5 minutes (00:05:00)

The following limit applies:

  • maximal runtime: 5 minutes (00:05:00)

Due to the fact that only 38 nodes can be used by the default queue while there are 42 nodes available for PBS, there are always 4 nodes available to the test queue. Users can take advantage of these exclusive resources by limiting their jobs to upto 4 nodes when submitting to the test queue. These will then be scheduled very quickly. Jobs can be submitted to the test queue using more than 4 nodes, however, the job must then await the availability of the requested resources.

How to determine the right parameters

Performance of the cluster is directly influenced by the performance of your programs. Besides that, the ability of the scheduler to realize high throughput and low return times is largely influenced by the accuracy of specified runtimes. The scalability and efficiency of your jobs affect not only you but also the other users of the cluster.

Gather information

You can use the test queue to gain insight in the performance and scalability of your program, which should provide you with information to determine the `right' number of nodes for a job and estimate the corresponding runtime.

It is possible to observe the behaviour of your program during runtime. Use `qstat -n' to find out on which nodes it is running. Then, use ganglia to observe, for example, the cpu use (metric `cpu_user', sorted `by hostname') of your program. If all cpu's appear to be largely idle, the performance bottleneck of your job is not computation. If only some nodes are are idle, your program is not well-balanced and you should consider addressing this issue in stead of using more resources.

Number of nodes

In theory, increasing the number of nodes should make your job run faster and decrease the computation time. However, there are numerous reasons why increasing the number of nodes might not be a good idea. Hence, the `right' amount of nodes is not necessarily `as many as possible'.

The `right' number of nodes is the smallest number that offers a noticable advantage over using less nodes.

If you have several jobs, it is more efficient to launch them at the same time on a few nodes each, than sequentially on many nodes - scalability is generally less than linear.

The following is a list of reasons to limit the number of nodes you want to use. Some can be explained by the fact that scalability is generally less than linear.

  • The Poulailler is a shared resource. Consider the needs of your collegues, respect the rules described for submitting jobs.
  • Even if your program scales well, there might be a number of nodes where computation is no longer a bottleneck. If communication or I/O becomes a bottleneck, increasing the nodes any further will generally not reduce the runtime.
  • Increasing the number of nodes will make it harder to schedule your job in the near future.
  • The probability that your job will be prematurely terminated because of a dying node increases.

Runtime

The runtime of a job is an important parameter. Although each queue has a limit and a default to the runtimes of the jobs in the queue, you should consider specifying a value. If you do not specify a runtime, the queue will use the default runtime for your job with possibly disastrous results.

You should minimize the specified runtime without making it too short. If your job exceeds its specified runtime, it will be killed! However, you will be rewarded for having short runtimes as the scheduler might be able to start your job earlier. If you do not specify a runtime, check whether the default is sufficient. Remember to add some extra time to your estimate.

To reduce your runtime, try out different compilers and compiler options to create faster programs and write your programs with performance and scalability in mind.

Useful PBS commands

The commands should be used on poule00. You should log in using `ssh -X poule00' or `ssh -Y poule00' to run `xpbs' and `xpbsmon'.

qstat Show the status of the jobs in the queues, such as which jobs are running or queued, who owns them and the corresponding job-ids.
qstat -a Show more information of the status of the jobs in the queues.
watch qstat -a Like `qstat -a', but frequently updated.
qstat -an Show nodes assigned to jobs in addition to `qstat -a'.
qstat -f <job-id> Show detailed information on the specified job.
qstat -Q Display information about the configuration of the queues.
qdel <job-id> Remove the job from the queue. If the job is running, it will be killed.
qalter -l … <job-id> Alter the resource requested for a queued job
xpbs Graphical user interface for PBS commands.
xpbsmon Graphical user interface to monitor the PBS nodes.
showstart <job-id> Let MAUI show estimated time to start of a job.
showq Let MAUI show queue status.
diagnose Diagnose various problems with queues, the scheduling of jobs, et cetera in MAUI.
checkjob <job-id> Let MAUI display information about a job.
showbf Let MAUI show the backfill window.
showres Let MAUI show the reservations.

Further notes

Hints

If you have big jobs, it is adviced to write your program such that it can continue where it left off in case of a crash or hardware failule.

PBS setup

The `default' queue acts as one big job pool. The `one pool' strategy is an alternative to the `multiple queue' strategy. The `one pool' strategy has advantages over the `multiple queue' strategy. First, the simple setup should be easy to use. Second, the purpose of a `multiple queue' system is to assign priorities to jobs with different properties. However, the priorities should be determined by the scheduler to optimize for throughput and turnaround time. Setting priorities manually will generally have unpredictable effect and degrade performance.

Network

There is only one network, which is used for all network traffic, such as MPI, logging in and shared filesystem access through NFS. This might affect MPI performance and therefore interfere with benchmarks involving MPI programs. Since this is a hardware limitation, there is nothing that can be done about this.

  • The bigger sister of the Poulailler is the Myri cluster.
  • The cluster Pleiades at the EPFL has a page with instructions.
  • For a better understanding of PBS, please consult the manual.

Credits

The policies implemented by the cluster configuration are based on recommendations and analyses by Vincent Keller, Bernhard Sonderegger and Jonas Latt. The specifications based on this information were created by Fokko Beekhof and implemented by Nicolas Mayencourt.

Page created by F.P.Beekhof, 10 feb 2006.