Cette page existe aussi en français.

About the Myrinet

The Myrinet is a linux cluster at the University of Geneva and managed by the Centre Universitaire d'Informatique (CUI). It consists of:

  • 1 Master node: dual Xeon 2.4GHz, 2Gb RAM
  • 32 Computation nodes: Sun V60x dual Intel Xeon 2.8GHz, 2Gb RAM
  • Network: 1 Gigabit switched ethernet and Myrinet connections
  • Shared filesystem: NFS

The activity of the cluster can be monitored with ganglia. For questions, problems et cetera, contact EMail: Nicolas Mayencourt, the system administrator of the Myrinet:

Phone: +41 (0)22 379 0198

If any information on this page is incorrect or outdated, contact the system administrator or try to find someone who knows what to do. And update the webpage while you're at it…

A mailing list EMail: myrinet is available for annonces and discussions. Archives for this list are here.

Getting an account

To optain a login on the myrinet cluster, please send an email to EMail: Nicolas Mayencourt

Usage

All users should log in and compile their code on node myri00. The nodes myri01-myri32 are reserved for batch jobs. Batch jobs should be launched from myri00, the TORQUE server. See the following sections for more details. Users should not log in to or run programs on the computation nodes.

Software

All commands like `setenv', `source' or `export' can be placed in your ~/.cshrc, ~/.login, ~/.bashrc or ~/.profile. Refer to the shell documentation. All commands should be executed on the myri00. A C-shell is the default on myri00.

Note: It is advised to check the contents of ~/.cshrc, ~/.login, ~/.bashrc and ~/.profile and remove anything that might interfere with the instructions given in this page, such as manually setting the path to include an alternative compiler or the library path for Matlab.

MPI

There are several compilers available on Myrinet, for each a corresponding version of MPI is installed. To use a particular compiler and mpi version, you need to `source' a script corresponding to the environment of your choice.

Setting up the environment

The recommended environment is the GNU compiler version 4.1. See the section on compilers. To setup the environment, you need to source the appropriate script. It is adviced to place these commands into your ~/.cshrc and ~/.bashrc!

In a C-Shell:

[user@myri00 ~] source /opt/env/mpich-gcc41.csh

For BASH:

user@myri00:~$ source /opt/env/mpich-gcc41.sh

Here is an overview of scripts for C-Shell:

default environment GNU compiler 4.0.1, Mpich 1.2.6..14b
/opt/env/mpich-intel.csh Intel compiler 9.1, Mpich 1.2.7..15
/opt/env/mpich-gcc41.csh GNU compiler 4.1.0, Mpich 1.2.6..14b
/opt/env/mpich-gcc346.csh GNU compiler 3.4.6, Mpich 1.2.6..14b

The scripts for BASH have the extention `.sh' in stead of `.csh'.

Note: The scripts in `/unige/util/env/' for mpi are old - don't use these anymore.

Using the environment

After the environment has been set up, use `mpicc' as your compiler for C code and `mpiCC' for C++. All mpi related programs, i.e. `mpicc', `mpiCC' and `mpirun' should be used without including full paths to binaries in `/opt/'. So, do NOT do:

[user@myri00 ~] /opt/mpich-1.2.6..13b-gcc/bin/mpicc ..... # WRONG !

MPI and C++

There exists a conflict between MPI and C++. The constants `SEEK_SET', `SEEK_CUR' and `SEEK_END' are defined in both standards. This is considered a bug in the MPI standard. It is recommended to not use these constants, for instance by replacing `fseek()' by `fgetpos()'.

Compilers

The GNU compiler

The default compiler of the system is the GNU compiler version 4.0.1. For MPI, version 3.4.6 and 4.1.0 and 4.0.1 of the GNU compiler are installed. See the section on setting up the environment for MPI. The version 4.1 compiler should be faster.

To use the GCC compiler without mpich, the following commands need to be used: In a C-shell:

[user@myri00 ~] source /opt/env/gcc41.csh

In BASH:

user@myri00:~$ source /opt/env/gcc41.sh

It is adviced to place these commands in ~/.cshrc and ~/.bashrc, respectively.

The following compiler flags might be interesting:

-O2 or -O3 Enable compiler optimizations.
-march=pentium4 Generate code exclusively for the type of processor in the computation nodes, including vectorization (SSE2).
-funroll-loops Unroll loops if the number of iterations can be determined at compile-time.

The Intel compiler

The Intel compiler version 9.1, which is installed on myri00, is probably the fastest compiler available. This compiler can only be used on myri00, but you can use the executable on every node of the cluster. The Intel compiler can be used after executing the following command when using a C-shell:

[user@myri00 ~] source /unige/util/env/intel.csh

In BASH:

user@myri00:~$ source /unige/util/env/intel.sh

The following compiler flags might be interesting:

-ipo Enable interprocedural optimizations.
-xW Generate code exclusively for the type of processor in the computation nodes.
-fno-alias Assume no aliasing in program. Use with caution - only if you know what you are doing. Your program might show unexpected behavior otherwise.
-fno-falias Assume aliasing in program, but not in functions. Use with caution - only if you know what you are doing. Your program might show unexpected behavior otherwise.

For more information, execute `icc –help | less'.

Matlab

First, source the appropriate script. In a C-Shell:

[user@myri00 ~] source /unige/util/env/matlab_7.csh

In BASH:

user@myri00:~$ source /unige/util/env/matlab_7.sh

In order to use the graphical environment, login to myri00 using `ssh -X myri00' or `ssh -Y myri00'. Launch `matlab' normally. To avoid using the graphical interface, start Matlab with `matlab -nodesktop'.

Running Batch Jobs

Batch jobs must be submitted to TORQUE to be run. TORQUE is a derivative of Portable Batch System (OpenPBS), a system designed to run batch jobs from queues. The scheduler and the queues implement policies, which allow some form of coordinated use of resources. Please do not bypass the batch system. If there is something you would like to do but you do not know how to do it using the batch system ask for help, bypassing the batch system will only annoy your collegues.

TORQUE script

It is assumed here that your environment is set up using your ~/.cshrc and / or ~/.bashrc. To run a job, a script is needed. Example:

#!/bin/sh
 
cd /program/working/directory
mpiexec myprogram input.txt

In the rest of the text we suppose that the above script is called `job.sh'.

Another example:

#!/bin/sh
#PBS -l nodes=3
#PBS -q test
#PBS -l walltime=00:01:00
 
cd /program/working/directory
mpiexec myprogram input.txt

This script contains directives for TORQUE, so they don't need to be specified on the command line when submitting a job. In the rest of the text we suppose that the above script is called `job_pbs.sh'.

If you do not set up your environment in ~/.cshrc and / or ~/.bashrc, the TORQUE script needs to source the script needed to set up the environment. The first example would then look like:

#!/bin/sh
 
source /opt/env/mpich-gcc41.sh
 
cd /program/working/directory
mpiexec myprogram input.txt

Submitting a job

To run the program, one must submit a job to TORQUE with the command 'qsub'. However, several parameters must be taken into account: the number of required nodes, the running time and the job queue. If your job exceeds its specified runtime, it will be killed!

Before you submit a job you need to set up the environment as was done to compile your program by sourcing the proper script.

Example 1 - Submit a job to the test queue

qsub -l nodes=3 -q test -l walltime=00:01:00 job.sh

This will submit a job in script `job.sh' to queue `test' to run on 3 nodes with a runtime of upto 1 minute.

Example 2 - Using 2 processors per node

qsub -l nodes=2:ppn=2 -q test -l walltime=00:01:00 job.sh

This will submit a job in script `job.sh' to queue `test' to run on 2 nodes using 2 Processors Per Node, so a total of 4 processors, with a runtime of upto 1 minute.

Example 3 - Submitting a job to the batch queue

qsub -l nodes=6 -q batch -l walltime=01:00:00 job.sh

This will submit a job in script `job.sh' to queue `batch' to run on 6 nodes with a runtime of upto 1 hour.

Example 4 - An easier way

qsub -l nodes=6 -l walltime=01:00:00 job.sh

Equivalent to the previous example - `batch' is the default queue.

Example 5 - The very easy way

qsub job.sh

Submits `job.sh' to the default queue, using the default number of nodes upto the default runtime.

Example 6 - Using directives in the script

qsub job_pbs.sh

Submits `job_pbs.sh' to the test queue, using 3 nodes, running upto 1 minute - as specified by the TORQUE directives in the script.

Example 7 - Overriding directives in the script

qsub -l nodes=4 job_pbs.sh

Submits `job_pbs.sh' to the test queue, running upto 1 minute - as specified by the TORQUE directives in the script. The number of nodes is 4, not 3, as the command line parameter overrides the TORQUE directive in the script.

Example 8 - Using specific nodes

qsub -l nodes=myri01+myri02 job.sh

Submits `job.sh' to the default queue, upto the default runtime, but specifically demand that the job be run on nodes myri01 and myri02.

Queues

There are two queues: `batch' and `test'.

Queue `batch'

The batch queue is meant for all serious computations. The following defaults apply to all jobs:

  • default number of nodes: 1
  • default runtime: 1 hour

The following limit applies:

  • maximal number of processors per user: 20

There is an additional limit for queue `batch': the maximal number of processors for all jobs combined in this queue is 60.

Although the limits allow you to use 20 processors for an indefinite period of time, please consider that the cluster is a resource to be shared with collegues.

Queue `test'

The test queue is meant to test programs during development. The following defaults apply:

  • default number of nodes per job: 1
  • default runtime: 5 minutes (00:05:00)

The following limit applies:

  • maximal runtime: 5 minutes (00:05:00)

Due to the fact that only 60 processors can be used by the batch queue while there are 64 processors available for computations, there are always 4 processors available to the test queue. Users can take advantage of these exclusive resources by limiting their jobs to upto 4 processors when submitting to the test queue. These will then be scheduled very quickly. Jobs can be submitted to the test queue using more than 4 processors, however, the job must then await the availability of the requested resources.

How to determine the right parameters

Performance of the cluster is directly influenced by the performance of your programs. Besides that, the ability of the scheduler to realize high throughput and low return times is largely influenced by the accuracy of specified runtimes. The scalability and efficiency of your jobs affect not only you but also the other users of the cluster.

Gather information

You can use the test queue to gain insight in the performance and scalability of your program, which should provide you with information to determine the `right' number of nodes for a job and estimate the corresponding runtime.

It is possible to observe the behaviour of your program during runtime. Use `qstat -n' to find out on which nodes it is running. Then, use ganglia to observe, for example, the cpu use (metric `cpu_user', sorted `by hostname') of your program. If all cpu's appear to be largely idle, the performance bottleneck of your job is not computation. If only some nodes are are idle, your program is not well-balanced and you should consider addressing this issue in stead of using more resources.

Number of nodes

In theory, increasing the number of nodes should make your job run faster and decrease the computation time. However, there are numerous reasons why increasing the number of nodes might not be a good idea. Hence, the `right' amount of nodes is not necessarily `as many as possible'.

For parallel jobs, it is possible to use two tasks per node, i.e. `-l nodes=n:ppn=2' where n is the number of desired nodes. This will reduce communication. If you do not specify two tasks, the node will most likely be shared with another job, and your process will have to compete with the other job for the availability of the myrinet connection. On the other hand, it will only be possible to put your job on nodes with two free processors, so you might want to check how the cluster is loaded and base your decision on that.

The `right' number of nodes is the smallest number that offers a noticeable advantage over using less nodes.

If you have several jobs, it is more efficient to launch them at the same time on a few nodes each, than sequentially on many nodes - scalability is generally less than linear.

The following is a list of reasons to limit the number of nodes you want to use. Some can be explained by the fact that scalability is generally less than linear.

  • The Myrinet is a shared resource. Consider the needs of your collegues, respect the rules described for submitting jobs.
  • Even if your program scales well, there might be a number of nodes where computation is no longer a bottleneck. If communication or I/O becomes a bottleneck, increasing the nodes any further will generally not reduce the runtime.
  • Increasing the number of nodes will make it harder to schedule your job in the near future.
  • The probability that your job will be prematurely terminated because of a dying node increases.

Runtime

The runtime of a job is an important parameter. Although each queue has a limit and a default to the runtimes of the jobs in the queue, you should consider specifying a value. If you do not specify a runtime, the queue will use the default runtime for your job with possibly disastrous results.

You should minimize the specified runtime without making it too short. If your job exceeds its specified runtime, it will be killed! However, you will be rewarded for having short runtimes as the scheduler might be able to start your job earlier. If you do not specify a runtime, check whether the default is sufficient. Remember to add some extra time to your estimate.

To reduce your runtime, try out different compilers and compiler options to create faster programs and write your programs with performance and scalability in mind.

Useful commands

The commands should be used on myri00. You should log in using `ssh -X myri00' or `ssh -Y myri00' to run `xpbs' and `xpbsmon'.

qstat Show the status of the jobs in the queues, such as which jobs are running or queued, who owns them and the corresponding job-ids.
qstat -a Show more information of the status of the jobs in the queues.
watch qstat -a Like `qstat -a', but frequently updated.
qstat -an Show nodes assigned to jobs in addition to `qstat -a'.
qstat -f <job-id> Show detailed information on the specified job.
qstat -Q Display information about the configuration of the queues.
qdel <job-id> Remove the job from the queue. If the job is running, it will be killed.
qalter -l … <job-id> Alter the resource requested for a queued job
xpbs Graphical user interface for PBS commands.
xpbsmon Graphical user interface to monitor the PBS nodes.
showstart <job-id> Let MAUI show estimated time to start of a job.
showq Let MAUI show queue status.
diagnose Diagnose various problems with queues, the scheduling of jobs, et cetera in MAUI.
checkjob <job-id> Let MAUI display information about a job.
showbf Let MAUI show the backfill window.
showres Let MAUI show the reservations.

Further notes

Temporary Disk Space

Temporary space is available locally on each node: /scratch. You can create a personal folder for your temporary files, but don't forget to clean up this space after use.

Hints

If you have big jobs, you are advised to write your program in such a way that it can pick up where it left off in case of a crash or hardware failure.

Torque setup

The `batch' queue acts as one big job pool. The `one pool' strategy is an alternative to the `multiple queue' strategy. The `one pool' strategy has advantages over the `multiple queue' strategy. First, the simple setup should be easy to use. Second, the purpose of a `multiple queue' system is to assign priorities to jobs with different properties. However, the priorities should be determined by the scheduler to optimize for throughput, turnaround time and fairness. Setting priorities manually will generally have unpredictable effects and degrade performance.

Network

There is a separate network independant of the Myrinet, which is used for network traffic such as MPI, logging in and shared filesystem access through NFS. Therefore, MPI performance is not affected. Safe benchmark tests of MPI programs are still difficult as nodes are shared between jobs. Specifying two tasks per node eliminates this problem, but introduces another: it might allow tasks to communicate with tasks on the same node directly and not through the myrinet connection.