OpenMP machine

About OpenMP

The OpenMP machine was acquired for lab courses, but can be used freely in the intervals between courses.

The OpenMP machine has 4 Intel© Xeon CPU's with HyperThreading at 2.4 GHz and 32 GB memory. Due to the HyperThreading, there are 8 logical CPU's although there are only 4 physical CPU's. It is a shared memory parallel machine, intended to be used by programs following the OpenMP programming paradigm, but it is also an attractive choice for MPI programs as communication in a shared memory machine is very fast. There is a basic C++ OpenMP tutorial in french.

The administrator of the OpenMP machine is Nicolas Mayencourt.

Logging in

Just use ssh:

ssh openmp.unige.ch

Once you are logged in, you need to set up your environment to use the intel compilers and the debugger. The C / C++ compiler is called icc, the Fortran compiler ifort, and idb is the debugger.

C-Shell:

% source /opt/env/ifortvars.csh
% source /opt/env/iccvars.csh
% source /opt/env/idbvars.csh

Bash:

% source /opt/env/ifortvars.sh
% source /opt/env/iccvars.sh
% source /opt/env/idbvars.sh

There is on-line documentation, for the compilers and the debugger:

Creating a program

A simple example in C

The following program, 'openmp.c', gives a very simple example of the most common use of OpenMP.

#include <stdio.h>
#include <stdlib.h>

#define N  256
#define T 1000

int main()
{
      int data[N][N];
      int t, sum;
      int i, j;

      /* Clear array: set all to zero. */
      memset(data, 0, N * N * sizeof(int));

      /* All variables are by default shared between all threads,
         except 'i', 'j' and 't', of which each thread has a private copy.
         So, each thread has its own indices 't', 'i' and 'j' but operates on
         the same array. */

#pragma omp parallel for default(shared), private(t, i, j)
      for (t = 0; t < T; ++t)
              for (i = 0; i < N; ++i)
              for (j = 0; j < N; ++j)
                      data[i][j] += random();

      /* This loop cannot be parallellized because writing to 'sum' cannot
         be done concurrently. */
      sum = 0;
      for (i = 0; i < N; ++i)
      for (j = 0; j < N; ++j)
              sum += data[i][j];

      printf("Sum: %d\n", sum);

      return 0;
}

Compiling

Use the Intel Compiler! In the next sections the C++ compiler icc is used, but everything is the same for the Fortran compiler ifort.

For performace

Use the following switches to maximize performance:

icc -xN -O3 -ipo -openmp -o openmp openmp.c
-xN Generate code for Xeon CPU with MMX / SSE vectorization
-O3 Optimize a lot
-ipo Optimize between functions and files
-openmp Enable OpenMP
-Wall Emit all possible warnings (optional)

Another interesting switch is `-parallel', which lets the compiler automatically parallelize loops. Unfortunately it is not clear how many threads the program will use when compiling with -parallel. If you do know how to control the number of threads, please adapt this page.

Debugging

Use only the -g switch for debugging:

icc -g -o openmp openmp.c

This will disable all optimizations, and add some extra information to the program to allow for easier debugging.

Running a Job

Selecting the number of threads

There is no batch-system present to coordinate the use of the machine between users, so you need to check how much resources are available before you start a job. To do this, run the command 'top', then press '1' to show the individual CPU's. On your screen, you could see something like this:

top - 14:33:32 up 96 days, 12:24,  8 users,  load average: 2.22, 1.80, 1.49
Tasks: 123 total,   3 running, 120 sleeping,   0 stopped,   0 zombie
Cpu0  : 97.7% us,  2.3% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu1  : 98.0% us,  2.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu2  : 100.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu3  : 100.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu4  :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu5  :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu6  :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu7  :  0.0% us,  0.3% sy,  0.0% ni, 99.7% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   2074760k total,  1968320k used,   106440k free,    69504k buffers
Swap:  4168828k total,    61112k used,  4107716k free,  1502140k cached
----------------------------------------------------------------------------
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28507 beekhof   25   0  7340 1616 1092 R 37.5  0.1   0:31.22 a.out
12689 latt      25   0  173m 171m  960 R 12.5  8.5  20355:10 bb_dipole
28506 beekhof   16   0  2124 1168  884 R  0.0  0.1   0:00.10 top
    1 root      16   0  1560  108   84 S  0.0  0.0   0:03.19 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.45 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0

This shows the 8 logical cpu's, of which currently 4 are busy. If you are curious, you can use 'dmesg' to find the relation between physical CPU's and logical cpu's, at the time of writing: CPU0(cpu0, cpu7), CPU1(cpu1, cpu4), CPU2(cpu2, cpu5), CPU3(cpu3, cpu6). If 4 or more cpu's in use, the machine is full - use more than 4 cpu's only if you are alone on the machine. Linux will assign a thread to an idle physical CPU is possible, otherwise it will pick a logical cpu. If you are not alone, you run the risk of starting a thread on a CPU that is already running a thread of someone else. Then, the threads on this CPU will start competing for cache, memory bandwidth and ALU's, resulting in cache thrashing and impressive slowdowns for both running programs. It is a great way to annoy people.

Remember to pay attention to the memory as well!

If you are alone on the machine, you can try to use 8 threads in stead of 4, but whether the performance will improve depends on your program.

Launching a job

First, you need to specify the number of threads you wish your OpenMP program to use. The easiest way to do this is by setting an environment variable OMP_NUM_THREADS.

In Bash:

export OMP_NUM_THREADS=4

In C-Shell:

setenv OMP_NUM_THREADS 4

Then, just run your job from the command line.