About OpenMP
The OpenMP machine was acquired for lab courses, but can be used freely in the intervals between courses.
The OpenMP machine has 4 Intel© Xeon CPU's with HyperThreading at 2.4 GHz and 32 GB memory. Due to the HyperThreading, there are 8 logical CPU's although there are only 4 physical CPU's. It is a shared memory parallel machine, intended to be used by programs following the OpenMP programming paradigm, but it is also an attractive choice for MPI programs as communication in a shared memory machine is very fast. There is a basic C++ OpenMP tutorial in french.
The administrator of the OpenMP machine is Nicolas Mayencourt.
Logging in
Just use ssh:
ssh openmp.unige.ch
Once you are logged in, you need to set up your environment to use the intel compilers and the debugger. The C / C++ compiler is called icc, the Fortran compiler ifort, and idb is the debugger.
C-Shell:
% source /opt/env/ifortvars.csh % source /opt/env/iccvars.csh % source /opt/env/idbvars.csh
Bash:
% source /opt/env/ifortvars.sh % source /opt/env/iccvars.sh % source /opt/env/idbvars.sh
There is on-line documentation, for the compilers and the debugger:
Creating a program
A simple example in C
The following program, 'openmp.c', gives a very simple example of the most common use of OpenMP.
#include <stdio.h> #include <stdlib.h> #define N 256 #define T 1000 int main() { int data[N][N]; int t, sum; int i, j; /* Clear array: set all to zero. */ memset(data, 0, N * N * sizeof(int)); /* All variables are by default shared between all threads, except 'i', 'j' and 't', of which each thread has a private copy. So, each thread has its own indices 't', 'i' and 'j' but operates on the same array. */ #pragma omp parallel for default(shared), private(t, i, j) for (t = 0; t < T; ++t) for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) data[i][j] += random(); /* This loop cannot be parallellized because writing to 'sum' cannot be done concurrently. */ sum = 0; for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) sum += data[i][j]; printf("Sum: %d\n", sum); return 0; }
Compiling
Use the Intel Compiler! In the next sections the C++ compiler icc is used, but everything is the same for the Fortran compiler ifort.
For performace
Use the following switches to maximize performance:
icc -xN -O3 -ipo -openmp -o openmp openmp.c
-xN | Generate code for Xeon CPU with MMX / SSE vectorization |
-O3 | Optimize a lot |
-ipo | Optimize between functions and files |
-openmp | Enable OpenMP |
-Wall | Emit all possible warnings (optional) |
Another interesting switch is `-parallel', which lets the compiler automatically parallelize loops. Unfortunately it is not clear how many threads the program will use when compiling with -parallel. If you do know how to control the number of threads, please adapt this page.
Debugging
Use only the -g switch for debugging:
icc -g -o openmp openmp.c
This will disable all optimizations, and add some extra information to the program to allow for easier debugging.
Running a Job
Selecting the number of threads
There is no batch-system present to coordinate the use of the machine between users, so you need to check how much resources are available before you start a job. To do this, run the command 'top', then press '1' to show the individual CPU's. On your screen, you could see something like this:
top - 14:33:32 up 96 days, 12:24, 8 users, load average: 2.22, 1.80, 1.49 Tasks: 123 total, 3 running, 120 sleeping, 0 stopped, 0 zombie Cpu0 : 97.7% us, 2.3% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu1 : 98.0% us, 2.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu2 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu3 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu4 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu5 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu6 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu7 : 0.0% us, 0.3% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 2074760k total, 1968320k used, 106440k free, 69504k buffers Swap: 4168828k total, 61112k used, 4107716k free, 1502140k cached ---------------------------------------------------------------------------- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28507 beekhof 25 0 7340 1616 1092 R 37.5 0.1 0:31.22 a.out 12689 latt 25 0 173m 171m 960 R 12.5 8.5 20355:10 bb_dipole 28506 beekhof 16 0 2124 1168 884 R 0.0 0.1 0:00.10 top 1 root 16 0 1560 108 84 S 0.0 0.0 0:03.19 init 2 root RT 0 0 0 0 S 0.0 0.0 0:00.45 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
This shows the 8 logical cpu's, of which currently 4 are busy. If you are curious, you can use 'dmesg' to find the relation between physical CPU's and logical cpu's, at the time of writing: CPU0(cpu0, cpu7), CPU1(cpu1, cpu4), CPU2(cpu2, cpu5), CPU3(cpu3, cpu6). If 4 or more cpu's in use, the machine is full - use more than 4 cpu's only if you are alone on the machine. Linux will assign a thread to an idle physical CPU is possible, otherwise it will pick a logical cpu. If you are not alone, you run the risk of starting a thread on a CPU that is already running a thread of someone else. Then, the threads on this CPU will start competing for cache, memory bandwidth and ALU's, resulting in cache thrashing and impressive slowdowns for both running programs. It is a great way to annoy people.
Remember to pay attention to the memory as well!
If you are alone on the machine, you can try to use 8 threads in stead of 4, but whether the performance will improve depends on your program.
Launching a job
First, you need to specify the number of threads you wish your OpenMP program to use. The easiest way to do this is by setting an environment variable OMP_NUM_THREADS.
In Bash:
export OMP_NUM_THREADS=4
In C-Shell:
setenv OMP_NUM_THREADS 4
Then, just run your job from the command line.