Program BIT1 je bil običajno serijski program s kupom grafičnih oken (xgrafix). Ker je program izjemno računsko potraten, je bil paraleliziran s pomočjo strokovnjakov v Barcelona Super Computing Centru (BSC) in s tem postal bolj uporaben, saj so serijski programi potekali tudi več mesecev. V vzporednem načinu delovanja (MPI) je grafičen način dela izključen.
Program BIT1 avtor predstavlja z: BIT1 is an electrostatic 1D3V PIC + 2D3V direct Monte Carlo code for plasma simulations. It can run in serial as well as massively parallel versions. BIT – Berkeley (the first version of the serial BIT1 was based on XPDP1 code from University of Berkeley) – Innsbruck (code has been developed by D. Tskhakaya at University of Innsbruck) – Tbilisi (D. Tskhakaya is from Tbilisi).
Namestitev BIT1
Program si pripravimo iz izvorne kode tako, da ga odpakiramo v nov imenik in pripravimo okolje za prevajanje s prevajalniki Intel in vzporednim okoljem OpenMPI:
[leon@prelog ~]$ mkdir bit1
[leon@prelog ~]$ cd bit1
[leon@prelog bit1]$ cp ../Downloads/BIT1_032010.tar .
[leon@prelog bit1]$ tar xf BIT1_032010.tar
[leon@prelog bit1]$ cd codeUnix
[leon@prelog codeUnix]$ module load intel openmpi
Loading intel/12.0
Loading openmpi/1.5.1 for compiler intel-12.0
Prevajanje
Prevajalniki Intel omogočajo napredne funkcije in boljšo optimizacijo od GCC. Za veliko število funkcij obstajajo tudi boljše zamenjave, zato pred matematično knjižnico vedno vstavimo -limf. Ker pravajalnik nima stikala -qinitauto, ki je na voljo za prevajalnike IBM (Na Mare Nostrumu v BSC) ga zaenkrat kar izklopimo, s tem da popravimo datoteko makefile tin tako zklopimo avtomatsko inicializacijo spremenljivk na nič. Ker -qinitauto=00 daje slutiti, da je v programu kakšna spremenljivka neinicializirana, bomo raje kasneje te spremenljivke poiskali.
[leon@prelog codeUnix]$ diff -u /home/leon/bit1.vanilla/codeUnix/makefile makefile
--- makefile 2009-12-23 13:45:21.000000000 +0100
+++ /home/leon/bit1.fixed/codeUnix/makefile 2010-12-25 02:07:27.121474246 +0100
@@ -14,13 +14,13 @@
##
## compiler for BSC
-CC= mpicc -O2 -qinitauto=00
+CC= mpicc -O2
## for others
##CC = cc -O2
##CC = gcc -g -O2
##
##
-LIBS= -L/usr/X11R6/lib64 -lm -lX11
+LIBS= -L/usr/X11R6/lib64 -limf -lm -lX11
##LIBS= -L/usr/X11R6/lib -lm -lX11
##
## Libraries and their directories used in loading. Normal is -lm and -lX11
Program nato prevedemo z običajnim ukazom make
Poganjanje v sistemu čakalnih vrst
Za lažje pošiljanje v LSF si pripravimo skript
[leon@prelog codeUnix]$ cat bit1.lsf
#!/bin/bash
#BSUB -a openmpi
#BSUB -n 24
#BSUB -o bit1.log
#BSUB -J bit1
mpirun.lsf ./BIT1 ../tok-em4e22.inp
in ga poženemo na 24 procesorjih z ukazom
[leon@prelog codeUnix]$ bsub < bit1.lsf
Job <7> is submitted to default queue <normal>.
[leon@prelog codeUnix]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
7 leon RUN normal prelog 12*cn52 bit1 Dec 25 15:41
12*cn55
[leon@prelog codeUnix]$ bhist -l 7|grep CPU
Sat Dec 25 15:42:05: Done successfully. The CPU time used is 885.6 seconds;
ki, se v kratkem času zaključi.
Popravki izvorne kode
Ko program konča, lahko v datoteki bit1.log opazimo, da program vedno zaključi s statusom 1 in zato LSF javlja
...
Code is based on XPDP1 code from University of California - Berkeley: Plasma Theory and Simulation Group
Final tstep= 50000
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
Job /opt/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper ./BIT1 ../tok-em4e22.inp
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00000 cn53 ./BIT1 ../tok-em Exit (1) 12/25/2010 13:23:15
....
zato popravimo program, da normalno konča s statusom 0 s popravkom datoteke
[leon@prelog codeUnix]$ diff -u ~leon/bit1.vanilla/codeUnix/somef.c somef.c
--- /home/leon/bit1.vanilla/codeUnix/somef.c 2009-12-23 13:45:20.000000000 +0100
+++ somef.c 2010-12-25 13:25:40.919774527 +0100
@@ -104,7 +104,7 @@
void exitrun()
{
if (fmpi) MPI_Finalize();
- exit(1);
+ exit(0);
}
/*********************************************************************
ki odpravi pritoževanje LSF tudi ob pravilnem zaključku programa.
Iskanje napak v izvorni kodi
Pri razvoju programov se napaka stalno pojavljajo. Če je le mogoče si pri iskanju pomagama z orodji, ki lahko najdejo osnovne napake. Napake pri delu s spremenljivkami nam lahko odkrijejo že pravajalniki ali orodja kot je npr. Valgrind, ki ščitijo spomin rezerviran za spremenljivke ter tako ob nekontroliranem pisanju ujamejo problematično kodo. Za prevajalnike Intel tako namesto -qinitauto napišemo iskanje neinicializiranih polj, kar deluje celo brez stikala za razhroščevanje (-g).
[leon@prelog codeUnix]$ diff -u ~leon/bit1.vanilla/codeUnix/makefile makefile
--- /home/leon/bit1.vanilla/codeUnix/makefile 2009-12-23 13:45:21.000000000 +0100
+++ makefile 2010-12-25 15:55:18.208471392 +0100
@@ -14,13 +14,13 @@
##
## compiler for BSC
-CC= mpicc -O2 -qinitauto=00
+CC= mpicc -g -O2 -check-uninit -ftrapuv
## for others
##CC = cc -O2
##CC = gcc -g -O2
Iskanje napak pričnemo s ponovnim prevajanjem in zagonom:
[leon@prelog codeUnix]$ make clean
rm *.o BIT1
make -C xgrafix clean
make[1]: Entering directory `/home/leon/bit1/codeUnix/xgrafix'
rm xgrafix.o
make[1]: Leaving directory `/home/leon/bit1/codeUnix/xgrafix'
[leon@prelog codeUnix]$ make
mpicc -O2 -check-uninit -ftrapuv -c -O2 fft.c
mpicc -O2 -check-uninit -ftrapuv -c -O2 mcc.c
...
...
ki prikaže problematične spremenljivke v bit1.log kot:
Run-Time Check Failure: The variable 'vv' is being used without being initialized
[cn55:04806] *** Process received signal ***
[cn55:04806] Signal: Aborted (6)
[cn55:04806] Signal code: (-6)
Run-Time Check Failure: The variable 'vv' is being used without being initialized
...
S hitrim popravko oz inicializacijo spremenljivk ter poganjanju v (pseudo)ospredju
[leon@prelog codeUnix]$ grep vv *.c |grep float
[leon@prelog codeUnix]$ emacs datf_disp_mpi.c datf_fv_mpi.c dmprest.c movie.c tdf.c &
[leon@prelog codeUnix]$ make
[leon@prelog codeUnix]$ bsub -K < bit1.lsf
Job <13> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Job is finished>>
Ko program konča lahko posmrtno (post mortem) poiščemo mesto napake tako, da poberemo zadnje "kosti" (core) in pogledamo v "pacienta" s razhroščevalnikom in pogledamo (where) na kateri vrstici se je smrt zgodila:
[leon@prelog codeUnix]$ gdb ./BIT1 core.4096
...
Program terminated with signal 6, Aborted.
#0 0x0000003fc5230265 in raise () from /lib64/libc.so.6
(gdb) where
#0 0x0000003fc5230265 in raise () from /lib64/libc.so.6
#1 0x0000003fc5231eb8 in abort () from /lib64/libc.so.6
#2 0x0000000000468aa0 in __intel_rtc_uninit_use ()
#3 0x0000000000449a01 in datf_t (datf=4096) at datf_t_mpi.c:199
#4 0x0000000000440800 in cridat_mpi (datfile=4096) at dmprest_mpi.c:449
#5 0x000000000040f4fc in main (argc=2, argv=0x7fff0aa620b8) at bit1.c:99
(gdb) quit
kar lahko potem s par poskusi ugotovimo, da je prav zaprav šlo tudi za neinicializirano spremenljivko vv1 in ne samo vv. Namesto gdb lahko uporabimo tudi Intel razghroščevalnik idb. Popravek modula datf_t_mpi.c je tako odpravljen z
[leon@prelog codeUnix]$ diff -u ~leon/bit1.vanilla/codeUnix/datf_t_mpi.c datf_t_mpi.c
--- /home/leon/bit1.vanilla/codeUnix/datf_t_mpi.c 2009-12-23 13:45:17.000000000 +0100
+++ datf_t_mpi.c 2010-12-25 17:25:06.738668233 +0100
@@ -7,7 +7,7 @@
int datf;
{
int i, j, isp;
- float vv, vv1, vv2, vv3, vv4, vv5, vv6, vv7, vv8;
+ float vv = 0.0, vv1 = 0.0, vv2, vv3, vv4, vv5, vv6, vv7, vv8;
char fn[10], fname[90];
FILE *mydmp;
in tako lahko končno poženemo program tudi brez kontrole na inicializacijo spremenljivk, saj to upočasnjuje program. Končni rezultat izpisa programa s statistiko je:
Sender: LSF System <lsfadmin@cn52>
Subject: Job 22: <bit1> Done
Job <bit1> was submitted from host <prelog> by user <leon> in cluster <hpcfs>.
Job was executed on host(s) <12*cn52>, in queue <normal>, as user <leon> in cluster <hpcfs>.
<12*cn55>
</home/leon> was used as the home directory.
</home/leon/bit1/codeUnix> was used as the working directory.
Started at Sat Dec 25 17:27:48 2010
Results reported at Sat Dec 25 17:28:48 2010
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -a openmpi
#BSUB -n 24
#BSUB -o bit1.log
#BSUB -J bit1
mpirun.lsf ./BIT1 ../tok-em4e22.inp
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 965.01 sec.
Max Memory : 5670 MB
Max Swap : 11982 MB
Max Processes : 42
Max Threads : 92
The output (if any) follows:
BIT1 - Bounded Electrostatic 1 Dimensional PIC-MCC Code (Berkeley-Innsbruck-Tbilisi)
Developed in High Energy and Plasma Physics Group, University of Innsbruck
Author: D.Tskhakaya, Permanent address:
Institute of Physics, Georgian Academy of Sciences, Tbilisi
Code is based on XPDP1 code from University of California - Berkeley: Plasma Theory and Simulation Group
Final tstep= 50000
Job /opt/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper ./BIT1 ../tok-em4e22.inp
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00000 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00001 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00002 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00003 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00004 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00005 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00006 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00007 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00008 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00009 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00010 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00011 cn55 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00012 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00013 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00014 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00015 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00016 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00017 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00018 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00019 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00020 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00021 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00022 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
00023 cn52 ./BIT1 ../tok-em Done 12/25/2010 17:28:30
Rezultate, ki so zapisani v več datotekah *.inp.dat.<processor.rank> združimo s funkcijo zdruzi tok-em4e22.inp.dat.*
zdruzi() { echo $* | sort -t . -n -k 4 | args head -q -n -1 > izpis.txt; }
Datoteko izpis.txt kar lahko gledamo v gnuplotu z ukazom plot 'izpis.txt' using 4
Zaključek
Dodatne informacije o programu lahko dobite pri {leon.kos, nikola.jelic}@lecad.fs.uni-lj.si ali pri avtorju samem.