Table of Contents
LIGO code benchmarking
PyCBC Inspiral
These are instructions for how to check out and compile an example LIGO executable and run it on the Raven system. The following script will download, build and install a basic system to start profiling:
First, log on to Raven, and save the following script somewhere in your home directory. By default it will install everything into scratch:
#!/bin/bash # Pull in required modules module load gsl/1.15 module load fftw/3.3-avx module load fftw/3.3-avx-float module load libframe/8.04-2 module load libmetaio/8.4 module load python/2.6.6-ligo module load swig/2.0.11 WORKDIR=/scratch/$USER/benchmarking INSTALLDIR=$WORKDIR/install RUNDIR=$WORKDIR/runs CONFIGOPTS="--enable-swig-python --disable-laldetchar --disable-lalburst --disable-lalstochastic --disable-gcc-flags --disable-debug" mkdir -p $WORKDIR cd $WORKDIR # Download LALSuite if [ ! -d lalsuite ]; then git clone git://ligo-vcs.phys.uwm.edu/lalsuite.git fi # Build LALSuite cd lalsuite ./00boot ./configure --prefix $INSTALLDIR $CONFIGOPTS make -j 4 make install # Build GLUE cd $WORKDIR/lalsuite/glue python setup.py install --prefix=$INSTALLDIR # Source LALSuite for building next parts . $INSTALLDIR/etc/lalapps-user-env.sh . $INSTALLDIR/etc/lal-user-env.sh . $INSTALLDIR/etc/glue-user-env.sh # Download PyCBC cd $WORKDIR if [ ! -d pycbc ]; then git clone git://ligo-vcs.phys.uwm.edu/pycbc.git fi # Build PyCBC cd pycbc python setup.py install --prefix=$INSTALLDIR . $INSTALLDIR/etc/pycbc-user-env.sh
Make a directory where you'll do the runs and copy over the required files:
RUNDIR=/scratch/$USER/benchmarking/runs mkdir $RUNDIR cd $RUNDIR cp /home/spxph/BenchmarkingFiles/* .
Then make a job submission script, PyCBC_Profile.sh
with the following text:
#!/bin/bash #PBS -S /bin/bash #PBS -q serial #PBS -l select=1:ncpus=1 #PBS -l walltime=2:00:00 #PROJECT=PR37 module load compiler/gnu-4.6.2 export CFLAGS=$CFLAGS" -D__USE_XOPEN2K8" module load fftw/3.3-sse2 module load fftw/3.3-sse2-float module load python/2.6.6-ligo WORKDIR=/scratch/$USER/benchmarking INSTALLDIR=$WORKDIR/install RUNDIR=$WORKDIR/runs . $INSTALLDIR/etc/lal-user-env.sh . $INSTALLDIR/etc/glue-user-env.sh . $INSTALLDIR/etc/pycbc-user-env.sh cd $RUNDIR python -m cProfile -o PyCBC_Profile_Report `which pycbc_inspiral` \ --trig-end-time 0 \ --cluster-method template \ --bank-file templatebank.xml \ --strain-high-pass 30 \ --approximant SPAtmplt \ --gps-end-time 1026021620 \ --channel-name H1:FAKE-STRAIN \ --snr-threshold 5.5 \ --trig-start-time 0 \ --gps-start-time 1026019572 \ --chisq-bins 16 \ --segment-length 256 \ --low-frequency-cutoff 40.0 \ --pad-data 8 \ --sample-rate 4096 \ --chisq-threshold 10.0 \ --chisq-delta 0.2 \ --user-tag FULL_DATA \ --order 7 \ --frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \ --processing-scheme cpu \ --psd-estimation median \ --maximization-interval 30 \ --segment-end-pad 64 \ --pad-data 8 \ --segment-start-pad 64 \ --psd-segment-stride 128 \ --psd-inverse-length 16 \ --psd-segment-length 256 \ --chisq-delta 0.2 \ --output PYCBC_OUTPUT.xml.gz \ --verbose
Now, you should be ready to do the runs. It's set up to loop through the given templatebank.xml
file which contains around 3,000 templates, each of which is processed independently through the data. If you want to do a quick test, you can use small_bank.xml
instead which contains only about 50 templates. To run the job just submit it to the PBS queue: qsub PyCBC_Profile.sh
. When that is complete you can analyse the output using python:
import pstats p = pstats.Stats('PyCBC_Profile_Report') p.sort_stats('cumulative').print_stats(10)
Using multiple cores
PyCBC can use multiple cores by supplying the options
--processing-scheme cpu:N --fftw-threads-backend pthreads|openmp
where N
is the number of cores to use, and FFTW should use either the openmp
or pthread
model. The openmp
backend does not currently work with the modules on Raven but I have found that it works with my current FFTW build. It is also necessary to bind the processes to the correct cpu cores. I use taskset -c 0-N pycbc_inspiral
where N
is the number of cores.
Using FFTW Wisdom
FFTW can find the quickest methods for computing FFTs and store this information in a file for later use. PyCBC uses the following options to calculate and record the wisdom:
--fftw-measure-level Level --fftw-output-float-wisdom-file FFTW_Float_Wisdom --fftw-output-double-wisdom-file FFTW_Double_Wisdom
where the measure levels correspond to
0: FFTW_ESTIMATE, 1: FFTW_MEASURE, 2: FFTW_MEASURE|FFTW_PATIENT, 3: FFTW_MEASURE|FFTW_PATIENT|FFTW_EXHAUSTIVE
Level 2 is recommended to find the best plan, but it can take several hours to run, especially if your are using threaded schemes.
Results
The most comprehensive results I have so far are for measure level 2 for a range of cores on the Westmere (12 HT cores) and Sandy bridge (16 HT cores) nodes using either the openmp or pthreads FFTW threading models. Generally the more cores you use, the quicker it runs. However, for both types of nodes it is NOT quicker than running multiple copies of the processes in parallel. These results are obtained from the script below, which starts multiple copies of the process and times how long they all take to finish.
Total cores on node | Cores used per job | Total time for 12 or 16 jobs (s) | |
---|---|---|---|
FFTW Pthreads | FFTW OpenMP | ||
12 | 1 | 1857 | 1717 |
12 | 2 | 2072 | 1940 |
12 | 3 | 2328 | 2289 |
12 | 4 | 3096 | |
12 | 6 | 3084 | 2436 |
12 | 12 | 8652 | |
16 | 1 | 1372 | 1389 |
16 | 2 | 1952 | 1374 |
16 | 4 | 2404 | 1940 |
16 | 8 | 3840 | 2688 |
16 | 16 | 5120 |
The code will calculate the FFTW wisdom if required. The script also includes a pause between starting jobs. This is necessary to prevent errors with the runtime compilation of inlined c code, and other resource conflicts.
#!/bin/bash #PBS -S /bin/bash #PBS -l walltime=8:00:00 #PROJECT=PR37 module load compiler/gnu-4.6.2 export CFLAGS=$CFLAGS" -D__USE_XOPEN2K8" module load python/2.6.6-ligo FFTW_BASE=/home/spxph/fftw/install export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$FFTW_BASE/lib export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$FFTW_BASE/lib/pkgconfig WORKDIR=/scratch/$USER/benchmarking INSTALLDIR=$WORKDIR/install RUNDIR=$WORKDIR/runs FFTW_MEASURE_LEVEL=2 JOBPAUSE=10 . $INSTALLDIR/etc/lal-user-env.sh . $INSTALLDIR/etc/glue-user-env.sh . $INSTALLDIR/etc/pycbc-user-env.sh RUNOPTS="--trig-end-time 0 \ --cluster-method template \ --bank-file templatebank.xml \ --strain-high-pass 30 \ --approximant SPAtmplt \ --gps-end-time 1026021620 \ --channel-name H1:FAKE-STRAIN \ --snr-threshold 5.5 \ --trig-start-time 0 \ --gps-start-time 1026019572 \ --chisq-bins 16 \ --segment-length 256 \ --low-frequency-cutoff 40.0 \ --pad-data 8 \ --sample-rate 4096 \ --chisq-threshold 10.0 \ --chisq-delta 0.2 \ --user-tag FULL_DATA \ --order 7 \ --frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \ --processing-scheme cpu:${nodes} \ --psd-estimation median \ --maximization-interval 30 \ --segment-end-pad 64 \ --pad-data 8 \ --segment-start-pad 64 \ --psd-segment-stride 128 \ --psd-inverse-length 16 \ --psd-segment-length 256 \ --chisq-delta 0.2 \ --fft-backends fftw \ --fftw-threads-backend pthreads \ --verbose" cd $RUNDIR HARDWARE=SandyBridge MAXCPUS=16 echo $PBS_QUEUE if [ "$PBS_QUEUE" == "serial" ]; then HARDWARE=Westmere MAXCPUS=12 fi FFTW_WISDOM_FLOAT=FFTW_Wisdom_Float_${HARDWARE}_${nodes} FFTW_WISDOM_DOUBLE=FFTW_Wisdom_Double_${HARDWARE}_${nodes} if [[ ! -f $FFTW_WISDOM_FLOAT || ! -f $FFTW_WISDOM_DOUBLE ]]; then echo "Measure job using" 0-$(($nodes-1)) taskset -c 0-$(($nodes-1)) pycbc_inspiral $RUNOPTS --bank-file small_bank.xml \ --fftw-measure-level $FFTW_MEASURE_LEVEL \ --fftw-output-float-wisdom-file $FFTW_WISDOM_FLOAT \ --fftw-output-double-wisdom-file $FFTW_WISDOM_DOUBLE \ --output PYCBC_OUTPUT_${HARDWARE}_${nodes}_0.xml.gz fi PLANTIME=$SECONDS echo "Plan took" $PLANTIME for ((i = 0; i < $(($MAXCPUS/${nodes})); i++)); do echo Job $i using $(($i*${nodes}))-$((($i+1)*${nodes} - 1)) taskset -c $(($i*${nodes}))-$((($i+1)*${nodes} - 1)) pycbc_inspiral $RUNOPTS \ --fftw-input-float-wisdom-file $FFTW_WISDOM_FLOAT \ --fftw-input-double-wisdom-file $FFTW_WISDOM_DOUBLE \ --output PYCBC_OUTPUT_${HARDWARE}_${nodes}_$i.xml.gz & sleep $JOBPAUSE done wait echo "Finished in" $(($SECONDS-$PLANTIME)) echo Total time for $MAXCPUS using $nodes : $((($SECONDS-$PLANTIME)*$nodes))
It was started using these commands:
qsub -q serial -l select=1:ncpus=12 -v nodes=1 prof_options.sub qsub -q serial -l select=1:ncpus=12 -v nodes=2 prof_options.sub .... qsub -q workq -l select=1:ncpus=16 -v nodes=1 prof_options.sub qsub -q workq -l select=1:ncpus=16 -v nodes=2 prof_options.sub ...
CUDA
The PyCBC guide gives information on dependencies and installing PyCUDA, SciKits.cuda, and Mako. Briefly, I found that I did:
git clone http://git.tiker.net/trees/pycuda.git cd pycuda git submodule init git submodule update export PATH=$PATH:/usr/local/cuda-6.0/bin ./configure.py python setup.py build python setup.py install --user easy_install --prefix=/home/paul.hopkins/.local/ scikits.cuda easy_install --prefix=/home/paul.hopkins/.local/ Mako
I then executed the same command as above but using the CUDA schema, optionally selecting the GPU ID, and removing all FFTW references:
. ~/lalsuite/etc/lal-user-env.sh . ~/lalsuite/etc/pycbc-user-env.sh export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.0/lib64 export PATH=$PATH:/usr/local/cuda-6.0/bin RUNOPTS="--trig-end-time 0 --cluster-method template --bank-file templatebank.xml \ --strain-high-pass 30 --approximant SPAtmplt --gps-end-time 1026021620 \ --channel-name H1:FAKE-STRAIN --snr-threshold 5.5 --trig-start-time 0 \ --gps-start-time 1026019572 --chisq-bins 16 --segment-length 256 \ --low-frequency-cutoff 40.0 --pad-data 8 --sample-rate 4096 \ --chisq-threshold 10.0 --chisq-delta 0.2 --user-tag FULL_DATA \ --order 7 --frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \ --psd-estimation median --maximization-interval 30 --segment-end-pad 64 \ --pad-data 8 --segment-start-pad 64 --psd-segment-stride 128 \ --psd-inverse-length 16 --psd-segment-length 256 --chisq-delta 0.2 \ --output PYCBC_OUTPUT.xml.gz --verbose \ --processing-scheme cuda \ --processing-device-id 3" pycbc_inspiral $RUNOPTS echo "Finished in " $SECONDS
The device id and information about the GPUs can be obtained from running the command nvidia-smi
.
export PATH=$PATH:/usr/local/cuda-6.0/bin
Using an NVidia Tesla K10 this completed in a time of 330 seconds.
LALInference Nest
Download the testinj.xml
file (wget https://gravity.astro.cf.ac.uk/dokuwiki/_media/public/testinj.xml
)
and then using the lalinference_nest
application:
source $INSTALLDIR/etc/lalinference-user-env.sh time lalinference_nest --Neff 1000 \ --psdlength 255 \ --inj testinj.xml \ --V1-cache LALVirgo \ --nlive 1024 \ --V1-timeslide 0 \ --srate 1024 \ --event 3 \ --seglen 8 \ --L1-channel LALLIGO \ --H1-timeslide 0 \ --trigtime 894377300 \ --use-logdistance \ --psdstart 894377022 \ --H1-cache LALLIGO \ --progress \ --L1-timeslide 0 \ --H1-channel LALLIGO \ --V1-channel LALVirgo \ --outfile lalinferencenest-3-V1H1L1-894377300.0-0.dat \ --L1-cache LALLIGO \ --randomseed 1778223270 \ --dataseed 1237 \ --ifo V1 \ --ifo H1 \ --ifo L1 \ --Nmcmc 100
The runtime can be controlled using the –Nmcmc
parameter. Time on Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge):
Using –Nmcmc 100
:
Turbo Boost On: 3500 MHz real 15m17.527s user 15m14.425s sys 0m0.404s Turbo Boost Off: 2700 MHz real 19m38.241s user 19m35.629s sys 0m0.540s
and for –Nmcmc 200
Turbo Boost On real 35m40.957s user 20m12.832s sys 0m0.334s
And on an ARCCA Sandy Bridge node
real 23m21.891s user 23m20.324s sys 0m0.586s
Haswell node:
Turbo Boost Off real 20m52.627s user 20m52.242s sys 0m0.435s Turbo Boost Off: Icc -O3 -xHost real 12m43.876s user 12m43.007s sys 0m0.363s and for 24: 13m09s - 13m53s
Heterodyne Pulsar
Using the lalapps_heterodyne_pulsar
application and the following script
cp -R /home/spxph/AcceptanceTestsHeterodyne/* . . $INSTALLDIR/etc/lalapps-user-env.sh time lalapps_heterodyne_pulsar --heterodyne-flag 0 --ifo H1 --pulsar J0000+0000 --param-file J0000+0000.par \ --sample-rate 16384 --resample-rate 1 --filter-knee 0.25 \ --data-file H_cache.txt --seg-file H_segs.txt --channel H1:LDAS-STRAIN \ --output-dir H1/931052708-931453056 --freq-factor 2
Running on Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge) for the times 931052708-931453056
gives times of:
Turbo Boost On: 3500 MHz real 12m43.314s user 12m0.647s sys 0m34.930s Turbo Boost Off: 2700 MHz real 16m9.190s user 15m26.190s sys 0m35.840s
Raven Sandy Bridge node for the times: 931052708-931453056
:
real 19m17.915s user 16m28.514s sys 1m26.912s
Raven Haswell node.
real 16m12.350s user 13m54.145s sys 1m12.344s
This command was prouced using the following script:
#!/bin/bash set -e # Test the heterodyne code on the Raven cluster export S6_SEGMENT_SERVER=https://segdb.ligo.caltech.edu # set the data start and end times (giving about 7 days of S6 H1 science mode data) starttime=931052708 #endtime=931052828 endtime=932601620 obs=H ftype=H1_LDAS_C02_L2 cache=${obs}_cache.txt if [ ! -f $cache ]; then echo Getting data # use ligo_data_find to get the data frame cache file /usr/bin/ligo_data_find -o $obs -s $starttime -e $endtime -t $ftype -l --url-type file >> $cache fi segfile=${obs}_segs.txt if [ ! -f $segfile ]; then echo Making segments # use make_segs.sh script to generate segment list (in format needed by lalapps_heterodyne_pulsar) ./make_segs.sh ${obs}1 $starttime $endtime $segfile fi # set up stuff for running lalapps_heterodyne_pulsar DETECTOR=H1 CHANNEL=H1:LDAS-STRAIN FKNEE=0.25 # sample rates (frames are 16384Hz) SRATE1=16384 SRATE2=1 SRATE3=1/60 # create a pulsar par file PSRNAME=J0000+0000 FREQ=245.678910 FDOT=-9.87654321e-12 RA=00:00:00.0 DEC=00:00:00.0 PEPOCH=53966.22281462963 PFILE=$PSRNAME.par UNITS=TDB if [ -f $PFILE ]; then rm -f $PFILE fi echo PSR $PSRNAME > $PFILE echo F0 $FREQ >> $PFILE echo F1 $FDOT >> $PFILE echo RAJ $RA >> $PFILE echo DECJ $DEC >> $PFILE echo PEPOCH $PEPOCH >> $PFILE echo UNITS $UNITS >> $PFILE if [ $? != "0" ]; then echo Error writing parameter file! exit 2 fi # make output directory mkdir -p $DETECTOR OUTDIR=$DETECTOR/$starttime-$endtime mkdir -p $OUTDIR # run the code in coarse heterodyne mode EXECCODE=/usr/bin/lalapps_heterodyne_pulsar echo Performing coarse heterodyne - mode 0 - and outputting to text file $EXECCODE --heterodyne-flag 0 --ifo $DETECTOR --pulsar $PSRNAME --param-file $PFILE --sample-rate $SRATE1 --resample-rate $SRATE2 --filter-knee $FKNEE --data-file $cache --seg-file $segfile --channel $CHANNEL --output-dir $OUTDIR --freq-factor 2 # set ephemeris files EEPHEM=/usr/share/lalpulsar/earth00-19-DE405.dat.gz SEPHEM=/usr/share/lalpulsar/sun00-19-DE405.dat.gz TEPHEM=/usr/share/lalpulsar/tdb_2000-2019.dat.gz COARSEFILE=$OUTDIR/coarsehet_${PSRNAME}_${DETECTOR}_${starttime}-${endtime} echo $COARSEFILE # run the code in fine heterodyne mode $EXECCODE --ephem-earth-file $EEPHEM --ephem-sun-file $SEPHEM --ephem-time-file $TEPHEM --heterodyne-flag 1 --ifo $DETECTOR --pulsar $PSRNAME --param-file $PFILE --sample-rate $SRATE2 --resample-rate $SRATE3 --filter-knee $FKNEE --data-file $COARSEFILE --output-dir $OUTDIR --channel $CHANNEL --seg-file $segfile --freq-factor 2 --stddev-thresh 5 --verbose
where the following is the make_segs.sh
script:
#!/bin/bash # small script to use ligolw_segment_query and ligolw_print # to get a science segment list for a given IFO for E14 and # output it in segwizard format # detector echo $1 det=$1 stime=$2 etime=$3 outfile=$4 if [ $det == H1 ]; then segs=H1:DMT-SCIENCE:1 fi if [ $det == V1 ]; then segs=V1:ITF_SCIENCEMODE fi if [ $det == L1 ]; then segs=L1:DMT-SCIENCE:1 fi echo $segs # get ascii segment list /usr/bin/ligolw_segment_query -s $stime -e $etime -d --include-segments $segs --query-segments | ligolw_print --table segment --column start_time --column end_time --delimiter " " > ${outfile}_tmp.txt #ligolw_segment_query -s $stime -e $etime -d --include-segments $segs --query-segments | grep -v "\[" | ligolw_print --table segment --column start_time --column end_time --delimiter " " > $1segments_tmp.txt count=1 # read in file and work out length of each segment while read LINE; do i=1 for j in $LINE; do if (( $i == 1 )); then tstart=$j fi if (( $i == 2 )); then tend=$j fi i=$((i+1)) done # get duration of segment dur=$(echo $tend - $tstart | bc) # output data to file echo $count $tstart $tend $dur >> $outfile count=$((count+1)) done < ${outfile}_tmp.txt rm ${outfile}_tmp.txt