Table of Contents

LIGO code benchmarking

PyCBC Inspiral

These are instructions for how to check out and compile an example LIGO executable and run it on the Raven system. The following script will download, build and install a basic system to start profiling:

First, log on to Raven, and save the following script somewhere in your home directory. By default it will install everything into scratch:

#!/bin/bash
 
# Pull in required modules
module load gsl/1.15
module load fftw/3.3-avx
module load fftw/3.3-avx-float
module load libframe/8.04-2
module load libmetaio/8.4
module load python/2.6.6-ligo
module load swig/2.0.11
 
WORKDIR=/scratch/$USER/benchmarking
INSTALLDIR=$WORKDIR/install
RUNDIR=$WORKDIR/runs
CONFIGOPTS="--enable-swig-python --disable-laldetchar --disable-lalburst --disable-lalstochastic --disable-gcc-flags --disable-debug"
 
mkdir -p $WORKDIR
cd $WORKDIR
 
# Download LALSuite
if [ ! -d lalsuite ]; then
  git clone git://ligo-vcs.phys.uwm.edu/lalsuite.git
fi
 
# Build LALSuite
cd lalsuite
./00boot
./configure --prefix $INSTALLDIR $CONFIGOPTS
make -j 4
make install
 
# Build GLUE
cd $WORKDIR/lalsuite/glue
python setup.py install --prefix=$INSTALLDIR
 
# Source LALSuite for building next parts
. $INSTALLDIR/etc/lalapps-user-env.sh
. $INSTALLDIR/etc/lal-user-env.sh
. $INSTALLDIR/etc/glue-user-env.sh
 
# Download PyCBC
cd $WORKDIR
if [ ! -d pycbc ]; then
  git clone git://ligo-vcs.phys.uwm.edu/pycbc.git
fi
 
# Build PyCBC
cd pycbc
python setup.py install --prefix=$INSTALLDIR
. $INSTALLDIR/etc/pycbc-user-env.sh

Make a directory where you'll do the runs and copy over the required files:

RUNDIR=/scratch/$USER/benchmarking/runs
mkdir $RUNDIR
cd $RUNDIR
cp /home/spxph/BenchmarkingFiles/* .

Then make a job submission script, PyCBC_Profile.sh with the following text:

#!/bin/bash
#PBS -S /bin/bash
#PBS -q serial
#PBS -l select=1:ncpus=1
#PBS -l walltime=2:00:00
#PROJECT=PR37
 
module load compiler/gnu-4.6.2
export CFLAGS=$CFLAGS" -D__USE_XOPEN2K8"
module load fftw/3.3-sse2
module load fftw/3.3-sse2-float
module load python/2.6.6-ligo
 
WORKDIR=/scratch/$USER/benchmarking
INSTALLDIR=$WORKDIR/install
RUNDIR=$WORKDIR/runs
 
. $INSTALLDIR/etc/lal-user-env.sh
. $INSTALLDIR/etc/glue-user-env.sh
. $INSTALLDIR/etc/pycbc-user-env.sh
 
cd $RUNDIR
 
python -m cProfile -o PyCBC_Profile_Report `which pycbc_inspiral` \
    --trig-end-time 0 \
    --cluster-method template \
    --bank-file templatebank.xml \
    --strain-high-pass 30  \
    --approximant SPAtmplt \
    --gps-end-time 1026021620 \
    --channel-name H1:FAKE-STRAIN \
    --snr-threshold 5.5 \
    --trig-start-time 0 \
    --gps-start-time 1026019572  \
    --chisq-bins 16 \
    --segment-length 256 \
    --low-frequency-cutoff 40.0 \
    --pad-data 8 \
    --sample-rate 4096 \
    --chisq-threshold 10.0 \
    --chisq-delta 0.2 \
    --user-tag FULL_DATA \
    --order 7   \
    --frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \
    --processing-scheme cpu \
    --psd-estimation median \
    --maximization-interval 30 \
    --segment-end-pad 64 \
    --pad-data 8 \
    --segment-start-pad 64 \
    --psd-segment-stride 128 \
    --psd-inverse-length 16 \
    --psd-segment-length 256 \
    --chisq-delta 0.2 \
    --output PYCBC_OUTPUT.xml.gz \
    --verbose

Now, you should be ready to do the runs. It's set up to loop through the given templatebank.xml file which contains around 3,000 templates, each of which is processed independently through the data. If you want to do a quick test, you can use small_bank.xml instead which contains only about 50 templates. To run the job just submit it to the PBS queue: qsub PyCBC_Profile.sh. When that is complete you can analyse the output using python:

import pstats
p = pstats.Stats('PyCBC_Profile_Report')
p.sort_stats('cumulative').print_stats(10)

Using multiple cores

PyCBC can use multiple cores by supplying the options

    --processing-scheme cpu:N
    --fftw-threads-backend pthreads|openmp

where N is the number of cores to use, and FFTW should use either the openmp or pthread model. The openmp backend does not currently work with the modules on Raven but I have found that it works with my current FFTW build. It is also necessary to bind the processes to the correct cpu cores. I use taskset -c 0-N pycbc_inspiral where N is the number of cores.

Using FFTW Wisdom

FFTW can find the quickest methods for computing FFTs and store this information in a file for later use. PyCBC uses the following options to calculate and record the wisdom:

    --fftw-measure-level Level
    --fftw-output-float-wisdom-file FFTW_Float_Wisdom
    --fftw-output-double-wisdom-file FFTW_Double_Wisdom

where the measure levels correspond to

    0: FFTW_ESTIMATE,
    1: FFTW_MEASURE,
    2: FFTW_MEASURE|FFTW_PATIENT,
    3: FFTW_MEASURE|FFTW_PATIENT|FFTW_EXHAUSTIVE

Level 2 is recommended to find the best plan, but it can take several hours to run, especially if your are using threaded schemes.

Results

The most comprehensive results I have so far are for measure level 2 for a range of cores on the Westmere (12 HT cores) and Sandy bridge (16 HT cores) nodes using either the openmp or pthreads FFTW threading models. Generally the more cores you use, the quicker it runs. However, for both types of nodes it is NOT quicker than running multiple copies of the processes in parallel. These results are obtained from the script below, which starts multiple copies of the process and times how long they all take to finish.

Total cores on node Cores used per job Total time for 12 or 16 jobs (s)
FFTW Pthreads FFTW OpenMP
12 1 1857 1717
12 2 2072 1940
12 3 2328 2289
12 4 3096
12 6 3084 2436
12 12 8652
16 1 1372 1389
16 2 1952 1374
16 4 2404 1940
16 8 3840 2688
16 16 5120

The code will calculate the FFTW wisdom if required. The script also includes a pause between starting jobs. This is necessary to prevent errors with the runtime compilation of inlined c code, and other resource conflicts.

#!/bin/bash
#PBS -S /bin/bash
#PBS -l walltime=8:00:00
#PROJECT=PR37
 
module load compiler/gnu-4.6.2
export CFLAGS=$CFLAGS" -D__USE_XOPEN2K8"
module load python/2.6.6-ligo
 
FFTW_BASE=/home/spxph/fftw/install
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$FFTW_BASE/lib
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$FFTW_BASE/lib/pkgconfig
 
WORKDIR=/scratch/$USER/benchmarking
INSTALLDIR=$WORKDIR/install
RUNDIR=$WORKDIR/runs
FFTW_MEASURE_LEVEL=2
JOBPAUSE=10
 
. $INSTALLDIR/etc/lal-user-env.sh
. $INSTALLDIR/etc/glue-user-env.sh
. $INSTALLDIR/etc/pycbc-user-env.sh
 
RUNOPTS="--trig-end-time 0 \
    --cluster-method template \
    --bank-file templatebank.xml \
    --strain-high-pass 30  \
    --approximant SPAtmplt \
    --gps-end-time 1026021620 \
    --channel-name H1:FAKE-STRAIN \
    --snr-threshold 5.5 \
    --trig-start-time 0 \
    --gps-start-time 1026019572  \
    --chisq-bins 16 \
    --segment-length 256 \
    --low-frequency-cutoff 40.0 \
    --pad-data 8 \
    --sample-rate 4096 \
    --chisq-threshold 10.0 \
    --chisq-delta 0.2 \
    --user-tag FULL_DATA \
    --order 7   \
    --frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \
    --processing-scheme cpu:${nodes} \
    --psd-estimation median \
    --maximization-interval 30 \
    --segment-end-pad 64 \
    --pad-data 8 \
    --segment-start-pad 64 \
    --psd-segment-stride 128 \
    --psd-inverse-length 16 \
    --psd-segment-length 256 \
    --chisq-delta 0.2 \
    --fft-backends fftw \
    --fftw-threads-backend pthreads \
    --verbose"
 
cd $RUNDIR
 
HARDWARE=SandyBridge
MAXCPUS=16
echo $PBS_QUEUE
if [ "$PBS_QUEUE" == "serial" ]; then
    HARDWARE=Westmere
    MAXCPUS=12
fi
 
FFTW_WISDOM_FLOAT=FFTW_Wisdom_Float_${HARDWARE}_${nodes}
FFTW_WISDOM_DOUBLE=FFTW_Wisdom_Double_${HARDWARE}_${nodes}
 
if [[ ! -f $FFTW_WISDOM_FLOAT || ! -f $FFTW_WISDOM_DOUBLE ]]; then
 
  echo "Measure job using" 0-$(($nodes-1))
 
  taskset -c 0-$(($nodes-1)) pycbc_inspiral $RUNOPTS --bank-file small_bank.xml \
      --fftw-measure-level $FFTW_MEASURE_LEVEL \
      --fftw-output-float-wisdom-file $FFTW_WISDOM_FLOAT \
      --fftw-output-double-wisdom-file  $FFTW_WISDOM_DOUBLE \
      --output PYCBC_OUTPUT_${HARDWARE}_${nodes}_0.xml.gz 
fi
 
PLANTIME=$SECONDS
echo "Plan took" $PLANTIME
 
for ((i = 0; i < $(($MAXCPUS/${nodes})); i++)); do
 
    echo Job $i using $(($i*${nodes}))-$((($i+1)*${nodes} - 1))
 
    taskset -c $(($i*${nodes}))-$((($i+1)*${nodes} - 1)) pycbc_inspiral $RUNOPTS \
      --fftw-input-float-wisdom-file $FFTW_WISDOM_FLOAT \
      --fftw-input-double-wisdom-file $FFTW_WISDOM_DOUBLE \
      --output PYCBC_OUTPUT_${HARDWARE}_${nodes}_$i.xml.gz &
 
    sleep $JOBPAUSE
 
done
 
wait
 
echo "Finished in" $(($SECONDS-$PLANTIME))
echo Total time for $MAXCPUS using $nodes : $((($SECONDS-$PLANTIME)*$nodes))

It was started using these commands:

qsub -q serial -l select=1:ncpus=12 -v nodes=1 prof_options.sub 
qsub -q serial -l select=1:ncpus=12 -v nodes=2 prof_options.sub
....
qsub -q workq -l select=1:ncpus=16 -v nodes=1 prof_options.sub
qsub -q workq -l select=1:ncpus=16 -v nodes=2 prof_options.sub 
...

CUDA

The PyCBC guide gives information on dependencies and installing PyCUDA, SciKits.cuda, and Mako. Briefly, I found that I did:

git clone http://git.tiker.net/trees/pycuda.git
cd pycuda
git submodule init
git submodule update
export PATH=$PATH:/usr/local/cuda-6.0/bin
./configure.py
python setup.py build
python setup.py install --user
easy_install  --prefix=/home/paul.hopkins/.local/ scikits.cuda
easy_install  --prefix=/home/paul.hopkins/.local/ Mako

I then executed the same command as above but using the CUDA schema, optionally selecting the GPU ID, and removing all FFTW references:

. ~/lalsuite/etc/lal-user-env.sh
. ~/lalsuite/etc/pycbc-user-env.sh
 
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.0/lib64
export PATH=$PATH:/usr/local/cuda-6.0/bin
 
RUNOPTS="--trig-end-time 0 --cluster-method template --bank-file templatebank.xml \
    --strain-high-pass 30 --approximant SPAtmplt --gps-end-time 1026021620 \
    --channel-name H1:FAKE-STRAIN --snr-threshold 5.5 --trig-start-time 0 \
    --gps-start-time 1026019572 --chisq-bins 16 --segment-length 256 \
    --low-frequency-cutoff 40.0 --pad-data 8 --sample-rate 4096 \
    --chisq-threshold 10.0 --chisq-delta 0.2 --user-tag FULL_DATA \
    --order 7 --frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \
    --psd-estimation median --maximization-interval 30 --segment-end-pad 64 \
    --pad-data 8 --segment-start-pad 64 --psd-segment-stride 128 \
    --psd-inverse-length 16 --psd-segment-length 256 --chisq-delta 0.2 \
    --output PYCBC_OUTPUT.xml.gz --verbose \
    --processing-scheme cuda \
    --processing-device-id 3"
 
pycbc_inspiral $RUNOPTS
 
echo "Finished in " $SECONDS

The device id and information about the GPUs can be obtained from running the command nvidia-smi. export PATH=$PATH:/usr/local/cuda-6.0/bin Using an NVidia Tesla K10 this completed in a time of 330 seconds.

LALInference Nest

Download the testinj.xml file (wget https://gravity.astro.cf.ac.uk/dokuwiki/_media/public/testinj.xml) and then using the lalinference_nest application:

source $INSTALLDIR/etc/lalinference-user-env.sh
time lalinference_nest --Neff 1000 \
                  --psdlength 255 \
                  --inj testinj.xml \
                  --V1-cache LALVirgo \
                  --nlive 1024 \
                  --V1-timeslide 0 \
                  --srate 1024 \
                  --event 3 \
                  --seglen 8 \
                  --L1-channel LALLIGO \
                  --H1-timeslide 0 \
                  --trigtime 894377300 \
                  --use-logdistance \
                  --psdstart 894377022 \
                  --H1-cache LALLIGO \
                  --progress \
                  --L1-timeslide 0 \
                  --H1-channel LALLIGO \
                  --V1-channel LALVirgo \
                  --outfile lalinferencenest-3-V1H1L1-894377300.0-0.dat \
                  --L1-cache LALLIGO \
                  --randomseed 1778223270 \
                  --dataseed 1237 \
                  --ifo V1 \
                  --ifo H1 \
                  --ifo L1 \
                  --Nmcmc 100

The runtime can be controlled using the –Nmcmc parameter. Time on Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge): Using –Nmcmc 100:

Turbo Boost On: 3500 MHz
real    15m17.527s
user    15m14.425s
sys     0m0.404s

Turbo Boost Off: 2700 MHz 
real    19m38.241s
user    19m35.629s
sys     0m0.540s

and for –Nmcmc 200

Turbo Boost On
real    35m40.957s
user    20m12.832s
sys     0m0.334s

And on an ARCCA Sandy Bridge node

real    23m21.891s
user    23m20.324s
sys     0m0.586s

Haswell node:

Turbo Boost Off
real    20m52.627s
user    20m52.242s
sys     0m0.435s

Turbo Boost Off: Icc -O3 -xHost
real    12m43.876s
user    12m43.007s
sys     0m0.363s

and for 24: 13m09s - 13m53s

Heterodyne Pulsar

Using the lalapps_heterodyne_pulsar application and the following script

cp -R /home/spxph/AcceptanceTestsHeterodyne/* .
. $INSTALLDIR/etc/lalapps-user-env.sh
time lalapps_heterodyne_pulsar --heterodyne-flag 0 --ifo H1 --pulsar J0000+0000 --param-file J0000+0000.par \
                               --sample-rate 16384 --resample-rate 1 --filter-knee 0.25 \
                               --data-file H_cache.txt --seg-file H_segs.txt --channel H1:LDAS-STRAIN \
                               --output-dir H1/931052708-931453056 --freq-factor 2

Running on Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge) for the times 931052708-931453056 gives times of:

Turbo Boost On: 3500 MHz
real    12m43.314s
user    12m0.647s
sys     0m34.930s

Turbo Boost Off: 2700 MHz
real 16m9.190s
user 15m26.190s
sys 0m35.840s

Raven Sandy Bridge node for the times: 931052708-931453056:

real    19m17.915s
user    16m28.514s
sys     1m26.912s

Raven Haswell node.

real    16m12.350s
user    13m54.145s
sys     1m12.344s

This command was prouced using the following script:

#!/bin/bash
set -e
 
# Test the heterodyne code on the Raven cluster
export S6_SEGMENT_SERVER=https://segdb.ligo.caltech.edu
 
# set the data start and end times (giving about 7 days of S6 H1 science mode data)
starttime=931052708
#endtime=931052828
endtime=932601620
 
obs=H
ftype=H1_LDAS_C02_L2
 
cache=${obs}_cache.txt
if [ ! -f $cache ]; then
   echo Getting data
  # use ligo_data_find to get the data frame cache file
  /usr/bin/ligo_data_find -o $obs -s $starttime -e $endtime -t $ftype -l --url-type file >> $cache
fi
 
segfile=${obs}_segs.txt
if [ ! -f $segfile ]; then
  echo Making segments
  # use make_segs.sh script to generate segment list (in format needed by lalapps_heterodyne_pulsar)
  ./make_segs.sh ${obs}1 $starttime $endtime $segfile
fi
 
# set up stuff for running lalapps_heterodyne_pulsar
DETECTOR=H1
CHANNEL=H1:LDAS-STRAIN
FKNEE=0.25
 
# sample rates (frames are 16384Hz)
SRATE1=16384
SRATE2=1
SRATE3=1/60
 
# create a pulsar par file
PSRNAME=J0000+0000
FREQ=245.678910
FDOT=-9.87654321e-12
RA=00:00:00.0
DEC=00:00:00.0
PEPOCH=53966.22281462963
PFILE=$PSRNAME.par
UNITS=TDB
 
if [ -f $PFILE ]; then
  rm -f $PFILE
fi
 
echo PSR    $PSRNAME > $PFILE
echo F0     $FREQ >> $PFILE
echo F1     $FDOT >> $PFILE
echo RAJ    $RA >> $PFILE
echo DECJ   $DEC >> $PFILE
echo PEPOCH $PEPOCH >> $PFILE
echo UNITS  $UNITS >> $PFILE
 
if [ $? != "0" ]; then
  echo Error writing parameter file!
  exit 2
fi
 
# make output directory
mkdir -p $DETECTOR
OUTDIR=$DETECTOR/$starttime-$endtime
mkdir -p $OUTDIR
 
# run the code in coarse heterodyne mode
EXECCODE=/usr/bin/lalapps_heterodyne_pulsar
echo Performing coarse heterodyne - mode 0 - and outputting to text file
$EXECCODE --heterodyne-flag 0 --ifo $DETECTOR --pulsar $PSRNAME --param-file $PFILE --sample-rate $SRATE1 --resample-rate $SRATE2 --filter-knee $FKNEE --data-file $cache --seg-file $segfile --channel $CHANNEL --output-dir $OUTDIR --freq-factor 2
 
# set ephemeris files
EEPHEM=/usr/share/lalpulsar/earth00-19-DE405.dat.gz
SEPHEM=/usr/share/lalpulsar/sun00-19-DE405.dat.gz
TEPHEM=/usr/share/lalpulsar/tdb_2000-2019.dat.gz
 
COARSEFILE=$OUTDIR/coarsehet_${PSRNAME}_${DETECTOR}_${starttime}-${endtime}
 
echo $COARSEFILE
 
# run the code in fine heterodyne mode
$EXECCODE --ephem-earth-file $EEPHEM --ephem-sun-file $SEPHEM --ephem-time-file $TEPHEM --heterodyne-flag 1 --ifo $DETECTOR --pulsar $PSRNAME --param-file $PFILE --sample-rate $SRATE2 --resample-rate $SRATE3 --filter-knee $FKNEE --data-file $COARSEFILE --output-dir $OUTDIR --channel $CHANNEL --seg-file $segfile --freq-factor 2 --stddev-thresh 5 --verbose

where the following is the make_segs.sh script:

#!/bin/bash
 
# small script to use ligolw_segment_query and ligolw_print
# to get a science segment list for a given IFO for E14 and
# output it in segwizard format
 
# detector
echo $1
det=$1
stime=$2
etime=$3
outfile=$4
 
if [ $det == H1 ]; then
  segs=H1:DMT-SCIENCE:1
fi
 
if [ $det == V1 ]; then
  segs=V1:ITF_SCIENCEMODE
fi
 
if [ $det == L1 ]; then
  segs=L1:DMT-SCIENCE:1
fi
 
echo $segs
 
# get ascii segment list
/usr/bin/ligolw_segment_query -s $stime -e $etime -d --include-segments $segs --query-segments | ligolw_print --table segment --column start_time --column end_time --delimiter " " > ${outfile}_tmp.txt
#ligolw_segment_query -s $stime -e $etime -d --include-segments $segs --query-segments | grep -v "\[" | ligolw_print --table segment --column start_time --column end_time --delimiter " " > $1segments_tmp.txt
 
count=1
 
# read in file and work out length of each segment
while read LINE; do
  i=1
  for j in $LINE; do
    if (( $i == 1 )); then
      tstart=$j
    fi
 
    if (( $i == 2 )); then
      tend=$j
    fi
    i=$((i+1))
  done
 
  # get duration of segment
  dur=$(echo $tend - $tstart | bc)
 
  # output data to file
  echo $count $tstart $tend $dur >> $outfile
 
  count=$((count+1))
done < ${outfile}_tmp.txt
 
rm ${outfile}_tmp.txt