====== LIGO code benchmarking ======
===== PyCBC Inspiral =====
These are instructions for how to check out and compile an example LIGO executable and run it on the Raven system. The following script will download, build and install a basic system to start profiling:
First, log on to Raven, and save the following script somewhere in your home directory. By default it will install everything into scratch:
#!/bin/bash
# Pull in required modules
module load gsl/1.15
module load fftw/3.3-avx
module load fftw/3.3-avx-float
module load libframe/8.04-2
module load libmetaio/8.4
module load python/2.6.6-ligo
module load swig/2.0.11
WORKDIR=/scratch/$USER/benchmarking
INSTALLDIR=$WORKDIR/install
RUNDIR=$WORKDIR/runs
CONFIGOPTS="--enable-swig-python --disable-laldetchar --disable-lalburst --disable-lalstochastic --disable-gcc-flags --disable-debug"
mkdir -p $WORKDIR
cd $WORKDIR
# Download LALSuite
if [ ! -d lalsuite ]; then
git clone git://ligo-vcs.phys.uwm.edu/lalsuite.git
fi
# Build LALSuite
cd lalsuite
./00boot
./configure --prefix $INSTALLDIR $CONFIGOPTS
make -j 4
make install
# Build GLUE
cd $WORKDIR/lalsuite/glue
python setup.py install --prefix=$INSTALLDIR
# Source LALSuite for building next parts
. $INSTALLDIR/etc/lalapps-user-env.sh
. $INSTALLDIR/etc/lal-user-env.sh
. $INSTALLDIR/etc/glue-user-env.sh
# Download PyCBC
cd $WORKDIR
if [ ! -d pycbc ]; then
git clone git://ligo-vcs.phys.uwm.edu/pycbc.git
fi
# Build PyCBC
cd pycbc
python setup.py install --prefix=$INSTALLDIR
. $INSTALLDIR/etc/pycbc-user-env.sh
Make a directory where you'll do the runs and copy over the required files:
RUNDIR=/scratch/$USER/benchmarking/runs
mkdir $RUNDIR
cd $RUNDIR
cp /home/spxph/BenchmarkingFiles/* .
Then make a job submission script, ''PyCBC_Profile.sh'' with the following text:
#!/bin/bash
#PBS -S /bin/bash
#PBS -q serial
#PBS -l select=1:ncpus=1
#PBS -l walltime=2:00:00
#PROJECT=PR37
module load compiler/gnu-4.6.2
export CFLAGS=$CFLAGS" -D__USE_XOPEN2K8"
module load fftw/3.3-sse2
module load fftw/3.3-sse2-float
module load python/2.6.6-ligo
WORKDIR=/scratch/$USER/benchmarking
INSTALLDIR=$WORKDIR/install
RUNDIR=$WORKDIR/runs
. $INSTALLDIR/etc/lal-user-env.sh
. $INSTALLDIR/etc/glue-user-env.sh
. $INSTALLDIR/etc/pycbc-user-env.sh
cd $RUNDIR
python -m cProfile -o PyCBC_Profile_Report `which pycbc_inspiral` \
--trig-end-time 0 \
--cluster-method template \
--bank-file templatebank.xml \
--strain-high-pass 30 \
--approximant SPAtmplt \
--gps-end-time 1026021620 \
--channel-name H1:FAKE-STRAIN \
--snr-threshold 5.5 \
--trig-start-time 0 \
--gps-start-time 1026019572 \
--chisq-bins 16 \
--segment-length 256 \
--low-frequency-cutoff 40.0 \
--pad-data 8 \
--sample-rate 4096 \
--chisq-threshold 10.0 \
--chisq-delta 0.2 \
--user-tag FULL_DATA \
--order 7 \
--frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \
--processing-scheme cpu \
--psd-estimation median \
--maximization-interval 30 \
--segment-end-pad 64 \
--pad-data 8 \
--segment-start-pad 64 \
--psd-segment-stride 128 \
--psd-inverse-length 16 \
--psd-segment-length 256 \
--chisq-delta 0.2 \
--output PYCBC_OUTPUT.xml.gz \
--verbose
Now, you should be ready to do the runs. It's set up to loop through the given ''templatebank.xml'' file which contains around 3,000 templates, each of which is processed independently through the data. If you want to do a quick test, you can use ''small_bank.xml'' instead which contains only about 50 templates. To run the job just submit it to the PBS queue: ''qsub PyCBC_Profile.sh''. When that is complete you can analyse the output using python:
import pstats
p = pstats.Stats('PyCBC_Profile_Report')
p.sort_stats('cumulative').print_stats(10)
==== Using multiple cores ====
PyCBC can use multiple cores by supplying the options
--processing-scheme cpu:N
--fftw-threads-backend pthreads|openmp
where ''N'' is the number of cores to use, and FFTW should use either the ''openmp'' or ''pthread'' model. The ''openmp'' backend does not currently work with the modules on Raven but I have found that it works with my current FFTW build. It is also necessary to bind the processes to the correct cpu cores. I use ''taskset -c 0-N pycbc_inspiral'' where ''N'' is the number of cores.
==== Using FFTW Wisdom ====
FFTW can find the quickest methods for computing FFTs and store this information in a file for later use. PyCBC uses the following options to calculate and record the wisdom:
--fftw-measure-level Level
--fftw-output-float-wisdom-file FFTW_Float_Wisdom
--fftw-output-double-wisdom-file FFTW_Double_Wisdom
where the measure levels correspond to
0: FFTW_ESTIMATE,
1: FFTW_MEASURE,
2: FFTW_MEASURE|FFTW_PATIENT,
3: FFTW_MEASURE|FFTW_PATIENT|FFTW_EXHAUSTIVE
Level 2 is recommended to find the best plan, but it can take several hours to run, especially if your are using threaded schemes.
==== Results ====
The most comprehensive results I have so far are for measure level 2 for a range of cores on the Westmere (12 HT cores) and Sandy bridge (16 HT cores) nodes using either the openmp or pthreads FFTW threading models. Generally the more cores you use, the quicker it runs. However, for both types of nodes it is NOT quicker than running multiple copies of the processes in parallel. These results are obtained from the script below, which starts multiple copies of the process and times how long they all take to finish.
^ Total cores on node ^ Cores used per job ^ Total time for 12 or 16 jobs (s) ^^
^ ^ ^ FFTW Pthreads ^ FFTW OpenMP ^
| 12 | 1 | 1857 | 1717 |
| 12 | 2 | 2072 | 1940 |
| 12 | 3 | 2328 | 2289 |
| 12 | 4 | 3096 | |
| 12 | 6 | 3084 | 2436 |
| 12 | 12 | 8652 | |
| 16 | 1 | 1372 | 1389 |
| 16 | 2 | 1952 | 1374 |
| 16 | 4 | 2404 | 1940 |
| 16 | 8 | 3840 | 2688 |
| 16 | 16 | 5120 | |
The code will calculate the FFTW wisdom if required. The script also includes a pause between starting jobs. This is necessary to prevent errors with the runtime compilation of inlined c code, and other resource conflicts.
#!/bin/bash
#PBS -S /bin/bash
#PBS -l walltime=8:00:00
#PROJECT=PR37
module load compiler/gnu-4.6.2
export CFLAGS=$CFLAGS" -D__USE_XOPEN2K8"
module load python/2.6.6-ligo
FFTW_BASE=/home/spxph/fftw/install
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$FFTW_BASE/lib
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$FFTW_BASE/lib/pkgconfig
WORKDIR=/scratch/$USER/benchmarking
INSTALLDIR=$WORKDIR/install
RUNDIR=$WORKDIR/runs
FFTW_MEASURE_LEVEL=2
JOBPAUSE=10
. $INSTALLDIR/etc/lal-user-env.sh
. $INSTALLDIR/etc/glue-user-env.sh
. $INSTALLDIR/etc/pycbc-user-env.sh
RUNOPTS="--trig-end-time 0 \
--cluster-method template \
--bank-file templatebank.xml \
--strain-high-pass 30 \
--approximant SPAtmplt \
--gps-end-time 1026021620 \
--channel-name H1:FAKE-STRAIN \
--snr-threshold 5.5 \
--trig-start-time 0 \
--gps-start-time 1026019572 \
--chisq-bins 16 \
--segment-length 256 \
--low-frequency-cutoff 40.0 \
--pad-data 8 \
--sample-rate 4096 \
--chisq-threshold 10.0 \
--chisq-delta 0.2 \
--user-tag FULL_DATA \
--order 7 \
--frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \
--processing-scheme cpu:${nodes} \
--psd-estimation median \
--maximization-interval 30 \
--segment-end-pad 64 \
--pad-data 8 \
--segment-start-pad 64 \
--psd-segment-stride 128 \
--psd-inverse-length 16 \
--psd-segment-length 256 \
--chisq-delta 0.2 \
--fft-backends fftw \
--fftw-threads-backend pthreads \
--verbose"
cd $RUNDIR
HARDWARE=SandyBridge
MAXCPUS=16
echo $PBS_QUEUE
if [ "$PBS_QUEUE" == "serial" ]; then
HARDWARE=Westmere
MAXCPUS=12
fi
FFTW_WISDOM_FLOAT=FFTW_Wisdom_Float_${HARDWARE}_${nodes}
FFTW_WISDOM_DOUBLE=FFTW_Wisdom_Double_${HARDWARE}_${nodes}
if [[ ! -f $FFTW_WISDOM_FLOAT || ! -f $FFTW_WISDOM_DOUBLE ]]; then
echo "Measure job using" 0-$(($nodes-1))
taskset -c 0-$(($nodes-1)) pycbc_inspiral $RUNOPTS --bank-file small_bank.xml \
--fftw-measure-level $FFTW_MEASURE_LEVEL \
--fftw-output-float-wisdom-file $FFTW_WISDOM_FLOAT \
--fftw-output-double-wisdom-file $FFTW_WISDOM_DOUBLE \
--output PYCBC_OUTPUT_${HARDWARE}_${nodes}_0.xml.gz
fi
PLANTIME=$SECONDS
echo "Plan took" $PLANTIME
for ((i = 0; i < $(($MAXCPUS/${nodes})); i++)); do
echo Job $i using $(($i*${nodes}))-$((($i+1)*${nodes} - 1))
taskset -c $(($i*${nodes}))-$((($i+1)*${nodes} - 1)) pycbc_inspiral $RUNOPTS \
--fftw-input-float-wisdom-file $FFTW_WISDOM_FLOAT \
--fftw-input-double-wisdom-file $FFTW_WISDOM_DOUBLE \
--output PYCBC_OUTPUT_${HARDWARE}_${nodes}_$i.xml.gz &
sleep $JOBPAUSE
done
wait
echo "Finished in" $(($SECONDS-$PLANTIME))
echo Total time for $MAXCPUS using $nodes : $((($SECONDS-$PLANTIME)*$nodes))
It was started using these commands:
qsub -q serial -l select=1:ncpus=12 -v nodes=1 prof_options.sub
qsub -q serial -l select=1:ncpus=12 -v nodes=2 prof_options.sub
....
qsub -q workq -l select=1:ncpus=16 -v nodes=1 prof_options.sub
qsub -q workq -l select=1:ncpus=16 -v nodes=2 prof_options.sub
...
==== CUDA ====
The [[https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/cuda_install.html|PyCBC guide]] gives information on dependencies and installing PyCUDA, SciKits.cuda, and Mako. Briefly, I found that I did:
git clone http://git.tiker.net/trees/pycuda.git
cd pycuda
git submodule init
git submodule update
export PATH=$PATH:/usr/local/cuda-6.0/bin
./configure.py
python setup.py build
python setup.py install --user
easy_install --prefix=/home/paul.hopkins/.local/ scikits.cuda
easy_install --prefix=/home/paul.hopkins/.local/ Mako
I then executed the same command as above but using the CUDA schema, optionally selecting the GPU ID, and removing all FFTW references:
. ~/lalsuite/etc/lal-user-env.sh
. ~/lalsuite/etc/pycbc-user-env.sh
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.0/lib64
export PATH=$PATH:/usr/local/cuda-6.0/bin
RUNOPTS="--trig-end-time 0 --cluster-method template --bank-file templatebank.xml \
--strain-high-pass 30 --approximant SPAtmplt --gps-end-time 1026021620 \
--channel-name H1:FAKE-STRAIN --snr-threshold 5.5 --trig-start-time 0 \
--gps-start-time 1026019572 --chisq-bins 16 --segment-length 256 \
--low-frequency-cutoff 40.0 --pad-data 8 --sample-rate 4096 \
--chisq-threshold 10.0 --chisq-delta 0.2 --user-tag FULL_DATA \
--order 7 --frame-files H-H1_ER_C00_L2-1026019564-3812.gwf \
--psd-estimation median --maximization-interval 30 --segment-end-pad 64 \
--pad-data 8 --segment-start-pad 64 --psd-segment-stride 128 \
--psd-inverse-length 16 --psd-segment-length 256 --chisq-delta 0.2 \
--output PYCBC_OUTPUT.xml.gz --verbose \
--processing-scheme cuda \
--processing-device-id 3"
pycbc_inspiral $RUNOPTS
echo "Finished in " $SECONDS
The device id and information about the GPUs can be obtained from running the command ''nvidia-smi''.
export PATH=$PATH:/usr/local/cuda-6.0/bin
Using an NVidia Tesla K10 this completed in a time of 330 seconds.
===== LALInference Nest =====
Download the ''testinj.xml'' {{:public:testinj.xml|file}} (''%%wget https://gravity.astro.cf.ac.uk/dokuwiki/_media/public/testinj.xml%%'')
and then using the ''lalinference_nest'' application:
source $INSTALLDIR/etc/lalinference-user-env.sh
time lalinference_nest --Neff 1000 \
--psdlength 255 \
--inj testinj.xml \
--V1-cache LALVirgo \
--nlive 1024 \
--V1-timeslide 0 \
--srate 1024 \
--event 3 \
--seglen 8 \
--L1-channel LALLIGO \
--H1-timeslide 0 \
--trigtime 894377300 \
--use-logdistance \
--psdstart 894377022 \
--H1-cache LALLIGO \
--progress \
--L1-timeslide 0 \
--H1-channel LALLIGO \
--V1-channel LALVirgo \
--outfile lalinferencenest-3-V1H1L1-894377300.0-0.dat \
--L1-cache LALLIGO \
--randomseed 1778223270 \
--dataseed 1237 \
--ifo V1 \
--ifo H1 \
--ifo L1 \
--Nmcmc 100
The runtime can be controlled using the ''--Nmcmc'' parameter. Time on Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge):
Using ''--Nmcmc 100'':
Turbo Boost On: 3500 MHz
real 15m17.527s
user 15m14.425s
sys 0m0.404s
Turbo Boost Off: 2700 MHz
real 19m38.241s
user 19m35.629s
sys 0m0.540s
and for ''--Nmcmc 200''
Turbo Boost On
real 35m40.957s
user 20m12.832s
sys 0m0.334s
And on an ARCCA Sandy Bridge node
real 23m21.891s
user 23m20.324s
sys 0m0.586s
Haswell node:
Turbo Boost Off
real 20m52.627s
user 20m52.242s
sys 0m0.435s
Turbo Boost Off: Icc -O3 -xHost
real 12m43.876s
user 12m43.007s
sys 0m0.363s
and for 24: 13m09s - 13m53s
===== Heterodyne Pulsar =====
Using the ''lalapps_heterodyne_pulsar'' application and the following script
cp -R /home/spxph/AcceptanceTestsHeterodyne/* .
. $INSTALLDIR/etc/lalapps-user-env.sh
time lalapps_heterodyne_pulsar --heterodyne-flag 0 --ifo H1 --pulsar J0000+0000 --param-file J0000+0000.par \
--sample-rate 16384 --resample-rate 1 --filter-knee 0.25 \
--data-file H_cache.txt --seg-file H_segs.txt --channel H1:LDAS-STRAIN \
--output-dir H1/931052708-931453056 --freq-factor 2
Running on Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge) for the times ''931052708-931453056'' gives times of:
Turbo Boost On: 3500 MHz
real 12m43.314s
user 12m0.647s
sys 0m34.930s
Turbo Boost Off: 2700 MHz
real 16m9.190s
user 15m26.190s
sys 0m35.840s
Raven Sandy Bridge node for the times: ''931052708-931453056'':
real 19m17.915s
user 16m28.514s
sys 1m26.912s
Raven Haswell node.
real 16m12.350s
user 13m54.145s
sys 1m12.344s
This command was prouced using the following script:
#!/bin/bash
set -e
# Test the heterodyne code on the Raven cluster
export S6_SEGMENT_SERVER=https://segdb.ligo.caltech.edu
# set the data start and end times (giving about 7 days of S6 H1 science mode data)
starttime=931052708
#endtime=931052828
endtime=932601620
obs=H
ftype=H1_LDAS_C02_L2
cache=${obs}_cache.txt
if [ ! -f $cache ]; then
echo Getting data
# use ligo_data_find to get the data frame cache file
/usr/bin/ligo_data_find -o $obs -s $starttime -e $endtime -t $ftype -l --url-type file >> $cache
fi
segfile=${obs}_segs.txt
if [ ! -f $segfile ]; then
echo Making segments
# use make_segs.sh script to generate segment list (in format needed by lalapps_heterodyne_pulsar)
./make_segs.sh ${obs}1 $starttime $endtime $segfile
fi
# set up stuff for running lalapps_heterodyne_pulsar
DETECTOR=H1
CHANNEL=H1:LDAS-STRAIN
FKNEE=0.25
# sample rates (frames are 16384Hz)
SRATE1=16384
SRATE2=1
SRATE3=1/60
# create a pulsar par file
PSRNAME=J0000+0000
FREQ=245.678910
FDOT=-9.87654321e-12
RA=00:00:00.0
DEC=00:00:00.0
PEPOCH=53966.22281462963
PFILE=$PSRNAME.par
UNITS=TDB
if [ -f $PFILE ]; then
rm -f $PFILE
fi
echo PSR $PSRNAME > $PFILE
echo F0 $FREQ >> $PFILE
echo F1 $FDOT >> $PFILE
echo RAJ $RA >> $PFILE
echo DECJ $DEC >> $PFILE
echo PEPOCH $PEPOCH >> $PFILE
echo UNITS $UNITS >> $PFILE
if [ $? != "0" ]; then
echo Error writing parameter file!
exit 2
fi
# make output directory
mkdir -p $DETECTOR
OUTDIR=$DETECTOR/$starttime-$endtime
mkdir -p $OUTDIR
# run the code in coarse heterodyne mode
EXECCODE=/usr/bin/lalapps_heterodyne_pulsar
echo Performing coarse heterodyne - mode 0 - and outputting to text file
$EXECCODE --heterodyne-flag 0 --ifo $DETECTOR --pulsar $PSRNAME --param-file $PFILE --sample-rate $SRATE1 --resample-rate $SRATE2 --filter-knee $FKNEE --data-file $cache --seg-file $segfile --channel $CHANNEL --output-dir $OUTDIR --freq-factor 2
# set ephemeris files
EEPHEM=/usr/share/lalpulsar/earth00-19-DE405.dat.gz
SEPHEM=/usr/share/lalpulsar/sun00-19-DE405.dat.gz
TEPHEM=/usr/share/lalpulsar/tdb_2000-2019.dat.gz
COARSEFILE=$OUTDIR/coarsehet_${PSRNAME}_${DETECTOR}_${starttime}-${endtime}
echo $COARSEFILE
# run the code in fine heterodyne mode
$EXECCODE --ephem-earth-file $EEPHEM --ephem-sun-file $SEPHEM --ephem-time-file $TEPHEM --heterodyne-flag 1 --ifo $DETECTOR --pulsar $PSRNAME --param-file $PFILE --sample-rate $SRATE2 --resample-rate $SRATE3 --filter-knee $FKNEE --data-file $COARSEFILE --output-dir $OUTDIR --channel $CHANNEL --seg-file $segfile --freq-factor 2 --stddev-thresh 5 --verbose
where the following is the ''make_segs.sh'' script:
#!/bin/bash
# small script to use ligolw_segment_query and ligolw_print
# to get a science segment list for a given IFO for E14 and
# output it in segwizard format
# detector
echo $1
det=$1
stime=$2
etime=$3
outfile=$4
if [ $det == H1 ]; then
segs=H1:DMT-SCIENCE:1
fi
if [ $det == V1 ]; then
segs=V1:ITF_SCIENCEMODE
fi
if [ $det == L1 ]; then
segs=L1:DMT-SCIENCE:1
fi
echo $segs
# get ascii segment list
/usr/bin/ligolw_segment_query -s $stime -e $etime -d --include-segments $segs --query-segments | ligolw_print --table segment --column start_time --column end_time --delimiter " " > ${outfile}_tmp.txt
#ligolw_segment_query -s $stime -e $etime -d --include-segments $segs --query-segments | grep -v "\[" | ligolw_print --table segment --column start_time --column end_time --delimiter " " > $1segments_tmp.txt
count=1
# read in file and work out length of each segment
while read LINE; do
i=1
for j in $LINE; do
if (( $i == 1 )); then
tstart=$j
fi
if (( $i == 2 )); then
tend=$j
fi
i=$((i+1))
done
# get duration of segment
dur=$(echo $tend - $tstart | bc)
# output data to file
echo $count $tstart $tend $dur >> $outfile
count=$((count+1))
done < ${outfile}_tmp.txt
rm ${outfile}_tmp.txt