speeding up HaplotyperCaller


I've started to call variants on 200 diploid individuals with a genome size of 160Mb, using HaplotypeCaller and 20 cores. In about 10 hours it's covered the first 600,000 bases, so it looks like completion will take several months. I could potentially use up to 64 cores although this would inhibit other users.

My input parameters are as below. They are probably pretty loose. I plan on hard-filtering variants after this.

Do you have any advice on how to speed this up?



                    --heterozygosity 0.01 \
                    --indel_heterozygosity 0.001 \
                    -stand_call_conf 31 \
                    -stand_emit_conf 31 \
                    -mbq 10 \
                    -gt_mode DISCOVERY \
                                            -L chr2L -L chr2R -L chr3L -L chr3R -L chrX -L chr4 -L chrM \

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Please show the full command line. Are you running HC on all the bam files together?

  • BlueBlue Member

    Hi Gera, yes, I'm running HC on all 222 bam files together.


    $ -N HAPCALL

    $ -pe openmp 20

    $ -S /bin/sh

    $ -cwd

    $ -j y

    $ -q bioinf.q

    . /etc/profile.d/modules.sh
    module load gatk/3.2.2 jre/1.7.0_25
    GenomeAnalysisTK -nct 20 --analysis_type HaplotypeCaller \
    --reference_sequence ../../reference_sequences/dmel/v6.0/dm6.fa \
    --input_file lhm_rg_bams.list \
    --heterozygosity 0.01 \
    --indel_heterozygosity 0.001 \
    -stand_call_conf 31 \
    -stand_emit_conf 31 \
    -mbq 10 \
    -gt_mode DISCOVERY \
    -L chr2L -L chr2R -L chr3L -L chr3R -L chrX -L chr4 -L chrM \
    --out LHm_RG_raw.vcf

    and a bit from the log.

    Executing as [email protected] on Linux 2.6.32-358.14.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15.

  • pdexheimerpdexheimer Member ✭✭✭✭

    If I recall correctly, I could only ever run HC over about 70 or 80 exomes (genome size ~50MB) at a time. This is one of the reasons the gVCF workflow is so great - it dodges this kind of issue altogether.

    It's been a couple of years, but I think my case at the time was ~150 exomes that I had to split into 2 batches in order to successfully run with HC, and it still took about a week. When GATK3 came out with gVCFs, my calling time dropped to something like 8 hours

