HaplotypeCaller is extremely slow

Hi Everyone,

This is the first time I run GATK and do variant calling.
I have been looking around the internet for a while, but haven't been able to resolve my issue.

I run the following command to do variant calling with HaplotypeCaller using GATK 4.0.1.1, in order to generate a single VCF from 1,011 samples:

~/third_party/gatk-4.0.1.1/gatk HaplotypeCaller -R ~/data/human_reference_genome/GRCh37_whole_genome.fa -I bams.list -L ~/data/human_genome_annotations/GRCh37_swissport_exome_plus_100.bed --output vcf/combined.vcf

where:

  • bams.list lists 1,011 BAM files, with total size of 6.9 TB
  • GRCh37_swissport_exome_plus_100.bed contains the coordinates of all protein-coding exons (according to Swissprot) plus 100 bp on each side.

The process has been running for more than 10 days now, but it has only reached to position 25,617,351 on chr1. On that pace, I estimate it will take more than 3 years to finish...

What am I doing wrong here? Is it supposed to take that long? Is distribution on more computing resources the only way to make it run faster?

By the way, it seems that HaplotypeCaller version 4.0.1.1 doesn't accept the argument num_cpu_threads_per_data_thread which (according to GATK's documentation) is necessary in order to make it run in multi-threading. I use a 40-core machine, so I think I can gain a lot from running with more threads.

Thanks a lot,

Nadav

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @nadavb
    Hi Nadav,

    HaplotypeCaller on more than 1000 samples will indeed take a long time. We have a GVCF workflow that should help. You can read about it here. Note, that article is for GATK3, but all the steps still apply. One new tool is available in GATK4 that can be used instead of CombineGVCFs called GenomicsDBImport, which you can read more about here.

    -Sheila

Sign In or Register to comment.