HaplotypeCaller is extremely slow
This is the first time I run GATK and do variant calling.
I have been looking around the internet for a while, but haven't been able to resolve my issue.
I run the following command to do variant calling with HaplotypeCaller using GATK 18.104.22.168, in order to generate a single VCF from 1,011 samples:
~/third_party/gatk-22.214.171.124/gatk HaplotypeCaller -R ~/data/human_reference_genome/GRCh37_whole_genome.fa -I bams.list -L ~/data/human_genome_annotations/GRCh37_swissport_exome_plus_100.bed --output vcf/combined.vcf
- bams.list lists 1,011 BAM files, with total size of 6.9 TB
- GRCh37_swissport_exome_plus_100.bed contains the coordinates of all protein-coding exons (according to Swissprot) plus 100 bp on each side.
The process has been running for more than 10 days now, but it has only reached to position 25,617,351 on chr1. On that pace, I estimate it will take more than 3 years to finish...
What am I doing wrong here? Is it supposed to take that long? Is distribution on more computing resources the only way to make it run faster?
By the way, it seems that HaplotypeCaller version 126.96.36.199 doesn't accept the argument num_cpu_threads_per_data_thread which (according to GATK's documentation) is necessary in order to make it run in multi-threading. I use a 40-core machine, so I think I can gain a lot from running with more threads.
Thanks a lot,