Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)
I am enduring an incredible slow down during my genotyping stage of the haplotypecaller GVCF command series. It is my understanding from the documentation that this step should be rather fast: "This step runs very fast and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem."
However, given 50 - 100 exomes, the command estimates several weeks until completion time, despite being given 64 cores and 256GB ram with unlimited disk space. I'm concerned because this seems unrealistically high, especially given that once a pool of several hundred training exomes is created, the purpose of the GVCF pipeline is to quickly use that pool in a joint genotyping step with a new sample exome. Therefore, each time I have a new sample exome, I would have to endure another multi-week joint genotyping step.
Can you please advise me as to why my command is taking so long? Any insight is much appreciated. Please find below a copy of my command:
time java -Djava.io.tmpdir=$temp_directory -Xmx192g -jar /root/Installation/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs \ -R /bundles/b37/human_g1k_v37.fasta \ (list of all training exomes and the single sample exome goes here \ --disable_auto_index_creation_and_locking_when_reading_rods \ -o genotyped.g.vcf -nt 60 # I deactivated the following step since it seems to be unnecessary # --sample_ploidy 60 \ #(ploidy is set to number of samples per pool * individual sample ploidy)