Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)

I am enduring an incredible slow down during my genotyping stage of the haplotypecaller GVCF command series. It is my understanding from the documentation that this step should be rather fast: "This step runs very fast and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem."

However, given 50 - 100 exomes, the command estimates several weeks until completion time, despite being given 64 cores and 256GB ram with unlimited disk space. I'm concerned because this seems unrealistically high, especially given that once a pool of several hundred training exomes is created, the purpose of the GVCF pipeline is to quickly use that pool in a joint genotyping step with a new sample exome. Therefore, each time I have a new sample exome, I would have to endure another multi-week joint genotyping step.

Can you please advise me as to why my command is taking so long? Any insight is much appreciated. Please find below a copy of my command:

    time java$temp_directory -Xmx192g -jar /root/Installation/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs \
    -R /bundles/b37/human_g1k_v37.fasta \
    (list of all training exomes and the single sample exome goes here \
    --disable_auto_index_creation_and_locking_when_reading_rods \
    -o genotyped.g.vcf -nt 60

    # I deactivated the following step since it seems to be unnecessary
    # --sample_ploidy 60 \ #(ploidy is set to number of samples per pool * individual sample ploidy)


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oof, your samples are 60-ploid? We've never tested for that amount of load... the program is designed for the diploid case, and although we know it can handle reasonable non-diploid configurations (and that's what the comments you cite are based on), I would expect performance to degrade considerably given such extreme ploidy. Consider the number of possible combinations of genotypes; this gets you into combinatorial territory that is deeply problematic if it's not explicitly designed for that case. Don't get me wrong, it is possible to design for this case in a way that would make the problem tractable, but it's not something we've had the bandwidth to do (largely because diploid humans pay our bills). It's possible to limit the damage, upstream of joint genotyping, by limiting the number of possible genotypes that we're willing to consider, but at the moment this functionality only exists for HaplotypeCaller, not for GenotypeGVCFs. We have a ticket to extend this but again, bandwidth limitations stand in our way. For now my only workable recommendation is to artificially lower your ploidy; this will limit your sensitivity to events that are somewhat common within your pools (to whatever extent based on the value you choose) but at least your analysis will run to completion in a reasonable amount of time. Unfortunately I can't give you an estimate of how much time that would be, you will need to run some empirical test runs. Good luck and apologies for the limitations.

  • Thanks for the detailed response Geraldine, however it may have been predicated on a faulty ploidy that I set!

    I am working with human samples, and therefore diploid. However, my understanding according to the documentation for the command is that I should set the ploidy to 2 (because diploid) * the number of samples in my pool (30) = 60. So for this example I have 30 individual human exomes, and therefore set my ploidy to 60. I thought this may have been a mistake, so I also tried simply leaving off the ploidy specification and use the default parameter, with no change in processing time.

    Elsewhere in the forums I asked whether having a increased number of similar exomes to my experimental exome will improve the sensitivity of the pipeline, and the answer provided stated that having more exomes was better. Furthermore, some of the older best practices docs recommended having at least 30 control exomes as training sample to genotype along with the experimental sample. Therefore I am assuming that quite frequently, users of GATK for human exome SNP calling will process 30 or more exomes from the 1000 genome project and combine them with their single experimental exome. Each time a new experimental exome is obtained, the joint genotyping step would have to be repeated. It is my understanding that this is why the GVCF standard is now being implemented, since it should allow speedy joint genotyping instead of a lengthy reprocessing of all samples at once.

    Please correct me if my above statement is wrong! If it is accurate, do you have any ballpark estimates for expected processing times on scientific computing clusters for joint genotyping 30 + 1 1000 genome project exomes? I can't imagine that the several week estimate I'm currently receiving is expected.

    Thanks so much for your assistance.

  • Hello again Geraldine,

    As a followup, I updated to the latest version of GATK, rebuilt the indices for all exome GVCFs, allocated more memory and a larger temp folder, and now I could genotype 25 training exomes + 1 sample exome in about 5 hours. However, when I increased to 120 training exomes + 1 sample exome, the command crashed after 3 hours of processing the first third of chromosome 1 due to out of memory (I allocated 200GB memory). Any thoughts?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Have you tried running CombineGVCFs on your GVCFs before inputting them into GenotypeGVCFs? You can try combining 10 GVCFs at a time, then input the combined GVCFs into GenotypeGVCFs.

    In GATK4, there is a new tool called GenomicsDBImport, that will help to reduce the runtime as well. Have a look at this article for more information. Note: the tool is still being worked on and has some kinks, which are being looked into as we speak :smile:


Sign In or Register to comment.