Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller is slower when restricted to intervals

Hi
I have been timing some different steps in my dna exome pipeline, and it has surprised me, that haplotypeCaller is slowed down, when restricted to a target area. I am using version 3.4-46 an these are the options and timings:
-T HaplotypeCaller -nct 16 -L SeqCap_EZ_Exome_v3_primary_targets.interval_list --interval_padding 100 -R ucsc.hg19.fasta -I 100441501280.sorted.markdup.realigned.recal.bam --genotyping_mode DISCOVERY --dbsnp dbsnp_138.hg19.vcf -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o VERSION_3.4-46_WITH_L.vcf.gz
15.3h

-T HaplotypeCaller -nct 16 -R ucsc.hg19.fasta -I 100441501280.sorted.markdup.realigned.recal.bam --genotyping_mode DISCOVERY --dbsnp dbsnp_138.hg19.vcf -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o VERSION_3.4-46_WITHOUT_L.vcf.gz
4.7h
Is this something normally seen, or am I doing something wrong? It should be noted, that the input is realigned using the same target_interval file.

Thanks.

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vang
    Hi,

    This is not normal. Using an intervals list should speed up the runtime. Can you tell me how you pre-processed your data? Did you follow the Best Practices?
    Can you confirm that the issue persists when you do not use -nct 16?

    Thanks,
    Sheila

  • vangvang Member ✭✭

    Thanks Sheila
    Yes, I have followed the Best Practice.

    I have tested with different settings, and it appears that the problem indeed lies within the –nct setting. I do get a speedup using –L when running with –ntc 1. However, combining –L and –ntc 16 greatly increase the running time. This is both tested on local disk and using our infiniband network with same results. We are now looking into a GATK queue solution.

  • shawpashawpa Member

    I am also having a lot of issues with run time of the HaplotypeCaller. I don't know if my issue is the same as the user above or not. I am running the most updated version of GATK. I have 20 exomes and according to the best practices, I should generate GVCF's on each of those with haplotype caller and then do joint genotyping. All of the solutions that I read have to do with using the nct and interval options. I am already doing that. I am using a bed file for the intervals and I have been using -nct 8. When I run with these options, it says it is going to take 14 days for 1 sample. Is that really how long it should take for 1 exome? Using the queue option really doesn't seem viable to me as I am not a strong programmer and have no idea how to use it. Can you suggest any other options me? I have tried running without -nct and it says it will take much longer (as I would expect) so not sure if that is the source of my problem.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @shawpa What kind of computing infrastructure are you using?

  • shawpashawpa Member

    Honestly I can't really say cause I don't exactly understand how all that works. I can at least tell you that I have up to 24 computing nodes on my cluster. Other people are running jobs so I can't use all 24 nodes. When I submit jobs, I tell it to use 8 if I use -nct 8 option. I don't know if that is the info you need.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I see, that does help a little -- but it's hard to say what speed you can expect, except that that seems pretty slow to me, so my recommendation is to contact the IT support staff at your institution and ask if they can check that the server is running correctly. Sometimes nodes freeze up or slow down, and rebooting the server can fix that. If it turns out that the server is already running as fast as it can, then consider using data parallelization (eg run per-chromosome) to speed up processing. But ultimately it sounds like the infrastructure available to you is going to be fairly limiting. We're working on a cloud-based version of GATK that will help with that sort of problem, but unfortunately it's not ready yet. In the meantime you may just need to be patient. Good luck!

  • shawpashawpa Member

    It does seem to be slowed significantly by adding the interval file (about 2x slower for me). If I ran my recalibration with the interval file, would it be okay to not use the interval file for the haplotype caller step? In the end I could just filter the vcf for only variants in my targeted regions. I am actually only interested in SNPs (not indels). I looked in the haplotype caller documentation but there doesn't seem to be an option to tell it to ignore indels. I know it is doing a lot less computational work but Unified Genotyper is only taking 40 minutes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Adding the interval file slows down the processing? That is really unexpected. The opposite should happen. Something sounds wrong. Where does the interval file come from?

  • shawpashawpa Member

    It's a bed file of the regions targeted.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, that sounds reasonable. Can you post your command line? And are you seeing the same runtimes if you try different samples?

  • vangvang Member ✭✭

    This sounds exactly like my problem (I started this thread). Could you try to rerun the same command with -nct 1. I my case the problem was the parallelization. I ended up using queue.

  • shawpashawpa Member

    I checked the largest (7.9Gb) and smallest file (5.6Gb) that I have. The estimated run times are only different by a couple of decimal points. The following command gives an initial estimate of 8.6 days. I let it run for a few minutes to see if that time drop drastically since I understand it is an estimate. It really doesn't. There might be slight issue with the computing nodes so I submitted to a different node for this command than what I was referring to yesterday. Yesterday it was 14 days, on a different node, today it is 8.6 days. That is a huge improvement but it is still going to take over 6 months to run all the samples this way. I just ran another test without the interval file. The estimate started at 28.5 hours but it is quickly going up. Seems to have leveled off at about 4.8 days. On @vang (the original poster) suggestion I tried without multi-threading and the time has increased to 12.3 days. I don't know how to use queue so that is not really an option for me.

    java -Xmx10g -jar /shared/bin/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /mnt/DATA/Cores/hiseq2000/annie/reference/Homo_sapiens/UCSC/hg19/hg19.fa -I LG049.recal.bam -o LG049.g.vcf --dbsnp /mnt/DATA/Cores/hiseq2000/annie/reference/Homo_sapiens/UCSC/hg19/dbsnp_138.hg19.vcf --emitRefConfidence GVCF -nct 10 -L AmpliSeqExome.20141113.designed_plus100_merged.bed

    @Geraldine_VdAuwera, ideally how many cores would I need to use get this under 8 hours per sample. That seems like a reasonable amount of time compared to this.

  • vangvang Member ✭✭

    As a reference, I can say that my 20gb exome is using 14.9 hours in haplotypecaller without queue.

    -T HaplotypeCaller -nct 16 -L SeqCap_EZ_Exome_v3_capture_targets.interval_list --interval_padding 100 -R ucsc.hg19.fasta -I 90044150132500/90044150132500.sorted.markdup.realigned.recal.bam --genotyping_mode DISCOVERY --dbsnp dbsnp_138.hg19.vcf -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o 90044150132500/90044150132500.sorted.markdup.realigned.recal.HaplotypeCaller_gVCF.vcf.gz
    

    Maybe you network or disk system is slow?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I agree, this seems like an infrastructure problem. This is not something we can really help with, sorry.

  • shawpashawpa Member

    @Geraldine_VdAuwera I am still trying to figure this out you. You said it was an infrastructure problem. Can you please tell me "ideally" what I need to devote to this to make it work. I have use 24 parts per node and specifying -nct 24 and after 24 hours it was only at 2.2% finished. I am attaching my standard out file because that might contain some information that is useful to you. I have access to more nodes but since the previous user said he only ran it on 16 I don't know why 24 would be slower. The other file is the standard out from running on 24 parts per nodes but not specifying -nct option. It ran about 17 hours and got to 3.5% so it is marginally faster than with running nct. I am obviously doing something wrong and really need advice. The way this system is set up is that I can specify how many nodes and how many parts per node. I used nodes=1:ppn=24 to run this. Would you say it is actually the first part of this that needs to 16 in order for this to work. Your help is greatly appreciated.

    Issue · Github
    by Sheila

    Issue Number
    247
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    The problem is that compute performance isn't just about the number of nodes you have access to. Other factors also matter such as what kind of nodes they are (speed and memory), where the data is stored relative to the compute, and how fast the disks are. This is not something we can go into detail about because it varies too much and we only have direct experience with the specific hardware/infrastructure we have in our institute. Scientific computing infrastructure is a complicated topic.

    We are currently doing some profiling in order to be able to give people an estimate of how long things should take with some standard configurations; but it's already clear in your case that your jobs seem to be running much more slowly than ours or other users who have provided feedback. Unfortunately, it's not something we can help you with directly. I would recommend consulting your IT team; or there are consultants who specialize in helping people with these kinds of problems.

Sign In or Register to comment.