HaplotypeCaller speed

filippofilippo Posts: 1Member
edited September 2012 in Ask the GATK team

Hi,

how does the speed of the haplotypeCaller usually compare to that of the UnifiedGenotyper?

When I try to use it, it seems to be about 90x slower.

these are the command lines I use:

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T UnifiedGenotyper -dcov 1000 -I file.bam -o file.vcf -L myTarget

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T HaplotypeCaller -dcov 1000 -I file.bam -o file.vcf -L myTarget

thanks!

Post edited by Geraldine_VdAuwera on

Answers

  • hanalangoallenhanalangoallen Posts: 2Member

    I've noticed the same thing, and was wondering if this is expected or not.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi there,

    Although those two command lines seem like they're essentially the same, they lead to two big differences in the work that's actually done by the Haplotype Caller and the Unified Genotyper, with the HC doing quite a bit more than the UG:

    1) By default, the Unified Genotyper calls only SNPs, while the Haplotype Caller calls both SNPs and indels. That alone causes a significant difference in speed.

    2) Downsampling currently does not work with the Haplotype Caller. The program is not complaining because the argument is given to the engine, so it is considered valid, but the downsampling is not communicated to the HC, so the HC actually calls on the entire set of reads. If you have high coverage, that will slow the run down significantly. This will be fixed in the future, but for now it is a caveat you need to take into account.

    Fortunately there are some tweaks you can use to speed up the process, I'll ask the author of the HC to contribute some tips in this thread.

    Geraldine Van der Auwera, PhD

  • rpoplinrpoplin Posts: 122GATK Developer mod

    Hi there,

    The command line argument which most determines the runtime of the HaplotypeCaller is -minPruning. This argument controls the amount of pruning that is performed on the local de-novo assembly graph which is generated for every variant region. The default value is -minPruning 1 which means events have to been seen by more than 1 read in order to remain in the graph. This is a very conservative default designed not to miss anything. By raising this threshold fewer haplotypes will need to be evaluated and this reduces the runtime. The optimal value for this parameter depends on your project design and so experimentation is required.

    I hope that helps,

  • mikemike Posts: 103Member

    Hi,

    I have the same issue with illumina exome-seq data, in fact way more >80X slower in HC than UG (I call both SNPs and Indels for both cases. for UG, I used -glm BOTH). The HC took the way long than >80X time compared to UG calling both SNPs and Indels. The HC command I used as below:

    java -jar /PATH/GenomeAnalysisTK-2.1-13-g1706365/bin/GenomeAnalysisTK.jar
    -T HaplotypeCaller -R /PATH/hg19.fa
    -I /PATH/S1.bam -I /PATH/S2.bam -I ... --dbsnp /PATH/dbsnp_135.hg19.vcf -o /PATH/GATK_HTC_AllSamples_V2.vcf -stand_call_conf 50 -stand_emit_conf 10
    -L /PATH/myInterval.bed

    Is my command looking OK to you? As you mentioned, the high-coverage data would slow down quite a bit, which occurred to me. my question is: does the new released version V2.2-4 have added Downsampling feature that would work for HC caller? also to use -minPruning option, is there any general suggestion (e.g., -minPruning 2 or -minPruning 5 etc), if my exome-seq data is average 60X coverage and about 20 samples from the same family? Any suggestion would be greatly appreciated.

    Thanks and best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Mike, the latest version has huge improvements in HC runtimes, check out details and other changes here:

    http://www.broadinstitute.org/gatk/guide/version-history

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thanks for the info, Geraldine! Mike

  • mikemike Posts: 103Member

    Hi, Geraldine:

    I did try the new version using command below:

    java -jar /PATH/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /PATH/hg19.fa -I /PATH/S1.bam -I /PATH/S2.bam -I ... --dbsnp /PATH/dbsnp_135.hg19.vcf -o /PATH/GATK_HTC_AllSamples_V2.vcf -stand_call_conf 50 -stand_emit_conf 10 -L /PATH/myInterval.bed

    It is still running right now, but using the same data, The Unified genotyper only took 4.6 hours whereas the HaplotypeCaller already took 12hr alone for just chromosome 1 based on the log file (it is still running and not sure when will finish).The link you refer to me seem showing smaller difference between the HaplotypeCaller and Unified genotyper in the newest version, but in the case of my data, the difference seems bigger? Any suggestion?

    Thanks

    Mike

  • rpoplinrpoplin Posts: 122GATK Developer mod

    Hi there,

    It looks like you aren't using the minPruning parameter that is mentioned in this thread. The default value is -minPruning 1 which means every event seen by more than 1 read remains in the assembly graph and has to be evaluated. This is a very conservative default designed not to miss anything.

    Cheers,

  • mikemike Posts: 103Member

    Hi, Ryan:

    Thanks for the suggestion. Yes, I am aware of that trick as mentioned in my initial post above , in which I asked about any advice or details to use -minPruning option: is there any general suggestion (e.g., -minPruning 2 or -minPruning 5 etc), if my exome-seq data is average 60X coverage and about 20 samples from the same family? Any suggestion would be greatly appreciated.

    Also the link pointed by Geraldine (http://www.broadinstitute.org/gatk/guide/version-history) showed a big difference in the plot between before and after the improvement, does that use the -minPruning as well or still used the default -minPruning (as 1)?

    Thanks again for your advice!

    Best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    As Ryan said earlier:

    The optimal value for this parameter depends on your project design and so experimentation is required.

    Unfortunately we can't give you a number, you really do need to try out different values and see what works best for you.

    Use of minPruning still applies with the new version, yes.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thanks for the info, Geraldine! Mike

  • mikemike Posts: 103Member

    Hi,

    I come back to this thread, because I still run into a big time issue for the HaplotypeCaller, I have Illumina hi-seq data from 22 samples and the command I used is as below:

    java -jar /Path/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar
    -T HaplotypeCaller -R /Path/hg19.fa
    -I /Path/S1.bam -I /Path/S2.bam -I ...... --dbsnp /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -o /Path/GATK_HTC_AllSamples.vcf -stand_call_conf 50 -stand_emit_conf 10
    -minPruning 10 -L /Path/029720_D_BED_20101013.bed

    As you can see I already used -minPruning 10 (default is 1), however, the job has already run about 100hr, still just on chr2 and expected 3.9w ahead to finish. However, for the same set of samples, the Unified genotyper only took 48 hr to finish the job.

    Does my command looks reasonable, what else I shall do to increase the speed? Shall I keep increasing the -minPruning to 20 or more, or any reasonable larger number here?

    Thanks again for your help. You can imagine how painful to wait for so long for the result, which becomes a bit almost unreachable now? Also my secondary thought is: given so much cost in time for the HaplotypeCaller callset, how much benefits can be gained from that? Unfortunately, I have not been able to get the HaplotypeCaller callset and can not compare its result with UG callset yet.

    Thanks again for your help!

    Mike

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    One option -- if you are willing to experiment -- is to enable the experimental downsampler and set the dcov value to like 10. This is likely to help with regions of excessive depth. If you do use this, please let us know how it performs... we are intending to turn on by default the new downsampler for the upcoming GATK 2.3 release.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • mikemike Posts: 103Member

    Hi, Mark:

    Thanks a lot for the suggestion!

    In fact I was wondering about that too, only because I noticed that for UG, we can use -dcov 200 (in the documentation page), but for HaplotypeCaller as documented in the following URL, the example command not even show that, which is why I eliminate it from the command:

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html

    Shall I use -dcov like UG (-dcov 200) or better start from what you suggested: -dcov 10?

    Thanks again for the suggestion, very anxious to try.

    Mike

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    The name of the argument is --enable_experimental_downsampling and I'd use -dcov 10 because it means that no more than 10 reads starting at the exact same position will be included in the analyzed data. The HC is fundamentally different from the UG, and so the downsampling has use a new approach. Let me know how it goes. I suspect that this will speed things up dramatically, but do not there may be some bugs (which please do report, we are going live here with this as the only option for GATK 2.3 so we need to sort out bugs now).

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • mikemike Posts: 103Member

    Hi, Mark:

    Thanks for the info! I already start to run it, which seems OK so far. Just want to make sure, here is the command I used: (By the way, when I used -h to check, I could not see the option --enable_experimental_downsampling, but when I run it, it runs OK)

    java -jar /opt/nasapps/stow/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Path/hg19.fa
    -I /Path/S1.bam -I /Path/S2.bam -I ..... --dbsnp /SeqIdx/gatkdb/bundle-1.5/hg19/dbsnp_135.hg19.vcf -o /Path/GATK_HTC_AllSamples_minPrun10.vcf -stand_call_conf 50 -stand_emit_conf 10
    -minPruning 10 --enable_experimental_downsampling -dcov 10 -L /Path/029720_D_BED_20101013.bed

    the output of the log file seems OK as below: ... INFO 08:55:30,316 GenomeAnalysisEngine - Strictness is SILENT INFO 08:55:30,524 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 10 Using Experimental Downsampling INFO 08:55:30,530 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 08:55:30,626 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.09 INFO 08:55:30,661 RMDTrackBuilder - Loading Tribble index from disk for file /SeqIdx/gatkdb/bundle-1.5/hg19/dbsnp_135.hg19.vcf WARN 08:55:30,846 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A INFO 08:55:49,855 GenomeAnalysisEngine - Processing 51542882 bp from intervals INFO 08:55:49,875 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 08:55:49,875 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining INFO 08:58:01,550 ProgressMeter - chr1:16527 1.90e+03 2.2 m 19.2 h 0.0% 14.0 w 14.0 w INFO 08:59:02,522 ProgressMeter - chr1:35604 1.90e+04 3.2 m 2.8 h 0.0% 4.5 w 4.5 w INFO 09:00:50,460 ProgressMeter - chr1:267011 2.80e+04 5.0 m 3.0 h 0.0% 3.2 w 3.2 w INFO 09:01:50,526 ProgressMeter - chr1:721274 4.40e+04 6.0 m 2.3 h 0.0% 15.1 d 15.1 d INFO 09:03:21,698 ProgressMeter - chr1:861203 4.77e+04 7.5 m 2.6 h 0.0% 16.5 d 16.5 d INFO 09:04:27,018 ProgressMeter - chr1:911757 1.47e+05 8.6 m 58.8 m 0.1% 10.6 d 10.6 d

    The time quickly jump from 14.0w to only 10.6 d now. Looking promising!

    Hope my command is Ok, let me know if my understanding is correct. Thanks a lot for your great help!

    Best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    @mike said: (By the way, when I used -h to check, I could not see the option --enable_experimental_downsampling, but when I run it, it runs OK)

    That's because experimental features are hidden from the public until we have tested them thoroughly and we know that they work well and won't blow up in people's faces...

    Geraldine Van der Auwera, PhD

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    I think everything looks good. Let us know how the calls look. It's good to hear that things are progressing well. Let us know how the calls look. I suspect everything will be find. Allowing 10 reads per sites for 100 bp reads is like allowing up to 1000x coverage at each position.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • mikemike Posts: 103Member

    Great! Thanks a lot for your help, Mark and Geraldine! I will get back to you after I check out the result, Thanks again, Mike

  • mikemike Posts: 103Member

    Hi, Mark and Geraldine:

    This is the second I am running the above command. I post the first piece of log, now one thing I noticed is kind of odd as shown in the log info as below: the first few lines says yesterday, only need 4-5 days,

    INFO 08:55:49,875 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 08:55:49,875 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining INFO 08:58:01,550 ProgressMeter - chr1:16527 1.90e+03 2.2 m 19.2 h 0.0% 14.0 w 14.0 w INFO 08:59:02,522 ProgressMeter - chr1:35604 1.90e+04 3.2 m 2.8 h 0.0% 4.5 w 4.5 w INFO 09:00:50,460 ProgressMeter - chr1:267011 2.80e+04 5.0 m 3.0 h 0.0% 3.2 w 3.2 w INFO 09:01:50,526 ProgressMeter - chr1:721274 4.40e+04 6.0 m 2.3 h 0.0% 15.1 d 15.1 d INFO 09:03:21,698 ProgressMeter - chr1:861203 4.77e+04 7.5 m 2.6 h 0.0% 16.5 d 16.5 d INFO 09:04:27,018 ProgressMeter - chr1:911757 1.47e+05 8.6 m 58.8 m 0.1% 10.6 d 10.6 d INFO 09:05:27,097 ProgressMeter - chr1:1003133 2.28e+05 9.6 m 42.2 m 0.1% 8.0 d 8.0 d INFO 09:06:27,408 ProgressMeter - chr1:1148298 2.92e+05 10.6 m 36.4 m 0.1% 7.2 d 7.2 d INFO 09:07:28,753 ProgressMeter - chr1:1228831 3.45e+05 11.6 m 33.8 m 0.1% 6.7 d 6.7 d INFO 09:08:28,916 ProgressMeter - chr1:1309986 5.28e+05 12.7 m 24.0 m 0.2% 5.5 d 5.5 d INFO 09:09:31,528 ProgressMeter - chr1:1421064 5.81e+05 13.7 m 23.6 m 0.2% 5.4 d 5.4 d INFO 09:11:06,005 ProgressMeter - chr1:1550959 6.42e+05 15.3 m 23.8 m 0.2% 5.5 d 5.5 d INFO 09:12:09,510 ProgressMeter - chr1:1580370 7.07e+05 16.3 m 23.1 m 0.2% 5.4 d 5.4 d INFO 09:13:10,437 ProgressMeter - chr1:1597095 7.38e+05 17.3 m 23.5 m 0.2% 5.7 d 5.7 d INFO 09:14:35,881 ProgressMeter - chr1:1652884 7.86e+05 18.8 m 23.9 m 0.2% 5.7 d 5.7 d INFO 09:15:38,656 ProgressMeter - chr1:1656732 7.95e+05 19.8 m 24.9 m 0.2% 6.0 d 6.0 d INFO 09:16:38,658 ProgressMeter - chr1:1749231 8.25e+05 20.8 m 25.2 m 0.2% 6.0 d 6.0 d INFO 09:17:38,660 ProgressMeter - chr1:1959495 8.75e+05 21.8 m 24.9 m 0.3% 5.8 d 5.8 d INFO 09:18:38,661 ProgressMeter - chr1:2332368 9.46e+05 22.8 m 24.1 m 0.3% 5.5 d 5.5 d INFO 09:20:05,190 ProgressMeter - chr1:2457819 1.05e+06 24.3 m 23.0 m 0.3% 5.5 d 5.4 d INFO 09:21:08,328 ProgressMeter - chr1:2980577 1.12e+06 25.3 m 22.6 m 0.3% 5.3 d 5.3 d INFO 09:22:08,329 ProgressMeter - chr1:3431347 1.23e+06 26.3 m 21.4 m 0.4% 5.0 d 5.0 d INFO 09:23:08,830 ProgressMeter - chr1:3739617 1.33e+06 27.3 m 20.5 m 0.4% 4.8 d 4.8 d INFO 09:24:17,050 ProgressMeter - chr1:3812604 1.37e+06 28.5 m 20.7 m 0.4% 4.9 d 4.9 d INFO 09:25:17,050 ProgressMeter - chr1:6158685 1.43e+06 29.5 m 20.6 m 0.4% 4.8 d 4.8 d INFO 09:26:17,151 ProgressMeter - chr1:6259503 1.51e+06 30.5 m 20.2 m 0.4% 4.7 d 4.7 d INFO 09:27:17,153 ProgressMeter - chr1:6556446 1.70e+06 31.5 m 18.5 m 0.5% 4.5 d 4.4 d INFO 09:29:24,225 ProgressMeter - chr1:6694470 1.77e+06 33.6 m 19.0 m 0.5% 4.5 d 4.5 d .........

    Then after 1-day run (30hr), it says need 12.8 days to complete and the chr is still at chr 1 (see below)

    ......... INFO 14:55:06,768 ProgressMeter - chr1:213134427 2.58e+07 30.0 h 69.9 m 8.9% 14.0 d 12.8 d INFO 14:56:06,852 ProgressMeter - chr1:213180394 2.58e+07 30.0 h 69.9 m 8.9% 14.0 d 12.8 d INFO 14:57:06,853 ProgressMeter - chr1:213434901 2.58e+07 30.0 h 69.9 m 8.9% 14.0 d 12.8 d INFO 14:58:12,617 ProgressMeter - chr1:214500861 2.58e+07 30.0 h 69.9 m 8.9% 14.0 d 12.8 d INFO 14:59:13,334 ProgressMeter - chr1:214551159 2.58e+07 30.1 h 69.9 m 8.9% 14.0 d 12.8 d

    Looks like still a long way to go. Any idea or suggestions?

    Thanks again

    Mike

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    Yes, this is what I'd expect. There's a lot of coverage in the centromere, and passing through the centromere will increase the cost of the compute estimates over the whole genome.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • mikemike Posts: 103Member

    Interesting point. Thanks for the info, Mike.

  • mikemike Posts: 103Member

    Hi, Mark and Geraldine:

    Sorry, I may still need help from you about the HaplotypeCaller. I got two issues with the above run I posted. The run finally stop at chr2 with the following error message in the log file:

    INFO 16:22:19,504 ProgressMeter - chr2:88071405 3.47e+07 79.4 h 2.3 h 12.7% 3.7 w 3.3 w INFO 16:23:38,504 ProgressMeter - chr2:88071672 3.47e+07 79.5 h 2.3 h 12.7% 3.7 w 3.3 w INFO 16:35:00,657 ProgressMeter - chr2:88076001 3.47e+07 79.7 h 2.3 h 12.7% 3.7 w 3.3 w

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.2-4-g4a174fb):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
    ERROR ------------------------------------------------------------------------------------------

    problem 1: as you can see from the log, after 79.7h run, it is still at chr2 with 3.3 weeks expected. which is still not feasible for routine project like this. Any more aggressive way to speed up?

    Problem 2: as the error message says, I got memory issue. However I am not sure how much I need here. I only got 17 samples/bam files, and so for Unified genotyper, I never need to specify a bigger memory but use the default without sepcifying, any suggestion? -Xmx20g would be OK? each of my bam file is around 10G after phase I recalibration etc

    Mike

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    For 1 -- the best thing is to scatter gather your jobs. If cut the genome into 100 pieces this will finish in just a few days. As for 2, it's possible that more downsampling would be valuable. But this is likely a problem with the HC generating region too big for it to handle. Ryan / Joel -- is there some upper limit on the size of an active region? Probably necessary to put one in to the engine...

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • mikemike Posts: 103Member

    Hi, Mark:

    Thanks a lot for the input! when you say "scatter gather", you mean split the reads from my bam files e.g. by chromosomes into subsets of bam files and then call the SNPs separately on each chromosome and then combine the chr-wise vcf files all together, right? If I do that, I do not have to change any setting of my command? Sounds useful idea.

    Is this idea generically feasible and reasonable for all GATK steps, I know VQSR step would need or like more variants to build up the model? Sounds like the GATK can help do that by implementing a way to split the one big process into multiple parallel processes to deal with reads from each chromosome one at a time in parallel fashion.

    Thanks again

    Mike

  • thibaultthibault Posts: 19GATK Developer mod

    There is an upper limit on the active region size. It's set per walker, and the size is 300 bases for HaplotypeCaller.

    Scatter/Gather is the technique that Queue uses to split up large tasks in the way you're describing. If you create a Queue script which runs a walker with a scatter count of 100, it will split your inputs into 100 pieces (Scatter), run the walker on each piece in parallel, and combine the results into a single output file (Gather).

    Joel Thibault ~ Software Engineer ~ GSA ~ Broad Institute

  • mikemike Posts: 103Member

    Hi, Joel:

    Thanks for the info about Queue. I had not tried out the Queue yet, which seems very useful in such cases. I will give a shot.

    not quite follow what you mentioned about the upper limit on the active region size, how does that impact the HyplotypeCaller?

    Thanks again,

    Mike

  • rpoplinrpoplin Posts: 122GATK Developer mod

    It means that the work unit for the HaplotypeCaller is a region with a maximum base size of 300bp.

  • nikmalnikmal Posts: 23Member
    edited April 2013

    I'm using GATK 2.4-9 and getting insane estimated run speed for HaplotypeCaller. I have 7 reduced BAM files, whole genomes (Illumina HiSeq 30-40x), running with -minPruning 5 , -dcov 200. This is run in Ubuntu 12.04 on a PC with Intel Xeon @ 6 x 3.47 GHz, 24 GB RAM. I'm also giving java all available memory with -Xmx24G

    After running for about 16 hours, I get an estimated run time of 18.1 weeks! I know that this is a demanding analysis, but that kind of time is just ridiculous.

    Is there anything I can do to speed this up and also be confident that I get as good results as possible? I mean, increasing -minPruning might decrease run time but this might also lead to a lower quality call set, right?

    Post edited by nikmal on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi Nikmal,

    Ouch, that's a long time.

    Well, keep in mind that the value of minPruning you can afford to use (without excessive effect on call quality) is to a large extent relative to the coverage you expect throughout your data. If your coverage is very consistently above, say, 50, then you can comfortably increase minPruning to up to maybe 10 if you assume events that occur at 20% in a diploid are probably junk anyway. But if your coverage is not very consistent and you have alternating regions of high and low coverage, then you have to be more careful because in low coverage regions, a high value of minPruning may cost you real calls. So for this it's important to know how good your data is, because it tells you how robust it will be to harsher parameter values.

    If you're nervous about playing with minPruning, you can instead turn up the downsampling, which is more robust to coverage issues. Note that as explained earlier in this thread, the downsampling applied to the HC is different from regular downsampling, so you can use a much lower value, e.g. 10 instead of 200.

    Finally, you can also look into using scatter/gather to parallelize execution of the HC.

    Geraldine Van der Auwera, PhD

  • nikmalnikmal Posts: 23Member

    Hi Geraldine,

    Thank you for explaining and thanks for the helpful tips! I'll play around with minPruning and dcov and hopefully get a shorter execution time. :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Glad to help, @nikmal. FYI, the impending 2.5 release is going to be quite a lot faster, and I am told 2.6 will be "blazingly fast" :)

    By the way, to nuance my earlier recommendations, try getting to the speed you need by adjusting the dcov first, since comparatively speaking it is safer than increasing minPruning, which is best used conservatively.

    Geraldine Van der Auwera, PhD

  • ameynertameynert Posts: 5Member

    The --enable-experimental-downsampling flag is no longer allowed in version 2.5.2 - is the HaplotypeCaller-specific downsampling now enabled through the usual -dcov/-dt parameters?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Yes that's correct -- details of how it works are in the version highlights for release 2.3 (or 2.4)

    Geraldine Van der Auwera, PhD

  • trgalltrgall Posts: 13Member

    I have ~100 exomes (62mbp) at ~75 coverage. Using 1024 cores UnifiedGenotyper took about 5 hours. Given the increased memory requirements of HaplotypeCaller, I can only run 4 threads per 24GB node, instead of 24 threads, so already times will be at least 6x longer than UG. I am on a PBS Pro system, so I can't use Queue, but I ran some practice runs. I broke the 10MB of the HLA region into 40 parts, each covering about 15,000bp of exons. It took 80 core/hours to run total (~8hr/1mbp), so I was expecting about 25,000 core/hours (~150core/weeks) to process the entire dataset. On the cluster I was hoping to finish within 4 weeks.

    I am using minPruning 4 and dcov 2, and many of the threads are estimating between 4-12 hours/1Mbp, but some are estimating several days/Mbp, with several weeks for the segment (and these are all <10Mbp!). As I am starting 4 threads per node, and can't use more nodes until all threads are finished, this will mean I am using many 24 core nodes with a single thread! Is there a way to tell HaplotypeCaller to process to a certain time, instead of a segment size? Otherwise I will be wasting many nodes, and waiting many weeks for the job to finish.

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    This is really the current worst case for HC performance. Internally we use the HC for deep WGS projects at near production scale (such as < 50 WGS sample), but many deep exomes really doesn't perform very well at all yet. Our target to is to improve performance on exomes for 2.7. But I'd recommend you stay with the UG for such project.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • modi2020modi2020 Posts: 15Member
    edited October 2013

    Dear Gerladine,

    I am using GATK v2.6-5 to call variants on three genomes. With two genomes it only takes 90 hours but when I add the third it runs forever and the estimated running time keeps increasing (was up to 12 days before I stopped it). I decided to speed it up using the -dcov/-dt parameters where I set the -dcov to 10 but it didn't run and insisted that it should be at least 200. When I resorted to the minPruning and set it to 7 to increase the speed it ran much faster!!. I understand though that minPruning should be used with caution!

    My understanding is that in the minPruning 7 would mean that events have to been seen by more than 7 reads in order to remain in the graph. Does that mean for a region to remain in the graph, it has to be present 7 times in each of the three genomes?

    My other question is, is minPruning 7 safe for coverages of 22 x, 24 x and 17.8 x ?

    I would appercieate your help with these questions.

    Thank you!

    @Geraldine_VdAuwera said: Glad to help, nikmal. FYI, the impending 2.5 release is going to be quite a lot faster, and I am told 2.6 will be "blazingly fast" :)

    By the way, to nuance my earlier recommendations, try getting to the speed you need by adjusting the dcov first, since comparatively speaking it is safer than increasing minPruning, which is best used conservatively.

    Post edited by modi2020 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi @modi2020,

    The graph is constructed with data from all samples, so the pruning requirement is that the event has to be seen N times overall (not per sample). So that gives you a bit of margin when running on multiple samples. That said, at lower coverages you have fewer chances of seeing events, so you should be more cautious with minPruning. We haven't tested this systematically at different coverages so I can't give you a specific answer to your question; you may want to test a few different settings.

    Geraldine Van der Auwera, PhD

  • modi2020modi2020 Posts: 15Member

    Great! I got it. Yeah, I guess I have to experiment with it a bit and see what best works for it.

    Thank you Geraldine

  • mike_boursnellmike_boursnell Posts: 72Member

    Hi Geraldine. Is there somewhere an up to date explanation of what the current best practice is for speeding up HaplotypeCaller?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi @mike_boursnell,

    No, we haven't made this a documentation point yet but perhaps this would be a good idea. I'll add it to my to-do list. In the meantime, the same recommendations apply as previously explained in this thread: minPruning and downsampling at the parameter level, and multithreading/scatter-gather parallelism at the execution level.

    Geraldine Van der Auwera, PhD

  • baescbaesc Posts: 21Member

    Hi GATK Team, I have a general question about the Haplotype caller. I've been playing around testing how long various calling software takes (just with wall clock time - I'm the only one using our cluster, so I figured the time should be comparable if I only run one job at a time..). I compared two different chromosomal lengths (5 Mb and 10 Mb), and various numbers of samples. At around 40 samples, the wall time required by the HaplotypeCaller seems to reach a peak (see attached file). Are there any reasons for this?

    Thanks for your help! Chris

    pdf
    pdf
    Figure_7_MultiSampleLength_Runtime.pdf
    190K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi @baesc,

    It looks like you're feeding in multiple samples at a time to HaplotypeCaller, which is the old/deprecated way of doing multisample analysis with GATK. We have a new workflow to do this that is much faster and has additional benefits like being able to add samples to your analysis cohort incrementally as sequences become available. Have a look at this doc: http://www.broadinstitute.org/gatk/guide/article?id=3893

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.