HaplotypeCaller (gvcf mode) on whole genome vs chromosome by chromosome

jmcclurejmcclure UMassMedMember, Broadie

I'm currently running my first real use of GATK. I was worried about running HaplotypeCaller on whole geneomes given some of the reports I've seen on these forums about how long it can take to run. In contrast, I was pleasantly surprised with the current GATK it is proceeding well (~7 day estimate on dog wgs). But it seems it could be much faster if I divided it up by chromosome with the -L flag.

I see that the advice is to not use the -L flag for whole genome analysis [1]. But the wording in that link seems open: it is not necessary, but if it would help efficiency it might be worthwhile.

I've found a related question on the forums here [2], but it seems the descrepancy discussed in that thread is suspected to be due to downsampling and not actually the result of a chromosome-by-chromosome use of HaplotypeCaller.

Again, I'm content with a ~7 day run time in order to take proper care of our data. I wouldn't want to sacrifice power or accuracy for a shorter runtime, but if there is really no trade-off, a chromosomal approach would be even better. So I'm curious if there is a downside to partitioning the HaplotypeCaller step by chromosome?

[1] http://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals
[2] http://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results

Best Answer


Sign In or Register to comment.