HaplotypeCaller IntervalList Performance weirdness.. BQSR involved?
I am experiencing weirdness when running the HaplotypeCaller using an intervallist (the -L option). I have a dataset of 762 individuals, deeply sequenced for about 800kb, spread over ~800 target regions. I've currently got BAM files split per chromosome, each file containing data for all samples. As such the chromosome 1 BAM file is about 9Gb. To assess whether we would find variants at all, I previously (boldly) ran the HC without the -L option, and all appeared well. Chromosome 1 completed in approx. 29 hours, for all samples, and VCFs were produced.
However, we discussed the possibility of false positive calls outside of the targeted regions, and their possible effect on the GATK error models, and as such, I ventured to include a list of targeted regions through the -L, in order to reduce false positives (I also repeated the BQSR, etc steps using the -L option).
However, I noticed this has a huge effect on the runtime: where the previous run (without the intervallist) finished chromosome 1 in ~29 hours, the HC run with the -L option was still not finished after two weeks (and was not predicted to finish for another 12 days). Other settings were not changed, and runs were performed on the same machine. This strikes me as weird, since the -L option should greatly reduce the number of bases that have to be traversed, and as such, should make the local realignment by HC much faster. The only thing I can think of right now is that the quality scores of the reads have greatly changed as a consequence of using the -L option at BQSR.
I am now trying to rerun the HC using the GVCF mode, using 1 HC command per sample, in the hope that this will improve the performance.
Do you have any suggestions how I could figure out what is going on here?
With kind regards,