HaplotypeCaller IntervalList Performance weirdness.. BQSR involved?

HarmJanWestraHarmJanWestra Broad instituteMember

Dear all,

I am experiencing weirdness when running the HaplotypeCaller using an intervallist (the -L option). I have a dataset of 762 individuals, deeply sequenced for about 800kb, spread over ~800 target regions. I've currently got BAM files split per chromosome, each file containing data for all samples. As such the chromosome 1 BAM file is about 9Gb. To assess whether we would find variants at all, I previously (boldly) ran the HC without the -L option, and all appeared well. Chromosome 1 completed in approx. 29 hours, for all samples, and VCFs were produced.

However, we discussed the possibility of false positive calls outside of the targeted regions, and their possible effect on the GATK error models, and as such, I ventured to include a list of targeted regions through the -L, in order to reduce false positives (I also repeated the BQSR, etc steps using the -L option).

However, I noticed this has a huge effect on the runtime: where the previous run (without the intervallist) finished chromosome 1 in ~29 hours, the HC run with the -L option was still not finished after two weeks (and was not predicted to finish for another 12 days). Other settings were not changed, and runs were performed on the same machine. This strikes me as weird, since the -L option should greatly reduce the number of bases that have to be traversed, and as such, should make the local realignment by HC much faster. The only thing I can think of right now is that the quality scores of the reads have greatly changed as a consequence of using the -L option at BQSR.

I am now trying to rerun the HC using the GVCF mode, using 1 HC command per sample, in the hope that this will improve the performance.

Do you have any suggestions how I could figure out what is going on here?

With kind regards,

Harm-Jan

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Harm-Jan,

    That is interestingly counterintuitive -- we would definitely expect the interval list to help speed things up, not slow them down, since they allow you to skip regions that are of no interest.

    You could plot the distribution of quals in both versions of your recalibrated bams to see if there is any significant difference. Though I think the crucial test would be to run HC both ways (with and without -L) on each input file and see where the differences lie.

  • HarmJanWestraHarmJanWestra Broad instituteMember
    edited January 2015

    Dear Geraldine,

    Thank you for your prompt answer. I am indeed very happy that I've run the HC without the -L option first. I will try to figure out what the base-score quality distribution is of the reads in the intervals before and after using the -L option. My guess is that the quality scores will generally be higher when adding the -L option when performing BQSR. As such there are more reads in each region, and consequently, more work for the HC. However, I assumed that the HC uses subsampling (default ~250 reads/sample), so that should not really matter as well, correct?

    Anyway, I'll report my findings here at a later moment.

    With kind regards,

    Harm-Jan

  • HarmJanWestraHarmJanWestra Broad instituteMember

    Dear Geraldine,

    I've done some tinkering around and finally found that the issue can be resolved by splitting the per chromosome BAM files further, into a file per sample per chromosome (and outputting to GVCF). Now, HC requires about a minute per sample for chromosome 1.

    It appears that for each sample, the contribution to the 9Gb multi-sample BAM file is only about 12 megabytes. I assume that when intervals are supplied, a different BAM file walker is used that does random accesses (one for each interval), as compared to a run without intervals, which would use more or less sequential queries on the BAM file. As such, HC spends most time looking for the 12mb worth of reads for each sample. This high amount of IO greatly slows things down. I just didn't realize the performance penalty would be that large (i.e. ~40x).

    I think it would be nice if this was part of the HC documentation page: it is possible to use multi-sample BAM files, although there is a large performance penalty when using HC.

    Hope this helps,

    Harm-Jan

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi HarmJan,

    That's not how the data is accessed by the program (the same walker is used whether or not you provide intervals -- if you don't, the program assigns some internally).

    The huge improvement you're seeing when you run per-sample in GVCF mode is because the computations that are performed on the data scale exponentially with the number of samples and total read depth. That is why we developed the per-sample GVCF-based workflow, in order to scale up the number of samples that could be analyzed jointly. That's also why it is part of our Best Practices recommendations, which are the docs we most encourage people to read before starting to use the tools...

  • HarmJanWestraHarmJanWestra Broad instituteMember

    I have read the documentation, and found that most tools ran within a proper timeframe when using a multi-sample BAM split over chromosome (as did my first run of the HC). If the HC runs with the same walker after providing an intervalList, I really have no idea why there is such a performance gap. I guess it doesn't really matter anymore.

    Thanks anyway for your help!

Sign In or Register to comment.