Attention:
The frontline support team will be offline as we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and available to answer questions on the forum on March 25th 2019.

Joint calling of projects run on different exome-capture platforms

seruseru BergenMember ✭✭
edited October 2014 in Ask the GATK team

Hi,

We have some exomes processed with GATK from one exome capture platform (Nimblegen SeqCap on HiSeq), and now I am going to analyze a small batch of exomes sequenced using a different platform (Nextera on NextSeq). I was wondering if and how GenotypeGVCFs can cope with gvcfs produced on two (or more) different exome capture platforms? Is joint-calling of such "heterogeneous" samples advised, or should I rather genotype the small set of equally processed bams separately?

The problems I could foresee were:

  • differing targets (each batch of samples will have gaps in coverage due to target-constrained (-L) haplotypecalling). If this should cause problems for GenotypeGVCFs the haplotypes could be called again on superset of targets, or easier, the calling restricted to the intersection.
  • the alignment algorithm is different (BWA BWTSW vs BWA MEM) so the mapping quality could potentially differ (haven't checked). Is this something GATK is compensating for, like base qualities?
  • the sequencer and chemistry, but here I hope that the BQSR should help in removing the variation

Any thoughts and comments appreciated,

Thanks,
Paweł

Best Answer

Answers

  • seruseru BergenMember ✭✭

    Thank you @Sheila for the answer. So MAPQ is not recalibrated whatsover, and you confirm that it could be an issue. Since so far we only have had BWA aligned data (different versions and algorithms though), there is a chance that the MAPQ is not modelled differently. I will check with them. Realignment is probably not an option (it is too many samples), so robably I would have to call them separately.

    But continuing this thread on MAPQ differences between aligners, wouldn't this be a limitation for multisample calling at some point? I mean, combining large batches of samples has become very easy with the new way of calling combined GVCFs. But making sure they are aligned with the same software (and subsequently preprocessed by GATK) may be quite a burden. I am wondering if it would be possible (I am sure someone at your team has thought of that before) to do similar "recalibration" of MAPQ scores, as is done with BQSR?

    Thanks,
    Paweł

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Pawel,

    You're correct that batch effects due to different mapping and preprocessing protocols are becoming a concern. The best way to handle this currently is to reprocess everything the same way, but that is not always feasible due to limited compute resources.

    The good news is that if you use a variant caller that performs local reassembly like HaplotypeCaller, the MAPQs are actually recalculated, at least for the reads that are located within ActiveRegions. So as long as the mappers get the reads in roughly the right area, the caller can compensate for any minor mapping/modeling differences. In light of that, recalibrating MAPQs is not necessary as long as you use a caller that reassembles reads (which is why we haven't put any resources into developing an "MQSR" process).

  • seruseru BergenMember ✭✭

    Thank you Geraldine. This is good news indeed as we use HC:)
    So in effect, differing MAPQ models between aligners should not affect the confidence in variant calls (MAPQ is taken into consideration here AFAIK). But the batch effect could be possibly manifested as differing coverage due to different mapping strategies, provided the read is not placed in "roughly the right area".

    Thank you again,
    Paweł

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yep, that's correct.

Sign In or Register to comment.