Mixing Paired-End and Single-End Reads

Hi there,

I am calling SNVs from whole genome sequencing data using the workflow outlined in the GATK best practice. My sample has been sequenced both in single and (mostly) paired end lanes. I wish to use all of these data for SNV calling. Other threads suggest that this should not be a problem, but I was just wondering if there are any specific steps for which I should treat these reads separately (i.e. base recalibration).

Thanks in advance,


  • They should already be separated for base recalibration because they have different RG ids (They do have different RG ids, right?)

    I don't mix SE/PE data for variant calling because BWA assigns different MAPQs to their alignments - PE alignments get 60, and SE maxes out at 37. Which means that when you filter your variants, the MQ annotation (average MAPQ) will be dependent on both the mappability of the region and the proportion of SE:PE data, and no longer be a useful metric. I can't offhand think of any non-MAPQ metrics that would be affected, so you may be able to get away with just leaving out MQ and MAPQ0. But those are pretty powerful discriminators, and I wouldn't bet on getting good results without them

