Question Regarding GATK Doc #6483

SkyWarriorSkyWarrior TurkeyMember ✭✭✭


I just happen to see a strange issue with this document.

In my own practice I always remove adapters at the very beginning (demultiplexing stage) and continue my analyses from fastq to uBAM to mapping and so on.

However recently I received some external data for analysis and realized that there is about 9 percent adapter contamination in fastqs. Looks like adapter cleanup is omitted in the demultiplexing stage.

As a personal preference I am against removing anything from fastq after demultiplexing stage and I am totally against trimmers since they tend to mess up with the order of reads and further complicate debugging of already established pipelines in production.

So I decided to give a try to MarkIlluminaAdapters option since that gives me the option to mark them and rescore them with QV2 therefore they won't interfere with my analyses. Looking at the document #6483 after marking illumina adapters step uBAM is streamed to BWA then streamed to MergeBamAlignment to create a clean bam however those marked adapters with QV2 are totally reverted to their original quality values (QV >30 for most!!!!) at that stage. So I am concerned about this.

Can anyone comment on that from GATK team why do we mark them if the original qualities will be restored anyway?

Am I missing something?

My current practice is to mark the adapters and convert uBAMxt to FASTQ with CLIPPING OPTION 2 and start mapping with this fastq and also generate a second uBAM with these new fastqs that contain the adapter sequences with QV2.

Am I understanding wrong?

Thanks for the help.


  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Any response?

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @SkyWarrior Apologies for the delay! Looking into this and will get back to you as soon as possible!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    I was able to reach out to the team to get you an explanation for the question you had!

    In the tutorial doc, the MarkIlluminaAdapters stage is used to hard clip the adapter bases before alignment. Lowering the base quality scores was done to add additional visual indication of this clipping. BWA-MEM then ignores those bases in the alignment process, allowing the non-adapter parts of the reads to be aligned correctly. Subsequent to the merge, the original base quality scores are restored. However, the mapped location and mapping quality score are based on the non-adapter bases. In places where the marked bases don't line up with the reference, they will be soft-clipped.

    This is probably what's desired in most cases: the original calls from the sequencer about its confidence in bases is kept, but the adapter sequences are not used in alignment, and can be readily identified by looking at the soft-clips. However, if you want the marked adapter bases to retain their low base quality scores, you need only merge in the bam output from MarkIlluminaAdapters rather than the original unaligned bam.

    Please let us know if you have any follow-up questions!

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    The bamoutput of MarkIlluminaAdapters does not change the QV. Only SamToFastq has that option to change the QV of the adapter sequences so a new bam must be made from new fastqs made after the MarkIlluminaAdapters stage. Also although mapping qualities are based on softclipped bases there is still a potential pitfall that restored adapters may end up in HaplotypeCaller's reassembly causing false positive indels. Low QVs will probably be ignored or eliminated in that process so you will end up with less adapter caused indels.

    I think this was my main concern.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    I have an updated response from the gatk team that I think might help!

    You are correct that the base qualities can be changed in SamToFastq, not MarkIlluminaAdapters. This means you'd have to run FastqToSam and use its output as the unaligned bam for MergeBamAlignment. It's a bit of a pain because you'd have to first use grep to extract the /1 reads and the /2 reads because FastqToSam requires the two ends to be in separate fastq files. It should be mentioned though that BWA-MEM does not look at base quality scores ( so setting them to 2 will not improve the alignment. If you want to fully exclude these bases from downstream analysis, it is probably best to set the clipping action in MarkIlluminaAdapters to remove them entirely ("--CLIPPING_ACTION X").

    Alternatively you could run HaplotypeCaller with "--dont-use-soft-clipped-bases true". However, HaplotypeCaller internally ignores the ends of reads that overhang past the insert (as determined by the aligner), so provided the alignment is done correctly, HaplotypeCaller should be robust against this problem.

Sign In or Register to comment.