We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Possible fixes needed for GATK analysis pipelines for dealing with adapters

Hi,
Many past threads have discussed the issue of whether it is necessary to trim library adapters prior to or during the GATK pipeline. But false positive indel calls occur not infrequently due to Haplotypecaller using soft clipped portions of reads that contain adapter sequence. So I think that trimming adapters is appropriate, and it's hard to think of a reason not to trim adapters (but of course, open to hearing otherwise why this might not be a good idea).

Given that trimming adapters is likely beneficial with no downsides, I have attempted to incorporate this into the standard GATK pipelines. However, I have noticed a few things that preclude this from being possible that would require input and possible fixes to tools from the GATK team:

1) A prior GATK tutorial gave a suggestion for how to incorporate adapter trimming, but this does not effectively do so: https://software.broadinstitute.org/gatk/documentation/article?id=6483
Specifically, in this tutorial, guidance is given on first doing MarkIlluminaAdapters, then feeding that to SamToFastq, piping that to bwa, and then merging that with the original unmapped BAM using MergeBamAlignment. However, this leads the final BAM file to have the original base qualities prior to MarkIlluminaAdapters's downgrading of base quality based on adapters. Because HaplotypeCaller uses soft clipped portions of reads (unless the --dontUseSoftClippedBases option is chosen, which would not be wise as this leads to loss of sensitivity for true indels), then Illumina adapters are still used by HaplotypeCaller as if they are normal sequence.

So there is no easy way to get true adapter trimming such that HaplotypeCaller either gets adapter sequences as low quality bases, or completely trimmed. Simply changing the UNMAPPED_BAM option of MergeBamAlignment to the output of MarkIlluminaAdapters rather than the original unmapped BAM file won't help because the output of MarkIlluminAdapters still contains the original base qualities.

2) The MergeBamAlignment tool has a CLIP_ADAPTERS option. However, there is no documentation about how this works. If this is clarified, perhaps it would allow to address the above issue of how to get the final BAM file to have proper trimming of adapters. Nevertheless, the documentation says it only soft-clips those bases, so this will still have no effect on HaplotypeCaller, which uses soft-clipped bases.

3) bwa itself outputs an XT tag, which unfortunately is the same tag that MarkIlluminaAdapters gives. If these were different tag names, then MergeBamAlignment could potentially be changed to have an option to preserve the XT tag from the MarkIlluminaAdapters output BAM while still preserving the original base qualities, and then adding an option to HaplotypeCaller to take into account the XT tag, either by considering those bases as low quality or as completely missing. This would probably be the ideal workflow, but this would require code changes to all these tools: MarkIlluminaAdapters, MergeBamAlignment, and HaplotypeCaller.

The simple fix for now then is to use a different tool to definitively trim the FASTQ files prior to the GATK workflow.

Thanks for any feedback.

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    MarkIlluminaAdapters SamToFastq can remove all the adapters that you need to get rid of without much of a hassle.

    Here are 2 samples that I worked on recently. You need to keep the minimum size after clipping around the region where you observe adapter contamination within your reads. I kept it at 35 bases and final result is pretty clean of any adapter contamination.
    Before

    After

    As for the adapters keeping the qual scores after merging according to best practices it is up to you to do how you please for the quality scores. I generate a new UBAM (with new fastqs after samtofastq step ) after marking adapter sequences with qual 2 and merge that new UBAM with my sam file so my adapters keep the quality value 2.

  • GERGER Member

    Yes that is another work-around. The important point is that the standard GATK pipeline do not do what you have described or what I have described. So most people don't realize that the standard GATK pipeline HaplotypeCaller step "sees" all the adapters as if they are not trimmed. This is not crucial for WGS data that usually has larger insert sizes, but for exome analysis this is important.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Best practices are just recommendations but not something that anyone should follow blindly as explained before by the devs I suppose. Everyone has a different way of generating their data from wetlab to sequencer to analytics platform.

Sign In or Register to comment.