This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Possible fixes needed for GATK analysis pipelines for dealing with adapters
Many past threads have discussed the issue of whether it is necessary to trim library adapters prior to or during the GATK pipeline. But false positive indel calls occur not infrequently due to Haplotypecaller using soft clipped portions of reads that contain adapter sequence. So I think that trimming adapters is appropriate, and it's hard to think of a reason not to trim adapters (but of course, open to hearing otherwise why this might not be a good idea).
Given that trimming adapters is likely beneficial with no downsides, I have attempted to incorporate this into the standard GATK pipelines. However, I have noticed a few things that preclude this from being possible that would require input and possible fixes to tools from the GATK team:
1) A prior GATK tutorial gave a suggestion for how to incorporate adapter trimming, but this does not effectively do so: https://software.broadinstitute.org/gatk/documentation/article?id=6483
Specifically, in this tutorial, guidance is given on first doing MarkIlluminaAdapters, then feeding that to SamToFastq, piping that to bwa, and then merging that with the original unmapped BAM using MergeBamAlignment. However, this leads the final BAM file to have the original base qualities prior to MarkIlluminaAdapters's downgrading of base quality based on adapters. Because HaplotypeCaller uses soft clipped portions of reads (unless the --dontUseSoftClippedBases option is chosen, which would not be wise as this leads to loss of sensitivity for true indels), then Illumina adapters are still used by HaplotypeCaller as if they are normal sequence.
So there is no easy way to get true adapter trimming such that HaplotypeCaller either gets adapter sequences as low quality bases, or completely trimmed. Simply changing the UNMAPPED_BAM option of MergeBamAlignment to the output of MarkIlluminaAdapters rather than the original unmapped BAM file won't help because the output of MarkIlluminAdapters still contains the original base qualities.
2) The MergeBamAlignment tool has a CLIP_ADAPTERS option. However, there is no documentation about how this works. If this is clarified, perhaps it would allow to address the above issue of how to get the final BAM file to have proper trimming of adapters. Nevertheless, the documentation says it only soft-clips those bases, so this will still have no effect on HaplotypeCaller, which uses soft-clipped bases.
3) bwa itself outputs an XT tag, which unfortunately is the same tag that MarkIlluminaAdapters gives. If these were different tag names, then MergeBamAlignment could potentially be changed to have an option to preserve the XT tag from the MarkIlluminaAdapters output BAM while still preserving the original base qualities, and then adding an option to HaplotypeCaller to take into account the XT tag, either by considering those bases as low quality or as completely missing. This would probably be the ideal workflow, but this would require code changes to all these tools: MarkIlluminaAdapters, MergeBamAlignment, and HaplotypeCaller.
The simple fix for now then is to use a different tool to definitively trim the FASTQ files prior to the GATK workflow.
Thanks for any feedback.