Is the adapter boundary set correctly for overlapping read pairs?

Hi,

I've been working on a GATK walker and during testing I noticed some discrepencies between the base counts from the ReadBackedPileup object I get from AlignmentContext.getBasePileup() and those I get from a previous incarnation of my tool based on htsjdk as well as those I can see when looking at the BAM file using IGV.

I've tracked this down to some overlapping read pairs where the insert size is smaller than the read length. From what I've read in the source code (LocusIteratorByState.dontIncludeReadInPileup and ReadUtils.isBaseInsideAdaptor methods), GATK does not include the non-overlapping portion of each read on because these are likely to be adapter.

However, the adapter boundary on the right of the overlapping portion is set two bases after the alignment end of the read aligning to the negative strand, i.e. one base beyond what I would expect. It means that if I obtain the pileup for a test BAM file containing just one of these overlapping read pairs, I get a depth of 2 for the overlapping portion of the reads and 1 for the base immediately after the overlap, 0 elsewhere.

Is this what was intended?

Thanks,
Matt

Tagged:

Issue · Github
by Sheila

Issue Number
968
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited June 2016

    image Hi @mde,

    I want to make sure I understand your question correctly. You are writing a walker based on htsjdk objects and what you see is as follows for an overlapping read pair. Clipping is happening in a staggered manner for the right end of the overlap as in the first diagram. And this does not happen for the left end. image

    Or is what is in the second diagram happening? Are your unexpected overhangs on both ends?

    But a previous version of your tool did not stagger the clipping. Rather it gave you what is happening in the third diagram. image

    You are asking if this difference is intentional. Is this correct? Also, can you point us to the lines of GATK source code you are referring to? Also, can you also clarify what you were using before versus after, e.g. were you using htsjdk's SamLocusIterator versus GATK's LocusWalker?

    Post edited by shlee on
  • mdemde Member

    Thanks shlee.

    Your first diagram is what I'm observing but I was expecting the behaviour represented by the third diagram.

    My previous version of the tool includes all of the aligned portion of a read in its base counts at each locus. One of the motivations for rewriting as a GATK walker was that I could take advantage of the rather handy ReadBackedPileup functionality and in particular the getOverlappingFragmentFilteredPileup method so I wouldn't be double-counting for overlapping read pairs.

    So the line of code in question is:

    https://github.com/broadgsa/gatk/blob/master/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/sam/ReadUtils.java#L218

    in the getAdaptorBoundary method and whether it should include the + 1 in
    return read.getAlignmentStart() + insertSize + 1;

    As an aside, I'm working with cancer data and the examples of these supposedly short insert overlapping read pairs do not appear to be cases of reading into adapter sequence. I think I would prefer it if there was a way to include the non-overlapping portions of these reads in the pileup, while still not double-counting for the overlapping portion, but I can't see any options for controlling the way GATK is working in this respect.

    Matt

  • shleeshlee CambridgeMember, Broadie, Moderator

    Can you clarify what you were using before versus after, e.g. were you using htsjdk's SamLocusIterator versus GATK's LocusWalker?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @mde,

    For your 0/1/2 one-sided overhang, we will have someone look into it to determine if it's a bug.

    As for your question of these overhangs not being adapter, I suppose this comes down to GATK's original assumption and intention with this clipping action versus what is happening in your data:

    • Overhangs are mostly adapter, non-instances are rare.
    • Overhangs mostly represent something biological and adapter overhangs are rare.

    I'm trying to imagine what genomic events would give rise to your case and how frequent these events are over adapter overhangs. Can you send us these reads? You are implying that a proportion of reads for this locus are mapped in a way to create these overlaps, and the overlap is artifactual in that you do not observe adapter sequence, but you suspect something biological for the event(s) represented by your reads and the pile-up for the region. Can you explain to us why you think this is the case for your data?

    Also, I think you are saying that the manner in which you are writing your tool does not allow you to disable this clipping action. Our developer says that our tools have to opt-in to activate this clipping action. He is curious what type of walker you are writing (LocusWalker or ActiveRegionWalker) and whether you are building on existing code or writing it from scratch.

  • mdemde Member

    Hi shlee,

    The assumption of overhangs being adapter seems reasonable to me and I'm not that concerned about being able to disable the clipping action. But I don't think I can control this, it seems pretty much hardwired in LocusIteratorByState.lazyLoadNextAlignmentContext() which calls dontIncludeReadInPileup and ReadUtils.isBaseInsideAdaptor in turn to decide whether to include the read in the pileup. Perhaps I'm missing something?

    I'm writing a LocusWalker and testing on an ICGC benchmark dataset. It's from a cancer that I don't really know a lot about but for which there is some ground truth in terms of validated variants.

    The following read pair and greatly simplified locus walker show the behaviour I'm seeing.

    D00491:336:C8VN3ANXX:7:1101:7002:64885 81 chr20 59076018 60 101M = 59076060 -59 TATGTCCTGAGGTTATGAGGATGGTGATCAACTGAGGTTTTAGGAATGGGGATGGGATGAGGTTATGAGGATGGGGATGCCCTGAGGTTATGGGGACAGGG #C@BCCC@CDCDDDDCCECEEDBEB@DE?A=GHJIIIIGIJJIIHFEJJJIJJIJIJJJIJJJJJJJJJJIEJJJGIHFGJIIJJJJJHHHHHFFFFFCCC MC:Z:101M MD:Z:10PG:Z:MarkDuplicates RG:Z:MDE001 NM:i:0 MQ:i:52 AS:i:101 XS:i:32
    D00491:336:C8VN3ANXX:7:1101:7002:64885 161 chr20 59076060 52 101M = 59076018 59 GGATTGGGGATGGGATGAGGTTATGAGGACGGGGATGCACTGAGGTTATGGGGACAGGGATGCATTGAGGTTGTAGGGATGGAGATGTGCTGAGGTTATGG BC@FFFFFHDHHHJGHIIJJGHGIJGIJJIJJJIGHIIIHGJGHJ@EGGJEHHFFFEDD?6=A>;;>@CC59?B:ACB9<<@(<;>9@(:>:ACA@3?>ACC MC:Z:101M MD:Z:3A25T8C33A26A1 PG:Z:MarkDuplicates RG:Z:MDE001 NM:i:5 MQ:i:60 AS:i:80 XS:i:56

    @Downsample(by = DownsampleType.NONE)
    public class TestAdapterBoundaryPileup extends LocusWalker<Integer, Integer>
    {
    @Override
    public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context)
    {
    int depth = context.getBasePileup().depthOfCoverage();
    logger.info(context.getContig() + "\t" + context.getPosition() + "\t" + ref.getBase() + "\t" + depth);
    return depth;
    }

    @Override
    public Integer reduce(Integer value, Integer sum)
    {
        return sum + value;
    }
    
    @Override
    public Integer reduceInit()
    {
        return 0;
    }
    

    }

    Best
    Matt

  • shleeshlee CambridgeMember, Broadie, Moderator

    Matt, we haven't forgotten about you. I will get back to you after talking to our developer.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited June 2016

    Hi Matt (@mde),

    Our developer confirms you cannot control the clipping action either for GATK3 or GATK4. However, for GATK4, there are @Override options that allow you to control whether to include Ns (from the CIGAR) and/or include deletions in the pileup. These two controllable options are examples of how you could write what you need. Because these options are set up for GATK4, our developer recommends writing new options for GATK4.

    If you will go ahead with this, I'd like to add my two cents. Namely, instead of disabling all clipping of overhanging ends, perhaps you should consider fine-tuning the clipping action to be more discriminant. That is, set the clipping to pertain only to those sequences confirmed to be adapter sequence, e.g. that marked with the XT tag by MarkIlluminaAdapters. MarkIlluminaAdapters looks specifically for adapter sequence and marks the 5' start of the adapter sequence on the 3' end of the read. Tutorial#6483 illustrates its use in Section 2. This would exclude clipping for any remaining overhangs from overlapping read mates.

    image

    Thanks for the example reads. This is the alignment I see for the pair using hg38. Both reads align in the forward direction with an overlapping middle position. I assume the +1 clipping action applies to such unidirectional FF mates as well as the FR innie-mates. Can you clarify if this is correct?

    P.S. I mentioned the unidirectional reads from your ICGC data to a Cancer Program friend here at the Broad. After a few minutes of discussion, I realize you ought to characterize this phenomena in your dataset further. Before you go about writing up your new walker, you may find correlating these events to your validated variants prudent. To be more specific, your validation data should have been derived from an orthogonal technology that would not sample from the same library preparation as your short read data. Also, do you see the structural variants implied by your unidirectional reads supported by more than single pairs in a stack? Do you observe such events singly across the genome and are they restricted to mate pairs? Consider the possibility of pcr jumps that generate artifactual chimeric reads. Incomplete pcr products can hybridize to different templates to continue elongation. An analyst here at the Broad has found recurrent inversions on the insert/fragment length scale that they attribute to pcr jumps specific to ICGC data. If you are interested, I'm certain they would be happy to discuss with you and I can make an introduction.

    Post edited by shlee on
  • mdemde Member

    Thanks shlee, I'll certainly look out for the new options in GATK4. I've been away the past few days so apologies for not replying sooner.

    I've just used Picard SamToFastq to convert back to FASTQ and realigned using bwa mem to GRCh38. I still get the first read aligning in the reverse direction, so this pair is still RF for the latest reference sequence.

    I reverse complemented the first read to get your FF example and run the GATK pileup on that. In this case no clipping is performed so I get the read depth that corresponds to the coverage track in IGV, i.e. depth 1 for the non-overlapping portion and 2 for the overlap.

    Matt

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @mde,

    Sorry I am just noticing you've replied in this thread. Must have missed the email notification somehow. Our developer is looking into this currently.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited November 2016

    Hi Matt (@mde),

    We've decided the behavior is a bug and this is now being corrected in the code. Just to recapitulate, the bug in the code places the adapter boundary one position downstream from the correct position, such that the pileup at the position will have a coverage of 1 instead of 0. The fix that our developer has implemented removes this offset at the end.

    Thanks again for bringing this to our attention. If you have a moment, could you share with us the tool-chain you use to align your reads? It would be good to know how you end up with an RF alignment when we get an FF alignment.

Sign In or Register to comment.