Reads traversed during IndelRealigner

Hi GATK team.

I am running GATK version 2.3. I've run a few jobs (with different reference sequences and Illumina read datasets) and noticed the same puzzling phenomenon.

When I run RealignerTargetCreator, I get the following output in the log...

96553 reads were filtered out during traversal out of 31487680 total (0.31%)
96137 reads (0.31% of total) failing MappingQualityZeroFilter
416 reads (0.00% of total) failing UnmappedReadFilter

Then, when I run IndelRealigner, I get the following puzzling output in the log...

0 reads were filtered out during traversal out of 20024 total (0.00%)

Previously, when I was using an earlier version of GATK (pre-2), the number of reads traversed during RealignerTargetCreator and IndelRealigner were about the same. Now, the IndelRealigner log always shows about 20000 reads traversed.

Should I be concerned about this?

Also...a quick question...will IndelRealigner still output a warning if the maxreadsforalignment threshold is exceeded for an interval (and therefore the region is returned as-is instead of realigned)?

Thank you for your help!

Laura Williams

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Laura,

    I think that's a bug we fixed in 2.4, can you give it a try with 2.4 and lt me know if you're still having the problem?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To answer your second question, yes, the IR will say "Not attempting realignment in interval X because there are too many reads."

  • laura13laura13 Member

    Hi Geraldine.

    Thanks for your reply.

    I downloaded version 2.4, and here are the results...

    Running RealignerTargetCreator gives the same output as version 2.3...

    96553 reads were filtered out during traversal out of 31487680 total (0.31%)
    96137 reads (0.31% of total) failing MappingQualityZeroFilter
    416 reads (0.00% of total) failing UnmappedReadFilter

    Running IndelRealigner now gives the following output in the log...

    0 reads were filtered out during traversal out of 200040 total (0.00%)

    There are more reads traversed during IndelRealigner in version 2.4 compared to 2.3, but there is still a huge discrepancy between reads traversed during RealignerTargetCreator and IndelRealigner.

    IndelRealigner should report all reads traversed and not just reads traversed in a particular interval, yes?

    I appreciate any help with this puzzling difference!

    Thanks,
    Laura Williams

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Laura,

    No, actually the number of reads traversed by IndelRealigner is just the reads that fall within the target intervals determined by the RTC. So that difference is expected.

  • laura13laura13 Member

    Hi Geraldine.

    Thanks for your reply.

    I also noticed that my intervals output file from RealignerTargetCreator is empty. Does this mean that GATK did not identify any regions needing re-alignment?

    If the input intervals file is empty, why did IndelRealigner run?

    Thanks for your help!
    Laura

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited April 2013

    Well, IndelRealigner will start up even if the intervals file is empty, it'll just see it has nothing to do and consider that it has finished without error. I'm not sure why it gives a non-zero number of reads traversed though, that's weird.

    As for the fact that RTC is outputting an empty target list, I'm a little surprised. Unless you have a very small dataset (or one that has already been realigned) it is unlikely that there would be nothing at all that needs realignment. Have you validated your input files?

  • laura13laura13 Member

    Hi Geraldine.

    I used Picard's ValidateSamFile to check my input bam file.

    It returned the following error messages...

    Mate negative strand flag does not match read negative strand flag of mate
    MAPQ should be 0 for unmapped read
    Mate unmapped flag does not match read unmapped flag of mate

    My BAM file was generated using BWA. I had to used Picard's AddOrReplaceReadGroups to add read group information to make the file compatible with GATK. My read dataset represents a single genotype, and it was generated on a single lane.

    Would these errors cause problems with GATK?

    A final note...coverage for this alignment is incredibly deep (~2500x). Overkill, I know... Would this impact the output from GATK?

    Thanks again for your help!

    Laura

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Are you getting a lot of these bad mate errors? RTC does apply some very stringent filters (see the tech doc for a complete list) -- but if the bulk of your data was getting filtered out you'd see it in the output log. Yet what you posted above doesn't suggest that's the case.

    That is a lot of depth you've got there (have you marked duplicates?), but it's still well below the default numbers that the IndelRealigner will just give up on (because past a certain depth, realignment is too hard). And RTC doesn't have a maximum depth setting. But you could try setting -dcov to downsample, and see if that helps.

  • KurtKurt Member ✭✭✭
    edited April 2013

    Hi, this is just a general comment when running ValidateSam.jar from Picard on a BAM file that is generated with BWA. The error of "MAPQ should be 0 for unmapped read" you get is pretty common and might mask what your real error is (I think by default, ValidateSamFile.jar outputs the first 100 records and then stops). In any case, it might be a good idea to pass the following when running it IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR ; from the picard sourceforge faq page.

    Q: Why am I getting errors from Picard like "MAPQ should be 0 for unmapped read" or "CIGAR should have zero elements for unmapped read?"
    A: BWA can produce SAM records that are marked as unmapped but have non-zero MAPQ and/or non-"*" CIGAR. Typically this is because BWA found an alignment for the read that hangs off the end of the reference sequence. Picard considers such input to be invalid. In general, this error can be suppressed in Picard programs by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENT. For ValidateSamFile, you can pass the arguments IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR.

  • laura13laura13 Member

    Hi Kurt.

    Thanks for the tip.

    I did run ValidateSam with the three detected error types masked/ignored (and the default max output changed to 1000 records) just to make sure I wasn't missing any other errors. That run didn't pick up any additional errors, so it looks like those three are the only errors afflicting my BAM file.

    Good advice for people who might read this thread and use the ValidateSam tool.

    Laura

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Note that Kurt's advice is good if you are looking to uncover an underlying error, but users shouldn't automatically make Picard ignore those errors when validating files the first times. Issues like bad CIGAR strings can cause the GATK to crap out on some analyses.

  • KurtKurt Member ✭✭✭

    Thanks, Geraldine. I totally agree, didn't mean to imply otherwise.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No worries Kurt, just clarifying for people who might over-interpret the scope of the comment. Please do keep jumping in like this, it's very helpful.

  • KurtKurt Member ✭✭✭

    Will do ;) your group has created an amazing tool set and does a great job at support, and I'm more than happy to help in any way that i can!

  • laura13laura13 Member

    Hi Geraldine.

    Following up with your recommendations, I removed duplicates from my BAM file using Picard MarkDuplicates. Running RTC on this file (with no other changes from my previous command line) produced an intervals.intervals file with 5 regions flagged.

    I then tried downsampling using the -dfrac flag.

    -dfrac 0.75 produced an intervals.intervals with 6 regions flagged.

    -dfrac 0.60 produced an intervals.intervals with 10 regions flagged.

    I did some searching to find recommendations on appropriate settings for downsampling, but didn't find much.

    Do you have any recommendations on what cutoff to use for downsampling with RTC?

    Thanks yet again!
    Laura

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The thing is that we normally don't suggest using downsampling at this step because it's not usually necessary, but it sounds like it might be helping produce at least some target intervals.

    In principle you want to produce the most intervals that the RTC can identify as being problematic, to avoid missing out any areas that will give messy results if they are not realigned.

    At this point I would recommend visually inspecting the regions you have there (using IGV for example) and see if they look legitimate. If so, you can run them through the realigner and inspect again after realignment to check that the results make sense. To be completely thorough you can then try producing different lists of intervals with different settings of dfrac, realign into separate files (using -L to narrow down the test files and not reprint the whole data set each time) and compare the results for sanity.

    I wish I could give you a more straightforward answer but since this situation is a little odd I think it's best to proceed cautiously. Let me know if you need me to clarify anything.

Sign In or Register to comment.