Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Indel realigner introducing errors

Hi

I'm indel realigning with version 2.4-9 using generic commands such as:

java -Xmx4g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -I /path/to/input.bam \ -o /path/to/realigner.intervals

java -Xmx4g -jar /path/to/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R /path/to/reference.fasta \ -I /path/to/sample-level.bam \ -targetIntervals /path/to/realigner.intervals.from.rtc \ -o /path/to/realigned.bam \ -model USE_SW \ -LOD 0.4

In most cases this is working fine, but in a few cases it is introducing artefacts that subsequently cause the bam file to fail Picard's ValidateSamFile, and in a couple of cases it introduces errors that can't be fixed by CleanSam and/or FixMateInformation.

Here's an example of an error that can be fixed by CleanSam:

Before indel realignment:

HS24_08564:7:2311:4630:19372#87 69 AAKM01002546 471 0 * = 471 0 HS24_08564:7:2311:4630:19372#87 137 AAKM01002546 471 25 100M = 471 0

After indel realignment:

HS24_08564:7:2311:4630:19372#87 69 AAKM01002546 471 0 * = 471 0 HS24_08564:7:2311:4630:19372#87 137 AAKM01002546 471 35 91M1D9M = 471 0

Here's an example of an error that can't be fixed:

Before indel realignment:

HS24_10061:6:1312:10172:98346#54 69 AAKM01002280 649 0 * = 649 0 HS24_10061:6:1312:10172:98346#54 137 AAKM01002280 649 37 100M = 649 0

After indel realignment:

HS24_10061:6:1312:10172:98346#54 69 AAKM01002280 649 0 * = 649 0 HS24_10061:6:1312:10172:98346#54 137 AAKM01002280 649 47 91M2D9M0S = 649 0

Is this a known bug? Any chance of a fix?

Thanks!

Richard

Answers

  • Richard_PearsonRichard_Pearson Posts: 3Member

    BTW, also have the same issue with version 2.7-4

  • ebanksebanks Posts: 671GSA Member mod

    Hi Richard, good to hear from you.

    A few comments:

    1) We can definitely take a look and patch it up if you are able to upload a small bam snippet that easily reproduces the error(s). See the posting about uploading to our ftp for more details.

    2) I'm embarrassed to admit that I don't see the error in the first example. In the second example, is the problem that there's a trailing 0S? While that seems weird, is it technically an error? (In any case, we'll fix it of course - but I'm surprised that it fails validation)

    3) Keep in mind that using the Smith-Waterman capabilities of the realigner with such a low LOD is really dangerous. That LOD is more appropriate when realigner only around known indels. Please see the best practices for more details.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • Richard_PearsonRichard_Pearson Posts: 3Member

    Hi Eric, regarding your points:

    1) OK, I've now uploaded a tarball (GATK_0S_bug_reproducible_example.tar.gz). See the script indel_realignment_bug_for_Eric_Banks.sh in here for details.

    2) The first example causes errors such as the following with ValidateSamFile

    ERROR: Record 80, Read name HS24_08564:7:2311:4630:19372#87, CIGAR M operator maps off end of reference

    I think it probably is the trailing 0S that is causing CleanSam problems in the second example. It seems like Alec Wysoker thinks this is fixable in Picard so that CleanSam should be able to clean up such problems (http://sourceforge.net/mailarchive/forum.php?thread_name=90E74C31-3DEC-4773-BFC9-45AEA9209F87@broadinstitute.org&forum_name=samtools-help), but would still be good to ensure GATK isn't introducing such errors in the first place.

    3) Thanks for this. I'll look at the best practices again and reconsider

    Best wishes and thanks for your help, Richard

  • ebanksebanks Posts: 671GSA Member mod

    Hi Richard,

    I've just created a patch for the first case (no longer realigning reads off the end of contigs). Hopefully that will go into our nightly builds starting tomorrow. However, I am not able to reproduce your other bug (adding a 0S operator) with the current version of the GATK; hopefully that means it's already fixed.

    Thanks for the report!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.