Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Apparent loss of reads during realigment

MattBMattB Posts: 19Member

Hi, I've 32 exomes in a merged BAM file, I have read groups to identify each exome, with id, sample, library and platform all set (I hope) correctly, what I can't understand why I have a discrepancy in the logs generated about the number of reads traversed by RealignmentTargetcreator Vs IndelRealigner:

From my RealignmentTargetcreator run I got:

INFO 18:24:17,286 ProgressMeter - Total runtime 5394.91 secs, 89.92 min, 1.50 hours
INFO 18:24:17,286 MicroScheduler - 389,565,880 reads were filtered out during traversal out of 2,752,553,629 total (14.15%)
INFO 18:24:17,287 MicroScheduler - -> 24573133 reads (0.89% of total) failing BadMateFilter
INFO 18:24:17,287 MicroScheduler - -> 229871090 reads (8.35% of total) failing DuplicateReadFilter
INFO 18:24:17,287 MicroScheduler - -> 135121273 reads (4.91% of total) failing MappingQualityZeroFilter
INFO 18:24:17,298 MicroScheduler - -> 384 reads (0.00% of total) failing UnmappedReadFilter

Vs this from IndelRealigner

INFO 11:47:21,707 MicroScheduler - 0 reads were filtered out during traversal out of 200,100 total (0.00%)

Why have so few of the reads RealignmentTarget creator reported been traversed by IndelRealigner?

My command for IndelRealigner was:

GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner --maxReadsInMemory 1000000 --maxReadsForRealignment 1000000 -known /data/GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -known /data/GATK_bundle/hg19/1000G_phase1.indels.hg19.vcf -I Merged_dedup.bam -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -targetIntervals forIndelRealigner.intervals -o Merged_dedup_realigned.bam

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin
    edited May 2013

    Hi Matt,

    Keep in mind that when you run the RTCreator, it traverses the entire dataset, so it looks at pretty much all the reads. In contrast, the IRealigner only traverses the targets created by RTC, so it only looks at the subset of reads that are contained within those intervals. Makes sense?

    EDIT: This is not actually right, I should have said that the IndelRealigner only acts on reads within the targets created by RTC, but it should in fact count all reads in the file.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • MattBMattB Posts: 19Member
    edited May 2013

    Thanks, I see, i.e. the 200,100 is down to the number of target reads for realignment. So the next output line I'm trying to figure out is form BaseRecalibrator, here the MicroSchedular is reporting a traversal of a similar number:

    INFO 22:00:49,216 BaseRecalibrator - Processed: 2370766246 reads
    INFO 22:00:49,216 ProgressMeter - done 2.37e+09 6.9 h 10.0 s 100.0% 6.9 h 0.0 s
    INFO 22:00:49,216 ProgressMeter - Total runtime 24957.86 secs, 415.96 min, 6.93 hours
    INFO 22:00:49,216 MicroScheduler - 28283 reads were filtered out during traversal out of 228442 total (12.38%)
    INFO 22:00:49,217 MicroScheduler - -> 22674 reads (9.93% of total) failing DuplicateReadFilter
    INFO 22:00:49,217 MicroScheduler - -> 5609 reads (2.46% of total) failing MappingQualityZeroFilter

    in the next stage PrintReads I'm getting the same 200,100 value back and I'm concerned something has gone wrong.

    INFO 15:02:08,627 ProgressMeter - Total runtime 99031.08 secs, 1650.52 min, 27.51 hours
    INFO 15:02:08,653 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%)

    Post edited by MattB on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    I see -- that is a little odd, but I think this might be a problem of the counting not working properly, while the tools themselves ran correctly. I think we've seen and fixed a similar issue in the past -- can you tell me what version you're using?

    Geraldine Van der Auwera, PhD

  • MattBMattB Posts: 19Member

    Interesting, I just was wondering if I'd maxed out some internal counter, so I'm using what I think is the current release 2.4-9-g532efad.

    I did try and count the number of reads in my bam file using samtools as an attempt to get a handle on things but it reported some huge negative value (I'm assuming I maxed that out too), I'll try with picard.

  • MattBMattB Posts: 19Member

    I've just looked over my RealignerTargetCreator log and I found this line at the top:

    INFO 17:50:22,143 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

    I've not explicitly asked it to do any downsampling, could this be the cause of the issue?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    FYI you can use the CountReads walker (in GATK) to count reads as well.

    The downsampling is done by default to avoid getting bogged down in regions of huge coverage (which are usually not informative). I wouldn't expect it to make that huge a difference in the reported read counts, to go from 2Bn reads to 200K traversed. I'm still leaning towards thinking the traversal counter is bugging, maxing out somehow.

    Getting accurate read counts should help shed some light on this.

    You may also want to try running again with the 2.5 version, which we released a couple of days ago. It's almost all bug fixes in this one.

    Geraldine Van der Auwera, PhD

  • MattBMattB Posts: 19Member

    Did not know 2.5 was now out, I'll give that a try and get back to you with CoutReads walker counts too!

  • MattBMattB Posts: 19Member

    Ok so with 2.5-2-gf57256b, using IndelRealigner I'm still getting the figure of 200,100 from the MicroScheduler, I guess this could be down to the number of intervals but it's suspiciously the same as the number I got out of PrintReads with the previous version. I'm counting the number of reads in my various files now. Will post that when I have the info.

    My input for RealignerTargetCreator is a 201G BAM file, and my output from IndelRealigner is 202G so in principle it looks to have at least outputted all the previous data with an attentional bit. I'm doing BaseRecalibrator stage now, will follow up with PrintReads output when I have it.

    INFO 15:37:16,054 ReadShardBalancer$1 - Loading BAM index data for next contig
    INFO 15:37:24,220 ProgressMeter - done 2.78e+09 87.0 h 112.0 s 100.0% 87.0 h 0.0 s
    INFO 15:37:24,221 ProgressMeter - Total runtime 313358.19 secs, 5222.64 min, 87.04 hours
    INFO 15:37:24,334 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%)
    INFO 15:37:27,030 GATKRunReport - Uploaded run statistics report to AWS S3

    Which is following a run of RealignerTargetCreator which had the following output.

    INFO 21:51:11,193 ProgressMeter - done 3.14e+09 92.3 m 1.0 s 100.0% 92.3 m 0.0 s
    INFO 21:51:11,193 ProgressMeter - Total runtime 5539.06 secs, 92.32 min, 1.54 hours
    INFO 21:51:11,194 MicroScheduler - 389580510 reads were filtered out during traversal out of 2752562860 total (14.15%)
    INFO 21:51:11,194 MicroScheduler - -> 24572210 reads (0.89% of total) failing BadMateFilter
    INFO 21:51:11,194 MicroScheduler - -> 229879914 reads (8.35% of total) failing DuplicateReadFilter
    INFO 21:51:11,194 MicroScheduler - -> 135128002 reads (4.91% of total) failing MappingQualityZeroFilter
    INFO 21:51:11,195 MicroScheduler - -> 384 reads (0.00% of total) failing UnmappedReadFilter
    INFO 21:51:14,441 GATKRunReport - Uploaded run statistics report to AWS S3

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Hmm, that does sound like the counter is buggy. We've got other reports of a different symptom as well, also of buggy counter results. Let me know what results you get and I'll see how we can look into this.

    Geraldine Van der Auwera, PhD

  • MattBMattB Posts: 19Member

    Thanks should have more counts for you by the end of the day.

  • MattBMattB Posts: 19Member
    edited May 2013

    Ok, I've run CountReads on the merged de duplicated input file, I used in the beginning for RealignmentTargetcreator (which had its self correctly reported the number of reads).

    I used this command:

    Java -Xmx128g -jar ~/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T CountReads -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -I Merged_dedup.bam -nct 8

    And the output I get here also appears to be broken, the large negative number in the transversal result is interesting too as this looks like a max int issue with a 32bit integer i.e. my 2.7 Bill overflows the max 32 bit int to -2.1Bill then you add the remainder of my 2.7Bill i.e. 600Mill and you get -1.5Bill. As before the MicroScheduler is reporting a traversal of 200100.

    INFO 13:29:48,459 Walker - [REDUCE RESULT] Traversal result is: -1519325197
    INFO 13:29:48,460 ProgressMeter - done 2.78e+09 3.2 h 4.0 s 100.0% 3.2 h 0.0 s
    INFO 13:29:48,460 ProgressMeter - Total runtime 11408.46 secs, 190.14 min, 3.17 hours
    INFO 13:29:48,563 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%)
    INFO 13:29:50,421 GATKRunReport - Uploaded run statistics report to AWS S3

    Just double checked and I'm using 64 bit Java.

    Post edited by MattB on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Alright, we're going to have to look into this more closely. We'll want a snippet of your file to debug the PrintReads / IR issue -- but preferably not the whole bam file. Can you see if you can select out a region where PrintReads and RTC output different numbers? If you can't then we may just have to look at the whole thing, but we'd rather avoid that if possible.

    Also, to correct a statement I made early on, I should have said that the IndelRealigner only acts on reads within the targets created by RTC, but it should in fact count all reads in the file.

    Geraldine Van der Auwera, PhD

  • MattBMattB Posts: 19Member
    edited May 2013

    Thanks for your help, I can try to cut out a snippet but my guess is that it's the shear number of them in total that is the issue. So I've just tried the whole of chr1 in RTC which gave:

    INFO 16:23:19,911 ProgressMeter - done 2.49e+08 9.4 m 2.0 s 100.0% 9.4 m 0.0 s
    INFO 16:23:19,912 ProgressMeter - Total runtime 563.61 secs, 9.39 min, 0.16 hours
    INFO 16:23:19,912 MicroScheduler - 46560609 reads were filtered out during traversal out of 287085351 total (16.22%)
    INFO 16:23:19,912 MicroScheduler - -> 2195996 reads (0.76% of total) failing BadMateFilter
    INFO 16:23:19,912 MicroScheduler - -> 23464896 reads (8.17% of total) failing DuplicateReadFilter
    INFO 16:23:19,913 MicroScheduler - -> 20899717 reads (7.28% of total) failing MappingQualityZeroFilter
    INFO 16:23:21,295 GATKRunReport - Uploaded run statistics report to AWS S3

    I'm now doing IR on this bit too using the intervals from the above step. Looks like the result is ~8 hours away I'll get back to you when its done. My guess is that the traversal stats will be fine. But its worth a shot before I send you the full file.

    Post edited by MattB on
  • MattBMattB Posts: 19Member
    edited May 2013

    Thanks that puts my data paranoia at rest! :) Just as long as I'm not maxing out anything else somewhere which is a 32 bit integer (assuming that was the bug)! Was that the issue with the counts?

    Post edited by MattB on
  • ebanksebanks Posts: 678GATK Developer mod

    Yes. Long being casted to an integer.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • MattBMattB Posts: 19Member

    Just to follow up on this, you might be interested to know that the final lines in the PrintReads output (for the same bunch of exomes post BQSR) also had a rather odd traversal result:

    INFO 05:02:08,403 ReadShardBalancer$1 - Loading BAM index data for next contig
    INFO 05:02:09,459 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@65196761
    INFO 05:02:10,373 ProgressMeter - done 2.78e+09 19.1 h 24.0 s 100.0% 19.1 h 0.0 s
    INFO 05:02:10,373 ProgressMeter - Total runtime 68642.62 secs, 1144.04 min, 19.07 hours
    INFO 05:02:10,374 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%)
    INFO 05:02:11,954 GATKRunReport - Uploaded run statistics report to AWS S3

    Not sure why its a REDUCE RESULT, since I've not specified any reduction, but he "org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@65196761" might be useful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Oh, "reduce result" just refers to the fact that GATK uses a map/reduce framework to process data; you can safely ignore that.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.