The current GATK version is 3.3-0

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Apparent loss of reads during realigment

Posts: 20Member

Hi, I've 32 exomes in a merged BAM file, I have read groups to identify each exome, with id, sample, library and platform all set (I hope) correctly, what I can't understand why I have a discrepancy in the logs generated about the number of reads traversed by RealignmentTargetcreator Vs IndelRealigner:

From my RealignmentTargetcreator run I got:

INFO  18:24:17,286 ProgressMeter - Total runtime 5394.91 secs, 89.92 min, 1.50 hours
INFO  18:24:17,286 MicroScheduler - 389,565,880 reads were filtered out during traversal out of 2,752,553,629 total (14.15%)
INFO  18:24:17,287 MicroScheduler -   -> 24573133 reads (0.89% of total) failing BadMateFilter
INFO  18:24:17,287 MicroScheduler -   -> 229871090 reads (8.35% of total) failing DuplicateReadFilter
INFO  18:24:17,287 MicroScheduler -   -> 135121273 reads (4.91% of total) failing MappingQualityZeroFilter
INFO  18:24:17,298 MicroScheduler -   -> 384 reads (0.00% of total) failing UnmappedReadFilter

Vs this from IndelRealigner

INFO 11:47:21,707 MicroScheduler - 0 reads were filtered out during traversal out of 200,100 total (0.00%)

Why have so few of the reads RealignmentTarget creator reported been traversed by IndelRealigner?

My command for IndelRealigner was:

GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner --maxReadsInMemory 1000000 --maxReadsForRealignment 1000000 -known /data/GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -known /data/GATK_bundle/hg19/1000G_phase1.indels.hg19.vcf -I Merged_dedup.bam -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -targetIntervals forIndelRealigner.intervals -o Merged_dedup_realigned.bam

Tagged:

edited May 2013

Hi Matt,

Keep in mind that when you run the RTCreator, it traverses the entire dataset, so it looks at pretty much all the reads. In contrast, the IRealigner only traverses the targets created by RTC, so it only looks at the subset of reads that are contained within those intervals. Makes sense?

EDIT: This is not actually right, I should have said that the IndelRealigner only acts on reads within the targets created by RTC, but it should in fact count all reads in the file.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 20Member
edited May 2013

Thanks, I see, i.e. the 200,100 is down to the number of target reads for realignment. So the next output line I'm trying to figure out is form BaseRecalibrator, here the MicroSchedular is reporting a traversal of a similar number:

INFO  22:00:49,216 BaseRecalibrator - Processed: 2370766246 reads
INFO  22:00:49,216 ProgressMeter -            done        2.37e+09    6.9 h       10.0 s    100.0%         6.9 h     0.0 s
INFO  22:00:49,216 ProgressMeter - Total runtime 24957.86 secs, 415.96 min, 6.93 hours
INFO  22:00:49,216 MicroScheduler - 28283 reads were filtered out during traversal out of 228442 total (12.38%)
INFO  22:00:49,217 MicroScheduler -   -> 22674 reads (9.93% of total) failing DuplicateReadFilter
INFO  22:00:49,217 MicroScheduler -   -> 5609 reads (2.46% of total) failing MappingQualityZeroFilter

in the next stage PrintReads I'm getting the same 200,100 value back and I'm concerned something has gone wrong.

INFO  15:02:08,627 ProgressMeter - Total runtime 99031.08 secs, 1650.52 min, 27.51 hours
INFO  15:02:08,653 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%)
Post edited by MattB on

I see -- that is a little odd, but I think this might be a problem of the counting not working properly, while the tools themselves ran correctly. I think we've seen and fixed a similar issue in the past -- can you tell me what version you're using?

Geraldine Van der Auwera, PhD

• Posts: 20Member

Interesting, I just was wondering if I'd maxed out some internal counter, so I'm using what I think is the current release 2.4-9-g532efad.

I did try and count the number of reads in my bam file using samtools as an attempt to get a handle on things but it reported some huge negative value (I'm assuming I maxed that out too), I'll try with picard.

• Posts: 20Member

I've just looked over my RealignerTargetCreator log and I found this line at the top:

INFO 17:50:22,143 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

I've not explicitly asked it to do any downsampling, could this be the cause of the issue?

FYI you can use the CountReads walker (in GATK) to count reads as well.

The downsampling is done by default to avoid getting bogged down in regions of huge coverage (which are usually not informative). I wouldn't expect it to make that huge a difference in the reported read counts, to go from 2Bn reads to 200K traversed. I'm still leaning towards thinking the traversal counter is bugging, maxing out somehow.

Getting accurate read counts should help shed some light on this.

You may also want to try running again with the 2.5 version, which we released a couple of days ago. It's almost all bug fixes in this one.

Geraldine Van der Auwera, PhD

• Posts: 20Member

Did not know 2.5 was now out, I'll give that a try and get back to you with CoutReads walker counts too!

• Posts: 20Member

Ok so with 2.5-2-gf57256b, using IndelRealigner I'm still getting the figure of 200,100 from the MicroScheduler, I guess this could be down to the number of intervals but it's suspiciously the same as the number I got out of PrintReads with the previous version. I'm counting the number of reads in my various files now. Will post that when I have the info.

My input for RealignerTargetCreator is a 201G BAM file, and my output from IndelRealigner is 202G so in principle it looks to have at least outputted all the previous data with an attentional bit. I'm doing BaseRecalibrator stage now, will follow up with PrintReads output when I have it.

INFO  15:37:16,054 ReadShardBalancer1 - Loading BAM index data for next contig INFO 15:37:24,220 ProgressMeter - done 2.78e+09 87.0 h 112.0 s 100.0% 87.0 h 0.0 s INFO 15:37:24,221 ProgressMeter - Total runtime 313358.19 secs, 5222.64 min, 87.04 hours INFO 15:37:24,334 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%) INFO 15:37:27,030 GATKRunReport - Uploaded run statistics report to AWS S3 Which is following a run of RealignerTargetCreator which had the following output. INFO 21:51:11,193 ProgressMeter - done 3.14e+09 92.3 m 1.0 s 100.0% 92.3 m 0.0 s INFO 21:51:11,193 ProgressMeter - Total runtime 5539.06 secs, 92.32 min, 1.54 hours INFO 21:51:11,194 MicroScheduler - 389580510 reads were filtered out during traversal out of 2752562860 total (14.15%) INFO 21:51:11,194 MicroScheduler - -> 24572210 reads (0.89% of total) failing BadMateFilter INFO 21:51:11,194 MicroScheduler - -> 229879914 reads (8.35% of total) failing DuplicateReadFilter INFO 21:51:11,194 MicroScheduler - -> 135128002 reads (4.91% of total) failing MappingQualityZeroFilter INFO 21:51:11,195 MicroScheduler - -> 384 reads (0.00% of total) failing UnmappedReadFilter INFO 21:51:14,441 GATKRunReport - Uploaded run statistics report to AWS S3 • Posts: 7,428Administrator, GATK Developer admin Hmm, that does sound like the counter is buggy. We've got other reports of a different symptom as well, also of buggy counter results. Let me know what results you get and I'll see how we can look into this. Geraldine Van der Auwera, PhD • Posts: 20Member Thanks should have more counts for you by the end of the day. • Posts: 20Member edited May 2013 Ok, I've run CountReads on the merged de duplicated input file, I used in the beginning for RealignmentTargetcreator (which had its self correctly reported the number of reads). I used this command: Java -Xmx128g -jar ~/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T CountReads -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -I Merged_dedup.bam -nct 8 And the output I get here also appears to be broken, the large negative number in the transversal result is interesting too as this looks like a max int issue with a 32bit integer i.e. my 2.7 Bill overflows the max 32 bit int to -2.1Bill then you add the remainder of my 2.7Bill i.e. 600Mill and you get -1.5Bill. As before the MicroScheduler is reporting a traversal of 200100. INFO 13:29:48,459 Walker - [REDUCE RESULT] Traversal result is: -1519325197 INFO 13:29:48,460 ProgressMeter - done 2.78e+09 3.2 h 4.0 s 100.0% 3.2 h 0.0 s INFO 13:29:48,460 ProgressMeter - Total runtime 11408.46 secs, 190.14 min, 3.17 hours INFO 13:29:48,563 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%) INFO 13:29:50,421 GATKRunReport - Uploaded run statistics report to AWS S3 Just double checked and I'm using 64 bit Java. Post edited by MattB on • Posts: 7,428Administrator, GATK Developer admin Alright, we're going to have to look into this more closely. We'll want a snippet of your file to debug the PrintReads / IR issue -- but preferably not the whole bam file. Can you see if you can select out a region where PrintReads and RTC output different numbers? If you can't then we may just have to look at the whole thing, but we'd rather avoid that if possible. Also, to correct a statement I made early on, I should have said that the IndelRealigner only acts on reads within the targets created by RTC, but it should in fact count all reads in the file. Geraldine Van der Auwera, PhD • Posts: 20Member edited May 2013 Thanks for your help, I can try to cut out a snippet but my guess is that it's the shear number of them in total that is the issue. So I've just tried the whole of chr1 in RTC which gave: INFO 16:23:19,911 ProgressMeter - done 2.49e+08 9.4 m 2.0 s 100.0% 9.4 m 0.0 s INFO 16:23:19,912 ProgressMeter - Total runtime 563.61 secs, 9.39 min, 0.16 hours INFO 16:23:19,912 MicroScheduler - 46560609 reads were filtered out during traversal out of 287085351 total (16.22%) INFO 16:23:19,912 MicroScheduler - -> 2195996 reads (0.76% of total) failing BadMateFilter INFO 16:23:19,912 MicroScheduler - -> 23464896 reads (8.17% of total) failing DuplicateReadFilter INFO 16:23:19,913 MicroScheduler - -> 20899717 reads (7.28% of total) failing MappingQualityZeroFilter INFO 16:23:21,295 GATKRunReport - Uploaded run statistics report to AWS S3 I'm now doing IR on this bit too using the intervals from the above step. Looks like the result is ~8 hours away I'll get back to you when its done. My guess is that the traversal stats will be fine. But its worth a shot before I send you the full file. Post edited by MattB on • Posts: 20Member edited May 2013 Thanks that puts my data paranoia at rest! Just as long as I'm not maxing out anything else somewhere which is a 32 bit integer (assuming that was the bug)! Was that the issue with the counts? Post edited by MattB on • Broad InstitutePosts: 684Member, Administrator, GATK Developer, Broadie, Moderator, DSDE Member, GP Member admin Yes. Long being casted to an integer. Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT • Posts: 20Member Just to follow up on this, you might be interested to know that the final lines in the PrintReads output (for the same bunch of exomes post BQSR) also had a rather odd traversal result: INFO 05:02:08,403 ReadShardBalancer1 - Loading BAM index data for next contig
INFO  05:02:09,459 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@65196761
INFO  05:02:10,373 ProgressMeter -            done        2.78e+09   19.1 h       24.0 s    100.0%        19.1 h     0.0 s
INFO  05:02:10,373 ProgressMeter - Total runtime 68642.62 secs, 1144.04 min, 19.07 hours
INFO  05:02:10,374 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%)
INFO  05:02:11,954 GATKRunReport - Uploaded run statistics report to AWS S3

Not sure why its a REDUCE RESULT, since I've not specified any reduction, but he "org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@65196761" might be useful.