The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Apparent loss of reads during realigment

MattBMattB NewcastleMember

Hi, I've 32 exomes in a merged BAM file, I have read groups to identify each exome, with id, sample, library and platform all set (I hope) correctly, what I can't understand why I have a discrepancy in the logs generated about the number of reads traversed by RealignmentTargetcreator Vs IndelRealigner:

From my RealignmentTargetcreator run I got:

INFO 18:24:17,286 ProgressMeter - Total runtime 5394.91 secs, 89.92 min, 1.50 hours INFO 18:24:17,286 MicroScheduler - 389,565,880 reads were filtered out during traversal out of 2,752,553,629 total (14.15%) INFO 18:24:17,287 MicroScheduler - -> 24573133 reads (0.89% of total) failing BadMateFilter INFO 18:24:17,287 MicroScheduler - -> 229871090 reads (8.35% of total) failing DuplicateReadFilter INFO 18:24:17,287 MicroScheduler - -> 135121273 reads (4.91% of total) failing MappingQualityZeroFilter INFO 18:24:17,298 MicroScheduler - -> 384 reads (0.00% of total) failing UnmappedReadFilter

Vs this from IndelRealigner

INFO 11:47:21,707 MicroScheduler - 0 reads were filtered out during traversal out of 200,100 total (0.00%)

Why have so few of the reads RealignmentTarget creator reported been traversed by IndelRealigner?

My command for IndelRealigner was:

GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner --maxReadsInMemory 1000000 --maxReadsForRealignment 1000000 -known /data/GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -known /data/GATK_bundle/hg19/1000G_phase1.indels.hg19.vcf -I Merged_dedup.bam -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -targetIntervals forIndelRealigner.intervals -o Merged_dedup_realigned.bam

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    edited May 2013

    Hi Matt,

    Keep in mind that when you run the RTCreator, it traverses the entire dataset, so it looks at pretty much all the reads. In contrast, the IRealigner only traverses the targets created by RTC, so it only looks at the subset of reads that are contained within those intervals. Makes sense?

    EDIT: This is not actually right, I should have said that the IndelRealigner only acts on reads within the targets created by RTC, but it should in fact count all reads in the file.

    Post edited by Geraldine_VdAuwera on
  • MattBMattB NewcastleMember
    edited May 2013

    Thanks, I see, i.e. the 200,100 is down to the number of target reads for realignment. So the next output line I'm trying to figure out is form BaseRecalibrator, here the MicroSchedular is reporting a traversal of a similar number:

    INFO 22:00:49,216 BaseRecalibrator - Processed: 2370766246 reads INFO 22:00:49,216 ProgressMeter - done 2.37e+09 6.9 h 10.0 s 100.0% 6.9 h 0.0 s INFO 22:00:49,216 ProgressMeter - Total runtime 24957.86 secs, 415.96 min, 6.93 hours INFO 22:00:49,216 MicroScheduler - 28283 reads were filtered out during traversal out of 228442 total (12.38%) INFO 22:00:49,217 MicroScheduler - -> 22674 reads (9.93% of total) failing DuplicateReadFilter INFO 22:00:49,217 MicroScheduler - -> 5609 reads (2.46% of total) failing MappingQualityZeroFilter

    in the next stage PrintReads I'm getting the same 200,100 value back and I'm concerned something has gone wrong.

    INFO 15:02:08,627 ProgressMeter - Total runtime 99031.08 secs, 1650.52 min, 27.51 hours INFO 15:02:08,653 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I see -- that is a little odd, but I think this might be a problem of the counting not working properly, while the tools themselves ran correctly. I think we've seen and fixed a similar issue in the past -- can you tell me what version you're using?

  • MattBMattB NewcastleMember

    Interesting, I just was wondering if I'd maxed out some internal counter, so I'm using what I think is the current release 2.4-9-g532efad.

    I did try and count the number of reads in my bam file using samtools as an attempt to get a handle on things but it reported some huge negative value (I'm assuming I maxed that out too), I'll try with picard.

  • MattBMattB NewcastleMember

    I've just looked over my RealignerTargetCreator log and I found this line at the top:

    INFO 17:50:22,143 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

    I've not explicitly asked it to do any downsampling, could this be the cause of the issue?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    FYI you can use the CountReads walker (in GATK) to count reads as well.

    The downsampling is done by default to avoid getting bogged down in regions of huge coverage (which are usually not informative). I wouldn't expect it to make that huge a difference in the reported read counts, to go from 2Bn reads to 200K traversed. I'm still leaning towards thinking the traversal counter is bugging, maxing out somehow.

    Getting accurate read counts should help shed some light on this.

    You may also want to try running again with the 2.5 version, which we released a couple of days ago. It's almost all bug fixes in this one.

  • MattBMattB NewcastleMember

    Did not know 2.5 was now out, I'll give that a try and get back to you with CoutReads walker counts too!

  • MattBMattB NewcastleMember

    Ok so with 2.5-2-gf57256b, using IndelRealigner I'm still getting the figure of 200,100 from the MicroScheduler, I guess this could be down to the number of intervals but it's suspiciously the same as the number I got out of PrintReads with the previous version. I'm counting the number of reads in my various files now. Will post that when I have the info.

    My input for RealignerTargetCreator is a 201G BAM file, and my output from IndelRealigner is 202G so in principle it looks to have at least outputted all the previous data with an attentional bit. I'm doing BaseRecalibrator stage now, will follow up with PrintReads output when I have it.

    INFO 15:37:16,054 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 15:37:24,220 ProgressMeter - done 2.78e+09 87.0 h 112.0 s 100.0% 87.0 h 0.0 s INFO 15:37:24,221 ProgressMeter - Total runtime 313358.19 secs, 5222.64 min, 87.04 hours INFO 15:37:24,334 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%) INFO 15:37:27,030 GATKRunReport - Uploaded run statistics report to AWS S3

    Which is following a run of RealignerTargetCreator which had the following output.

    INFO 21:51:11,193 ProgressMeter - done 3.14e+09 92.3 m 1.0 s 100.0% 92.3 m 0.0 s INFO 21:51:11,193 ProgressMeter - Total runtime 5539.06 secs, 92.32 min, 1.54 hours INFO 21:51:11,194 MicroScheduler - 389580510 reads were filtered out during traversal out of 2752562860 total (14.15%) INFO 21:51:11,194 MicroScheduler - -> 24572210 reads (0.89% of total) failing BadMateFilter INFO 21:51:11,194 MicroScheduler - -> 229879914 reads (8.35% of total) failing DuplicateReadFilter INFO 21:51:11,194 MicroScheduler - -> 135128002 reads (4.91% of total) failing MappingQualityZeroFilter INFO 21:51:11,195 MicroScheduler - -> 384 reads (0.00% of total) failing UnmappedReadFilter INFO 21:51:14,441 GATKRunReport - Uploaded run statistics report to AWS S3

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, that does sound like the counter is buggy. We've got other reports of a different symptom as well, also of buggy counter results. Let me know what results you get and I'll see how we can look into this.

  • MattBMattB NewcastleMember

    Thanks should have more counts for you by the end of the day.

  • MattBMattB NewcastleMember
    edited May 2013

    Ok, I've run CountReads on the merged de duplicated input file, I used in the beginning for RealignmentTargetcreator (which had its self correctly reported the number of reads).

    I used this command:

    Java -Xmx128g -jar ~/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T CountReads -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -I Merged_dedup.bam -nct 8

    And the output I get here also appears to be broken, the large negative number in the transversal result is interesting too as this looks like a max int issue with a 32bit integer i.e. my 2.7 Bill overflows the max 32 bit int to -2.1Bill then you add the remainder of my 2.7Bill i.e. 600Mill and you get -1.5Bill. As before the MicroScheduler is reporting a traversal of 200100.

    INFO 13:29:48,459 Walker - [REDUCE RESULT] Traversal result is: -1519325197 INFO 13:29:48,460 ProgressMeter - done 2.78e+09 3.2 h 4.0 s 100.0% 3.2 h 0.0 s INFO 13:29:48,460 ProgressMeter - Total runtime 11408.46 secs, 190.14 min, 3.17 hours INFO 13:29:48,563 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%) INFO 13:29:50,421 GATKRunReport - Uploaded run statistics report to AWS S3

    Just double checked and I'm using 64 bit Java.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Alright, we're going to have to look into this more closely. We'll want a snippet of your file to debug the PrintReads / IR issue -- but preferably not the whole bam file. Can you see if you can select out a region where PrintReads and RTC output different numbers? If you can't then we may just have to look at the whole thing, but we'd rather avoid that if possible.

    Also, to correct a statement I made early on, I should have said that the IndelRealigner only acts on reads within the targets created by RTC, but it should in fact count all reads in the file.

  • MattBMattB NewcastleMember
    edited May 2013

    Thanks for your help, I can try to cut out a snippet but my guess is that it's the shear number of them in total that is the issue. So I've just tried the whole of chr1 in RTC which gave:

    INFO 16:23:19,911 ProgressMeter - done 2.49e+08 9.4 m 2.0 s 100.0% 9.4 m 0.0 s INFO 16:23:19,912 ProgressMeter - Total runtime 563.61 secs, 9.39 min, 0.16 hours INFO 16:23:19,912 MicroScheduler - 46560609 reads were filtered out during traversal out of 287085351 total (16.22%) INFO 16:23:19,912 MicroScheduler - -> 2195996 reads (0.76% of total) failing BadMateFilter INFO 16:23:19,912 MicroScheduler - -> 23464896 reads (8.17% of total) failing DuplicateReadFilter INFO 16:23:19,913 MicroScheduler - -> 20899717 reads (7.28% of total) failing MappingQualityZeroFilter INFO 16:23:21,295 GATKRunReport - Uploaded run statistics report to AWS S3

    I'm now doing IR on this bit too using the intervals from the above step. Looks like the result is ~8 hours away I'll get back to you when its done. My guess is that the traversal stats will be fine. But its worth a shot before I send you the full file.

  • MattBMattB NewcastleMember
    edited May 2013

    Thanks that puts my data paranoia at rest! :) Just as long as I'm not maxing out anything else somewhere which is a 32 bit integer (assuming that was the bug)! Was that the issue with the counts?

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Yes. Long being casted to an integer.

  • MattBMattB NewcastleMember

    Just to follow up on this, you might be interested to know that the final lines in the PrintReads output (for the same bunch of exomes post BQSR) also had a rather odd traversal result:

    INFO 05:02:08,403 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 05:02:09,459 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@65196761 INFO 05:02:10,373 ProgressMeter - done 2.78e+09 19.1 h 24.0 s 100.0% 19.1 h 0.0 s INFO 05:02:10,373 ProgressMeter - Total runtime 68642.62 secs, 1144.04 min, 19.07 hours INFO 05:02:10,374 MicroScheduler - 0 reads were filtered out during traversal out of 200100 total (0.00%) INFO 05:02:11,954 GATKRunReport - Uploaded run statistics report to AWS S3

    Not sure why its a REDUCE RESULT, since I've not specified any reduction, but he "org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@65196761" might be useful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh, "reduce result" just refers to the fact that GATK uses a map/reduce framework to process data; you can safely ignore that.

Sign In or Register to comment.