Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

I dont understand what is "filtering out during traversal" ?

pallavipallavi Posts: 13Member
edited April 2013 in Ask the team

When we run BaseRecalibrator i.e. the first phase of BQSR, we get some information in logs at the end of the process saying the following thing:

INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads INFO 02:09:56,514 ProgressMeter - done 1.06e+09 5.4 d 7.3 m 100.0% 5.4 d 0.0 s INFO 02:09:56,514 ProgressMeter - Total runtime 464091.25 secs, 7734.85 min, 128.91 hours INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total (55.89%) INFO 02:09:56,514 MicroScheduler - -> 155 reads (0.34% of total) failing DuplicateReadFilter INFO 02:09:56,515 MicroScheduler - -> 25196 reads (55.55% of total) failing MappingQualityZeroFilter INFO 02:10:00,308 GATKRunReport - Uploaded run statistics report to AWS S3

What does this phase means "filtered out during traversal"? Does it has some specific significance?

Post edited by pallavi on

Best Answer

Answers

  • biscuit13161biscuit13161 Posts: 9Member

    following on from this, can you explain why the number of processed reads ( "INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads" from Pallavi's example) differs so much from the total number of reads ("INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total"). Do these numbers relate to different sets of reads?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hmm, that's a good observation. There shouldn't be such a huge difference. In some older versions of GATK , we had problems with the counter not reporting the right total numbers. That may be what's happening here. Without seeing the version number and command line it's difficult to say more.

    Geraldine Van der Auwera, PhD

  • biscuit13161biscuit13161 Posts: 9Member

    Hi Geraldine,

    Obviously I don't know about Pallavi, but I do know I'm observing this with v2.5.2.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Ah, I see. Can you please post your command line and the relevant console output?

    Geraldine Van der Auwera, PhD

  • pallavipallavi Posts: 13Member

    @biscuit13161 Yeah. I have observed it. Was thinking to ask about it specifically .... good u asked it here.... Actually I kept this issue across keeping in mind the numbers issue only. This is quite confusing. That is why I asked first if this filtering out during traversal means something specific or else. @Geraldine: I have got this log from v2.3-9

  • pallavipallavi Posts: 13Member

    @Geraldine: Here is the command line which I used: java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -ql 95 -I 100gb_whole_genome_run/final_100_gb.bam -R /storage/avadisngs/GATK/hg19.fa -knownSites /storage/avadisngs/GATK/chR20.vcf -o 100gb_whole_genome_run/recal_data_exome_2.3.grp

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Ok, we'll look into this. Sorry about the confusion.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Okay, it looks like we have bug in the Microscheduler counts. The good news is that this doesn't actually affect the processing of your data, so you can safely ignore it for now.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 671GSA Member mod

    This is fixed in our internal version and will be available in the nightly builds from now on.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • caschcasch Posts: 14Member

    Hi - I'm having a similar issue. I'm running GATK v2.5-2 and with BaseRecalibrator I get the following message:

    INFO 17:50:42,743 MicroScheduler - 321635 reads were filtered out during traversal out of 521639 total (61.66%) INFO 17:50:42,744 MicroScheduler - -> 2526 reads (0.48% of total) failing DuplicateReadFilter INFO 17:50:42,744 MicroScheduler - -> 319109 reads (61.17% of total) failing MappingQualityZeroFilter INFO 17:50:45,202 GATKRunReport - Uploaded run statistics report to AWS S3

    (So 61% failed Mapping quality zero step - which is quite alarming)

    While in the step just preceding this step (Local realignment) it was much lower for the same sample

    INFO 22:39:49,340 MicroScheduler - 144903903 reads were filtered out during traversal out of 1833725533 total (7.90%) INFO 22:39:49,340 MicroScheduler - -> 14349297 reads (0.78% of total) failing BadMateFilter INFO 22:39:49,341 MicroScheduler - -> 23871004 reads (1.30% of total) failing DuplicateReadFilter INFO 22:39:49,341 MicroScheduler - -> 106683602 reads (5.82% of total) failing MappingQualityZeroFilter INFO 22:39:51,923 GATKRunReport - Uploaded run statistics report to AWS S3

    And again for the SNPcall with Unified Genotyper, following the Recalibration, the values are low again. Also if I use an older version of GATK with CountCovariates for the Recalibration the values are also low (<10%) on the same sample

    From the previous posts it seems its just a problem in the counter and it does not affect the data itself?

    I just want to make 100% sure - Is this correct - should I just ignore this high value for unmapped reads (that I just get during this one step with BaseRecalibrator)?

    /thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    It's very probable that the inconsistency you're seeing is due to that bug, yes. But to be 100% sure you can run a simple tool like CountReads from the latest GATK version (2.7, in which the bug is fixed) on your data with -rf MappingQualityZero. This should give you a definitive bug-free count of the reads that fail that filter.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    When I run RealignerTargetCreator, I got the following in the INFO: 8 418 115 reads were filtered out during the traversal out of approximately 28 536 526 total reads (29.50%). Does this mean that 29.5% of my reads are removed from the produced .bam file and only 20 122 411 reads will be used for the next step (i.e IndelRealigner, BaseRecalibration, etc) ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    RealignerTargetCreator doesn't produce a bam file, it only produces a list of target intervals. So all the reads will still be available for the next step. But depending what the filtering reason is, some reads may be filtered out again by IndelRealigner in the next step. If that happens those reads will indeed not be written in the new realigned bam file.

    Which filters are your reads failing? 29% is a lot, you may need to do some QC on your data.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    Here are the complete Info: INFO 13:57:59,158 ProgressMeter - Total runtime 2422.06 secs, 40.37 min, 0.67 hours INFO 13:57:59,158 MicroScheduler - 8418115 reads were filtered out during the traversal out of approximately 28536526 total reads (29.50%) INFO 13:57:59,158 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 13:57:59,159 MicroScheduler - -> 599908 reads (2.10% of total) failing BadMateFilter INFO 13:57:59,159 MicroScheduler - -> 4159068 reads (14.57% of total) failing DuplicateReadFilter INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 13:57:59,160 MicroScheduler - -> 1041927 reads (3.65% of total) failing MappingQualityZeroFilter INFO 13:57:59,160 MicroScheduler - -> 2617212 reads (9.17% of total) failing NotPrimaryAlignmentFilter INFO 13:57:59,160 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter INFO 13:57:59,160 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 13:58:01,688 GATKRunReport - Uploaded run statistics report to AWS S3

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Well, the 14% duplicates can't be helped at this point; the 9% of not primary alignment plus 3% mapping quality zero suggests the alignment isn't stellar...

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    Is it because my deep sequencing reads are exome sequencing?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    I can't help you further with this, sorry, as it's out of the scope of support we provide. It may even not be a very big deal (I do not have expertise in assessing alignment results), but if you have concerns you should discuss this with your sequence data provider. You can ask about standards of quality and recommended procedures for best alignment results.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member
  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    @Geraldine_VdAuwera said: I can't help you further with this, sorry, as it's out of the scope of support we provide. It may even not be a very big deal (I do not have expertise in assessing alignment results), but if you have concerns you should discuss this with your sequence data provider. You can ask about standards of quality and recommended procedures for best alignment results.

    Geraldine,

    Maybe the following is out of the scope you provide but I just want to add that I used bwa mem for mapping and samtools idxstats show me that 28 374 381 reads out of 28 398 097 (99.9%) are mapped to the genome.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    28 374 381 reads out of 28 398 097 (99.9%) are mapped to the genome.

    Sure, but based on the read filtering 14% of those are duplicate reads (which are not informative) and 12% either have mapping quality zero (which means they're mapped but with zero confidence, so are also useless) or are aligned in more than one place (which is effectively pretty much equivalent to zero confidence mapping). That adds up to a lot of data that is unusable even if it is mapped. And maybe that's fine (I am not the right person to say what the acceptable amount of "waste data" should be) but it is important to be aware of how much usable data you have, because someone is paying for the data production, and they should get their money's worth.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    OK. Again my next question is maybe out of the scope but when we run MarkDuplicates, a .metrics file is produced and in that file I got a 0.17% duplication. Can you tell me what is the difference between this number and the 14% duplicate reads obtained from RealignerTargetCreator INFO?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Unfortunately I'm not familiar with the calculations that go into the MarkDuplicates metrics file. I can tell you that the GATK simply uses the flags added by MarkDuplicates; it doesn't do any additional evaluation of what might or might not be duplicates.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 297Member, GSA Collaborator ✭✭✭

    The value reported by Picard in the PERCENT_DUPLICATION is actually the fraction of duplicate reads, not a percentage. That is, a value of 0.17 in that column means 17%, not 0.17%. My suspicion is that you're calling RTC on a subset of the bam (using the -L parameter), which is why the numbers don't match exactly.

    My experience with exomes is that your numbers, while certainly not good, are also not unusual. It takes a very determined effort on the part of the lab doing the library capture to keep that duplicate rate down, and a lot of labs don't seem to know that it's important and/or don't put the effort in. For reference, I start complaining at the lab here when the dupe rate gets much above 5%, but I've seen data from external labs where it was as high as 20 or 25%. All of these are assuming PE sequencing - with single reads, all assumptions go out the window.

    My MQ0 numbers seem to be in the 1-2% range, so 3% doesn't raise any alarm bells for me. And I can't really speak to the NotPrimaryAlignment numbers - those are due to bwa mem splitting up the reads, and I don't use mem yet.

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    @pdexheimer said: The value reported by Picard in the PERCENT_DUPLICATION is actually the fraction of duplicate reads, not a percentage. That is, a value of 0.17 in that column means 17%, not 0.17%. My suspicion is that you're calling RTC on a subset of the bam (using the -L parameter), which is why the numbers don't match exactly.

    My experience with exomes is that your numbers, while certainly not good, are also not unusual. It takes a very determined effort on the part of the lab doing the library capture to keep that duplicate rate down, and a lot of labs don't seem to know that it's important and/or don't put the effort in. For reference, I start complaining at the lab here when the dupe rate gets much above 5%, but I've seen data from external labs where it was as high as 20 or 25%. All of these are assuming PE sequencing - with single reads, all assumptions go out the window.

    My MQ0 numbers seem to be in the 1-2% range, so 3% doesn't raise any alarm bells for me. And I can't really speak to the NotPrimaryAlignment numbers - those are due to bwa mem splitting up the reads, and I don't use mem yet.

    Thank you for your interesting answer. What I understand is that the duplication rate is high because of the wet lab portion, resulting in some sort of losses regarding the data available for analysis. But is the variant calling will be affected by 14% of duplication rate?

  • pdexheimerpdexheimer Posts: 297Member, GSA Collaborator ✭✭✭

    Your guess is as good as mine. If the duplicated reads are random, then it shouldn't (provided you still have enough coverage after removing them). But if the high dupe rate is because of an underlying bias in one of the steps, who knows?

Sign In or Register to comment.