GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

I dont understand what is "filtering out during traversal" ?

pallavipallavi Posts: 16Member
edited April 2013 in Ask the GATK team

When we run BaseRecalibrator i.e. the first phase of BQSR, we get some information in logs at the end of the process saying the following thing:

INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads
INFO 02:09:56,514 ProgressMeter - done 1.06e+09 5.4 d 7.3 m 100.0% 5.4 d 0.0 s
INFO 02:09:56,514 ProgressMeter - Total runtime 464091.25 secs, 7734.85 min, 128.91 hours
INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total (55.89%)
INFO 02:09:56,514 MicroScheduler - -> 155 reads (0.34% of total) failing DuplicateReadFilter
INFO 02:09:56,515 MicroScheduler - -> 25196 reads (55.55% of total) failing MappingQualityZeroFilter
INFO 02:10:00,308 GATKRunReport - Uploaded run statistics report to AWS S3

What does this phase means "filtered out during traversal"? Does it has some specific significance?

Post edited by pallavi on

Best Answer

Answers

  • biscuit13161biscuit13161 Posts: 13Member

    following on from this, can you explain why the number of processed reads ( "INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads" from Pallavi's example) differs so much from the total number of reads ("INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total"). Do these numbers relate to different sets of reads?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Hmm, that's a good observation. There shouldn't be such a huge difference. In some older versions of GATK , we had problems with the counter not reporting the right total numbers. That may be what's happening here. Without seeing the version number and command line it's difficult to say more.

    Geraldine Van der Auwera, PhD

  • biscuit13161biscuit13161 Posts: 13Member

    Hi Geraldine,

    Obviously I don't know about Pallavi, but I do know I'm observing this with v2.5.2.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Ah, I see. Can you please post your command line and the relevant console output?

    Geraldine Van der Auwera, PhD

  • pallavipallavi Posts: 16Member

    @biscuit13161 Yeah. I have observed it. Was thinking to ask about it specifically .... good u asked it here.... Actually I kept this issue across keeping in mind the numbers issue only. This is quite confusing. That is why I asked first if this filtering out during traversal means something specific or else.
    @Geraldine: I have got this log from v2.3-9

  • pallavipallavi Posts: 16Member

    @Geraldine: Here is the command line which I used:
    java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -ql 95 -I 100gb_whole_genome_run/final_100_gb.bam -R /storage/avadisngs/GATK/hg19.fa -knownSites /storage/avadisngs/GATK/chR20.vcf -o 100gb_whole_genome_run/recal_data_exome_2.3.grp

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Ok, we'll look into this. Sorry about the confusion.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Okay, it looks like we have bug in the Microscheduler counts. The good news is that this doesn't actually affect the processing of your data, so you can safely ignore it for now.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Broad InstitutePosts: 684Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Member, GP Member admin

    This is fixed in our internal version and will be available in the nightly builds from now on.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • caschcasch Posts: 14Member

    Hi - I'm having a similar issue. I'm running GATK v2.5-2 and with BaseRecalibrator I get the following message:

    INFO 17:50:42,743 MicroScheduler - 321635 reads were filtered out during traversal out of 521639 total (61.66%)
    INFO 17:50:42,744 MicroScheduler - -> 2526 reads (0.48% of total) failing DuplicateReadFilter
    INFO 17:50:42,744 MicroScheduler - -> 319109 reads (61.17% of total) failing MappingQualityZeroFilter
    INFO 17:50:45,202 GATKRunReport - Uploaded run statistics report to AWS S3

    (So 61% failed Mapping quality zero step - which is quite alarming)

    While in the step just preceding this step (Local realignment) it was much lower for the same sample

    INFO 22:39:49,340 MicroScheduler - 144903903 reads were filtered out during traversal out of 1833725533 total (7.90%)
    INFO 22:39:49,340 MicroScheduler - -> 14349297 reads (0.78% of total) failing BadMateFilter
    INFO 22:39:49,341 MicroScheduler - -> 23871004 reads (1.30% of total) failing DuplicateReadFilter
    INFO 22:39:49,341 MicroScheduler - -> 106683602 reads (5.82% of total) failing MappingQualityZeroFilter
    INFO 22:39:51,923 GATKRunReport - Uploaded run statistics report to AWS S3

    And again for the SNPcall with Unified Genotyper, following the Recalibration, the values are low again. Also if I use an older version of GATK with CountCovariates for the Recalibration the values are also low (<10%) on the same sample

    From the previous posts it seems its just a problem in the counter and it does not affect the data itself?

    I just want to make 100% sure - Is this correct - should I just ignore this high value for unmapped reads (that I just get during this one step with BaseRecalibrator)?

    /thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    It's very probable that the inconsistency you're seeing is due to that bug, yes. But to be 100% sure you can run a simple tool like CountReads from the latest GATK version (2.7, in which the bug is fixed) on your data with -rf MappingQualityZero. This should give you a definitive bug-free count of the reads that fail that filter.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    When I run RealignerTargetCreator, I got the following in the INFO: 8 418 115 reads were filtered out during the traversal out of approximately 28 536 526 total reads (29.50%). Does this mean that 29.5% of my reads are removed from the produced .bam file and only 20 122 411 reads will be used for the next step (i.e IndelRealigner, BaseRecalibration, etc) ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    RealignerTargetCreator doesn't produce a bam file, it only produces a list of target intervals. So all the reads will still be available for the next step. But depending what the filtering reason is, some reads may be filtered out again by IndelRealigner in the next step. If that happens those reads will indeed not be written in the new realigned bam file.

    Which filters are your reads failing? 29% is a lot, you may need to do some QC on your data.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    Here are the complete Info:
    INFO 13:57:59,158 ProgressMeter - Total runtime 2422.06 secs, 40.37 min, 0.67 hours
    INFO 13:57:59,158 MicroScheduler - 8418115 reads were filtered out during the traversal out of approximately 28536526 total reads (29.50%)
    INFO 13:57:59,158 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 13:57:59,159 MicroScheduler - -> 599908 reads (2.10% of total) failing BadMateFilter
    INFO 13:57:59,159 MicroScheduler - -> 4159068 reads (14.57% of total) failing DuplicateReadFilter
    INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 13:57:59,160 MicroScheduler - -> 1041927 reads (3.65% of total) failing MappingQualityZeroFilter
    INFO 13:57:59,160 MicroScheduler - -> 2617212 reads (9.17% of total) failing NotPrimaryAlignmentFilter
    INFO 13:57:59,160 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter
    INFO 13:57:59,160 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
    INFO 13:58:01,688 GATKRunReport - Uploaded run statistics report to AWS S3

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Well, the 14% duplicates can't be helped at this point; the 9% of not primary alignment plus 3% mapping quality zero suggests the alignment isn't stellar...

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    Is it because my deep sequencing reads are exome sequencing?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    I can't help you further with this, sorry, as it's out of the scope of support we provide. It may even not be a very big deal (I do not have expertise in assessing alignment results), but if you have concerns you should discuss this with your sequence data provider. You can ask about standards of quality and recommended procedures for best alignment results.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    OK. I see. Thank you.

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    @Geraldine_VdAuwera said:
    I can't help you further with this, sorry, as it's out of the scope of support we provide. It may even not be a very big deal (I do not have expertise in assessing alignment results), but if you have concerns you should discuss this with your sequence data provider. You can ask about standards of quality and recommended procedures for best alignment results.

    Geraldine,

    Maybe the following is out of the scope you provide but I just want to add that I used bwa mem for mapping and samtools idxstats show me that 28 374 381 reads out of 28 398 097 (99.9%) are mapped to the genome.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    28 374 381 reads out of 28 398 097 (99.9%) are mapped to the genome.

    Sure, but based on the read filtering 14% of those are duplicate reads (which are not informative) and 12% either have mapping quality zero (which means they're mapped but with zero confidence, so are also useless) or are aligned in more than one place (which is effectively pretty much equivalent to zero confidence mapping). That adds up to a lot of data that is unusable even if it is mapped. And maybe that's fine (I am not the right person to say what the acceptable amount of "waste data" should be) but it is important to be aware of how much usable data you have, because someone is paying for the data production, and they should get their money's worth.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    OK.
    Again my next question is maybe out of the scope but when we run MarkDuplicates, a .metrics file is produced and in that file I got a 0.17% duplication. Can you tell me what is the difference between this number and the 14% duplicate reads obtained from RealignerTargetCreator INFO?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Unfortunately I'm not familiar with the calculations that go into the MarkDuplicates metrics file. I can tell you that the GATK simply uses the flags added by MarkDuplicates; it doesn't do any additional evaluation of what might or might not be duplicates.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 463Member, GATK Dev, DSDE Member mod

    The value reported by Picard in the PERCENT_DUPLICATION is actually the fraction of duplicate reads, not a percentage. That is, a value of 0.17 in that column means 17%, not 0.17%. My suspicion is that you're calling RTC on a subset of the bam (using the -L parameter), which is why the numbers don't match exactly.

    My experience with exomes is that your numbers, while certainly not good, are also not unusual. It takes a very determined effort on the part of the lab doing the library capture to keep that duplicate rate down, and a lot of labs don't seem to know that it's important and/or don't put the effort in. For reference, I start complaining at the lab here when the dupe rate gets much above 5%, but I've seen data from external labs where it was as high as 20 or 25%. All of these are assuming PE sequencing - with single reads, all assumptions go out the window.

    My MQ0 numbers seem to be in the 1-2% range, so 3% doesn't raise any alarm bells for me. And I can't really speak to the NotPrimaryAlignment numbers - those are due to bwa mem splitting up the reads, and I don't use mem yet.

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    @pdexheimer said:
    The value reported by Picard in the PERCENT_DUPLICATION is actually the fraction of duplicate reads, not a percentage. That is, a value of 0.17 in that column means 17%, not 0.17%. My suspicion is that you're calling RTC on a subset of the bam (using the -L parameter), which is why the numbers don't match exactly.

    My experience with exomes is that your numbers, while certainly not good, are also not unusual. It takes a very determined effort on the part of the lab doing the library capture to keep that duplicate rate down, and a lot of labs don't seem to know that it's important and/or don't put the effort in. For reference, I start complaining at the lab here when the dupe rate gets much above 5%, but I've seen data from external labs where it was as high as 20 or 25%. All of these are assuming PE sequencing - with single reads, all assumptions go out the window.

    My MQ0 numbers seem to be in the 1-2% range, so 3% doesn't raise any alarm bells for me. And I can't really speak to the NotPrimaryAlignment numbers - those are due to bwa mem splitting up the reads, and I don't use mem yet.

    Thank you for your interesting answer.
    What I understand is that the duplication rate is high because of the wet lab portion, resulting in some sort of losses regarding the data available for analysis. But is the variant calling will be affected by 14% of duplication rate?

  • pdexheimerpdexheimer Posts: 463Member, GATK Dev, DSDE Member mod

    Your guess is as good as mine. If the duplicated reads are random, then it shouldn't (provided you still have enough coverage after removing them). But if the high dupe rate is because of an underlying bias in one of the steps, who knows?

  • SophiaSophia Posts: 35Member

    I have a slightly different question related to this topic:

    When running the RealignerTargetCreator, I am getting a list of ReadFilters and the number of reads that would be filtered out for each of them, e.g.:

    INFO  01:25:08,541 MicroScheduler - 515248653 reads were filtered out during the traversal out of approximately 695979089 total reads (74.03%) 
    INFO  01:25:08,542 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
    INFO  01:25:08,542 MicroScheduler -   -> 1439083 reads (0.21% of total) failing BadMateFilter 
    INFO  01:25:08,542 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter 
    INFO  01:25:08,542 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
    INFO  01:25:08,543 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
    INFO  01:25:08,543 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
    INFO  01:25:08,543 MicroScheduler -   -> 505509320 reads (72.63% of total) failing MappingQualityZeroFilter 
    INFO  01:25:08,543 MicroScheduler -   -> 8300250 reads (1.19% of total) failing NotPrimaryAlignmentFilter 
    INFO  01:25:08,544 MicroScheduler -   -> 0 reads (0.00% of total) failing Platform454Filter 
    INFO  01:25:08,544 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter 
    

    I understand that in this step, no bam file is created.

    However, the next step is IndelRealignment, and this one both uses the same input bam file again and produced an output file, and here, the information regarding the ReadFilters at this step is:

    INFO  07:53:44,144 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 691695998 total reads (0.00%) 
    INFO  07:53:44,144 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
    

    Moreover, I observed that in the bam files that are generated by IndelRealignment, the reads that were critiziced in the RealignerTargetCreator (e.g. MappingQualityZero) are still there, which means that they will be used for downstream analysis steps. Is this expected? Am I therefore supposed to filter those reads out myself before proceeding?

    I am using GATK v3.1-1-g07a4bf8.

    Many thanks for any clarifications you may have regarding this issue.

    PS.: I know this is an extreme example of numbers of reads filtered out, this is because we are testing a new aligner.

  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @Sophia

    Hi,

    When GATK filters reads, it does not discard them but simply marks them as a filtered read. So, the reads that "were criticized" originally will still be present, but in any further downstream analysis, they will be marked as filtered and will not be used. There is no need to filter them out yourself.

    The filter information is to let you know if there is an alarming number of reads that are of poor quality so you may want to recheck your data, but since you know what the issue is (a new aligner) you will be fine to continue.

    -Sheila

  • SophiaSophia Posts: 35Member

    thanks, @Sheila , that solves my issues.

Sign In or Register to comment.