The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Register now for the upcoming GATK Best Practices workshop, Nov 7-8 at the Broad in Cambridge, MA. Open to all comers! More info and signup at https://goo.gl/forms/mFeERlIeLuJ95Idm1

I dont understand what is "filtering out during traversal" ?

Posts: 16Member
edited April 2013

When we run BaseRecalibrator i.e. the first phase of BQSR, we get some information in logs at the end of the process saying the following thing:

INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads
INFO 02:09:56,514 ProgressMeter - done 1.06e+09 5.4 d 7.3 m 100.0% 5.4 d 0.0 s
INFO 02:09:56,514 ProgressMeter - Total runtime 464091.25 secs, 7734.85 min, 128.91 hours
INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total (55.89%)
INFO 02:09:56,514 MicroScheduler - -> 155 reads (0.34% of total) failing DuplicateReadFilter
INFO 02:09:56,515 MicroScheduler - -> 25196 reads (55.55% of total) failing MappingQualityZeroFilter
INFO 02:10:00,308 GATKRunReport - Uploaded run statistics report to AWS S3

What does this phase means "filtered out during traversal"? Does it has some specific significance?

• QatarPosts: 13Member

following on from this, can you explain why the number of processed reads ( "INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads" from Pallavi's example) differs so much from the total number of reads ("INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total"). Do these numbers relate to different sets of reads?

Hmm, that's a good observation. There shouldn't be such a huge difference. In some older versions of GATK , we had problems with the counter not reporting the right total numbers. That may be what's happening here. Without seeing the version number and command line it's difficult to say more.

Geraldine Van der Auwera, PhD

• QatarPosts: 13Member

Hi Geraldine,

Obviously I don't know about Pallavi, but I do know I'm observing this with v2.5.2.

Ah, I see. Can you please post your command line and the relevant console output?

Geraldine Van der Auwera, PhD

• Posts: 16Member

@biscuit13161 Yeah. I have observed it. Was thinking to ask about it specifically .... good u asked it here.... Actually I kept this issue across keeping in mind the numbers issue only. This is quite confusing. That is why I asked first if this filtering out during traversal means something specific or else.
@Geraldine: I have got this log from v2.3-9

• Posts: 16Member

@Geraldine: Here is the command line which I used:
java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -ql 95 -I 100gb_whole_genome_run/final_100_gb.bam -R /storage/avadisngs/GATK/hg19.fa -knownSites /storage/avadisngs/GATK/chR20.vcf -o 100gb_whole_genome_run/recal_data_exome_2.3.grp

Ok, we'll look into this. Sorry about the confusion.

Geraldine Van der Auwera, PhD

Okay, it looks like we have bug in the Microscheduler counts. The good news is that this doesn't actually affect the processing of your data, so you can safely ignore it for now.

Geraldine Van der Auwera, PhD

This is fixed in our internal version and will be available in the nightly builds from now on.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Posts: 14Member

Hi - I'm having a similar issue. I'm running GATK v2.5-2 and with BaseRecalibrator I get the following message:

INFO 17:50:42,743 MicroScheduler - 321635 reads were filtered out during traversal out of 521639 total (61.66%)
INFO 17:50:42,744 MicroScheduler - -> 2526 reads (0.48% of total) failing DuplicateReadFilter
INFO 17:50:42,744 MicroScheduler - -> 319109 reads (61.17% of total) failing MappingQualityZeroFilter
INFO 17:50:45,202 GATKRunReport - Uploaded run statistics report to AWS S3

(So 61% failed Mapping quality zero step - which is quite alarming)

While in the step just preceding this step (Local realignment) it was much lower for the same sample

INFO 22:39:49,340 MicroScheduler - 144903903 reads were filtered out during traversal out of 1833725533 total (7.90%)
INFO 22:39:49,340 MicroScheduler - -> 14349297 reads (0.78% of total) failing BadMateFilter
INFO 22:39:49,341 MicroScheduler - -> 23871004 reads (1.30% of total) failing DuplicateReadFilter
INFO 22:39:49,341 MicroScheduler - -> 106683602 reads (5.82% of total) failing MappingQualityZeroFilter
INFO 22:39:51,923 GATKRunReport - Uploaded run statistics report to AWS S3

And again for the SNPcall with Unified Genotyper, following the Recalibration, the values are low again. Also if I use an older version of GATK with CountCovariates for the Recalibration the values are also low (<10%) on the same sample

From the previous posts it seems its just a problem in the counter and it does not affect the data itself?

I just want to make 100% sure - Is this correct - should I just ignore this high value for unmapped reads (that I just get during this one step with BaseRecalibrator)?

/thanks

It's very probable that the inconsistency you're seeing is due to that bug, yes. But to be 100% sure you can run a simple tool like CountReads from the latest GATK version (2.7, in which the bug is fixed) on your data with -rf MappingQualityZero. This should give you a definitive bug-free count of the reads that fail that filter.

Geraldine Van der Auwera, PhD

When I run RealignerTargetCreator, I got the following in the INFO: 8 418 115 reads were filtered out during the traversal out of approximately 28 536 526 total reads (29.50%). Does this mean that 29.5% of my reads are removed from the produced .bam file and only 20 122 411 reads will be used for the next step (i.e IndelRealigner, BaseRecalibration, etc) ?

RealignerTargetCreator doesn't produce a bam file, it only produces a list of target intervals. So all the reads will still be available for the next step. But depending what the filtering reason is, some reads may be filtered out again by IndelRealigner in the next step. If that happens those reads will indeed not be written in the new realigned bam file.

Which filters are your reads failing? 29% is a lot, you may need to do some QC on your data.

Geraldine Van der Auwera, PhD

Here are the complete Info:
INFO 13:57:59,158 ProgressMeter - Total runtime 2422.06 secs, 40.37 min, 0.67 hours
INFO 13:57:59,158 MicroScheduler - 8418115 reads were filtered out during the traversal out of approximately 28536526 total reads (29.50%)
INFO 13:57:59,158 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 13:57:59,159 MicroScheduler - -> 599908 reads (2.10% of total) failing BadMateFilter
INFO 13:57:59,159 MicroScheduler - -> 4159068 reads (14.57% of total) failing DuplicateReadFilter
INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 13:57:59,159 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 13:57:59,160 MicroScheduler - -> 1041927 reads (3.65% of total) failing MappingQualityZeroFilter
INFO 13:57:59,160 MicroScheduler - -> 2617212 reads (9.17% of total) failing NotPrimaryAlignmentFilter
INFO 13:57:59,160 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter
INFO 13:57:59,160 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
INFO 13:58:01,688 GATKRunReport - Uploaded run statistics report to AWS S3

Well, the 14% duplicates can't be helped at this point; the 9% of not primary alignment plus 3% mapping quality zero suggests the alignment isn't stellar...

Geraldine Van der Auwera, PhD

Is it because my deep sequencing reads are exome sequencing?

I can't help you further with this, sorry, as it's out of the scope of support we provide. It may even not be a very big deal (I do not have expertise in assessing alignment results), but if you have concerns you should discuss this with your sequence data provider. You can ask about standards of quality and recommended procedures for best alignment results.

Geraldine Van der Auwera, PhD

OK. I see. Thank you.

@Geraldine_VdAuwera said:
I can't help you further with this, sorry, as it's out of the scope of support we provide. It may even not be a very big deal (I do not have expertise in assessing alignment results), but if you have concerns you should discuss this with your sequence data provider. You can ask about standards of quality and recommended procedures for best alignment results.

Geraldine,

Maybe the following is out of the scope you provide but I just want to add that I used bwa mem for mapping and samtools idxstats show me that 28 374 381 reads out of 28 398 097 (99.9%) are mapped to the genome.

28 374 381 reads out of 28 398 097 (99.9%) are mapped to the genome.

Sure, but based on the read filtering 14% of those are duplicate reads (which are not informative) and 12% either have mapping quality zero (which means they're mapped but with zero confidence, so are also useless) or are aligned in more than one place (which is effectively pretty much equivalent to zero confidence mapping). That adds up to a lot of data that is unusable even if it is mapped. And maybe that's fine (I am not the right person to say what the acceptable amount of "waste data" should be) but it is important to be aware of how much usable data you have, because someone is paying for the data production, and they should get their money's worth.

Geraldine Van der Auwera, PhD

OK.
Again my next question is maybe out of the scope but when we run MarkDuplicates, a .metrics file is produced and in that file I got a 0.17% duplication. Can you tell me what is the difference between this number and the 14% duplicate reads obtained from RealignerTargetCreator INFO?

Unfortunately I'm not familiar with the calculations that go into the MarkDuplicates metrics file. I can tell you that the GATK simply uses the flags added by MarkDuplicates; it doesn't do any additional evaluation of what might or might not be duplicates.

Geraldine Van der Auwera, PhD

• Posts: 543Member, Dev ✭✭✭✭

The value reported by Picard in the PERCENT_DUPLICATION is actually the fraction of duplicate reads, not a percentage. That is, a value of 0.17 in that column means 17%, not 0.17%. My suspicion is that you're calling RTC on a subset of the bam (using the -L parameter), which is why the numbers don't match exactly.

My experience with exomes is that your numbers, while certainly not good, are also not unusual. It takes a very determined effort on the part of the lab doing the library capture to keep that duplicate rate down, and a lot of labs don't seem to know that it's important and/or don't put the effort in. For reference, I start complaining at the lab here when the dupe rate gets much above 5%, but I've seen data from external labs where it was as high as 20 or 25%. All of these are assuming PE sequencing - with single reads, all assumptions go out the window.

My MQ0 numbers seem to be in the 1-2% range, so 3% doesn't raise any alarm bells for me. And I can't really speak to the NotPrimaryAlignment numbers - those are due to bwa mem splitting up the reads, and I don't use mem yet.

@pdexheimer said:
The value reported by Picard in the PERCENT_DUPLICATION is actually the fraction of duplicate reads, not a percentage. That is, a value of 0.17 in that column means 17%, not 0.17%. My suspicion is that you're calling RTC on a subset of the bam (using the -L parameter), which is why the numbers don't match exactly.

My experience with exomes is that your numbers, while certainly not good, are also not unusual. It takes a very determined effort on the part of the lab doing the library capture to keep that duplicate rate down, and a lot of labs don't seem to know that it's important and/or don't put the effort in. For reference, I start complaining at the lab here when the dupe rate gets much above 5%, but I've seen data from external labs where it was as high as 20 or 25%. All of these are assuming PE sequencing - with single reads, all assumptions go out the window.

My MQ0 numbers seem to be in the 1-2% range, so 3% doesn't raise any alarm bells for me. And I can't really speak to the NotPrimaryAlignment numbers - those are due to bwa mem splitting up the reads, and I don't use mem yet.

What I understand is that the duplication rate is high because of the wet lab portion, resulting in some sort of losses regarding the data available for analysis. But is the variant calling will be affected by 14% of duplication rate?

• Posts: 543Member, Dev ✭✭✭✭

Your guess is as good as mine. If the duplicated reads are random, then it shouldn't (provided you still have enough coverage after removing them). But if the high dupe rate is because of an underlying bias in one of the steps, who knows?

• Posts: 49Member

I have a slightly different question related to this topic:

When running the RealignerTargetCreator, I am getting a list of ReadFilters and the number of reads that would be filtered out for each of them, e.g.:

INFO  01:25:08,541 MicroScheduler - 515248653 reads were filtered out during the traversal out of approximately 695979089 total reads (74.03%)
INFO  01:25:08,542 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
INFO  01:25:08,542 MicroScheduler -   -> 1439083 reads (0.21% of total) failing BadMateFilter
INFO  01:25:08,542 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  01:25:08,542 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  01:25:08,543 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  01:25:08,543 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  01:25:08,543 MicroScheduler -   -> 505509320 reads (72.63% of total) failing MappingQualityZeroFilter
INFO  01:25:08,543 MicroScheduler -   -> 8300250 reads (1.19% of total) failing NotPrimaryAlignmentFilter
INFO  01:25:08,544 MicroScheduler -   -> 0 reads (0.00% of total) failing Platform454Filter
INFO  01:25:08,544 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter


I understand that in this step, no bam file is created.

However, the next step is IndelRealignment, and this one both uses the same input bam file again and produced an output file, and here, the information regarding the ReadFilters at this step is:

INFO  07:53:44,144 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 691695998 total reads (0.00%)
INFO  07:53:44,144 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter


Moreover, I observed that in the bam files that are generated by IndelRealignment, the reads that were critiziced in the RealignerTargetCreator (e.g. MappingQualityZero) are still there, which means that they will be used for downstream analysis steps. Is this expected? Am I therefore supposed to filter those reads out myself before proceeding?

I am using GATK v3.1-1-g07a4bf8.

Many thanks for any clarifications you may have regarding this issue.

PS.: I know this is an extreme example of numbers of reads filtered out, this is because we are testing a new aligner.

@Sophia‌

Hi,

When GATK filters reads, it does not discard them but simply marks them as a filtered read. So, the reads that "were criticized" originally will still be present, but in any further downstream analysis, they will be marked as filtered and will not be used. There is no need to filter them out yourself.

The filter information is to let you know if there is an alarming number of reads that are of poor quality so you may want to recheck your data, but since you know what the issue is (a new aligner) you will be fine to continue.

-Sheila

• Posts: 49Member

thanks, @Sheila , that solves my issues.

• Posts: 17Member

Hello,

I am following the GATK best practices guidelines to call SNPs on whole chicken genomes. Currently, indel realignment output says about 15% failed the MappingQualityZeroFilter, while BQSR says about 26% failed. I will eventually perform a count reads (currently having trouble with the cluster I'm running my jobs on) to verify the number of non-unique alignments. But from this thread it seems like even 15% means that this is a rather poor alignment. I am wondering if there is anything I should do at the trimming or bwa step to reduce this error. Thanks!

• Angela
• Posts: 543Member, Dev ✭✭✭✭

Context is important. 15% is bad for duplication rate, it might not be so bad for MQ0. MQ0 is really driven by whether the reads are unique, genome-wide - so it will differ between, for instance, an exome which is strongly enriched for non-repetitive regions and a genome which is not.

• Posts: 17Member

Hi @pdexheimer, thank you for your response. Count reads ended up giving me 26%, would that be considered too high for whole genome? What is an acceptable percentage generally? And are there any resources I can refer to? Thanks!

• Posts: 543Member, Dev ✭✭✭✭

I have no idea. For back of the envelope purposes, I see that ~85% of the human genome has been found to be uniquely mappable with 75bp reads, so assuming that you're using slightly longer reads and that they're paired, I would expect a smaller number. But I there are probably other considerations than just the repeat content of the genome, it could be that 26% is reasonable.