If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Picard / GATK read counts varying between programs
I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd.
I have paired end reads, my protocol until now is (in brief, omitting options etc.)
bwa aln R1.fastq
bwa aln R2.fastq
bwa sampe R1.sai R2.sai
GATK -T RealignerTargetCreator -known dbsnp.vcf
GATK -T IndelRealigner -known dbsnp.vcf
GATK -T BaseRecalibrator -knownSites dbsnp.vcf
GATK -T PrintReads
A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.
I have 85767226 paired end = 171534452 sequences in fastQ file
BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.
Read 165619516 records. 2 pairs never matched.
Marking 20272927 records as duplicates.
Found 2919670 optical duplicate clusters.
so nearly 6 million reads seem to miss.
CreateTargets MicroScheduler reports
35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
-> 428072 reads (0.26% of total) failing BadMateFilter
-> 16077607 reads (9.65% of total) failing DuplicateReadFilter
-> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter
so nearly 5 million reads seem to miss
The Realigner MicroScheduler reports
0 reads were filtered out during traversal out of 171551640 total (0.00%)
which appears a miracle to me since
1) there are even more reads now than input sequences,
2) all those crappy reads reported by CreateTargets do not appear.
From Base recalibration MicroScheduler, I get
41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
-> 16010068 reads (9.32% of total) failing DuplicateReadFilter
-> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter
..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.
I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?
Thanks for any comments!