The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!
Picard / GATK read counts varying between programs
I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd.
I have paired end reads, my protocol until now is (in brief, omitting options etc.)
bwa aln R1.fastq
bwa aln R2.fastq
bwa sampe R1.sai R2.sai
GATK -T RealignerTargetCreator -known dbsnp.vcf
GATK -T IndelRealigner -known dbsnp.vcf
GATK -T BaseRecalibrator -knownSites dbsnp.vcf
GATK -T PrintReads
A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.
I have 85767226 paired end = 171534452 sequences in fastQ file
BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.
Read 165619516 records. 2 pairs never matched.
Marking 20272927 records as duplicates.
Found 2919670 optical duplicate clusters.
so nearly 6 million reads seem to miss.
CreateTargets MicroScheduler reports
35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
-> 428072 reads (0.26% of total) failing BadMateFilter
-> 16077607 reads (9.65% of total) failing DuplicateReadFilter
-> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter
so nearly 5 million reads seem to miss
The Realigner MicroScheduler reports
0 reads were filtered out during traversal out of 171551640 total (0.00%)
which appears a miracle to me since
1) there are even more reads now than input sequences,
2) all those crappy reads reported by CreateTargets do not appear.
From Base recalibration MicroScheduler, I get
41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
-> 16010068 reads (9.32% of total) failing DuplicateReadFilter
-> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter
..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.
I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?
Thanks for any comments!