It looks like you're new here. If you want to get involved, click one of these buttons!
stephlo
Posts: 1Member ✭
Hi all,
I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)
bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads
A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.
I have 85767226 paired end = 171534452 sequences in fastQ file
BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.
MarkDuplicates reports:
Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.
so nearly 6 million reads seem to miss.
CreateTargets MicroScheduler reports
35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter
so nearly 5 million reads seem to miss
The Realigner MicroScheduler reports
0 reads were filtered out during traversal out of 171551640 total (0.00%)
which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.
From Base recalibration MicroScheduler, I get
41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter
..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.
I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?
Thanks for any comments!
Geraldine_VdAuwera
Posts: 2,239 admin
Hi there,
I can't comment on the numbers you get with non-GATK tools, except to say that those numbers look reasonable.
Regarding the variation in general with GATK, some of what you see is because different tools apply different filters by default. For example, the second "miracle" of the realigner is because Realigner doesn't apply any filters (they're not relevant to what it does). Those crappy reads are there, but the realigner doesn't mind them, as opposed to the target creator, which can't handle them and therefore must filter them out before doing anything.
Regarding the specific point of why some tools report traversing more reads than are even supposed to be in the file, that can be due to some reads (especially the good, non-filtered ones) getting double-counted towards the total because they overlap interval boundaries, or if you're using multithreading. This is quite normal and isn't a cause for concern as to the operation of the tools. As long as the numbers are roughly similar everything should be okay.
I hope this clarifies the matter?
Answers
Dear Geraldine,
thanks a lot - you relieved my headaches :)
Best, Stephan
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •