Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Picard / GATK read counts varying between programs

stephlostephlo Posts: 1Member

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd.
I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq
bwa aln R2.fastq
bwa sampe R1.sai R2.sai
picard/CleanSam.jar
picard/SortSam.jar
picard/MarkDuplicates.jar
picard/AddOrReplaceReadGroups.jar
picard/BuildBamIndex.jar
GATK -T RealignerTargetCreator -known dbsnp.vcf
GATK -T IndelRealigner -known dbsnp.vcf
GATK -T BaseRecalibrator -knownSites dbsnp.vcf
GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched.
Marking 20272927 records as duplicates.
Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
-> 428072 reads (0.26% of total) failing BadMateFilter
-> 16077607 reads (9.65% of total) failing DuplicateReadFilter
-> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since
1) there are even more reads now than input sequences,
2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
-> 16010068 reads (9.32% of total) failing DuplicateReadFilter
-> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!

Best Answer

Answers

Sign In or Register to comment.