The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

Picard / GATK read counts varying between programs

stephlostephlo Posts: 1Member

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd.
I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq
bwa aln R2.fastq
bwa sampe R1.sai R2.sai
picard/CleanSam.jar
picard/SortSam.jar
picard/MarkDuplicates.jar
picard/AddOrReplaceReadGroups.jar
picard/BuildBamIndex.jar
GATK -T RealignerTargetCreator -known dbsnp.vcf
GATK -T IndelRealigner -known dbsnp.vcf
GATK -T BaseRecalibrator -knownSites dbsnp.vcf
GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched.
Marking 20272927 records as duplicates.
Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
-> 428072 reads (0.26% of total) failing BadMateFilter
-> 16077607 reads (9.65% of total) failing DuplicateReadFilter
-> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since
1) there are even more reads now than input sequences,
2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
-> 16010068 reads (9.32% of total) failing DuplicateReadFilter
-> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!

Best Answer

Answers

  • stephlostephlo Posts: 1Member

    Dear Geraldine,

    thanks a lot - you relieved my headaches :)

    Best,
    Stephan

Sign In or Register to comment.