Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Why do the GATK read counts not agree with samtools flagstat?

igorigor Posts: 31Member
edited July 2013 in Ask the GATK team

I just noticed something odd about GATK read counts. Using a tiny test data set, I generated a BAM file with marked duplicates.

This is the output for samtools flagstat:

40000 + 0 in total (QC-passed reads + QC-failed reads)
63 + 0 duplicates
38615 + 0 mapped (96.54%:-nan%)
40000 + 0 paired in sequencing
20000 + 0 read1
20000 + 0 read2
37764 + 0 properly paired (94.41%:-nan%)
38284 + 0 with itself and mate mapped
331 + 0 singletons (0.83%:-nan%)
76 + 0 with mate mapped to a different chr
54 + 0 with mate mapped to a different chr (mapQ>=5)

This is what I get as part of GATK info stats when running RealignerTargetCreator:

INFO  14:42:05,815 MicroScheduler - 5175 reads were filtered out during traversal out of 276045 total (1.87%) 
INFO  14:42:05,816 MicroScheduler -   -> 84 reads (0.03% of total) failing BadMateFilter 
INFO  14:42:05,816 MicroScheduler -   -> 1014 reads (0.37% of total) failing DuplicateReadFilter 
INFO  14:42:05,816 MicroScheduler -   -> 4077 reads (1.48% of total) failing MappingQualityZeroFilter 

This is what I get as part of GATK info stats when running DepthOfCoverage (on the orignal BAM, not after realignment):

INFO  15:03:17,818 MicroScheduler - 2820 reads were filtered out during traversal out of 309863 total (0.91%) 
INFO  15:03:17,818 MicroScheduler -   -> 1205 reads (0.39% of total) failing DuplicateReadFilter 
INFO  15:03:17,818 MicroScheduler -   -> 1615 reads (0.52% of total) failing UnmappedReadFilter 

Why are all of these so different? Why are there much more total reads and duplicate reads for GATK stats?

Post edited by igor on

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    I can't speak for samtools' flagstat since it is not our tool. Regarding the differences you're seeing between the GATK tools, keep in mind that they apply different read filters by default -- what you're showing here has only DuplicateReadFilter in common. Also, RTC gets downsampled data by default whereas DoC does not. That might explain why RTC reports slightly fewer DuplicateReads than DoC.

    Geraldine Van der Auwera, PhD

  • igorigor Posts: 31Member

    @Geraldine_VdAuwera said: I can't speak for samtools' flagstat since it is not our tool. Regarding the differences you're seeing between the GATK tools, keep in mind that they apply different read filters by default -- what you're showing here has only DuplicateReadFilter in common. Also, RTC gets downsampled data by default whereas DoC does not. That might explain why RTC reports slightly fewer DuplicateReads than DoC.

    The samtools flagstat data agrees with the original FASTQ file, which had 40,000 (20,000 x 2) reads. How does GATK get 276,045 or 309,863 total reads? That is a lot more reads than I started with.

    Also, why is the total number different between RealignerTargetCreator and DepthOfCoverage? The total is supposed to be before any filtering, right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    The totals are before filtering, but after downsampling.

    Geraldine Van der Auwera, PhD

  • igorigor Posts: 31Member

    @Geraldine_VdAuwera said: The totals are before filtering, but after downsampling.

    I understand how that would lead to fewer reads, but how does that result in more than 5 times the number of reads? If I input 40,000 reads, I should not get more than 40,000 reads regardless of any filtering.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    Fair point -- sorry I didn't pay attention to that part of your question. It could be that the totals are not being reported properly. Can you tell me what version you're using? And could you please upload a test file that replicates the error, so we can debug this locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • igorigor Posts: 31Member
    edited July 2013

    @Geraldine_VdAuwera said: Fair point -- sorry I didn't pay attention to that part of your question. It could be that the totals are not being reported properly. Can you tell me what version you're using? And could you please upload a test file that replicates the error, so we can debug this locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Actually, I think this issue was already resolved. I previously used GATK 2.4. With 2.6, the output is different. I should mention that the resulting output files (the most important part) are the same, just the read count stats are different.

    Using GATK 2.4:

    INFO  16:58:27,473 MicroScheduler - 4203 reads were filtered out during traversal out of 451350 total (0.93%) 
    INFO  16:58:27,473 MicroScheduler -   -> 1789 reads (0.40% of total) failing DuplicateReadFilter 
    INFO  16:58:27,473 MicroScheduler -   -> 2414 reads (0.53% of total) failing UnmappedReadFilter 
    

    Using GATK 2.6:

    INFO  16:32:10,111 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12882 total reads (0.00%) 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter 
    INFO  16:32:10,112 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter 
    

    The totals appear reasonable now.

    GATK is definitely outputting more filtering info now, so there was definitely some work done on this section. Can you confirm this? At this point, I just want to make sure there is nothing weird about my setup.

    Post edited by igor on
  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    I remember seeing some forum posts about inaccurate read counts sometime recently, and the release notes for 2.6 include

    The engine now produces much more accurate read counts for Read traversals.

    So I think you're right - it's been resolved

Sign In or Register to comment.