The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Why do the GATK read counts not agree with samtools flagstat?

igorigor New YorkPosts: 54
edited July 2013 in Ask the GATK team

I just noticed something odd about GATK read counts. Using a tiny test data set, I generated a BAM file with marked duplicates.

This is the output for samtools flagstat:

40000 + 0 in total (QC-passed reads + QC-failed reads)
63 + 0 duplicates
38615 + 0 mapped (96.54%:-nan%)
40000 + 0 paired in sequencing
20000 + 0 read1
20000 + 0 read2
37764 + 0 properly paired (94.41%:-nan%)
38284 + 0 with itself and mate mapped
331 + 0 singletons (0.83%:-nan%)
76 + 0 with mate mapped to a different chr
54 + 0 with mate mapped to a different chr (mapQ>=5)

This is what I get as part of GATK info stats when running RealignerTargetCreator:

INFO  14:42:05,815 MicroScheduler - 5175 reads were filtered out during traversal out of 276045 total (1.87%) 
INFO  14:42:05,816 MicroScheduler -   -> 84 reads (0.03% of total) failing BadMateFilter 
INFO  14:42:05,816 MicroScheduler -   -> 1014 reads (0.37% of total) failing DuplicateReadFilter 
INFO  14:42:05,816 MicroScheduler -   -> 4077 reads (1.48% of total) failing MappingQualityZeroFilter 

This is what I get as part of GATK info stats when running DepthOfCoverage (on the orignal BAM, not after realignment):

INFO  15:03:17,818 MicroScheduler - 2820 reads were filtered out during traversal out of 309863 total (0.91%) 
INFO  15:03:17,818 MicroScheduler -   -> 1205 reads (0.39% of total) failing DuplicateReadFilter 
INFO  15:03:17,818 MicroScheduler -   -> 1615 reads (0.52% of total) failing UnmappedReadFilter 

Why are all of these so different? Why are there much more total reads and duplicate reads for GATK stats?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAPosts: 11,743 admin

    I can't speak for samtools' flagstat since it is not our tool. Regarding the differences you're seeing between the GATK tools, keep in mind that they apply different read filters by default -- what you're showing here has only DuplicateReadFilter in common. Also, RTC gets downsampled data by default whereas DoC does not. That might explain why RTC reports slightly fewer DuplicateReads than DoC.

    Geraldine Van der Auwera, PhD

  • igorigor New YorkPosts: 54

    @Geraldine_VdAuwera said:
    I can't speak for samtools' flagstat since it is not our tool. Regarding the differences you're seeing between the GATK tools, keep in mind that they apply different read filters by default -- what you're showing here has only DuplicateReadFilter in common. Also, RTC gets downsampled data by default whereas DoC does not. That might explain why RTC reports slightly fewer DuplicateReads than DoC.

    The samtools flagstat data agrees with the original FASTQ file, which had 40,000 (20,000 x 2) reads. How does GATK get 276,045 or 309,863 total reads? That is a lot more reads than I started with.

    Also, why is the total number different between RealignerTargetCreator and DepthOfCoverage? The total is supposed to be before any filtering, right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAPosts: 11,743 admin

    The totals are before filtering, but after downsampling.

    Geraldine Van der Auwera, PhD

  • igorigor New YorkPosts: 54

    @Geraldine_VdAuwera said:
    The totals are before filtering, but after downsampling.

    I understand how that would lead to fewer reads, but how does that result in more than 5 times the number of reads? If I input 40,000 reads, I should not get more than 40,000 reads regardless of any filtering.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAPosts: 11,743 admin

    Fair point -- sorry I didn't pay attention to that part of your question. It could be that the totals are not being reported properly. Can you tell me what version you're using? And could you please upload a test file that replicates the error, so we can debug this locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • igorigor New YorkPosts: 54
    edited July 2013

    @Geraldine_VdAuwera said:
    Fair point -- sorry I didn't pay attention to that part of your question. It could be that the totals are not being reported properly. Can you tell me what version you're using? And could you please upload a test file that replicates the error, so we can debug this locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Actually, I think this issue was already resolved. I previously used GATK 2.4. With 2.6, the output is different. I should mention that the resulting output files (the most important part) are the same, just the read count stats are different.

    Using GATK 2.4:

    INFO  16:58:27,473 MicroScheduler - 4203 reads were filtered out during traversal out of 451350 total (0.93%) 
    INFO  16:58:27,473 MicroScheduler -   -> 1789 reads (0.40% of total) failing DuplicateReadFilter 
    INFO  16:58:27,473 MicroScheduler -   -> 2414 reads (0.53% of total) failing UnmappedReadFilter 
    

    Using GATK 2.6:

    INFO  16:32:10,111 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12882 total reads (0.00%) 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
    INFO  16:32:10,111 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter 
    INFO  16:32:10,112 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter 
    

    The totals appear reasonable now.

    GATK is definitely outputting more filtering info now, so there was definitely some work done on this section. Can you confirm this? At this point, I just want to make sure there is nothing weird about my setup.

  • pdexheimerpdexheimer Posts: 544 ✭✭✭✭

    I remember seeing some forum posts about inaccurate read counts sometime recently, and the release notes for 2.6 include

    The engine now produces much more accurate read counts for Read traversals.

    So I think you're right - it's been resolved

Sign In or Register to comment.