We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How does total number of reads calculated in UnifiedGenotyper

xhzhaoxhzhao Member
edited August 2012 in Ask the GATK team

Dear GATK team,

From UnifiedGenotyper's output, there is a line indicating total number of reads that the tool works on. For example:
"INFO 01:11:47,861 TraversalEngine - 2710581 reads were filtered out during traversal out of 77522806 total (3.50%)"

Could you please explain how this number 77522806 get calculated? As I checked from our example, the input bam file contains 1,500,000 more reads than this. Apparently, some reads got filtered out from this total number. It would be great if you could advise on which filters have been applied to the input reads.

Thanks a lot for your time!

Sharon

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Sharon,

    If you look at the logging output lines following the one you posted, it should enumerate exactly which filters were applied and how many reads were removed by each one.

  • xhzhaoxhzhao Member

    Dear Eric,

    Thanks a lot for the quick reply!

    I understand the lines following the one I posted explain how many reads were removed by which filter. My question is more related to the "total" number of reads in the first line. I will try to explain my question here:

    1. the BAM file that I input to GATK has 79108128 reads.

    2. When I use this bam file as input to GATK's UnifiedGenotyper, the logging information is as follows:

    INFO 01:11:47,861 TraversalEngine - 2710581 reads were filtered out during traversal out of 77522806 total (3.50%)
    INFO 01:11:47,861 TraversalEngine - -> 1877634 reads (2.42% of total) failing BadMateFilter

    INFO 01:11:47,861 TraversalEngine - -> 832947 reads (1.07% of total) failing UnmappedReadFilter

    now, I understand that the number of reads filtered is 2710581 (=1877634 + 832947). My question is on the total number "77522806". This number is different from the total number of reads in the bam file (79108128) I used as input to GATK. So I wonder if there are other QC filters applied to the input bam file before these two filters listed in the log file? Any comments/suggestions are highly appreciated.

    Thanks a lot,

    Sharon

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    What command-line did you use to run the genotyper?

  • xhzhaoxhzhao Member

    java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I aln_realigned.bam -R refgenome.fa -o snp_indel.vcf -S SILENT -dt NONE -glm BOTH -A AlleleBalance -A DepthOfCoverage

  • Mark_DePristoMark_DePristo Broad InstituteMember admin

    The standard read filters for LocusWalkers are:

    @ReadFilters({UnmappedReadFilter.class,NotPrimaryAlignmentFilter.class,DuplicateReadFilter.class,FailsVendorQualityCheckFilter.class})

    That remove unmapped reads, non primary alignments, duplicates, and failing vender QC reads. These filters are applied by the engine to all LocusWalkers and aren't itemized in the filtering. That may sum up to your BAM read count. If not and you can identify an exact location in your BAM file where quality reads are being dropped we'd be happy to debug the situation. But I consider this issue closed now unless you specifically identify a bug.

Sign In or Register to comment.