Attention:
The frontline support team will be slow on the forum because we are occupied with a GATK Workshop on March 26th and 27th 2019. We will be back and available to answer questions on the forum on March 28th 2019.

BaseCoverageDistribution

blueskypyblueskypy Member ✭✭
edited June 2013 in Ask the GATK team

In the output grp file,

#:GATKReport.v1.1:1
#:GATKTable:3:880:%s:%s:%s:;
#:GATKTable:BaseCoverageDistribution:A simplified GATK table report
Coverage  Count    Filtered
       0  2859049   2932784
       1   856997    837791
       2   288587    276253
       3    95618     91703

what's the meaning of the three columns?

Thanks,

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'll have the tool author (@Carneiro) confirm, but as I recall:

    1. The first column, Coverage, is the depth of coverage corresponding to the bin (ie the first line is the set of loci that are covered by 0 reads, the second is the set covered by 1 read, etc);

    2. The second, Count, is the number of loci in the bin (without any filtering);

    3. The third, Filtered, is the number of loci in the bin after applying quality filtering to exclude bad reads.

  • blueskypyblueskypy Member ✭✭
    edited June 2013

    hi, Geraldine,
    Thanks for the quick response! Two questions:

    1. if the coverage is 2, can I interpret it as the so-called 2x coverage?

    2. By your explanation, the Count should > Filtered; why Count < Filtered at coverage 0?

  • blueskypyblueskypy Member ✭✭

    Well, I found Count < Filtered at some other coverages as well. For example:
    23 7910 8119

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    1. The 2x-style expression is a convention to express the overall coverage of a dataset, so I'm not sure it's appropriate to use it in this context. If you do use it, make sure to communicate clearly what you mean, to avoid any unfortunate misunderstandings.

    2. Hmm, I may have misremembered. Based on the tech doc it looks like it might be the count of filtered reads (not including good reads). I'll ask @Carneiro to confirm, but I reckon that makes sense. But if so, there's an awful lot of low-quality reads in your data, at least based on the low-value bins you posted...

  • blueskypyblueskypy Member ✭✭

    Thanks so much, Geraldine! Please confirm!

  • Hi Geraldine. Did you get a chance to confirm @blueskypy 's second question - whether the filtered column includes the good reads? I just ran this tool and for most rows, the values for the 3rd column are higher.

  • Thanks for the clarification @Geraldine_VdAuwera.

Sign In or Register to comment.