Attention:
The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

BaseCoverageDistribution

blueskypyblueskypy Member ✭✭
edited June 2013 in Ask the GATK team

In the output grp file,

#:GATKReport.v1.1:1
#:GATKTable:3:880:%s:%s:%s:;
#:GATKTable:BaseCoverageDistribution:A simplified GATK table report
Coverage  Count    Filtered
       0  2859049   2932784
       1   856997    837791
       2   288587    276253
       3    95618     91703

what's the meaning of the three columns?

Thanks,

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'll have the tool author (@Carneiro) confirm, but as I recall:

    1. The first column, Coverage, is the depth of coverage corresponding to the bin (ie the first line is the set of loci that are covered by 0 reads, the second is the set covered by 1 read, etc);

    2. The second, Count, is the number of loci in the bin (without any filtering);

    3. The third, Filtered, is the number of loci in the bin after applying quality filtering to exclude bad reads.

  • blueskypyblueskypy Member ✭✭
    edited June 2013

    hi, Geraldine,
    Thanks for the quick response! Two questions:

    1. if the coverage is 2, can I interpret it as the so-called 2x coverage?

    2. By your explanation, the Count should > Filtered; why Count < Filtered at coverage 0?

  • blueskypyblueskypy Member ✭✭

    Well, I found Count < Filtered at some other coverages as well. For example:
    23 7910 8119

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    1. The 2x-style expression is a convention to express the overall coverage of a dataset, so I'm not sure it's appropriate to use it in this context. If you do use it, make sure to communicate clearly what you mean, to avoid any unfortunate misunderstandings.

    2. Hmm, I may have misremembered. Based on the tech doc it looks like it might be the count of filtered reads (not including good reads). I'll ask @Carneiro to confirm, but I reckon that makes sense. But if so, there's an awful lot of low-quality reads in your data, at least based on the low-value bins you posted...

  • blueskypyblueskypy Member ✭✭

    Thanks so much, Geraldine! Please confirm!

  • Hi Geraldine. Did you get a chance to confirm @blueskypy 's second question - whether the filtered column includes the good reads? I just ran this tool and for most rows, the values for the 3rd column are higher.

  • Thanks for the clarification @Geraldine_VdAuwera.

Sign In or Register to comment.