Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Using depth of coverage metrics for variant evaluation

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited July 2015 in Methods and Algorithms

Overview

This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.

The metrics involved are the following:

  • DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
  • Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.


Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

  • DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
  • Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.


TO BE CONTINUED...

Post edited by Geraldine_VdAuwera on

Comments

  • I would like to make a feature request for a filtered depth of coverage of each allele per sample. I work with Plasmodium samples, which are typically a mixture of an unknown number of haploid strains in unknown proportions. I think I prefer to call these in haploid mode (ploidy 1), so the GT is then a reflection of the likely "majority call". However, I would also like to estimate the fractional proportions of each allele in each sample at each variant site. At present I am using the unfiltered allele depths contained in AD to do this. However, I'm thinking this would perhaps be more accurate if using the filtered depths, using the same filtering as applied when creating the sample-level DP. Would it be possible to include a new sample-level annotation (perhaps FAD?) that would give this filtered depth of coverage for each alelle in each sample? For each sample the sum of FAD would be equal to the DP for that sample.

    Issue · Github
    by Sheila

    Issue Number
    372
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Richard_Pearson
    Hi,

    We do have an annotation called StrandAlleleCountsBySample that does what you are asking. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandAlleleCountsBySample.php You will have to do the extra step of adding the counts of reads that support the alleles on the forward and reverse strands.

    -Sheila

  • Thanks Sheila, and apologies for my late response, I've only just seen your reply. I did see StrandAlleleCountsBySample, but assumed this was the same as DepthPerAlleleBySample in that it would return unfiltered counts, partly based on the fact that in the example given the values for SAC (1,0,3,15,4,8) add up to give the values in AD (1,18,12). Please could you confirm that StrandAlleleCountsBySample does indeed give filtered counts whereas DepthPerAlleleBySample gives unfiltered counts? If this is the case, it might be good to update the documentation for StrandAlleleCountsBySample to make this explicit.

    Many thanks! Richard

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Richard, recommending SAC was my idea but on second thought I may be wrong. I need to check a few things and get back to you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Richard_Pearson
    Hi Richard,

    No, SAC does not give filtered counts. It is unfiltered like AD. Let me see if I can put this in as a feature request. Unfortunately, our developers are quite busy right now and won't be able to get to this very soon. However, we are very happy to look at a patch you submit :smile:
    http://gatkforums.broadinstitute.org/gatk/discussion/1267/how-can-i-submit-a-patch-to-the-gatk-codebase

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Richard_Pearson
    Hi again Richard,

    It seems there may be some hope for this. One of the developers is working on something similar to what you are asking. I'm not sure when it will be available, however.

    -Sheila

  • Sounds positive, thanks both!

  • mbxat1mbxat1 NottinghamMember

    Hi,

    should I be worried about read length when calculating Depth of Coverage using GATK. I have samples of the same species, some sequenced with a read length of 100bp and others 150bp. I am using GATK version 3.4, mean depth calculation is less than I expected on the samples sequenced with a read length of 150bp.

    Thank you in anticipation of your response

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mbxat1
    Hi,

    Can you give us some more details about the differences? How much of a difference is there? The major issue I can think of is that VQSR runs on the assumption that the sample annotations are all distributed in the same way. So, if your depths are different between the samples, that can cause some issues.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @mbxat1, the length of reads is not taken into account by DepthOfCoverage. The tool simply looks at how many bases cover each position.

    If you're getting surprising results for the mean depth, you need to look at the distribution of coverage. Unevenness of coverage could affect your ability to call variants confidently. The tool produces a histogram file that can be useful in interpreting this.
  • isaac_josephisaac_joseph SF Bay AreaMember

    Greetings. Wondering why AD might be missing from the FORMAT for a very small minority (1 out of ≈ 15,000 variants ) after using HaplotypeCaller. Any insight? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @isaac_joseph
    Hi,

    This is a known issue. You can keep track of it here.

    -Sheila

  • tytolintytolin Member

    Hello, GATK

    I'm interested in filtering by allele depth in a VCF file containing multiple samples.
    Theoretically, if a sample which is heterozgote on a SNP site, I will see the allele depth 5,5 on a SNP site which depth is 10. However, in some of the cases, I got a VCF containing multiple samples. In some of the sample which shows heterozygote but allele depth is 2,10. Should I change the sample into alternative homozygote in the vcf file ?

Sign In or Register to comment.