Recommended/required sequencing depth for standard diploid, whole-genome, germline variant calling

Hello,

I was searching the GATK best practices for a recommended or required sequencing depth for standard whole-genome germline variant calling on a diploid species (humans, specifically). I know general word of mouth is at least 30x, but can you clarify GATK's recommendations or requirements?

Specific questions:
1. I know this is not super easy to determine, but what is the approximate required depth for UnifiedGenotyped (GATK v3.X)? How about recommended?
2. Similarly, what are the required and recommended depths for HaplotypeCaller?

I'm trying to determine the approximate coverage at which you would not trust a specific variant. I realize this is very context specific, but a general recommendation would be super helpful.

Best Answers

  • zugzugzugzug ✭✭
    Accepted Answer

    @shlee, thank you. The poster is exactly what I was looking for. I'm surprised that there is only ~50% sensitivity with ~25x coverage for mutations where AF = 0.4, and only ~75% with ~50x coverage. I'd say mean coverage of 30x is generally insufficient.

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @zugzug,

    When developing for germline WGS data, our developers typically use ~30x WGS data. This should be sufficient to call germline variants. This number represents an average depth across some percentage of the genome. Sequencing data from the Broad Genomics Platform also guarantees high library complexity, i.e. 30x coverage of independent observations that exclude duplicate evidence. To estimate library complexity, you can use MarkDuplicates.

    Our Best Practices now recommend you use HaplotypeCaller for shart variant calling. Unless your data requires a pileup caller, please upgrade from UnifiedGenotyper to HaplotypeCaller.

    I hope this is helpful.

  • zugzugzugzug Member ✭✭

    Thank you for your response.

    Suppose we're looking at position X in the genome and we want to classify (True/False) whether there was enough coverage to detect a heterozygous point mutation. What threshold would that be? i.e., are you saying that regions with <30x coverage are ignored? I suspect that's not the case.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @zugzug,

    are you saying that regions with <30x coverage are ignored? I suspect that's not the case.

    Not at all! I'm saying you should get good breadth of coverage for calls across genomic regions. There is variation in coverage depth across the reference and given ~30x average coverage, even regions with lower coverage should provide sufficient coverage for calling. There are a number of Picard tools towards QC evaluation. For WGS data, these include:

    Come to think of it, I believe EstimateLibraryComplexity is a standalone tool now. You can find documentation at https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php

  • zugzugzugzug Member ✭✭
    edited September 2018

    Perhaps the simplest way to phrase my question is: at what minimum coverage would you generally trust HaplotypeCaller is able to call a heterozygous point mutation (at a given position)?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @zugzug, that depends on your base qualities and whether you are considering just SNPs or are including indels. As you know, HaplotypeCaller takes into consideration base qualities in calculations and when reassembling reads for a region, uses only those that are considered informative towards scoring and genotyping count towards the scoring. I've been told that even two informative reads with good base quality at the variant site, e.g. 30, are sufficient to call a heterozygous variant. If you need to know absolute cutoffs, the math behind our tools are given in whitepapers here.

  • zugzugzugzug Member ✭✭
    edited September 2018

    Thank you @shlee. I know this gets super complicated since HC is HMM-based, especially if you consider joint calling, GC-rich regions, etc., but it would be nice if GATK performed a study to empirically estimate lower limits.

    e.g., if I'm curious about mutations in a region with 15x coverage, can I trust there are zero variants in the region (assuming no calls), or should I assume we didn't have enough coverage? Know what I mean? I'm working on a project where I need to set a general threshold for "if coverage < X, assume we didn't have enough coverage". I was originally going with a very strict threshold where coverage < 5 is considered insufficient, but I suspect the threshold should actually be much higher (e.g., 15).

    Any opinion on that?

    Really appreciate your help.

  • zugzugzugzug Member ✭✭
    Accepted Answer

    @shlee, thank you. The poster is exactly what I was looking for. I'm surprised that there is only ~50% sensitivity with ~25x coverage for mutations where AF = 0.4, and only ~75% with ~50x coverage. I'd say mean coverage of 30x is generally insufficient.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Glad to hear the poster answers your question. I've asked the developers if they'd like to enumerate on Figure 3.

  • flehartyfleharty Member, Broadie, Dev ✭✭

    @zugzug said:
    @shlee, thank you. The poster is exactly what I was looking for. I'm surprised that there is only ~50% sensitivity with ~25x coverage for mutations where AF = 0.4, and only ~75% with ~50x coverage. I'd say mean coverage of 30x is generally insufficient.

    Zugzug,

    You make a very good point here. Sensitivity at AF = 0.4 and depth = 25 should be much larger than 50%. With base qualities typical from Illumina sequencing we should see sensitivities higher than 90%. I don't have a good explanation off hand for why the sensitivities are so low in that regime, though I suspect it has something to do with how the 20 hapmap samples were pooled together.

    You may consider using CollectWgsMetrics. There is a metric called THEORETICAL_SENSITIVITY. I can be used to calculate sensitivities across a spectrum of allele fractions for a particular bam file. By default, HET_SNP_SENSITIVITY is computed for AF = 0.5. This tool only considers the depth distribution and base quality distribution though. It does not take GC content or mappability into account.

    Thanks,

    Mark

  • igorigor New YorkMember ✭✭

    Thank you for sharing that poster. Those are very interesting findings.

    I had another question about Figure 3. Do you know why the sensitivity starts substantially dropping between 150 and 200? That does not seem very high, especially for AF = 0.05.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @igor,

    The lower depth sites could very possibly have lower sensitivity because of low mapping quality (like the high depth sites). This kind of behavior would not be expected in exomes, but could very well happen in whole genome because of hard to map regions. These variants make up a small percentage of all the called sites (<5%) as indicated by the error bars.
    More on figure 3 is explained above.

    I hope this helps.

    Regards
    Bhanu

  • igorigor New YorkMember ✭✭

    Do you know what the mean coverage for these was?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    I am not sure but I can find out and get back to you.

Sign In or Register to comment.