Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

gatk 3.6 MIN_DP much lower then DP

I'm analysing a set of bacterial isolates, some which are (almost) identical to the reference, and some which are very different. Despite the fact that the identical isolates have good coverage (80x), I end up filtering a lot of the SNPs for the identical isolates due to lack of depth (cutoff of 10). I was wondering if this is due to the way the g.vcf files are used.

Below is a typical part of the g.vcf file for one of the identical isolates

#CHROM  POS ID  REF ALT         QUAL FILTER INFO        FORMAT              sample1

REF     1   .   A   <NON_REF>   .   .       END=296981  GT:DP:GQ:MIN_DP:PL  0:82:99:8:0,252

The average depth(DP) is 82, but the lowest depth (MIN_DP)in that region of 300kb is 8. If any of the other samples in the same analysis have a SNP in this region, what will be the DP for sample1 for that snip? Will it be 82 or 8?

If it is 8, every SNP in that regions for sample1 will be hard filtered, even though the actual coverage in that region (and most likely for that SNP) is a lot higher. How can I prevent discarding all that data for samples that are highly similar to the reference used?

I'm using 3.6-44-ge7d1cd2

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Redmar_van_den_Berg,

    Great to hear you are using GATK towards bacterial research.

    The gvcf record you show is highly confident that there is no variation for the site for sample1 and the 82x coverage is for sample1. The DP in a sample column applies exclusively to that sample. If you then ran GenotypeGVCFs on gvcfs from your many samples, then you have a cohort callset. In this case, the INFO field will contain additional annotations towards similar metrics but for the cohort, e.g. AF and DP.

    It's my understanding metrics are for the site in question (CHROM, POS). So they apply to a narrow range. Reassembly and pairHMM take in read information and so supporting evidence from a wider genomic range factors indirectly.

    If any of the other samples in the same analysis have a SNP in this region, what will be the DP for sample1 for that snip? Will it be 82 or 8?

    So the sample level DP should not change for the REF allele. Any counts for alternate alleles that surface for the site after cohort-level analysis will apply only to samples that are genotyped with that allele. The 82 or 8 distinction should show up in the INFO field, e.g. for REF=A ALT=X, DP would be 82,8.

    Historically, because population genomics has focused on discovering commonly variant germline sites, filtering has also focused on site-level filtering. These days, more folks are getting interested in allele-specific annotations and filtering. The GATK has options towards allele-specific analyses. These are prefixed with AS_, for allele specific. So I should think you can use a depth cutoff that is allele-specific in filtering your callset.

    I may have digressed a bit. I hope this is helpful. Let me know if it is otherwise.

  • Hi @shlee, thank your for your response. I've looked a bit better at our data based on your response and I think I've found the issue.

    We filter on sample level DP for our analysis, meaning that if a sample has a depth < 10 for a variant, we set the FT annotation for that sample. Because of this big HOM_REF blocks in the g.vcf file for a sample are an issue.

    From the g.vcf file for sample1, this HOM_REF block spans basically the entire 300kb contig

    GT:DP:GQ:MIN_DP:PL
    0:82:99:8:0,252
    

    From the final vcf file, the entry for sample1

    GT:AD:DP:GQ:PL
    0:8,0:8:99:0,252
    

    repeated hundreds of times. The reason is, other samples have variants in that contig, but sample1 only has one entry in it's g.vcf file. So for every variant on that contig, gatk assigns the MIN_DP from the g.vcf file as the DP for sample1. However, we hard filter our variants on DP of at least 10, so sample1 will be filtered for every variant in that contig. In the case above, we filter hundreds of variants for sample1 based on the MIN_DP, even though the actual DP on most of these positions is much higher.

    So the question is: is there a way to work around this, or is it just a mistake to hard filter on the DP of each sample individually?

    Issue · Github
    by shlee

    Issue Number
    2663
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @Redmar_van_den_Berg,

    What you describe for sample1 sounds like a bug. There should not be such high confidence in genotypes in regions with comparatively low coverage as well as alternate alleles such that the entirety of the contig is what constitutes a high GQ GVCF block that then somehow results in low coverage ref calls for numerous loci across the same sample1. Would you mind submitting a bug report following instructions in https://software.broadinstitute.org/gatk/guide/article?id=1894?


    In the meanwhile, here are some other thoughts that come to mind. I'd like to consult with the rest of the team as well, but they are currently away at a workshop. So please take what I say now tentatively.

    HaplotypeCaller can emit at basepair resolution with -ERC BP_RESOLUTION at storage expense. Perhaps for sample1, a basepair resolution GVCF will give a more sane result at the cohort-VCF level.

    Block sizes can be tweaked with the --GVCFGQBands parameter, which is based on GQ scores. However, given sample1's entirety has high GQ despite low coverage in certain regions, this latter won't help unless the analysis is tweaked to recalibrate the distribution of such a score.

    What I'm hearing is that you want to differentiate calls for a site with low coverage. You would prefer for this call to be inconclusive, i.e. ./. for the locus and sample with low coverage but not for other samples and not at the expense of other loci with >10 coverage. This sounds like an issue that could be handled upfront, before costly HaplotypeCaller reassembly.

    For example, you could calculate depth of coverage and create an intervals list using coverage information to include only regions with coverage > 10. Then use this intervals list with -L with HaplotypeCaller. Any regions with < 10 coverage would be excluded per sample from calling and result in a ./. call.

    Currently you are using the ERC GVCF mode of HaplotypeCaller and producing the final VCF with GenotypeGVCFs. However, another option for you (until we sort the seeming bug out) is to joint call directly on your cohort samples using HaplotypeCaller. That is, supply HaplotypeCaller with each of your sample BAMs at once and do not use an ERC mode.


    Again, please can you submit a bug report so we can recapitulate the odd behavior on our end? This kind of use-case (nonhuman, e.g. ploidy 1) is of interest to us and I'd like to make sure we understand correctly the bug you are describing by examining the data directly. Thanks.

  • @shlee

    Thank you for your reply, I've uploaded the requested files to the ftp site (gvcf-bug.tar.gz).

    I have re-run my analysis with the -ERC BP_RESOLUTION setting, and this does resolve the issue of filtering a lot of HOM_REF positions. However, in the process, a lot more SNPs are identified in all samples, which also make it through our sample level hard filtering (DP < 10, fraction of reads that agree with the call <0.9). Some samples go from 412 to 485 SNPs, but others go from 17 to 1308, which is a big difference. Is this the expected behavior of using BP_RESOLUTION?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Redmar_van_den_Berg, the behavior you observe with the MIN_DP and DP business is the inevitable tradeoff in how the hom-ref blocks are collapsed. Given that, yes it is problematic to filter samples individually by DP (which we never recommend doing); the better filter criterium is GQ since it is a measure of confidence that is better calibrated.

    It is somewhat expected that you would get more false positive calls with BP_RESOLUTION; I'd recommend looking at the distribution of GQs, as my instinct is that you'll be able to filter them out on that basis.

    For the record (@shlee) I don't find it surprising to have a high-confidence call from 8 reads in a haploid organism. If it was diploid data it would indeed be cause for concern, because of the greater potential for wobble in the allele ratio. But in a haploid it's much easier to be confident at lower coverage.

  • @Geraldine_VdAuwera Thanks for your reply.

    We are working on a haploid organism, so we cannot do BQSR. Would you still suggest we use per sample GQ as filter criteria instead of DP? I've taken a quick peek at a vcf file, it looks like the GQ is already maxed out with just 3 reads, which is surprising to me.

    GT:AD:DP:GQ:PL:FT
    1:0,1:1:45:45,0:G_dp10
    1:0,2:2:90:90,0:G_dp10
    0:3,0:3:99:0,111:G_dp10
    1:0,3:3:99:225,0:G_dp10
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I assume you mean you cannot do VQSR for that reason.

    I would say GQ is still the best annotation for per-sample filtering on the reliability of the call, at least as a primary filter. Keep in mind that GQ captures not just the number of reads, but also the quality of the base calls. See how the two last calls in your example have the exact same ADs, but the PL of the second most likely genotype is different, indicating that the qualities of the supporting bases (which go into calculating the PLs) are different -- lower on average in the ref call. The PLs are both above 99 (which is where we cap the GQ value) so that difference is not reflected in the GQs, but if there was a real problem at the level of the base calls, you would see the GQ drop. You can also see how in the other two calls the ADs are mixed, which in a haploid case is a big red flag (at least one of the alleles observed is an artifact) and the GQs are correspondingly lower. Take the first call, where based on the ADs alone it's basically a coin toss; the GQ is not good, though also not as terrible as you'd expect for a coin toss situation -- but there is probably a better quality for the base that is showing the ALT allele than for the one showing the ref allele. Now, that being said those are awfully low coverages and it wouldn't be unreasonable to be more stringent on the GQ filtering when DP is below some threshold.

  • We don't do VQSR or BQSR, since both require a set of known variants, as far as I know. Since we focus on real time food surveillance, we see a lot of different serotypes of every bacteria, so we sadly cannot bootstrap our own set.

    I'm confused about what you mean about the mixed ADs, the data I pasted on 5 December doesn't show any mixed alleles, does it?

    We do filter on mixed ADs, since even when the AD ratio is close to 50%, GQ is still maxed. My interpretation is that the gatk still sees more evidence for the alt then the ref, so it thinks the alt is much more likely to be correct.
    In my experience, regions that contain mixed ADs tend to be mapping artifacts, either duplications relative to the reference, or things like bacteriophages that are included in the reference while our samples carry a distantly related phage. If you look at those regions in IGV, they are a complete mess.

    GT:AD:DP:GQ:PL:FT
    1:45,140:185:99:2867,0:G_readfrac0.9
    1:21,46:67:99:902,0:G_readfrac0.9
    1:25,29:54:99:270,0:G_readfrac0.9
    

    Issue · Github
    by Sheila

    Issue Number
    2778
    State
    open
    Last Updated
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Redmar_van_den_Berg
    Hi,

    I will ask Geraldine to get back to you.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Redmar_van_den_Berg Well you could still bootstrap a set of known variants for BQSR, it's not as finicky as VQSR. But ok, I get what you mean.

    I'm also confused by what I wrote -- my eyes must have played a mean trick on me because now I'm not seeing any mixed ADs. Very sorry about that.

    I agree with your interpretation on the actual mixed sites you're showing. Ultimately GATK isn't "aware" of the idiosyncrasies of bacterial genomes so you end up having to compensate -- but it sounds like you're doing all the right things at this point. There may be some perspectives in the medium term to tailor the calling/filtering a bit more to cater to this sort of issue. You might see some chatter about deep learning; that's one of the approaches that could potentially help here.

Sign In or Register to comment.