Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Odd distribution of Coverage for GATK HaplotypeCaller Variants

aeonsimaeonsim Member ✭✭✭
edited June 2014 in Ask the GATK team

Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants.

All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off.

I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called.

Is this expected behaviour?
If so why does it occur?
If not any idea what is going on here, is it a bug in the variant caller or the depth statistics?
Do you see the same thing in other datasets?

Thanks!

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @aeonsim‌

    Hi,

    Does the graph look the same with the filtered variants?

    -Sheila

  • aeonsimaeonsim Member ✭✭✭

    @Sheila‌ This is for a non-human species so VQSR isn't that reliable, as we lack the large sets of proven variants for the model.

    Secondly I'm not certain filtering would make much difference? If we consider the variant sites to be a random sampling from the genome their distribution should closely reflect that of the underlying distribution of coverage. We would expect a slight bias towards low coverage sites (~ < 6x) where the low coverage allows errors to be mistaken as variants. Instead we have a bias towards variants at approximately 8x less than the mean depth.

    If filtering did help fix this then it would seem like there would be a significant problem with the HaplotypeCaller, if it's modelling of variants results in an excess of false positives at reasonably high coverage (16x, or 20x for the GVCF graphs).

    Also we do not see this bimodal distribution when using other variant callers (Platypus for example) where the distribution of coverage at variant sites is uni-modal and closely follows the model I described above (see graph of coverage for gatk, platy & bedtools).

    Finally we are also seeing this bi-model distribution in the gVCF files (again with the second peak at ~8x less than mean coverage) which suggests to me that something odd is going on. As, as I understand it the genomeVCFs should be giving a near complete sampling of the whole genome (in small windows) and thus I'd certainly expect it to show the same distribution as the genome depth of coverage (see NL & BE graph).

    Any thoughts would be appreciated, if you really do think filtering would change this we can have another go at using VQSR, but it seems to me it would make more sense to quickly check this out in a HaplotypeCaller VCF for another species (only takes a few minutes) which we don't have and thus work out if this problem is unique to us, or a property of the current HC.

  • pdexheimerpdexheimer Member ✭✭✭✭

    Where are these bimodal depths coming from? Are they from the output of HaplotypeCaller (the gVCF file), or the output of GenotypeGVCFs (the VCF file)? Note in particular that the depths listed in a gVCF are (by default) aggregated over contiguous regions, so I would expect those to look different.

    Also, depending on which DP annotation you're looking at, it might be a filtered depth - so this could be indicative of something as simple as a low-quality swath of reads on one of the flowcells

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @aeonsim,

    VQSR might help "fix" this if it's a matter of false positives due to some kind of internal bias in the HC, but I'm not convinced that's the case. Let's see if we can track down the origin of this bimodal effect.

    So, here are a few follow-up questions:

    1. To clarify what the first plot shows, do I understand correctly that Platy NR and GATK var DP are the coverage distributions of variant sites, while Bedtools cov is the coverage of all sites in the genome? Or is that also derived from a callset? If it was over the entire genome I'd expect the area under the curve to be much larger, or did you apply some normalization?

    2. Can you tell me what are the units of the axes?

    3. In the GVCF plot, are those banded GVCFs (-ERC GVCF) or BP_RESOLUTION?

    4. The GATK engine applies a bunch of filters + downsampling to the data before it even gets to the HC. Have you looked at what the coverage over the genome looks like if you apply the same filters and downsampling?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @pdexheimer‌, I find it very reassuring when we seem to be thinking along the same lines :)

  • aeonsimaeonsim Member ✭✭✭
    edited June 2014

    @pdexheimer‌ @Geraldine_VdAuwera‌

    Ok I'll go through exactly what we're doing to get these plots:

    First for all GATK VCF files (GATK HC gVCF, VCF) we are using the DP field (or NR for Platy) from the per sample information columns (ie the GT:AD:DP:PL). When using bedtools we are using the genomecov tool.

    For each of the Individuals (3 different ones in the various graphs) we have sequenced a single Paired end library (PCR-Free prep) on Illumina HiSeq's to a depth of ~24x (2 lanes worth of sequence, different times, flowcells and a random mix of 2 HiSeqs). BQSR was run on each lane separately (GATK 2.7.4) and the resulting bams were processed using GATK 3.1-1 HaplotypeCaller to generate a single banded gVCF for the combined coverage (-ERC GVCF). The GVCFs were then used as part of a population of ~150 individuals with GenotypeGVCFs to generate a population VCF file.

    In the first graph in my second post (one with 3 lines, labelled Bedtools cov, Platy NR & GATK var DP ) one individual (NL288) was selected and the DP score for every site in the GATK HC population VCF (0/0,0/1,1/1,1/2 ...) was taken and the DP was recorded into one of 52 bins (0-51+), the % was then calculated as the number of sites in each bin over the total number of sites in all bins.
    This was then done for a Platypus population VCF containing the same samples but using the NR value (Number of Reads at site) as they do not report DP.

    Finally bedtools genomecov was run against the samples BAM file (Combined 2 lanes) and the frequency for the coverage range 0-51 (% of times each sites in the complete genome at X coverage) was used.

    These three were then plotted together Y-axis is % of Genome/Sites and X-axis is the coverage bin, it's a histogram plotted as a line graph to make it easier to read, the same file as a histogram is attached if that makes it easier along with a labelled version of the plot.

    The second plot (the GVCF one) that was attached to the second post showed the same thing. However instead of using the normal VCF file the gVCF files for two different individuals (BE11 & NL11) were used. Again each had two lanes of sequence (~24x total) and they are from completely different flow cells from the sample in the first graph (NL288) and were processed in exactly the same manner as the first sample same pipeline same software, but with a 3-4 month difference in sequencing time (NL288 was sequenced December 2013, the other two April/May 2014).

    Though I've not shown it here this has been observed in several other individuals as well (around 5 all up), all of which were sequenced in house by the same technician using the same batch of kits and same prep method, over a 8month period with flowcells being sequenced on one of 2 instruments, there are no obvious issues seen in the HiSeq sequencing report, in Fastqc, Picard MultipleMetrics or the BQSR plots.

    Thanks it will be good to work out what exactly is happening here.

    Note the area under the curve for the graph adds up to:
    GATK 100.00%
    BED 99.25%
    Platy 99.29%
    The difference is due to slightly discarding sites over 51x for bed and platy and counting them in 51+ for GATK.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for the wealth of additional info! I think what we need here to complete the picture is the coverage plot for a BP_RESOLUTION gVCF of the same individual (because it will have been submitted to the same filtering, but will show all reference as well as variant sites). And if possible, show this faceted or decomposed by chromosome / contig.

  • pdexheimerpdexheimer Member ✭✭✭✭

    I don't know anything about Platypus, but I can think of two other things to look at (in addition to what Geraldine suggested, if nothing else we can keep you busy): stratify by variant type (SNV vs indel), and try to identify a handful of variants that look like they disagree between the callers and see if anything about them pops out

  • aeonsimaeonsim Member ✭✭✭

    Well the base pair resolution gVCF is going to take a while (need cluster time) but in the mean time here is what happens when I split NL288 into SNPs & Indels, pretty much the same pattern of a second peak at ~16x (8x from what should be the main peak at the mean of 24).

  • aeonsimaeonsim Member ✭✭✭

    Ok so here are the BP resolution plots for the individual in question (NL288), looking at them I see no evidence for this effect due to any of GATK's internal filters that would have been running on the data when compiling these stats.

    Note ChrX is at approximately half the usual coverage as expected for this male, how ever at ~140MB our of 2700MB it's unlikely to be responsible for the second peak we see in the variants or the overall distorted shape seen for the primary peak of the variant sites and it does not cause a peak in the whole genome plot from the BP res gVCF, excel sheet contains the raw numbers should the graph be too difficult to follow.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, thanks for these plots. Is it me or does this look like the distributions are all unimodal per contig, and it's the combination that ends up looking bimodal? Or am I reading this wrong?

  • aeonsimaeonsim Member ✭✭✭
    edited June 2014

    I don't think so, the ChrX which is the left most peak only contributes 5% of the sites to the genome.

    If we look at the graphs attached here we have the GATK BP-RES plot which shows a nice clean distribution across the complete genome, the GATK Var sites which gives the bimodal distribution & the Platypus Var sites which gives a left shifted unimodal distribution for the same individual. It seems that the combination of the varying coverage between the Chrs (including ChrX) should give a left shifted distribution similar to that in Platypus rather than the dip between the two peaks we see in GATK.

    If the variation in the Chrs + ChrX when we effectively sample ~20million sites (The variants) is enough to generate a bimodal distribution why would we not see something similar when looking at all the sites? Unless we are getting a noticeable excess of variants called on ChrX or in low covered regions of the genome. It really looks like we are calling fewer variants in regions of 16-24x coverage than I would expect.

    Still an easy way to check this will be to take a Female individual from our population and see if they show the same distribution, I'll get back with that.

  • aeonsimaeonsim Member ✭✭✭
    edited July 2014

    So looking at a dam where ChrX has similar coverage to the rest of the Chromosomes we still see the peak. In this case I've actually split the variants by Chromosome as well and we see the peak within every chromosome as well as in the overall whole genome (all 20M sites). I think that that is pretty convincing that something odd is going on here, when we can see it the whole genome level, within chromosome, and within class (SNP/Indel) but not in other pipelines. For some reason GATK HC seems to be calling an excess of variants in all chromosomes at around 8x below the mean across multiple samples of both genders. Or possible it's calling a small excess of variants at the lower mark but not enough variants between 16-24x coverage for a mean of 24x.

    Have you checked for this in the human samples you have access to? Accompanying graph shows half a dozen chromosomes including X for a female with 24x average depth of coverage. Other chromosomes look very similar to those shown and the second peak clearly separates from the main peak in all chromosomes, while ChrX as at similar levels of coverage as the other chromosomes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, you've convinced me this is real. I'll put this into our issue tracker to figure out what's happening here. Would you be able to share with us a bam file with one chromosome's worth of data, the corresponding GVCF, and the corresponding reference?

  • aeonsimaeonsim Member ✭✭✭

    Ok I'll upload the files next week when I've got some time. Does the FTP work with command line clients yet (there was a bug last time I tried)? As it'd be simpler to upload directly from our cluster rather than having to transfer to a local machine to use a GUI.

    Are you seeing this in your own bam/gVCF files?

  • bredesonbredeson Member ✭✭
    edited February 2015

    Hey @Geraldine_VdAuwera,

    I'm also experiencing this artifact in my banded GVCF output with GATK v3.3-0. Interestingly, it also causes a significant (and unnecessary) loss of reads, cutting my depth approximately in half. The aligned data are 2x250 Illumina reads from the 3rd-generation inbred reference accession. Please see the attached histograms.

    The panel on the left shows the distribution from the banded GVCF sample DP field, and the panel on the right shows the sample DP field extracted from the BP_RESOLUTION GVCF file.

    Is there a fix for this to be expected in the near future?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bredeson I discussed this with the devs some time ago (I'm surprised we don't seem to have followed up in this thread) and if I recall correctly they told me this is expected, not a cause for concern. I'll have to look up the exact explanation they gave me. Ping me if I don't get back to you by tomorrow; I just got back from vacation and have a lot to catch up on.

  • bredesonbredeson Member ✭✭

    Hey @Geraldine_VdAuwera,

    I'm pinging you, as requested. I'm currently working around this issue by outputting in BP_RESOLUTION, as some of our analyses rely on having accurate depth estimates (applying depth thresholds and calculating the callable loci as determined by HC, for instance). It was quite surprising (and unexpected to the user) to observe this artifact. Out of curiosity, do the devs have an explanation for why the artifact exists and why so much more data is thrown out?

    Many thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, it turns out that I was thinking of a different issue of apparent artifact caused by banded GVCFs (at the VQSR step) that is not a cause for concern. So much for me not taking the time to re-read the full thread.

    It seems that we never actually figured this one out. Would you be able to share your data, @bredeson?

  • bredesonbredeson Member ✭✭
    edited March 2015

    Hey @Geraldine_VdAuwera,

    I've uploaded a chromosome's worth of data and its reference as a tarball to the ftp. The data are 2 libs of 250+200 Illumina reads, ~125x depth, and were used in the de novo assembly of the reference sequence:

    -rw-r--r-- 1 depristo wga 3020950817 Mar 3 03:01 oddCoverage.tar.gz

    Hope that helps!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bredeson I'm sorry we haven't been able to get to this at all, we just have too much going on right now. I'm afraid we're going to have to classify this as an unexplained quirk for now, as I can't commit to having anyone look at this in the foreseeable future. Apologies for leaving you hanging this long.

Sign In or Register to comment.