Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Which part of mym QD plot is the homozygous peak?

ianwianw Bangor, UKMember

Hi there,

Just a quick question, which I think may be of use to people with similarly...squiffy...plots! I've plotted out QD values v Density so as to inform the hard filtering process but I'm having difficulty discerning the expected peaks for heterozygous calls and homozygous calls, as described at https://software.broadinstitute.org/gatk/guide/article?id=6925. As you can see from the attached plot, there is a peak at the lower values (or is it a shoulder?), a tiny bump and then a major peak, but then just a shoulder on the other side of the peak. As filtering effectively (and stringently) is key to my study, I'd like to know what each peak and shoulder represents before I take the plunge if anyone can make an educated guess, please?

Many thanks,

Ian

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    It's hard to say with certainty, but it looks to me like the peaky shoulder on the left (below QD=10) is your lump of probable false positives; then you have a proper peak for hets around QD=15, then an indistinct shouldery hump of hom-vars on the right around QD=30.

    There can be various reasons for why your hom-vars aren't coming out as clearly as ours. One of them is that we show plots made with calls from a trio of related individuals. If your plot includes many individuals that are not directly related, it may be that you have some blurring of the lines because of the relative amounts of het and homvars you see for each variant -- that causes wider distributions for each and so the peaks end up getting merged. Technical parameters of the sequencing could also affect this, eg more variability in the quality metrics at different sites.

    If you need very stringent filtering you could push the QD threshold to 7 or 8, based on the plot. Depends on how much of the FP shoulder/peak you can erode with other annotations.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    It's hard to say with certainty, but it looks to me like the peaky shoulder on the left (below QD=10) is your lump of probable false positives; then you have a proper peak for hets around QD=15, then an indistinct shouldery hump of hom-vars on the right around QD=30.

    There can be various reasons for why your hom-vars aren't coming out as clearly as ours. One of them is that we show plots made with calls from a trio of related individuals. If your plot includes many individuals that are not directly related, it may be that you have some blurring of the lines because of the relative amounts of het and homvars you see for each variant -- that causes wider distributions for each and so the peaks end up getting merged. Technical parameters of the sequencing could also affect this, eg more variability in the quality metrics at different sites.

    If you need very stringent filtering you could push the QD threshold to 7 or 8, based on the plot. Depends on how much of the FP shoulder/peak you can erode with other annotations.

  • ianwianw Bangor, UKMember

    That's really helpful, thank you. I wondered if the leftmost peak/shoulder might have been false positives. Good to get a second opinion on it and to know the reason behind the strange plot.

  • ABoursABours Member

    Hi,

    I'm facing a similar question. However, I first want to clarify.

    -) The peaks of the QD don't have to be around 12 and 32 as stated in tutorial 11069 (https://software.broadinstitute.org/gatk/documentation/article?id=11069)? The homozygous peak should just be around twice the heterozygous peak, or not?

    -) My QD plot only shows 1 distinct peak around 17 (see attachment, please ignore the diagonal line in the graph). We suspect ourselves that this peak is the heterozygote peak, due to the nature of the samples we have. However if one always should have two peaks in a QD-plot, is it possible that we have so little homozygotes (they just drown in the shoulder around (2*17=34)?

    I hope you can clarify and help us (I will have a look at the Qual and DP density graphs in case something maybe explained in there).

    Best and thanks,
    Andrea

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi Andrea,

    Correct, QD does not have to be 12 or 32. Your peaks will be dependent on your data (the tutorial uses a family trio). Geraldine's post explains some of the reasons the peaks can vary. Since QD is based on the quals of reads that support the variant (not the reference reads) heterozygous variants for diploids have ~1/2 the QD of those we see in the homozygous form so homs will have twice as many reads, as you've said. This is observed more strongly for one or a few related samples; once you have more, the effect is diluted because you'll observe a mix of hets and homs for each site (mix dependent on population diversity, etc.). What type of samples are you working with?

  • ABoursABours Member

    Hi @Tiffany_at_Broad,

    (sorry for the delay)

    I'm working with bird populations (none of theme are related individuals), spread across Europe. It is known for these birds that they have low levels of population diversity. Furthermore, due to the use of different sequencing platforms the readdepth per individual was quite variable. So, you suggest that what I'm seeing is a mix of hets and homs? Should I therefor not move the filter at all?

    I'm including DP and QUAL graphs in case this might aid you.

    Best,

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @ABours , Sorry for the delay. I think you could probably move the filter a bit at the ends to be more stringent. I'll see if someone else from our group has any thoughts.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Ok, the guidance I got, and sorry we can't be more specific, is that you can probably move the filter a little unless you care more about sensitivity over specificity. We are assuming you don't have the second bump because you combined a bunch of samples in that plot with various sequencing depths.

Sign In or Register to comment.