We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

#### Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

# Which part of mym QD plot is the homozygous peak?

Bangor, UKMember

Hi there,

Just a quick question, which I think may be of use to people with similarly...squiffy...plots! I've plotted out QD values v Density so as to inform the hard filtering process but I'm having difficulty discerning the expected peaks for heterozygous calls and homozygous calls, as described at https://software.broadinstitute.org/gatk/guide/article?id=6925. As you can see from the attached plot, there is a peak at the lower values (or is it a shoulder?), a tiny bump and then a major peak, but then just a shoulder on the other side of the peak. As filtering effectively (and stringently) is key to my study, I'd like to know what each peak and shoulder represents before I take the plunge if anyone can make an educated guess, please?

Many thanks,

Ian

Tagged:

It's hard to say with certainty, but it looks to me like the peaky shoulder on the left (below QD=10) is your lump of probable false positives; then you have a proper peak for hets around QD=15, then an indistinct shouldery hump of hom-vars on the right around QD=30.

There can be various reasons for why your hom-vars aren't coming out as clearly as ours. One of them is that we show plots made with calls from a trio of related individuals. If your plot includes many individuals that are not directly related, it may be that you have some blurring of the lines because of the relative amounts of het and homvars you see for each variant -- that causes wider distributions for each and so the peaks end up getting merged. Technical parameters of the sequencing could also affect this, eg more variability in the quality metrics at different sites.

If you need very stringent filtering you could push the QD threshold to 7 or 8, based on the plot. Depends on how much of the FP shoulder/peak you can erode with other annotations.

It's hard to say with certainty, but it looks to me like the peaky shoulder on the left (below QD=10) is your lump of probable false positives; then you have a proper peak for hets around QD=15, then an indistinct shouldery hump of hom-vars on the right around QD=30.

There can be various reasons for why your hom-vars aren't coming out as clearly as ours. One of them is that we show plots made with calls from a trio of related individuals. If your plot includes many individuals that are not directly related, it may be that you have some blurring of the lines because of the relative amounts of het and homvars you see for each variant -- that causes wider distributions for each and so the peaks end up getting merged. Technical parameters of the sequencing could also affect this, eg more variability in the quality metrics at different sites.

If you need very stringent filtering you could push the QD threshold to 7 or 8, based on the plot. Depends on how much of the FP shoulder/peak you can erode with other annotations.

• Bangor, UKMember

That's really helpful, thank you. I wondered if the leftmost peak/shoulder might have been false positives. Good to get a second opinion on it and to know the reason behind the strange plot.

• Member

Hi,

I'm facing a similar question. However, I first want to clarify.

-) The peaks of the QD don't have to be around 12 and 32 as stated in tutorial 11069 (https://software.broadinstitute.org/gatk/documentation/article?id=11069)? The homozygous peak should just be around twice the heterozygous peak, or not?

-) My QD plot only shows 1 distinct peak around 17 (see attachment, please ignore the diagonal line in the graph). We suspect ourselves that this peak is the heterozygote peak, due to the nature of the samples we have. However if one always should have two peaks in a QD-plot, is it possible that we have so little homozygotes (they just drown in the shoulder around (2*17=34)?

I hope you can clarify and help us (I will have a look at the Qual and DP density graphs in case something maybe explained in there).

Best and thanks,
Andrea

Hi Andrea,

Correct, QD does not have to be 12 or 32. Your peaks will be dependent on your data (the tutorial uses a family trio). Geraldine's post explains some of the reasons the peaks can vary. Since QD is based on the quals of reads that support the variant (not the reference reads) heterozygous variants for diploids have ~1/2 the QD of those we see in the homozygous form so homs will have twice as many reads, as you've said. This is observed more strongly for one or a few related samples; once you have more, the effect is diluted because you'll observe a mix of hets and homs for each site (mix dependent on population diversity, etc.). What type of samples are you working with?

• Member

(sorry for the delay)

I'm working with bird populations (none of theme are related individuals), spread across Europe. It is known for these birds that they have low levels of population diversity. Furthermore, due to the use of different sequencing platforms the readdepth per individual was quite variable. So, you suggest that what I'm seeing is a mix of hets and homs? Should I therefor not move the filter at all?

I'm including DP and QUAL graphs in case this might aid you.

Best,