We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

# Distribution of RGQ scores

Member

I work with non-human genomes and commonly need the confidence of the reference sites, so I was happy to see the inclusion of the RGQ score in the format field of GenotypeGVCFs. However, I am a little confused as to what this score means (how it is calculated). Out of curiosity I plotted the distribution of RGQ and GQ scores over ~1Mbp. A few things jumped out that I was hoping you could explain:

(1) There are two peaks of GQ and RGQ scores, one at 99 - which is obviously just the highest confidence score and another at exactly GQ/RGQ=45. You can see this in the GQ/RGQ distribution below. I've excluded the sites where RGQ/GQ = 0 or 99 (RGQ = blue, GQ=red) is there some reason why so many GT calls == 45?

(2) There are very few GQ = 0 calls and ~96% are GQ=99 - but in the RGQ ~42% == 0 and 54%=99. Is there any explanation why so many RGQ scores == 0? I fear that filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls.

Tagged:

#### Issue · Github September 2015 by Sheila

Issue Number
168
State
closed
Last Updated
Closed By
vdauwera

@rwness
Hi,

The GQ and RGQ values are calculated by taking the second most likely genotype PL - the most likely genotype PL. Because the most likely genotype PL is always 0, the GQ and RGQ are simply the second most likely genotype PL. Have a look at these documents for more information on PL and GQ: https://www.broadinstitute.org/gatk/guide/article?id=5913

1) You are correct the highest GQ possible is 99. As for the 45 peak, I am not sure why that is happening. It may simply be an artifact of your data. Again, have a look at how the PL is calculated. Also, please keep in mind the RGQ is calculated from the reference blocks. The reference block GQ is equivalent to the lowest GQ of the sites in the block.

2) Because we only call a variant at a site we are confident in, the GQ is not likely to be low. However, because the RGQ is calculated from reference blocks (and is the loest GQ of the block), it is more likely to be lower. Also, if a site has all 0 PLs, it is going to be called 0/0 because we cannot be sure there is a variant there (and the default is 0/0).

-Sheila

• Member
edited January 2016

Hi Sheila - I have been working on this problem again recently and I think I know why I am getting a huge excess of PLs that equal 0 and 45.
In my data there are reads that overlap (fragment length =180bp so that reads overlap by ~20bp). I think this means that the quality for the bases that overlap from those reads is capped at 45 if they concur on the base and 0 if they disagree. If somehow the lowest quality base at a site constrains the MQ of the read or the PL value for the site it could be reducing the PL to either exactly 45 or 0.

It would be extremely useful to know how and when hard constraints are placed on the quality of bases, reads, sites and genotype calls. for example if two bases overlap and don't agree what does that mean for the MQ of that site, of that read, of those bases and the genotype call of that individual.

It would also be nice to be able to tweak these parameters because they seem to be applied to reference and non-reference sites differently

Post edited by rwness on

@rwness
Hi,

The way HaplotypeCaller deals with paired reads when they overlap is documented here. The base quality is either Q20 or Q0 depending on whether the overlapping reads agree or disagree. This thread might help too.

I don't think you can tweak those base qualities.

Can you please post the GQ vs RGQ graphs again?

Thanks,
Sheila

• Member

I understand the capping of base qualities of overlapping reads - but what I can't understand is how this base quality is then possibly forcing the RGQ and GQ to be exactly 45 or 90. Do the overlapping reads have a capped MQ? is the data being downsampled to create this bias? I know that in other GATK walkers there are 'hard' decisions made about the quality information - for example there were implementations of UnifiedGenotyper where the BQ of a base was capped at the MQ of the read it came from - this filter would create a oddities in the distribution of BQ. So what I am wondering is there any filters like that being applied to my data that are causing so many sites to end up in the 45/90 RGQ/GQ bin of this histogram

The other problem is that nearly 40% of reference sites have RGQ of 0.0 - which seems incredibly low. Could this also be caused by capping.

The graph is a histogram of RGQ and GQ scores - excluding sites with 0 or 99. Red is reference sites, blue is variant. The X-axis is RGQ or GQ and the Y-axis is proportion (excluding 0's and 99's)

#### Issue · Github February 2016 by Sheila

Issue Number
551
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls

By definition you need to filter invariant calls and variant calls differently/separately, so this is kind of a non-sequitur... What are you planning to do with the final callset(s)?

• Member

The data will be used for a variety of population genetics analyses. The filtering needs to be such that the stringency is equal across variant and invariant sites to avoid bias. In a lot of analyses its just as important to know how many and which sites are invariant - but with 40% of RGQ == 0 and these weird distributions its hard to know how to filter this intelligently. As I said in my other comments - I am just trying to find out what sorts of hard caps or filters are placed on the data through the calculations. Something is constraining RGQ==0 when nearby variant sites have GQ==99.

I understand the desire to have comparable stringency on both types of sites but due to how these scores are calculated you really shouldn't consider the numbers as being functionally equivalent. Any downstream analysis should take into account that confidence for variant sites and invariant sites are not directly comparable. It is a technical limitation that is inconvenient but at present, inescapable.

That being said, we are currently investigating some reports of unusually large numbers of RGQ 0 calls around GQ 99 calls, and I wonder if your observation might be related to that. There isn't an intentional constraint that would cause this so I suspect something screwy may be going on.

• Member

Thanks @Geraldine_VdAuwera - I never assumed the two values were equivalent - which was why I was examining their distributions in the first place.

By constraints I mean things like capped base quality for overlapping reads, RGQ for a site being constrained to the lowest RGQ in the block, Base quality being capped as the MQ of the read it was derived from etc. I know that along the way these decisions have been made and I am assuming that my data have an idiosyncracy that is resulting in a disproportionate number of sites with having of exactly 45 and 90 RGQ (and ~40% have RGQ of 0) - I will keep looking into it to try to figure this out and let you know if I find anything

Fair enough, I'll ask around on my end but let me know if you find anything.

• Member

What are the guidelines on filtering non-variant sites?

#### Issue · Github May 2016 by Sheila

Issue Number
852
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

We don't have any guidelines on this, sorry. Consider looking at RGQ, which gives you the degree of confidence in the non-variant call.

• Member

Hi,
I also had a question about filtering non-variant sites in a non-model organism. Is there a general RGQ threshold that you would suggest?