Apparent true SNPs with allele inbalance pass all hard filters except QD

Hi,

I have been working on fine tuning hard filtering for a set of 100 samples where we are capturing a very small target (500k bp).

Through this fine tuning it has become clear to me that QD is probably the strongest filter, however it has a cost and that is that if one sets the limit at QD < 2, one filters out a non-negligeable number of SNPs for which there is strong evidence that they are true.

I have taken a look at these SNPs to try to see why they are failing the QD filter.

Almost all of these SNPs are in heteroz genotypes and have a strong allele imbalance, that is the alternative allele is supported by quite a bit less than 50% of the covering bases. This explains why they fail QD: they have decent QUAL (although not great) due to the allele imbalance, but they have excellent coverage and thus get highly penalised by the depth correction. However, all other filters (FS, HaplotypeScore, ReadPosRankSum, BaseQRankSum, MQRankSum, etc) are passed. In addition, several of these SNPs are in curated dbs of variants. Finally, some of these SNPs are in multiple samples and the imbalance between alleles is similar in different samples at the same site.

In summary, these variants appear to be true and are characterised by allele imbalance. But what is it that is causing this allele imbalance? How comes we sequence more of one allele than the other?

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Tim,

    The QD annotation was created for just this type of situation. Are you certain these are real events and not due to misalignments or sequencing errors? What is the transition/transversion ratio of these variants relative to ones that pass QD? I'll bet you'll find that it's much worse. Remember that these type of errors also have a habit of making it into hose curated DBs of known variation.

    That being said, this is also why we developed the VQSR, a statistical filtering approach designed to model error in multiple dimensions simultaneously. Unfortunately you cannot use this approach with your small target though. But if you believe that these are real events and it's just a symptom of your data, then by all means you should adjust the hard filter threshold. The recommended best practices are intended to be used in the general case and we expect every analyst to determine which thresholds are ideal for their own data type. Blindly copying the thresholds will not always lead to the best results.

  • TimHughesTimHughes Member

    Hi Eric,

    Thanks for your feedback. I will check the Ti/Tv ratios.

    I understand what you mean when you say QD was created for just this type of situation. But just to clarify as I think that I may be onto something interesting in this dataset: I have several hundred variants with QD < 2.0 and several of these are variant sites in curated dbs. Because of this I did not apply QD but used pretty much all the other filters. Very few variants survived this filtering battery. Even a significant fraction of the known variants were eliminated (when I inspected them they clearly failed filters such as FS or BaseQRankSum). But about half of these known variants survived (almost none of the unknown variants survived). But they would not have survived QD.

    You suggest that they may be sequencing or alignment errors (and many of the known variants clearly are) but I do find it interesting that there is this small fraction that survives pretty much the entire battery of filters excluding QD (HaplotypeScore, InbreedingCoeff, FS, BaseQRankSum, MQRankSum, ReadPosRankSum, etc)

  • TimHughesTimHughes Member

    I computed the ti/tv ratio for all variants surviving the battery of other filters (with QD < 2) and stratified by know / unknown and I get titv of about 30/12 for the known but 25/30 for unknown. So, your prediction is correct for the unknown but there are also clear TP calls at QD < 2 as corroborated by the ti/tv ratio and the annotation (amongst the known I have 7 in HAPMAP).

    I guess this is maybe a result specific to my dataset, but I would suggest to others using hard filtering to think hard about setting a QD cutoff at less than 1 (as long as one is using a good set of the other filters) if sensitivity is highly valued over specificity.

Sign In or Register to comment.