We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Excessive heterozygosity in a dataset filtered by VQSLOD


TL;DR: I have lots of excess het even after filtering (I used VQSLOD and then filtered by InbreedingCoeff too). Why, and what do I do? See questions at end.


I have a WES+ dataset of about 1.2 million bp from 202 individuals (mixed ancestries from a coastal American city, one sibling pair but others unrelated; 120ish Europeans, the rest Asian/African/admixed/etc) that I filtered with VQSLOD.

After filtering, I find that a very large number of SNPs have extreme values of heterozygosity. For example, some SNPs had 0/1 for every individual except one. The odds of this happening by random chance are... slim.

I then filtered by InbreedingCoeff to try to limit the amount of heterozygosity (and I used -0.55 instead of -0.8 as a threshold based on the distribution of hets). I noticed that some SNPs with very low heterozygosity (like all 0's and one het) were marked as having InbreedingCoeff<-0.8. Approximately 2000 of the InbreedingCoeff<-0.55 SNPs had <20 het alleles (not because of missing data) out of 7000 SNPs with InbreedingCoeff<-0.55.

After filtering by InbreedingCoeff, I still see a very large number of heterozygous SNPs. Approximately 10,000 have more than 140 heterozygous alleles, which should never happen in this dataset.

It seems pretty clear that these SNPs (1-2% of my dataset) are low quality reads, but I don't understand why the standard filters aren't picking them up.

As a related aside, after VQSLOD I also found two of the individuals in my dataset have excessive het. Should I re-run VQSLOD after removing them?

So my main questions for you are:

1. What is the right way to properly filter out SNPs with excessive heterozygosity? Filtering by VQSLOD left SNPs with 201 hets and 1 hom. Filtering by InbreedingCoeff leaves tens of thousands of SNPs that are still excessively het.

2. Why might InbreedingCoeff<-0.8 (or <-0.55) filter out so many SNPs that don't appear outbred and have few heterozygous alleles? Is it noticing something about the sample quality perhaps?

Thank you for your time,



  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    One reason could be the genome that you use may be a biased one. Some reference genomes without unlocalized contigs end up contaminating your actual WES loci with reads from pseudogenes or pseudosimilar sequences coming from else where in the genome. Also there may be contamination coming from retrotranspositions along the genome which we have no idea how often or where they actually insert themselves. Some of those insertions are pretty common for some of the genes such as PRKRA, MFF etc...

    I would strongly suggest you to use hg38 with alt mapping option on during BWA stage. This may reduce some of the artifacts that you are facing.

    Also I noticed that your population includes African and ad-mixed ones. These may end up messing your inbreeding estimations. It may be also a wise decission to check these excessive het loci to see if they end up some kind of heterozygosity region for a particular population or else. You may want to compare your data with 1000G phase 3. It may give you big hints whether the variant you observe is a real one or artifact based on the population that your samples belong.

    These were just my 2 cents.

  • Thanks for your comment SkyWarrior.

    The data was initially mapped (a couple years ago) with hg17, then lifted over to hg38 for GATK, so you might be right about these sorts of contamination issues. And I could see how admixture might mess up inbreeding estimations. But I still don't understand why het SNPs would be missed by the filters. Seems like something with 150 or 201 hets (out of 202 individuals) should be flagged regardless.

    Most of our population is European, but I agree that checking some reference data might be useful. If it's excessively het in the reference then it's probably a difficult region to sequence. If not then maybe it's a problem with our population or alignment.

    But yeah I still don't quite understand why filtering doesn't remove SNPs with excessive het, maybe it wasn't designed to...
  • I looked into the 1000 genomes data quickly (https://doi.org/10.12688/wellcomeopenres.15126.1) and it looks fairly consistent. I've only looked at CHR1, and the 1000G data includes only biallelic SNPs.

    The distribution of HWE p-values (plink-calculated) in the 1000G dataset is pretty extreme. 10% of SNPs have p-values < .01 and 80% of SNPs have p-values > 0.99. However, if I filter to snps that are overly heterozygous (O(HET)>E(HET)), then only 0.2% of SNPs have p-values < 0.01. In other words, overly het SNPs are generally rare.

    Then I compared to my dataset.

    There are 82,321 SNPs overlapping between the datasets (1000G and mine).

    In the 1000G data that overlaps mine, 1.3% of het SNPs have p-values < 0.01. So the regions in my dataset seem to be enriched for het SNPs in the 1000G data.

    Then I focused on the het SNPs with low p-values in my dataset, p<5e-8.

    In 1000 genomes there are 134 such SNPs. 101 of them are het. 31 of those have p<0.01 (31%).

    So the het SNPs with very low p-values in my dataset are highly enriched to have p<0.01 in 1000G.

    I think this means my results are generally consistent with 1000G: SNPs that are extremely heterozygous in my dataset are much more likely than random chance to be extremely heterozygous in the 1000G data.

    I guess this means it's safe to filter these out.

    Still don't have answers to my initial questions: (1) why don't VQSLOD or InbreedingCoeff filter out these SNPs with extremely high het; (2) Why does inbreedingcoeff filter out SNPs with very low het?

Sign In or Register to comment.