If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

ExcessHet filtering in cohorts with family members

AndreasZAndreasZ SydneyMember
edited May 2 in Ask the GATK team

This post mentions that the first step in Best Practice VQSR filtering involves hard filtering on ExcessHet. The post also states that "ExcessHet filtering applies only to callsets with a large number of samples, e.g. hundreds of unrelated samples."
I have a cohort of 170 samples from 70 families. Each family consists of either a trio, duo or singleton, depending on who I could recruit. Some of these families are consanguineous. Each family has a different undiagnosed rare disease and I am trying to find the causal variant in each family. The whole 170 samples were aligned and variant called following GATK Best Practices. No pedigree file was used during variant calling. In a sample VCF from this cohort, there are 4.9 Million variants, 124'00 of which are flagged as 'ExcessHet'. My question is: can I treat the variants flagged as 'ExcessHet' as likely false positives or does the fact that my cohort consists of multiple unrelated families make the 'ExcessHet' filter unreliable?
I don't really understand why ExcessHet filtering is needed in the VQSR workflow, so if you could explain what ExcessHet does in this context, this would be great!



  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin


    This annotation estimates the probability of the called samples exhibiting excess heterozygosity with respect to the null hypothesis that the samples are unrelated. The higher the score, the higher the chance that the variant is a technical artifact or that there is consanguinuity among the samples. In contrast to Inbreeding Coefficient, there is no minimal number of samples for this annotation. If samples are known to be related, a pedigree file can be provided so that the calculation is only performed on founders and offspring are excluded.

  • imneuroimneuro Member


    Thanks for the suggestion. I am looking for suggestion for the family study too.

    A pedigree file of a family missing founders. For example subject 3 and 4 are not sequenced in .

    Would the calculation performed on set of (1,2,9,10) or set (1,2,9,10,5,6,7,8) ? Or maybe even another set I don't know, because 1 and 9 are brother and sister.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 10

    Hi @imneuro

    I am not sure I quite understand the question, would you please elaborate on what it is that you are trying to achieve here and what your questions is.

  • imneuroimneuro Member

    Hi @bhanuGandham

    Let me try to elaborate a little bit more. Rather than having random sample in population, I have family data. I relabeled the picture to illustrate the family relationship in an example pedigree file that I plan to used in GenotypeGVCFs step.

    Let's define a circle is female, a square is a male. A "/" on top of a circle or square means the subject passed away and we don't have data for them. A direct horizontal line is marriage relationship. A vertical line is parents-children relation. Each row is a generation. Therefore, there are 4 generation in the picture. Subject id is labeled under the circle or square.

    The founders of this family are subjects 23 and 24. They have daughter1, son9 and son3. The son3 married to 4 and have 4 kids (5,6,7,8). However, the subjects 23, 24,3,and 4 deceased and we have no data. Subjects 22,2,10,4,19,17 are consider as random sample from population. And the rest of the group are related.

    The combine.g.vcf contains data for subjects 1,2,9,10,11,19,12,13,14,15,16,17,18,19,5,6,7,8,20,21. With the pedigree file that contains all the relationship of everyone (alive and deceased ) supplied to GenotypeGVCFs. My question is who will be used as founder when there are missing data (22,23,3,4) in the combine.g.vcf file?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 24

    Hey @imneuro

    1,2,9,10,5,6,7,8 will be used as founders because the tool algorithm basically uses samples with <1 parent as founders.

    PS: Checkout Terra for end-to-end GATK pipelining solutions and let us know what more pipelines we can add that will make using GATK easier for you! For more details on whether this is the right fit for you checkout our blog page.

    Post edited by bhanuGandham on
Sign In or Register to comment.