We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

calculating the fraction of a cohort sharing the same germline variants


I am trying to calculate the percentage of samples in my cohort that share identical variants.

Being new to genomics, I followed the GATK pipeline for germline (and somatic) variant discovery using the following approach:

(1) Germline Variant Call of normal sample (HaplotypeCaller)

(2) Joint genotyping (GenotypeGVCFs)

(3) Hard filtering (SelectVariants & VariantFiltration)

  • SNPs:
    --filterExpression "QD < 2.0" --filterName "low_QD" \
    --filterExpression "FS > 60.0" --filterName "high_FS_snp" \
    --filterExpression "MQ < 40.0" --filterName "low_MQ" \
    --filterExpression "MQRankSum < -12.5" --filterName "low_MQRankSum" \
    --filterExpression "ReadPosRankSum < -8.0" --filterName "low_ReadPosRankSum_snp" \

  • INDELs:
    --filterExpression "QD < 2.0" --filterName "low_QD" \
    --filterExpression "FS > 200.0" --filterName "high_FS_indel" \
    --filterExpression "ReadPosRankSum < -20.0" --filterName "low_ReadPosRankSum_indel" \

(4) Genotype Refinement (CalculateGenotypePosteriors & VariantFiltration)

/Q1: Having a pair of Normal and Tumor samples for a same individual, I use only the Normal sample for running HaplotypeCaller.
Should I also use the tumor sample?

/Q2: For counting the proportion of individuals among my cohort sharing the same mutation, I look at the final vcf file produced by the pipeline and select samples if all the following conditions are fulfilled:
. GT tag of a sample is not "./." (i.e. discarding samples without enough information supporting the genotype determination, I guess)
. Alt allele count in AD of a sample is >0
. FT tag of a sample is "PASS" ( although I am not sure if it is accurate since, for a reason I do not really understand, sometimes FT of sample is "FAIL" but the overall FT in the FILTER field is "PASS")
Does this selection criteria make sense?

Sharing your expertise would be greatly appreciated.

Thanks !



  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    First things first. What exactly is your end goal? What did you do with the tumor samples? You should be using MuTect2 for matched tumor/normal sample analysis. Have a look at the Best Practices for information on how to call somatic variants.


  • Hi,

    My end goal is to calculate, from the joint refined vcf file, the number of samples that contain the same GERMLINE variant.

    For somatic variant discovery, I used normal and tumor sample from a same individual.
    For germline variant discovery (approach described above), I used only the normal sample.

    My questions are related to GERMLINE variant discovery only:
    QUESTION #1: Is it the right way to use (with HaplotypeCaller) only the normal sample for germline variant discovery or should I also include the tumor sample?

    QUESTION #2: Let's consider a population of 3 individuals. My final vcf file obtained after joint genotyping, filtering and genotype refinement looks as follow (full header omitted here):

    1 14741 . C A 73.15 PASS BlaBlaBla GT:AD:DP:FT:GQ:PL:PP 0/1:7,5:12:PASS:99:101,0,154:101,0,154 ./.:0,0:0:PASS 0/0:2,0:2:lowGQ:6:0,6,60:0,6,60

    To consider a sample as having the germline variant described in the first 5 fields, the following conditions must be met:
    . GT tag of a sample is not "./."
    . Alt allele count in AD of a sample is >0
    . FT tag of a sample is "PASS"

    In the example above, the number of samples containing the variant is thus 2 (1st and 3rd samples; i.e. 66.66% of the population).

    Is it correct?

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @user31888, sorry for the late response. For your first question, you're doing the right thing: you should indeed use only the normal samples for producing germline calls.

    For your second question, you made a mistake -- the 3rd sample in your example does not have the variant since the genotype is hom-ref. There is more than just the ADs that go into making the genotype determination -- but even if that's what we used, the 3rd sample has 2 REF reads and zero ALT reads, so there is no basis for considering it variant. So the key criterium here is that the GT should include the ALT allele (or one of them if there are several).

  • Sorry you are totally right for the example I took above (only the 1 st sample is counted).

    Thanks Geraldine !

Sign In or Register to comment.