We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Is 70k millions variants for WES trios enough?

Hi there,

I am using GATK for trio analysis of WES data where I have proband and parents. I would like to execute a Joint analysis of all the 50 families that I have but could not get more than about 70 thousands variants.
How can I fix this?
My target is the clinical_exome from Agilent (about 50millions bp) and coverage of samples is very high (100-200X).
Can you help me in locating where do I have a mistake in the analysis?


Best Answer


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Vergilius,

    I think 70K variant is reasonable for 50Mbp genomic coverage. The rule of thumb is 1 SNP per 1kbp and 1 indel per 8kbp. If you want more exact numbers, I recommend you peruse the 1000 Genomes Project publications:

    • doi:10.1038/nature09534
    • doi:10.1038/nature11632
    • doi:10.1038/nature15393
    • doi:10.1038/nature15394
  • VergiliusVergilius ItalyMember

    Thanks Shlee for the quick response.
    My problem is that I am trying to execute the joint genotyping of 75 exomes and the VariantRecalibration step stops with this message:
    ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example)**

    If I look at the number of SNPs I have after GenotypeGVCFs it is 82k while there are only 109 indels.
    Do you think there could be a mistake in the analysis? If this depends from the low number of indels, how can I get more?

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2017


    Please see this thread and this thread for explanations of why the tool errors and possible solutions.

    I would add two things. And here I am being tentative as I am filling in for the team on the forum as they are away on a workshop. First, perhaps you can make sure your callset includes low QUAL variant sites, e.g. lower the default -stand_call_conf 10 to 0 to ensure inclusion of even more bad variants. Second, I'm hearing that folks should be able to now leverage public data, e.g. ExAc or gnomAD. That is, perhaps you can merge your calls with another publicly available one, e.g. ExAc, to reach the threshold of calls that enables VQSR.

    I'll tag @Sheila and @Geraldine_VdAuwera for followup.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Vergilius The number of indels does seem low to me. But to comment any further on whether there is any mistake in your analysis, we would need to know what you actually did... ie how did you align and process the data, how did you do variant calling, and so on. It would also be helpful to know more what type of data you are working with.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Also, please tell us what version of the tools you used.

  • VergiliusVergilius ItalyMember

    I re-started all the analysis from scratch. I used the default parameters and pipeline from the Best Practices.
    Now I have 174533 SNPs and 1761 INDELs. I used the default stand_call_conf for GenotypeGVCFs
    I tried also the VQSR and the genotype refinemenent workflow and the final VCF at the end of the story has the same number of SNPs and INDELs. Is this correct?

    Finally, I was wondering how do you analyze this big VCF containing results for many samples. I guess one wants to analyze one sample per time. What is the fastest way to fetch the relevant data for a single sample from it?

  • VergiliusVergilius ItalyMember

    By the way, I am working with GATK 3.6

Sign In or Register to comment.