Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Is 70k millions variants for WES trios enough?

Hi there,

I am using GATK for trio analysis of WES data where I have proband and parents. I would like to execute a Joint analysis of all the 50 families that I have but could not get more than about 70 thousands variants.
How can I fix this?
My target is the clinical_exome from Agilent (about 50millions bp) and coverage of samples is very high (100-200X).
Can you help me in locating where do I have a mistake in the analysis?

Thanks,
Francesco

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Vergilius,

    I think 70K variant is reasonable for 50Mbp genomic coverage. The rule of thumb is 1 SNP per 1kbp and 1 indel per 8kbp. If you want more exact numbers, I recommend you peruse the 1000 Genomes Project publications:

    • doi:10.1038/nature09534
    • doi:10.1038/nature11632
    • doi:10.1038/nature15393
    • doi:10.1038/nature15394
  • VergiliusVergilius ItalyMember

    Thanks Shlee for the quick response.
    My problem is that I am trying to execute the joint genotyping of 75 exomes and the VariantRecalibration step stops with this message:
    **
    ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example)**

    If I look at the number of SNPs I have after GenotypeGVCFs it is 82k while there are only 109 indels.
    Do you think there could be a mistake in the analysis? If this depends from the low number of indels, how can I get more?

    Issue · Github
    by shlee

    Issue Number
    2286
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2017

    @Vergilius,

    Please see this thread and this thread for explanations of why the tool errors and possible solutions.

    I would add two things. And here I am being tentative as I am filling in for the team on the forum as they are away on a workshop. First, perhaps you can make sure your callset includes low QUAL variant sites, e.g. lower the default -stand_call_conf 10 to 0 to ensure inclusion of even more bad variants. Second, I'm hearing that folks should be able to now leverage public data, e.g. ExAc or gnomAD. That is, perhaps you can merge your calls with another publicly available one, e.g. ExAc, to reach the threshold of calls that enables VQSR.

    I'll tag @Sheila and @Geraldine_VdAuwera for followup.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Vergilius The number of indels does seem low to me. But to comment any further on whether there is any mistake in your analysis, we would need to know what you actually did... ie how did you align and process the data, how did you do variant calling, and so on. It would also be helpful to know more what type of data you are working with.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Also, please tell us what version of the tools you used.

  • VergiliusVergilius ItalyMember

    I re-started all the analysis from scratch. I used the default parameters and pipeline from the Best Practices.
    Now I have 174533 SNPs and 1761 INDELs. I used the default stand_call_conf for GenotypeGVCFs
    I tried also the VQSR and the genotype refinemenent workflow and the final VCF at the end of the story has the same number of SNPs and INDELs. Is this correct?

    Finally, I was wondering how do you analyze this big VCF containing results for many samples. I guess one wants to analyze one sample per time. What is the fastest way to fetch the relevant data for a single sample from it?

  • VergiliusVergilius ItalyMember

    By the way, I am working with GATK 3.6

Sign In or Register to comment.