Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

VQSR or hard filters

A.IrisA.Iris UppsalaMember

Hi GATK team!
I was reading through the VQSR documentation. At some point it is stated that "Whole exome call sets work well, but anything smaller than that scale might run into difficulties." We used VQSR successfully for a 32Mb array and now I have an array of 20Mb.
I was wondering if this time the array is too small for the model. Shall I use hard filters instead?

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @A.Iris
    Hi,

    We recommend using either 30 whole exome samples or 1 whole genome in VQSR. How many samples do you have in your dataset?

    -Sheila

  • A.IrisA.Iris UppsalaMember

    Hi,
    Thanks for your response.
    I have in total 315 samples (case+controls).

  • SheilaSheila Broad InstituteMember, Broadie admin

    @A.Iris
    Hi,

    You should be able to use VQSR with 315 samples.

    Good luck!

    -Sheila

  • pdexheimerpdexheimer Member ✭✭✭✭

    @A.Iris -

    It's all about getting overlap with your training sites. I would think that 20MB would be large enough, though it's in the area where I would start getting concerned. But the more important metric is the total number of variants, and how well they overlap - it would certainly be possible to come up with 20MB of capture that miss the training sites completely (iirc, they're concentrated in coding regions). Similarly, you could have whole exomes but not have enough variation in your data to get sufficient overlap - that's the basis of the '30 individuals' guideline Sheila mentioned.

    In the end, all you can do is try. My intuition is that 20MB x 315 people should be fine, but it's definitely worth checking the output plots and running VariantEval over the result to make sure everything looks good.

  • A.IrisA.Iris UppsalaMember

    Thank you both for your responses!
    I will try to run it and check what happens. I am not sure if I could interpret the output plots but I would give it a go.

Sign In or Register to comment.