Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Input for VQSR from Mergevcfs

Dear GATK staff,

i have 28 vcf files from 31 humans exome data, the output after GenotypeGVCF according to the targeted gene intervals shows very little variant in each vcf files , less than 200 or 60 which seems might create a less reliable Gaussian model in VQSR. Should i use Mergevcfs to combine the 31 vcf files into a single file before piped them into VQSR?

Best Answer

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @wlai

    As mentioned in the tool docs for VQSR:

    The input variants to be recalibrated. These variant calls must be annotated with the annotations that will be used for modeling. If the calls come from multiple samples, they must have been obtained by joint calling the samples, either directly (running HaplotypeCaller on all samples together) or via the GVCF workflow (HaplotypeCaller with -ERC GVCF per-sample then GenotypeGVCFs on the resulting gVCFs) which is more scalable.

  • wlaiwlai Member
    edited April 4
    @bhanuGandham
    hope i didnt misunderstand what you have said, do u mean i should do HaplotypeCaller again ?
    So far i m processing according to best practices, my steps is calling on single sample till HaplotypeCaller, then produced a flat gvcf file using SelectVariant for each sample,
    CatVariant to combined them into one gvcf file, followed by jointcalling steps(GenotypeGVCFs). Let me know if i had misunderstood anything, thank you for the reply.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @wlai

    produced a flat gvcf file using SelectVariant for each sample, CatVariant to combined them into one gvcf file,

    I am not sure what you mean by this.

    1) To use GenotypeGVCFs, the input samples must possess genotype likelihoods produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION options. 2) Then use CombineGVCFs or GenomicsDBImport to combine the single sample gvcfs produced by Haplotypecaller 3) followed by jointcalling steps(GenotypeGVCFs)

  • wlaiwlai Member
    hi @bhanuGandham , thanks for the reply, i have different attempts, and there is one of mine followed the exact steps listed above, i realised the output of genomicsdbimport produced very few variants in each file which coressponding to the gene interval that i wanted, which makes them impossible to be apply into VQSR. Wondering can i use select variant to combine variants of different gene intervals into a single field and piped into VQSR? Or can i combine the variants to make up to 50 000 variants before jointcalling steps
  • wlaiwlai Member
    edited April 5

    @bhanuGandham

    To clarify again, my workflow is referring to this tutorial.
    https://gatkforums.broadinstitute.org/gatk/discussion/10061/using-genomicsdbimport-to-consolidate-gvcfs-for-input-to-genotypegvcfs-in-gatk4

    After running GenomicsDBimport according to multiple gene intervals of interest, i have multiple database labelled with different gene names. Then, i realised there is not many variants present in each database while GenotypeGVCFs can only take one database in each time.

    gatk-launch GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf

    I thought this can be solved by combining all the variants that has been called, so i used Selectvariant to extract all the variants from different database and MergeVCFs to combine all variants into one gvcf file, before GenotypeGVCFS. I hope this clarify.

  • wlaiwlai Member

    @bhanuGandham
    Is there a minimum cut-off for number of variants? Can you recommend?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @wlai

    You should have atleast 30 WES or 1WGS for VQSR. But as you mentioned there are very few variants even with 31 exome samples, then maybe you should go for hard filtering or take a look at CNN tolls for variant filtering. take a look at these docs:
    https://software.broadinstitute.org/gatk/blog?id=23457
    https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_vqsr_CNNScoreVariants.php

  • wlaiwlai Member

    @bhanuGandham
    thanks for the cnn recommendations, i thought the overall of 50k variants are enough for VQSR

Sign In or Register to comment.