Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
Input for VQSR from Mergevcfs

Dear GATK staff,
i have 28 vcf files from 31 humans exome data, the output after GenotypeGVCF according to the targeted gene intervals shows very little variant in each vcf files , less than 200 or 60 which seems might create a less reliable Gaussian model in VQSR. Should i use Mergevcfs to combine the 31 vcf files into a single file before piped them into VQSR?
i have 28 vcf files from 31 humans exome data, the output after GenotypeGVCF according to the targeted gene intervals shows very little variant in each vcf files , less than 200 or 60 which seems might create a less reliable Gaussian model in VQSR. Should i use Mergevcfs to combine the 31 vcf files into a single file before piped them into VQSR?
Tagged:
Best Answer
-
bhanuGandham Cambridge MA admin
Maybe in your case because there are too few variants and VQSR is meant to be used on larger cohorts, maybe you should look into hard filtering. Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=23216
Answers
Hi @wlai
As mentioned in the tool docs for VQSR:
hope i didnt misunderstand what you have said, do u mean i should do HaplotypeCaller again ?
So far i m processing according to best practices, my steps is calling on single sample till HaplotypeCaller, then produced a flat gvcf file using SelectVariant for each sample,
CatVariant to combined them into one gvcf file, followed by jointcalling steps(GenotypeGVCFs). Let me know if i had misunderstood anything, thank you for the reply.
@wlai
I am not sure what you mean by this.
1) To use GenotypeGVCFs, the input samples must possess genotype likelihoods produced by HaplotypeCaller with
-ERC GVCF
or-ERC BP_RESOLUTION
options. 2) Then use CombineGVCFs or GenomicsDBImport to combine the single sample gvcfs produced by Haplotypecaller 3) followed by jointcalling steps(GenotypeGVCFs)@bhanuGandham
To clarify again, my workflow is referring to this tutorial.
https://gatkforums.broadinstitute.org/gatk/discussion/10061/using-genomicsdbimport-to-consolidate-gvcfs-for-input-to-genotypegvcfs-in-gatk4
After running GenomicsDBimport according to multiple gene intervals of interest, i have multiple database labelled with different gene names. Then, i realised there is not many variants present in each database while GenotypeGVCFs can only take one database in each time.
gatk-launch GenotypeGVCFs \
-R data/ref/ref.fasta \
-V gendb://my_database \
-G StandardAnnotation -newQual \
-O test_output.vcf
I thought this can be solved by combining all the variants that has been called, so i used Selectvariant to extract all the variants from different database and MergeVCFs to combine all variants into one gvcf file, before GenotypeGVCFS. I hope this clarify.
@wlai
Maybe in your case because there are too few variants and VQSR is meant to be used on larger cohorts, maybe you should look into hard filtering. Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=23216
@bhanuGandham
Is there a minimum cut-off for number of variants? Can you recommend?
@wlai
You should have atleast 30 WES or 1WGS for VQSR. But as you mentioned there are very few variants even with 31 exome samples, then maybe you should go for hard filtering or take a look at CNN tolls for variant filtering. take a look at these docs:
https://software.broadinstitute.org/gatk/blog?id=23457
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_vqsr_CNNScoreVariants.php
@bhanuGandham
thanks for the cnn recommendations, i thought the overall of 50k variants are enough for VQSR