We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

What data set is better for VQSR?

Lisa0508Lisa0508 Ann Arbor, MIMember
edited October 2015 in Ask the GATK team

Hi all,
I found hard filter couldn't filter out most of false positive variants in my exome VCF files. I wish to try VQSR by adding more VCF files to my own data to reach at least the lower limit of 30 samples. I have read some posts about VQSR. Before I start to try, there are four problems that still confuse me. It was mentioned that bam files rather than VCF files from 1000 genome should be added and called by HC to generate VCF files.

  1. But how can I know which version of reference did the 1000 genome use to generate those bam files. If the version is not consistent with hg19 or b37, the downstream IndelRealigner and HC will report the error and quit. So I think I prefer to use the data from other individuals. I actually have joint VCF files from other families (three individuals in one family). The capture region was the same for all samples. But one thing I worry is that they don't come from the same ethnicity. Some from Middle East, some from Europe.

  2. So how will that affect the accuracy of VQSR?

  3. After I get the calibrated joint VCF file, can I use SelectVariants to extract the VCF file by individual? Because each individual suffers from different disease, I have to analyze each VCF file separately.

  4. We wish to find out the mutation upstream and downstream 20bp from the exon regions. So I didn't use the -L option when doing HC because I was afraid of missing some regions. All my VCF files include all genomic regions. Is it all right to joint all such VCF files together for VQSR?

Thank you very much,

Post edited by Geraldine_VdAuwera on

Best Answers


  • Lisa0508Lisa0508 Ann Arbor, MIMember

    Your answer helped a lot!. Thank you very much. I have finished VQSR with my joint VCF from 37 gVCF samples. I used -genotypeMergeOptions REQUIRE_UNIQUE so that each unique individual was still there, not merged. 1. Not sure if I used the right argument. Then I calibrated the joint VCF file with the --maxGaussians 4 argument. But I deleted the 'inbreedingcoeff' option because there was no annotation of this column in my VCF file. 2. Will this bring down the accuracy of VQSR? The tranches plots seemed kind of weird. Please see the attachment (recalibrate SNP). I guess it's due to the noise from off-target regions as you mentioned. I regret not using the -L and -ip option when doing HC. But there seemed to be no -L and -ip option in 'CombineGVCF' and "GenotypeGVCF'. Doing HC again will cost a lot of time. 3. Still any way to restrict the exon interval without doing HC again? The last question is that I have other three samples from SOLID platform. The sequencing quality and read depth was much lower than my current 37 illumina samples (their capture regions were the same though). 4. Can such VCF files be joint together with my current 37 VCF files from illumina platform to run VQSR? I guess most of the variants in SOLID VCF files will be filtered out then?
    Thank you again,

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Lisa,

    You need to run GenotypeGVCFs on your GVCFs to get the final multi-sample VCF. Can you confirm that you used GenotypeGVCFs to get the final multi-sample VCF? It sounds like you used CombineVariants. You can use the -L option with GenotypeGVCFs. -L is an argument that is available to all GATK tools.


  • Lisa0508Lisa0508 Ann Arbor, MIMember

    Thank you for your answer. I used GenotypeGVCFs to joint gVCF files that belong to the same family. Then I used CombineVariants again to combine the joint VCF files from different families. That's what I did to get the combined VCF file for VQSR. Yes, I wanted to use CombineGVCFs and then use GenotypeGVCFs. But then I thought that I don't want those variants that are homozygous to reference in one family but are heterozygous in other families. So I did it in a 'strange' way to get the combined VCF file. Now I used the -L and -ip option in CombineVariants and got the calibrated VCF file restricted to the exome region. After VQSR, I extracted VCF by family and run the trio exome analysis. Maybe I am thinking wrong?! Please tell me if I am wrong!
    Thank you very much,

Sign In or Register to comment.