Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Is it necessary to process 1000 genome data for exome variant calling training?

jtshrevejtshreve Member
edited April 2017 in Ask the GATK team

I have an independently sequenced human exomes with 100x coverage. I would like to call variants using the GATK best practices guidelines, and have been following the guide to do so. However, I am confused about using 1000 genome data to create training files to improve the accuracy of my variant calling.

I remember before the GVCF best practices were written, the previous guide suggested processing ~35 exomes from the 1000 genome project to be used as a training data set. Therefore, as an experiment, I am using my 50 exomes (from the 1000 genome project) and have created GVCF files which I then combined and genotyped into a single "total.vcf" file. Now, I will run VQSR using this "total.vcf" as input and the training resources listed in the documentation. I believe this will leverage both the 50 exome combination and the resources training sets and I will get a highly filtered set of SNPs from my sequenced exome as output. I will then run SelectVariants with my 1 experimental exome's sample name to extract just those high quality SNPs that pertain to my experimental exome.

(EDIT: I am referring the the documentation I found here: "Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.")

My questions are as follows:

1) Am I correct in my understanding that calling variants in numerous exomes from the 1000 genome project to create a training data set is good practice with the goal of achieving the best possible variant calling results for my single exome of interest?

2) If so, will my training set produce better results the larger it is (meaning using all ~3,500 exomes from the 1000 genome project will create the best possible training set)?

3) If more is better, is there are resource somewhere of all ~3,500 exomes already processed into GVCFs, or should I do that myself?

Thank you for your help as I learn more about exome sequencing!


Best Answer


  • jtshrevejtshreve Member

    Does anyone have any thoughts?

  • jtshrevejtshreve Member
    Understood, thank you for the helpful response!

    As a follow up question, what is the best way for me to evaluate the accuracy of my exome SNP calling pipeline, which uses GATK? I have downloaded the SNP set from genome in a bottle (NA12878) and the primary sequencing data, but I'm getting poor results. For example, if I send the GIAB reads through my pipeline without VQSR, I find 90% true positive SNPs, but with lots of false positives. After VQSR, I only find 60% true positives, but less false positives. I understand the trade offs of sensitivity, but am more interested in how the GATK team suggests testing pipelines for accuracy rather than getting assistance troubleshooting my methods.

    Can you please point me in the direction of any resources that are recommended for determining variant calling integrity / accuracy? My pipeline simply follows the GATK best practices from start to finish for exome sequencing, and am therefore assuming my evaluation of the resulting output is insufficient, not the pipeline itself.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, those numbers seem much lower than I would expect if you're running everything on the GIAB NA12878.

    We have some tutorials (see the hands-on materials from the last workshop, linked on the Presentations page) that describe basic evaluation methods, tools and pitfalls (eg it's important to distinguish site-level call concordance and sample-level genotype concordance) but it's not an out of the box solution for pipeline evaluation.

    What our methods development team uses is mostly composed of in-house resources for evaluation that aren't published, including a fair amount of custom scripts and datasets that I can't share with you at this time (though that might change in the future).

    For a solid public resource, check out the bcbio project. I believe they publish the details of how they do their evaluations. I think if you start here you'll find what you need.

  • jtshrevejtshreve Member

    Understood, thank you for the information. As a follow up, would you mind if I ask what sort of numbers Broad gets when applying GATK on the GIAB NA12878? Ballpark true positives and false positives would be quite helpful.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Sorry for the delay. I am checking with some of the developers. One of us will get back to you asap.


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Again, sorry for the delay. This site will give you some ballpark figures of what people expect to see for sensitivity and specificity.


Sign In or Register to comment.