The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
Is it necessary to process 1000 genome data for exome variant calling training?
I have an independently sequenced human exomes with 100x coverage. I would like to call variants using the GATK best practices guidelines, and have been following the guide to do so. However, I am confused about using 1000 genome data to create training files to improve the accuracy of my variant calling.
I remember before the GVCF best practices were written, the previous guide suggested processing ~35 exomes from the 1000 genome project to be used as a training data set. Therefore, as an experiment, I am using my 50 exomes (from the 1000 genome project) and have created GVCF files which I then combined and genotyped into a single "total.vcf" file. Now, I will run VQSR using this "total.vcf" as input and the training resources listed in the documentation. I believe this will leverage both the 50 exome combination and the resources training sets and I will get a highly filtered set of SNPs from my sequenced exome as output. I will then run SelectVariants with my 1 experimental exome's sample name to extract just those high quality SNPs that pertain to my experimental exome.
(EDIT: I am referring the the documentation I found here: "Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.")
My questions are as follows:
1) Am I correct in my understanding that calling variants in numerous exomes from the 1000 genome project to create a training data set is good practice with the goal of achieving the best possible variant calling results for my single exome of interest?
2) If so, will my training set produce better results the larger it is (meaning using all ~3,500 exomes from the 1000 genome project will create the best possible training set)?
3) If more is better, is there are resource somewhere of all ~3,500 exomes already processed into GVCFs, or should I do that myself?
Thank you for your help as I learn more about exome sequencing!