Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Missing sites in vcf file
I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:
whereas the initial set of sites in my final vcf file is:
First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.
I also got this warning 3 times when running GenotypeGVCFs:
WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used
How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).
FYI, this is the command I used for GenotypeGVCFs:
java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22
(only running this for chr22)
Thank you in advance,