If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
SelectVariants -selectType SNP
I have called variants from whole genome Illumina data for each of seven individuals separately using GATK's HaplotypeCaller v.3.5 (with default options) and have then jointly genotyped both variant and non-variant sites across all individuals using GATK's GenotypeGVCFs v.3.5 (with the -allSites flag). Next, I wanted to select only the SNPs from this data set (using -T SelectVariants -selectType SNP) but I am a little puzzled by the output.
For example, listed in the output file are sites that state both a reference (C) and alternative allele (G) but the alternative allele count (AC) is 0, as is the alternative allele frequency (AF), and all individuals have been genotyped as homozygous for the reference allele (0/0):
chr1 1220 . C G . PASS AC=0;AF=0.00;AN=12;DP=48;ExcessHet=3.01;MQ=42.51 GT:AD:DP:RGQ 0/0:31,6:11:27 0/0:18,1:4:12 0/0:16,2:3:12 ./.:0,0:0:0 0/0:8,2:3:9 0/0:63,10:21:36 0/0:14,0:6:18
Is this variant listed because there were reads observed that supported the alternative allele (e.g., 6 reads for the first individual)? There also seems to be a large discrepancy between the number of reads observed (e.g., 31,6 reads for the first individual) and the approximated read depth (DP=11) - does this mean that the majority of the reads were excluded because they either had a MQ=255 or bad mates?
I have also seen variant sites that look like this
chr1 2051 . T C 544.99 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=-1.880e+00;ClippingRankSum=0.627;DP=78;ExcessHet=3.0103;FS=1.971;MLEAC=1;MLEAF=0.071;MQ=63.00;MQRankSum=-2.580e-01;QD=24.77;ReadPosRankSum=-9.950e-01;SOR=0.654 GT:AD:DP:GQ:PGT:PID:PL 0/1:7,17:22:99:0|1:2046_G_A:577,0,204 0/0:4,1:5:0:.:.:0,0,81 0/0:10,0:7:18:.:.:0,18,270 0/0:13,0:11:24:.:.:0,24,360 0/0:8,0:7:21:.:.:0,21,191 0/0:22,0:20:45:.:.:0,45,675 0/0:4,0:4:12:.:.:0,12,106
chr1 1915 . A G 193.78 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=-3.241e+00;ClippingRankSum=-7.660e01;DP=89;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.071;MQ=64.74;MQRankSum=-7.660e-01;QD=12.92;ReadPosRankSum=-2.950e-01;SOR=1.230 GT:AD:DP:GQ:PGT:PID:PL 0/1:12,14:15:99:0|1:1885_A_ACT:225,0,360 0/0:6,2:7:0:.:.:0,0,88 0/0:2,0:1:3:.:.:0,3,27 0/0:12,0:9:24:.:.:0,24,360 0/0:8,0:8:21:.:.:0,21,315 0/0:47,0:37:99:.:.:0,100,971 0/0:3,0:3:9:.:.:0,9,82
What exactly does the "G_A" or "A_ACT" in the phasing ID (PID) mean?
Thank you very much for your help in advance!