Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
VCF to fasta format question
I'm trying to take a multi-sample vcf and convert it to an aligned fasta file for phylogenetic analysis, including all indels and invariant sites. Previous questions similar to this suggested isolating individual samples using SelectVariants, and then FastaAlternateReferenceMaker to make the fasta. I'm having trouble with this because for GATK 3.8, FastaAlternateReferenceMaker doesn't appear to accept '*' alternate alleles, and for GATK 4.0, it doesn't seem like FastaAlternateReferenceMaker exists. Also, the choice of always output the alternate allele for heterozygous sites doesn't work for my downstream analyses.
To get around these issues, what I've done so far is use GATK 3.8 to generate a vcf using GenotypeGVCF including invariant sites. Then use SelectVariants to pull out a single sample,
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -R /home/gowens/bin/ref/HanXRQr1.0-20151230.fa -T SelectVariants --variant HanXRQChr01_000000001.tmp.vcf.gz -o HanXRQChr01_000000001.subset -sn GO1_GB057 -trimAlternates
I was planning on scripting a parser to convert it to fasta (against the gatk recommendations), but I'm having trouble interpreting some sites.
HanXRQChr01 12430 . T . 16.80 . AN=2;DP=1 GT:AD:DP:RGQ 0/0:1,0:1:3
HanXRQChr01 12431 . GA . 138.70 . AN=2;BaseQRankSum=1.09;ClippingRankSum=0.00;DP=1;ExcessHet=3.0103;FS=0.000;MQ=60.00;MQRankSum=0.00;QD=15.41;ReadPosRankSum=0.020;SOR=1.179 GT:AD:DP:GQ:PL 0/0:1:1:3:0
HanXRQChr01 12432 . A . 138.70 . AN=2;DP=1;ExcessHet=3.0103;FS=0.000;QD=15.41;SOR=1.179 GT:AD:DP:GQ:PL 0/0:1:1:3:0
My interpretation would be that it is the true state is TGA, but the maybe its TGAA?
HanXRQChr01 17651 . TCGGCAA . 298.45 . AN=2;DP=3;ExcessHet=0.4576;FS=0.000;MQ=56.12;QD=30.57;SOR=0.941 GT:AD:DP:GQ:PL 0/0:3:3:3:0
HanXRQChr01 17652 . C . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=30.65;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
HanXRQChr01 17653 . G . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=35.84;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
HanXRQChr01 17654 . G . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=30.94;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
In this case, there seems to be a difference in confidence depending on which row I interpret. Looking at the first row, it calls the reference allele for a 7 basepair stretch, but subsequent rows don't have any variant call. Should I interpret this as TCGG or TNNN or NNNN?
HanXRQChr01 56181 . ATTT A,AT 921.63 . AC=1,1;AF=0.500,0.500;AN=2;BaseQRankSum=0.681;ClippingRankSum=0.00;DP=11;ExcessHet=3.0103;FS=0.000;MQ=58.70;MQRankSum=0.00;QD=25.74;ReadPosRankSum=0.475;SOR=0.321 GT:AD:DP:GQ:PL 1/2:0,9,2:11:43:462,73,43,378,0,
HanXRQChr01 56182 . T * 769.10 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=31.15;SOR=0.321 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56183 . T * 769.10 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=29.40;SOR=0.321 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56184 . T * 724.25 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=29.97;SOR=0.287 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56185 . T . 12.99 . AN=0;DP=17 GT:AD:DP ./.:17,0:17
HanXRQChr01 56186 . T . 23.07 . AN=2;DP=17 GT:AD:DP:RGQ 0/0:17,0:17:4
HanXRQChr01 56187 . T . 20.94 . AN=2;DP=16 GT:AD:DP:RGQ 0/0:16,0:16:1
HanXRQChr01 56188 . T . 26.81 . AN=2;DP=17 GT:AD:DP:RGQ 0/0:17,0:17:14
Would the two haplotypes be: A-NTTT/ATNTTT?