We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VCF to fasta format question

I'm trying to take a multi-sample vcf and convert it to an aligned fasta file for phylogenetic analysis, including all indels and invariant sites. Previous questions similar to this suggested isolating individual samples using SelectVariants, and then FastaAlternateReferenceMaker to make the fasta. I'm having trouble with this because for GATK 3.8, FastaAlternateReferenceMaker doesn't appear to accept '*' alternate alleles, and for GATK 4.0, it doesn't seem like FastaAlternateReferenceMaker exists. Also, the choice of always output the alternate allele for heterozygous sites doesn't work for my downstream analyses.

To get around these issues, what I've done so far is use GATK 3.8 to generate a vcf using GenotypeGVCF including invariant sites. Then use SelectVariants to pull out a single sample,

java -jar $EBROOTGATK/GenomeAnalysisTK.jar -R /home/gowens/bin/ref/HanXRQr1.0-20151230.fa -T SelectVariants --variant HanXRQChr01_000000001.tmp.vcf.gz -o HanXRQChr01_000000001.subset -sn GO1_GB057 -trimAlternates

I was planning on scripting a parser to convert it to fasta (against the gatk recommendations), but I'm having trouble interpreting some sites.

Example 1:

HanXRQChr01 12430 . T . 16.80 . AN=2;DP=1 GT:AD:DP:RGQ 0/0:1,0:1:3
HanXRQChr01 12431 . GA . 138.70 . AN=2;BaseQRankSum=1.09;ClippingRankSum=0.00;DP=1;ExcessHet=3.0103;FS=0.000;MQ=60.00;MQRankSum=0.00;QD=15.41;ReadPosRankSum=0.020;SOR=1.179 GT:AD:DP:GQ:PL 0/0:1:1:3:0
HanXRQChr01 12432 . A . 138.70 . AN=2;DP=1;ExcessHet=3.0103;FS=0.000;QD=15.41;SOR=1.179 GT:AD:DP:GQ:PL 0/0:1:1:3:0

My interpretation would be that it is the true state is TGA, but the maybe its TGAA?

Example 2:

HanXRQChr01 17651 . TCGGCAA . 298.45 . AN=2;DP=3;ExcessHet=0.4576;FS=0.000;MQ=56.12;QD=30.57;SOR=0.941 GT:AD:DP:GQ:PL 0/0:3:3:3:0
HanXRQChr01 17652 . C . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=30.65;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
HanXRQChr01 17653 . G . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=35.84;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
HanXRQChr01 17654 . G . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=30.94;SOR=0.941 GT:AD:DP:PL ./.:2:2:0

In this case, there seems to be a difference in confidence depending on which row I interpret. Looking at the first row, it calls the reference allele for a 7 basepair stretch, but subsequent rows don't have any variant call. Should I interpret this as TCGG or TNNN or NNNN?

Example 3:

HanXRQChr01 56181 . ATTT A,AT 921.63 . AC=1,1;AF=0.500,0.500;AN=2;BaseQRankSum=0.681;ClippingRankSum=0.00;DP=11;ExcessHet=3.0103;FS=0.000;MQ=58.70;MQRankSum=0.00;QD=25.74;ReadPosRankSum=0.475;SOR=0.321 GT:AD:DP:GQ:PL 1/2:0,9,2:11:43:462,73,43,378,0,
HanXRQChr01 56182 . T * 769.10 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=31.15;SOR=0.321 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56183 . T * 769.10 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=29.40;SOR=0.321 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56184 . T * 724.25 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=29.97;SOR=0.287 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56185 . T . 12.99 . AN=0;DP=17 GT:AD:DP ./.:17,0:17
HanXRQChr01 56186 . T . 23.07 . AN=2;DP=17 GT:AD:DP:RGQ 0/0:17,0:17:4
HanXRQChr01 56187 . T . 20.94 . AN=2;DP=16 GT:AD:DP:RGQ 0/0:16,0:16:1
HanXRQChr01 56188 . T . 26.81 . AN=2;DP=17 GT:AD:DP:RGQ 0/0:17,0:17:14

Would the two haplotypes be: A-NTTT/ATNTTT?



  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Sorry for the delay.

    My interpretation would be that it is the true state is TGA, but the maybe its TGAA?

    It is indeed TGA. It looks like one of the samples had a deletion at the GA site, but since your chosen sample is hom-ref, the tool still keeps the ref from the original VCF.

    Example 2:

    HanXRQChr01 17651 . TCGGCAA . 298.45 . AN=2;DP=3;ExcessHet=0.4576;FS=0.000;MQ=56.12;QD=30.57;SOR=0.941 GT:AD:DP:GQ:PL 0/0:3:3:3:0

    Honestly, this looks weird. I am not sure why the AD only has 1 value. Can you post the GVCF record for this site? Notice the GQ is 3. That is quite low, so I would not have much confidence in the site call.

    Would the two haplotypes be: A-NTTT/ATNTTT



Sign In or Register to comment.