We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
VCF file and allele frequency

Hello All,
I am using I am using GATK RNA-seq variant pipeline for finding muttaion/vatiants on the list of gene given in teh follwoing command line
java-1.7 -jar -Xincgc -Xmx1586M GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller --filter_reads_with_N_cigar -R human_genome37_gatk.fa -D dbsnp_137.hg19.vcf -I sample_split.bam -o sample.vcf -L mylist.intervals
And the resulting VCF files has for variants AF either 100 % or 50 % . It would be great if anyone would explain me what does AF means in INFO column from VCF file.
I AF means allele frequency or it has to be calculated separately from VCF file and if so how can I do it..???
Thank you
Best Answers
-
Geraldine_VdAuwera Cambridge, MA admin
The AF annotated by our tools is the theoretical allele frequency that corresponds to the genotype call made by the tool. If you want to calculate the empirical allele frequency you can use the values from the AD field. One way to do this is to extract the values using the VariantsToTable tool to produce a tab-delimited file, which you can then process with your preferred statistical tools.
-
Geraldine_VdAuwera Cambridge, MA admin
To extract genotype level fields like AD, you need to use the
-GF
argument. -
Alva Switzerland ✭✭
Ok, so here GT:AD:DP:GQ:PL 1/1:0,342:342:99:14410,1029,0 So here it is 342/342 is Variant Allele frequency rite..?
-
Sheila Broad Institute admin
@charlesbaudo
Hi Charles,Sorry for the late response.
1) We don't have hard recommendations for dealing with mitochondrial data. We also have not done testing to determine if 100 is a good starting ploidy. I would recommend looking at the literature to see what ploidy is usually used. From there, it is up to you to experiment and determine what is best for your data.
2) In your case, since you are using ploidy 100, you can simply use the AF as stated in the VCF. For your second example, notice 8 copies contain the alternate allele. 8 out of 100 gives you 8% allele frequency, and the AF given is 0.080. This convenience is probably why lots of people start with ploidy 100
-Sheila
Answers
The AF annotated by our tools is the theoretical allele frequency that corresponds to the genotype call made by the tool. If you want to calculate the empirical allele frequency you can use the values from the AD field. One way to do this is to extract the values using the VariantsToTable tool to produce a tab-delimited file, which you can then process with your preferred statistical tools.
Ok, so i used the flowing command line ,
java-1.7 -jar -Xincgc -Xmx1586M GenomeAnalysisTK-3.2-2.jar -R /human_genome37_gatk.fa -T VariantsToTable -V sample.vcf -F CHROM -F POS -F ID -F QUAL -F AD -o sample.table
to retrieve the field.. But its throwing error as,
ERROR
ERROR MESSAGE: Missing field AD in vc variant at [VC variant @ chr1:14653 Q33.77 of type=SNP alleles=[C*, T] attr={AC=1, AF=0.500, AN=2, BaseQRankSum=0.720, ClippingRankSum=-1.380, DP=7, FS=0.000, MLEAC=1, MLEAF=0.500, MQ=60.00, MQ0=0, MQRankSum=-0.720, QD=4.82, ReadPosRankSum=1.380} GT=GT:AD:DP:GQ:PL 0/1:4,2:6:62:62,0,108
But without AD as -F value its giving output table but I need AD..
So one more question from there..AD is a comma separated value, as for example.. in the following variant,
chr1 564598 . A G 14381.77 . AC=2;AF=1.00;AN=2;DP=342;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.56 GT:AD:DP:GQ:PL 1/1:0,342:342:99:14410,1029,0
AD here is 0,342 : so the how is empirical Allele frequency calculated or which count is alternate varinat count and which is reference variant count.. .Could you please explain with this example...?
Thank you
Or putting the question in another way
I ahve AD as GT:AD:DP:GQ:PL 1/1:2,366:368:99:14774,1067,0 in the column
And for VAF teh formulate is variant reads / total reads and my question is which is variant read and which is ref read
To extract genotype level fields like AD, you need to use the
-GF
argument.yeah I tried,
with GF option and now I have the output file, but I have hard time understanding what is QUAL and AC columns means how one can infer Empirical Variant Allele frequency from these information .?
the command line I used successfully,
java-1.7 -jar -Xincgc -Xmx1586MGenomeAnalysisTK-3.2-2.jar -R human_genome37_gatk.fa -T VariantsToTable -V sample.vcf -F CHROM -F POS -F ID -F QUAL -GF -F AC -o sample.table
and here is what i have as output
head sample.table
CHROM POS ID QUAL AC
chr1 564598 rs6594028 13998.77 2
chr1 564654 rs147404388 13921.77 2
chr1 564862 rs1988726 646.77 2
chr1 564868 rs1832728 646.77 2
And from here how can I infer VAF (Variant allele frequency) I mean which values I need to use for that.
Are you looking for allele frequency in each sample, or overall in the population?
Each sample
Then you simply divide AD of each allele by DP.
Ok, so here GT:AD:DP:GQ:PL 1/1:0,342:342:99:14410,1029,0 So here it is 342/342 is Variant Allele frequency rite..?
@Alva
Yes, that is correct.
Hi @Sheila and @Geraldine_VdAuwera, i seem to have two separate DP valeues for one line, below is the output for one variant:
1 680 rs794548358 C T 181.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.540;ClippingRankSum=-0.666;DB;DP=51;ExcessHet=3.0103;FS=10.533;MLEAC=1;MLEAF=0.500;MQ=44.74;MQRankSum=-1.972;QD=3.87;ReadPosRankSum=2.073;SOR=3.215 GT:AD:DP:GQ:PL 0/1:36,11:47:99:210,0,1046
It says DP=51, but then later GT:AD:DP:GQ:PL 0/1:36,11:47:99:210,0,1046 lists DP=47, this only happens for some variants. Do you know why I am getting two different DP values and which one I should use? Also, why does allele frequency have to be AD/DP, if AD is 5, 35, can I also do 5/(35+5)? Thanks!
@angezou
Hi,
The difference in the two DPs has to do with filtering. Have a look at this page and this page for more information.
The AD calculation is explained here.
Also, one last article to help you!
https://www.broadinstitute.org/gatk/guide/article?id=6005
I hope these help.
-Sheila
Could you explain why the "theoretical allele frequency" is not equal to AC/AN?
VCF reference document only says "use this when estimated from primary data, not called genotypes" to explain AF.
@WANGxiaoji
Hi,
Indeed the AF is equal to the AC/AN in GATK output. We differ a little from the VCF spec in that case.
-Sheila
Hello,
I'm attempting to estimate mitochondrial heteroplasmy using HaplotypeCaller on a single sample and I have a few questions.
1) After perusing the forums it appears people have suggested using a ploidy of 100. Do you still believe this is a good recommendation?
2) Can you please help me interpret the output to determine the alternate allele frequency for an individual site? I'm having difficulty following the above conversation and interpreting these results. I would like to make a statement similar to, "The alternate allele frequency for a given site ranges from 2-11%".
Here is my command line:
java -Xmx16g -jar GenomeAnalysisTK.jar -R /Volumes/Toshiba_NGS/Malassezia_sequences/TomDawson/MF1878.fasta -I /Volumes/Toshiba_NGS/Malassezia_sequences/TomDawson/1878final.bam -stand_call_conf 30 -stand_emit_conf 10 -ploidy 100 -o /Volumes/Toshiba_NGS/Malassezia_sequences/TomDawson/1878.raw.vcf -T HaplotypeCaller -minPruning 3 -nct 16
and an output for two variants:
1:
mito 5092 . A T 35.17 my_snp_filter AC=1;AF=0.010;AN=100;BaseQRankSum=-0.187;ClippingRankSum=0.000;DP=1475;FS=0.000;MLEAC=1;MLEAF=0.010;MQ=60.00;MQRankSum=0.000;QD=0.02;ReadPosRankSum=-0.567;SOR=3.598 GT:AD:DP:GQ 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:1470,5:1475:53
2:
mito 27040 . T A 1561.98 my_snp_filter AC=8;AF=0.080;AN=100;BaseQRankSum=-9.864;ClippingRankSum=0.000;DP=1245;FS=93.623;MLEAC=8;MLEAF=0.080;MQ=44.95;MQRankSum=22.792;QD=1.52;ReadPosRankSum=-11.137;SOR=0.000 GT:AD:DP:GQ 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1:929,98:1027:1
Thanks,
Charles
Issue · Github
by Sheila
@charlesbaudo
Hi Charles,
Sorry for the late response.
1) We don't have hard recommendations for dealing with mitochondrial data. We also have not done testing to determine if 100 is a good starting ploidy. I would recommend looking at the literature to see what ploidy is usually used. From there, it is up to you to experiment and determine what is best for your data.
2) In your case, since you are using ploidy 100, you can simply use the AF as stated in the VCF. For your second example, notice 8 copies contain the alternate allele. 8 out of 100 gives you 8% allele frequency, and the AF given is 0.080. This convenience is probably why lots of people start with ploidy 100
-Sheila
Is there a suggested method for combining this with the 'SelectVariants' tool? I want to filter my .vcf output from HaplotypeCaller based on this recalculated frequency value, along with several other values.
@steve1
Hi,
I am not sure what you mean by "combining this with the 'SelectVariants' tool"?
Perhaps this thread will help.
-Sheila
@Sheila
is 342/(0+342) more accurate or just AD/DP, there is some uninformative reads in AD , and in the vcf header, said there can be some reads dropped in DP, so is there a accurate method to calculate this?
and in muetct2 vcf, should I just select the AF value as alt variant frequency?
@Sheila @Geraldine_VdAuwera
Hi,
Thanks for the explaination above. I would like to ask if AC stands for "Allele count in genotypes, for each ALT allele, in the same order as listed", why there are some variants corresponding AC value is 0?
In other words, why call a variant in this sample if we haven't found one single ALT allele in the reads of this sample?
BTW, If AC does not stand for this meaning(given AN stands for total number of alleles), I'm not sure I quite follow the meaning of ACs.
Hi @Yangyxt
Can you please post the command used, version of gatk and the variant records you are referring to.
I used 4.1.0.0 gatk. The command is just calling HaplotypeCaller, CombineGVCFs and GenotypeGVCFs with -I/-V, -O, -R arguments.
Here I paste a screenshot of the record for your reference:

@Yangyxt
I am sorry I don't see AC=0 in the variant records you posted here. You are right that AC is the allele counts for alt alleles and AN refers to total number of allele.