Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

CombineVariants: AD field not updated when merging variants with different REF and ALT alleles

anders_kvistanders_kvist Posts: 3Member
edited October 2012 in Ask the team

When I run CombineVariants on two vcf files with variants at the same position but with different REF alleles and also different sets of ALT alleles, the AD fields for the genotypes are not updated to reflect the changes in the ALT field. The REF and ALT fields and the GT field for each genotype are all correctly updated. For example combining

3   10128965    rs71052293  CTT CT,C,CTTT   19936.43    PASS    AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL  0/2:115,0,33,12:230:6.96:980,1237,2795,0,946,1900,7,679,467,817 3/1:97,13,20,16:229:99:804,221,832,581,176,3047,521,0,1653,1595

and

3   10128965    rs71052293  CT  C,CTT,CTTT  14280.61    PASS    AC=1,1,1;AF=0.25,0.25,0.25;AN=4 GT:AD:DP:GQ:PL  2/1:110,20,33,18:237:1.90:850,289,1027,457,0,1487,147,877,2,1858    0/3:80,48,5,29:209:99:1835,875,977,2101,1119,3322,0,142,331,462

gives

3   10128965    rs71052293  CTT CT,C,CTTT,CTTTT 19936.43    PASS    AC=2,1,2,1;AF=0.250,0.125,0.250,0.125;AN=8;set=Intersection GT:AD:DP:GQ 0/2:115,0,33,12:230:7   3/1:97,13,20,16:229:99  3/1:110,20,33,18:237:2  0/4:80,48,5,29:209:99

There five alleles (one REF and four ALT) but only four AD fields for each genotype.

My command line:

java -jar -Xmx4g GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V test_input1.vcf -V test_input2.vcf -o test_combined.vcf

Is this a known limitation or a bug?

Post edited by Geraldine_VdAuwera on
Tagged:

Best Answer

Answers

  • anders_kvistanders_kvist Posts: 3Member

    Thanks for your quick reply. I appreciate the hard work that goes into developing and supporting GATK and it is an excellent and invaluable set of tools. I completely understand that you cannot act on all comments and requests from users. Still, l would like to offer a couple of reflections on this issue. Maybe they are useful, should you decide to develop the functionality of CombineVariants further in the future:

    • After the merge with CombineVariants, the allelic depth values in the AD fields are no longer in the same order as the ref and alt alleles and it is impossible to know which allele each value refers to. The description for the AD field remains: "Allelic depths for the ref and alt alleles in the order listed", but is no longer true. Would it not be better to remove the AD values completely than to keep values that are in the wrong order and hence unusable?

    • Updating the AD field in CombineVariants would only require shuffling the allelic depth values and adding zeroes to match the updated REF and ALT fields. The added runtime would be negligible. Rerunning VariantAnnotator to add AD requires reading through the bam files of all samples (since the link between AD values and allele is lost from the vcf), which for large vcfs with many samples can result in a very long runtime. For my small test case with four samples and a short target interval of ~4MB, the run time was ~30 min.

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GSA Member admin

    As I understand it, the AD field should be being filtered out when alleles are merged at a site. If not that's a bug in the code. Also note that it's not possible to compute AD when you've added new alleles, because the count shouldn't be zero. At best it should be ., but that's hard for us to do.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Thanks for your comments, Anders -- you're correct that this is a problem. We're putting this on our list to check & fix.

    Geraldine Van der Auwera, PhD

  • anders_kvistanders_kvist Posts: 3Member

    Of course you are right, computing AD for the added alleles in CombineVariants wouldn't work. It would require the original bams for the counts. As you say, filtering out AD or putting in a . is probably the best alternative. Thanks for pointing that out.

Sign In or Register to comment.