Why does GenotypeGVCFs output extra AD values that mess up the order of the values?
I have found several sites in my VCF files where GenotypeGVCFs appears to output the
AD records in the wrong order, since the
AD values do not match either the call or the order of values in the
PL field. The vast majority of sites are correct, but sometimes they look like this (
INFO column removed for readability):
POS ID REF ALT QUAL FILTER FORMAT Sample1 Sample2 349 . G A 210669.47 . GT:AD:DP:GQ:PL 1:3,0,238:241:99:9052,0 1:0,0,59:59:99:2699,0 910 . T A,G 177554.11 . GT:AD:DP:GQ:PL 2:0,0,0,76:76:99:3600,3600,0 1:0,90,0,0:90:99:4321,0,4321 1948 . T G,A,C 209510.27 . GT:AD:DP:GQ:PL 1:0,234,0,0,0:234:99:10934,0,10934,10934 2:0,0,0,57,0:57:99:2745,2745,0,2745
- In all cases I've found so far, there's more
ADentries than possible haplotypes or
PLentries (including the extreme case of having 5 AD values at site 1948 above)
- Not all sites are affected, and not even all samples at an affected site (see site 910 and 1948 above, although this could just be due to where exactly the 'extra'
ADvalue gets added)
- There does not seem to be a pattern as to where the extra entry gets added - at several sites there are still too many
ADfields, but the order is not wrong, so the extra entry must be at the end
- I did check the input VCF files (produced using HaplotypeCaller with
--emitRefConfidence BP_RESOLUTION), and the order/number of
ADfields is fine there
I haven't been able to find anything special about the affected sites, so it's not at all clear what is happing. The command line arguments used were:
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -V Sample1_raw.g.vcf -V Sample2_raw.g.vcf -R TanzaniaDog_RV2772_KF155002.1.fasta -o AllSamples_raw.gvcf --annotateNDA --includeNonVariantSites -stand_emit_conf 10 -stand_call_conf 30
I did try specifying ploidy in the above command (
--sample_ploidy 1), but this does not solve the problem.