Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Why does GenotypeGVCFs output extra AD values that mess up the order of the values?

NardusNardus GlasgowMember

I have found several sites in my VCF files where GenotypeGVCFs appears to output the AD records in the wrong order, since the AD values do not match either the call or the order of values in the PL field. The vast majority of sites are correct, but sometimes they look like this (INFO column removed for readability):

POS     ID  REF ALT QUAL    FILTER  FORMAT  Sample1 Sample2
349     .   G   A   210669.47   .   GT:AD:DP:GQ:PL  1:3,0,238:241:99:9052,0 1:0,0,59:59:99:2699,0
910     .   T   A,G 177554.11   .   GT:AD:DP:GQ:PL  2:0,0,0,76:76:99:3600,3600,0    1:0,90,0,0:90:99:4321,0,4321
1948    .   T   G,A,C   209510.27   .   GT:AD:DP:GQ:PL  1:0,234,0,0,0:234:99:10934,0,10934,10934    2:0,0,0,57,0:57:99:2745,2745,0,2745

Some observations:

  • In all cases I've found so far, there's more AD entries than possible haplotypes or PL entries (including the extreme case of having 5 AD values at site 1948 above)
  • Not all sites are affected, and not even all samples at an affected site (see site 910 and 1948 above, although this could just be due to where exactly the 'extra' AD value gets added)
  • There does not seem to be a pattern as to where the extra entry gets added - at several sites there are still too many AD fields, but the order is not wrong, so the extra entry must be at the end
  • I did check the input VCF files (produced using HaplotypeCaller with --emitRefConfidence BP_RESOLUTION), and the order/number of AD fields is fine there

I haven't been able to find anything special about the affected sites, so it's not at all clear what is happing. The command line arguments used were:

java -jar GenomeAnalysisTK.jar   
        -T GenotypeGVCFs
        -V Sample1_raw.g.vcf -V Sample2_raw.g.vcf
        -R TanzaniaDog_RV2772_KF155002.1.fasta  
        -o AllSamples_raw.gvcf  
        -stand_emit_conf 10  
        -stand_call_conf 30

I did try specifying ploidy in the above command (--sample_ploidy 1), but this does not solve the problem.



Sign In or Register to comment.