Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Why does GenotypeGVCFs output extra AD values that mess up the order of the values?

NardusNardus GlasgowMember

I have found several sites in my VCF files where GenotypeGVCFs appears to output the AD records in the wrong order, since the AD values do not match either the call or the order of values in the PL field. The vast majority of sites are correct, but sometimes they look like this (INFO column removed for readability):

POS     ID  REF ALT QUAL    FILTER  FORMAT  Sample1 Sample2
349     .   G   A   210669.47   .   GT:AD:DP:GQ:PL  1:3,0,238:241:99:9052,0 1:0,0,59:59:99:2699,0
910     .   T   A,G 177554.11   .   GT:AD:DP:GQ:PL  2:0,0,0,76:76:99:3600,3600,0    1:0,90,0,0:90:99:4321,0,4321
1948    .   T   G,A,C   209510.27   .   GT:AD:DP:GQ:PL  1:0,234,0,0,0:234:99:10934,0,10934,10934    2:0,0,0,57,0:57:99:2745,2745,0,2745

Some observations:

  • In all cases I've found so far, there's more AD entries than possible haplotypes or PL entries (including the extreme case of having 5 AD values at site 1948 above)
  • Not all sites are affected, and not even all samples at an affected site (see site 910 and 1948 above, although this could just be due to where exactly the 'extra' AD value gets added)
  • There does not seem to be a pattern as to where the extra entry gets added - at several sites there are still too many AD fields, but the order is not wrong, so the extra entry must be at the end
  • I did check the input VCF files (produced using HaplotypeCaller with --emitRefConfidence BP_RESOLUTION), and the order/number of AD fields is fine there

I haven't been able to find anything special about the affected sites, so it's not at all clear what is happing. The command line arguments used were:

java -jar GenomeAnalysisTK.jar   
        -T GenotypeGVCFs
        -V Sample1_raw.g.vcf -V Sample2_raw.g.vcf
        -R TanzaniaDog_RV2772_KF155002.1.fasta  
        -o AllSamples_raw.gvcf  
        --annotateNDA  
        --includeNonVariantSites  
        -stand_emit_conf 10  
        -stand_call_conf 30

I did try specifying ploidy in the above command (--sample_ploidy 1), but this does not solve the problem.

Tagged:

Comments

Sign In or Register to comment.