Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VariantAnnotator: "--annotation Coverage" updating INFO DP but not sample-level (FORMAT) DP

dpoznikdpoznik StanfordMember

I have constructed a VCF using GenotypeGVCFs. Due to downsampling and filtering imposed by HaplotypeCaller/GenotypeGVCFs, four fields of interest within the VCF do not reflect the data that went into building the file. I would like to update a these fields to reflect all read data used in the pipeline, but I have been able to do so for just 3 of the 4. I have been able to update INFO MQ0, INFO DP, and sample-level (FORMAT) AD, but I have not been able to update sample-level (FORMAT) DP.

I have run VariantAnnotator as follows.

inVCFfn=../vcf/raw.vcf
outVCFfn=../vcf/raw.annotated.vcf

gatk=/srv/gs1/software/gatk/gatk-3.3.0/GenomeAnalysisTK.jar
ref=/srv/gs1/projects/scg/Resources/GATK/b37/human_g1k_v37_decoy.fasta
dbsnp=/srv/gs1/projects/scg/Resources/GATK/b37/dbsnp_137.b37.vcf

java -Xmx10g \
  -jar $gatk \
  -T VariantAnnotator \
  -R $ref \
  -L $inVCFfn \
  --variant $inVCFfn \
  --dbsnp $dbsnp \
  --out $outVCFfn \
  --annotation Coverage \
  --annotation DepthPerAlleleBySample \
  --annotation MappingQualityZero \
  [-I fileName.1.bam … -I fileName.810.bam]

The documentation for --annotation Coverage indicates that it should update both the INFO DP and the sample-level (FORMAT) DP, but the sample-level update seems not to be occurring. For example:

$ awk '($2==7157834) { print $9, "|", $53 }' raw.vcf
GT:AD:DP:GQ:PL | 0:4,0:4:99:0,135

$ awk '($2==7157834) { print $9, "|", $53 }' raw.annotated.vcf
GT:AD:DP:GQ:PL | 0:131,0:4:99:0,135

(Aside: I'm working with a haploid chromosome, hence the single-value GT and two-value AD and PL fields). Although AD was updated from "4,0" to "131,0", the sample-level "DP" remained at "4". Of course I could approximate DP as the sum of AD values, but since AD is pre-filter and sample-level DP is post-filter, this wouldn't quite be correct. I should note that when I run HaplotypeCaller/GenotypeGVCFs on a subset of the data, the sample-level DP values are very close to the sum of AD's, so it's not a matter of most reads being filtered out, but rather that they have been downsampled. Am I missing something?

Thanks!
David

P.S. The INFO field DP and MQ0 are indeed updated as expected:

$ awk '($2==7157834) { print $8 }' raw.vcf
AC=1;AF=6.993e-03;AN=143;DP=609;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=0;NCC=0;QD=30.16;SOR=1.259

$ awk '($2==7157834) { print $8 }' raw.annotated.vcf
AC=1;AF=6.993e-03;AN=143;DP=10750;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=2;NCC=0;QD=30.16;SOR=1.259

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi David,

    No offense, but this is a very bad idea. By design, the call annotations in the VCF reflect the data that the caller/genotyper actually used in their calculation. Changing them undermines your ability to troubleshoot, analyze and further process calling metrics. If you want the VCF to include information about how much data was originally present in the file, I would suggest adding that as separate annotations, without overwriting the call metrics.

  • dpoznikdpoznik StanfordMember

    Hi Geraldine,

    Thanks for your reply. I didn't plan to delete the original file. Rather, I wanted to generate a parallel file that includes a summary of the available data. For my work, it is helpful to know whether a site resides in a region of inflated read depth or inflated MQ0 proportion, regardless of which reads were evaluated by the genotype caller. Similarly, for genotype look-ups, it's often helpful to know how many reads support a given call, regardless of how many the caller considered.

    If you want the VCF to include information about how much data was originally present in the file, I would suggest adding that as separate annotations, without overwriting the call metrics.

    Could you recommend an efficient means of doing so (ideally with a correct total-data FORMAT DP equivalent)? I'd considered running DepthOfCoverage, processing the results and merging with the VCF. This would yield total-data equivalents of INFO DP, FORMAT AD, and FORMAT DP (but not INFO MQ0), but this approach seemed a less efficient means to the same end. Is there a way to get VariantAnnotator to add fields with a non-default identifier rather than overwrite?

    Thanks for your help!

  • dpoznikdpoznik StanfordMember

    OK, no problem. Thanks for the info!

Sign In or Register to comment.