To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

Hi,

I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.

I've run the following (relevant) commands:

Haplotype Caller -> Generate GVCF:

    java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
       -T HaplotypeCaller \
       -R ${ref} \
       -I ${NEWTMPDIR}/${prefix}.realigned.fixed.recal.bam \
       -L ${reg} \
       -ERC GVCF \
       -nct ${nct} \
       --genotyping_mode DISCOVERY \
       -stand_emit_conf 10 \
       -stand_call_conf 30  \
       -o ${prefix}.raw_variants.annotated.g.vcf \
       -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage

That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):

18      11255   .       T       <NON_REF>       .       .       END=11256       GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
18      11257   .       C       G,<NON_REF>     229.77  .       BaseQRankSum=1.999;DP=20;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.377;ReadPosRankSum=0.489      GT:AD:GQ:PL:SB  0/1:10,8,0:99:258,0,308,288
18      11258   .       G       <NON_REF>       .       .       END=11260       GT:DP:GQ:MIN_DP:PL      0/0:17:48:16:0,48,530

Later, I ran Genotype GVCF joining all the samples with the following command:

java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R ${ref} \
   -L ${pos} \
   -o ${prefix}.raw_variants.annotated.vcf \
   --variant ${variant} [...]

This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.

22  17280388    .   T   C   18459.8 PASS    AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523    GT:AD:DP:GQ:PL  0/0:29,0:29:81:0,81,1118    0/1:20,22:.:99:688,0,682    1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664    0/1:11,17:.:99:532,0,330    0/1:14,14:.:99:431,0,458    0/0:28,0:28:81:0,81,1092    0/0:35,0:35:99:0,99,1326    0/1:14,20:.:99:631,0,453    0/1:13,16:.:99:511,0,423    0/1:38,29:.:99:845,0,1231   0/1:20,10:.:99:282,0,671    0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350    0/1:12,12:.:99:378,0,391    0/1:14,26:.:99:865,0,433    0/0:37,0:37:99:0,105,1406   0/0:44,0:44:99:0,120,1800   0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250    0/0:31,0:31:90:0,90,1350    0/1:15,25:.:99:827,0,462    0/0:35,0:35:99:0,99,1445    0/0:29,0:29:72:0,72,1089    1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718    1/1:0,40:.:99:1539,120,0    0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0    0/1:18,14:.:99:428,0,606    0/0:32,0:32:90:0,90,1158    0/1:24,22:.:99:652,0,816    0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501    0/0:31,0:31:90:0,90,1350    0/1:15,18:.:99:570,0,480    0/1:22,13:.:99:384,0,742    0/1:19,11:.:99:318,0,632    0/0:28,0:28:75:0,75,1125    0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108    0/1:16,16:.:99:479,0,493    0/1:14,22:.:99:745,0,439    0/0:31,0:31:90:0,90,1252
22  17280822    .   G   A   5491.56 PASS    AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695   GT:AD:DP:GQ:PL  0/0:27,0:27:72:0,72,1080    0/0:34,0:34:90:0,90,1350    0/1:15,16:.:99:528,0,491    0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453    0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:87:0,87,1305    0/0:40,0:40:99:0,108,1620   0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476    0/0:27,0:27:69:0,69,1035    0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215    0/0:36,0:36:99:0,99,1435    0/0:34,0:34:99:0,99,1299    0/0:35,0:35:99:0,102,1339   0/0:38,0:38:99:0,102,1520   0/0:36,0:36:99:0,99,1476    0/0:31,0:31:81:0,81,1215    0/0:31,0:31:75:0,75,1125    0/0:35,0:35:99:0,99,1485    0/0:37,0:37:99:0,99,1485    0/0:35,0:35:90:0,90,1350    0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474    0/0:32,0:32:90:0,90,1350    0/0:35,0:35:99:0,99,1467    0/1:27,36:.:99:1169,0,831   0/0:28,0:28:75:0,75,1125    0/0:36,0:36:81:0,81,1215    0/0:35,0:35:90:0,90,1350    0/0:28,0:28:72:0,72,1080    0/0:31,0:31:81:0,81,1215    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:84:0,84,1260    0/0:39,0:39:99:0,101,1575   0/0:37,0:37:96:0,96,1440    0/0:34,0:34:99:0,99,1269    0/0:30,0:30:81:0,81,1215    0/0:36,0:36:99:0,99,1485    0/1:17,17:.:99:567,0,530    0/0:26,0:26:72:0,72,1008    0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260    0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215    0/0:35,0:35:99:0,99,1485

I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.

I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.

What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?

Thanks in advance,

Txema

PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Txema,

    As I see it you have two options: use the sum of ADs as a proxy for "usable depth", or run a DepthOfCoverage analysis on your samples. Both have strengths and weaknesses, and address slightly different aspects of the coverage question.

    1. Use the sum of ADs
      The advantage here is that you already have the data in your VCF. The downside is that the AD doesn't include all of the reads that were seen, just the ones corresponding to the alleles that were called. But this is a pretty good proxy if you're just looking to see if you have significant discrepancy in the amount of useable data available between samples.

    2. Run a DepthOfCoverage analysis
      This will collect coverage statistics per-sample and allow you to evaluate whether there is any substantial bias in the amount of sequence produced between samples. The caveat is that it will take some time to run on thousands of samples (though you can parallellize massively, and optimize by just running on the sites you're looking for), and you will be collecting coverage from the original alignments, as opposed to the alignments that result from the graph assembly process performed internally by HaplotypeCaller. The latter is the bigger potential problem: there will be some sites where the coverage stats you get are inconsistent with the AD stats of the call. But the number of sites showing a big difference should be reasonably small and mostly affect indels. And you can mitigate the problem by averaging the coverage stats over some region around each site of interest. This will be less precise per-site but should enable you to determine whether there is substantial difference in the total amount of data.

  • The DepthOfCoverage is completely out of scope. That would imply re-downloading ~120TB of bamfiles.

    The sum of ADs seems promising, but it would be cool to be able to see how many non-REF-non-ALT alleles were seen. Specially sample-by-sample to check if any of them has an excess of "error" calls.

    Do you have any recommendation on what tool to use for rewriting the vcfs without messing them up?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You could grab the ADs from the GVCFs to get the full picture if that's what you want. VariantAnnotator can annotate from an external resource, but off the top of my head I don't remember it doing sample annotations, only site-level. I could be wrong though, feel free to check the tech doc.

  • shawpashawpa Member

    I am running into a similar issue as the above user. I just noticed that I am only getting DP output at homozygous reference sites. I get a "." at heterozygous or homozygous alternate sites. The issue with this is that all my downstream analysis programs are unable to filter by depth because of this "missing" output. When I ran haplotype caller with default annotations, I didn't have this problem. I am now running haplotype caller with an additional annotation for allele balance by sample. I saw in some other threads that this missing DP is a known issue and that possibly the bug has been fixed. The DP annotation is in my vcf header and also appears in the format field. Is there some way to force it to recalculate it from the AD field?

  • I don't know if anyone from GATK staff has seen my recent post. The only solution I have come up with is to 1) run the haplotype caller with only the default annotations 2) filter genotypes for minimum depth 3) apply the additional annotation (AlleleBalanceBySample).

    I really don't want to have to do this as I calculate that it will take about 3 weeks to rerun haplotype caller on all of my samples due to computing limitations. If there is no "quicker fix" then please let me know.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shawpa
    Hi,

    Is this happening in GATK 4.0.0.0? Can you post some example sites with and without the DP field (using default and non-default annotations)? I may need you to submit a bug report.

    Thanks,
    Sheila

  • Because this is an ongoing study, I am still using version 3.4. I know the bug was reported before. Is there a version where this is fixed?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shawpa
    Hi,

    It looks like the fix should be in version 3.8. Can you test that version and let us know if it is fixed?

    Thanks,
    Sheila

  • It will take me some time to test. If v 3.8 works, I still need to rerun HaplotypeCaller with version 3.8 or can I use the existing g.vcfs from HaplotypeCaller (v3.4) and just run GenotypeGVCFs?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shawpa
    Hi,

    Unfortunately, you will need to re-run with version 3.8 as there were some changes to GenotypeGVCFs in the later versions that may cause issues with old GVCFs.

    -Sheila

  • Would you recommend upgrading to v3.8 or v4.0? I would like to still be able to use the bam files that I generated previously or else it is going to take a very long time to start from scratch.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shawpa

    Hi,

    If you are going to upgrade, it would be best to upgrade to 4.0 :smiley:

    -Sheila

  • I don't think I can upgrade to 4.0 at this time because it is missing an annotation that I need (AlleleBalanceBySample). I cannot find the download link for version 3.8. The link on the main page takes you to the 4.0 download and 3.8 is not under the archived versions.

    Issue · Github
    by Sheila

    Issue Number
    2832
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shawpa
    Hi,

    Sorry about that. The 3.8 version should be available soon.

    -Sheila

Sign In or Register to comment.