Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

NaN LOD in VQSR, despite having enough variants

emilyeemilye Member
edited February 22 in Ask the GATK team
I am working with data from 122 human whole exomes, captured using SeqCap EZ Prime Exome. My software versions are GATK 3.8.0 and java 1.8.0_131.

I have followed the Best Practices guidelines with minimal problems, until the VQSR step.

My command is:

```
java -Xmx16000m -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R hg38.fa \
-input cohort_122exomes_RG.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg38.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \
--maxGaussians 4 \
-MQCap 70 \
-recalFile cohort.122.SNP.RG.recal \
-tranchesFile cohort.122.SNP.RG.tranches \
-rscriptFile cohort.122.SNP.RG.plots.R
```
This fails with the error:

```
NFO 11:49:00,122 HelpFormatter - Date/Time: 2019/02/22 11:49:00
INFO 11:49:00,123 HelpFormatter - ----------------------------------------------------------------------------------
INFO 11:49:00,123 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/groups/dpetrov/emily/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 11:49:00,523 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 11:49:00,524 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 11:49:00,526 GenomeAnalysisEngine - Strictness is SILENT
INFO 11:49:00,714 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
WARN 11:49:02,004 IndexDictionaryUtils - Track hapmap doesn't have a sequence dictionary built in, skipping dictionary validation
WARN 11:49:02,006 IndexDictionaryUtils - Track omni doesn't have a sequence dictionary built in, skipping dictionary validation
WARN 11:49:02,006 IndexDictionaryUtils - Track 1000G doesn't have a sequence dictionary built in, skipping dictionary validation
WARN 11:49:02,007 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation
INFO 11:49:02,223 GenomeAnalysisEngine - Preparing for traversal
INFO 11:49:02,233 GenomeAnalysisEngine - Done preparing for traversal
INFO 11:49:02,234 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 11:49:02,235 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 11:49:02,235 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 11:49:02,243 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0
INFO 11:49:02,244 TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0
INFO 11:49:02,245 TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0
INFO 11:49:02,245 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0
INFO 11:49:32,242 ProgressMeter - chr1:84585140 1985205.0 30.0 s 15.0 s 2.6% 19.0 m 18.5 m
INFO 11:50:02,247 ProgressMeter - chr1:239316238 4976154.0 60.0 s 12.0 s 7.5% 13.4 m 12.4 m
INFO 11:50:32,250 ProgressMeter - chr10:96261463 7533983.0 90.0 s 11.0 s 10.8% 13.9 m 12.4 m
INFO 11:51:02,253 ProgressMeter - chr11:71721489 1.0172477E7 120.0 s 11.0 s 14.2% 14.1 m 12.1 m
INFO 11:51:32,257 ProgressMeter - chr12:33536501 1.2527983E7 2.5 m 11.0 s 17.2% 14.5 m 12.0 m
INFO 11:52:02,260 ProgressMeter - chr13:25810675 1.5049274E7 3.0 m 11.0 s 21.1% 14.2 m 11.2 m
INFO 11:52:32,263 ProgressMeter - chr14:54712265 1.7984479E7 3.5 m 11.0 s 25.6% 13.7 m 10.2 m
INFO 11:53:02,266 ProgressMeter - chr15:95184077 2.1027381E7 4.0 m 11.0 s 30.2% 13.2 m 9.2 m
INFO 11:53:32,283 ProgressMeter - chr17:35257508 2.4179459E7 4.5 m 11.0 s 34.4% 13.1 m 8.6 m
INFO 11:54:02,286 ProgressMeter - chr18:76260691 2.7021397E7 5.0 m 11.0 s 38.3% 13.1 m 8.1 m
INFO 11:54:32,291 ProgressMeter - chr2:46427744 2.9926631E7 5.5 m 11.0 s 41.7% 13.2 m 7.7 m
INFO 11:55:02,310 ProgressMeter - chr2:174800045 3.2780685E7 6.0 m 10.0 s 45.7% 13.1 m 7.1 m
INFO 11:55:32,313 ProgressMeter - chr20:37932367 3.5194156E7 6.5 m 11.0 s 49.0% 13.3 m 6.8 m
INFO 11:56:02,325 ProgressMeter - chr22:35283328 3.7378337E7 7.0 m 11.0 s 52.4% 13.4 m 6.4 m
INFO 11:56:32,330 ProgressMeter - chr3:97070639 4.0104008E7 7.5 m 11.0 s 55.9% 13.4 m 5.9 m
INFO 11:57:02,336 ProgressMeter - chr4:17375498 4.2946028E7 8.0 m 11.0 s 59.6% 13.4 m 5.4 m
INFO 11:57:32,338 ProgressMeter - chr4:136430795 4.5676503E7 8.5 m 11.0 s 63.3% 13.4 m 4.9 m
INFO 11:58:02,340 ProgressMeter - chr5:64118609 4.8478411E7 9.0 m 11.0 s 67.0% 13.4 m 4.4 m
INFO 11:58:32,360 ProgressMeter - chr6:13714259 5.1519913E7 9.5 m 11.0 s 71.1% 13.4 m 3.9 m
INFO 11:59:02,362 ProgressMeter - chr6:145420855 5.4598912E7 10.0 m 10.0 s 75.2% 13.3 m 3.3 m
INFO 11:59:32,366 ProgressMeter - chr7:96067618 5.7570251E7 10.5 m 10.0 s 79.0% 13.3 m 2.8 m
INFO 12:00:02,372 ProgressMeter - chr8:53485460 6.0511916E7 11.0 m 10.0 s 82.6% 13.3 m 2.3 m
INFO 12:00:32,374 ProgressMeter - chr9:22847736 6.3256918E7 11.5 m 10.0 s 86.2% 13.3 m 110.0 s
INFO 12:01:02,377 ProgressMeter - chrX:31991270 6.6001101E7 12.0 m 10.0 s 94.4% 12.7 m 43.0 s
INFO 12:01:22,875 VariantDataManager - QD: mean = 13.58 standard deviation = 5.26
INFO 12:01:22,979 VariantDataManager - MQ: mean = 1.77 standard deviation = 0.24
INFO 12:01:23,041 VariantDataManager - MQRankSum: mean = -0.01 standard deviation = 0.18
INFO 12:01:23,090 VariantDataManager - ReadPosRankSum: mean = 0.16 standard deviation = 0.65
INFO 12:01:23,138 VariantDataManager - FS: mean = 1.52 standard deviation = 2.71
INFO 12:01:23,185 VariantDataManager - SOR: mean = 0.79 standard deviation = 0.43
INFO 12:01:23,232 VariantDataManager - InbreedingCoeff: mean = 0.03 standard deviation = 0.11
INFO 12:01:23,663 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, SOR, MQRankSum, ReadPosRankSum, InbreedingCoeff]
INFO 12:01:23,695 VariantDataManager - Training with 308650 variants after standard deviation thresholding.
INFO 12:01:23,699 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO 12:01:32,379 ProgressMeter - chrY:56886603 6.8052642E7 12.5 m 11.0 s 100.0% 12.5 m 0.0 s
INFO 12:01:40,636 VariantRecalibratorEngine - Finished iteration 0.
INFO 12:01:48,822 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.91918
INFO 12:01:56,876 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.40332
INFO 12:02:02,381 ProgressMeter - chrY:56886603 6.8052642E7 13.0 m 11.0 s 100.0% 13.0 m 0.0 s
INFO 12:02:03,623 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.28258
INFO 12:02:10,453 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.10852
INFO 12:02:17,411 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.04969
INFO 12:02:24,213 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.02834
INFO 12:02:32,384 ProgressMeter - chrY:56886603 6.8052642E7 13.5 m 11.0 s 100.0% 13.5 m 0.0 s
INFO 12:02:32,594 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.00995
INFO 12:02:40,375 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00343
INFO 12:02:47,116 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00174
INFO 12:02:47,117 VariantRecalibratorEngine - Convergence after 45 iterations!
INFO 12:02:48,041 VariantRecalibratorEngine - Evaluating full set of 715516 variants...
WARN 12:02:49,212 VariantRecalibratorEngine - Evaluate datum returned a NaN.
INFO 12:02:49,241 VariantDataManager - Selected worst 8357 scoring variants --> variants with LOD <= -5.0000.
INFO 12:02:49,242 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO 12:02:49,321 VariantRecalibratorEngine - Finished iteration 0.
INFO 12:02:49,354 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.02743
INFO 12:02:49,388 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.04119
INFO 12:02:49,422 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.00956
INFO 12:02:49,456 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.00153
INFO 12:02:49,457 VariantRecalibratorEngine - Convergence after 20 iterations!
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider lowering the maximum number of Gaussians allowed for use in the model (via --maxGaussians 4, for exam
ple).
##### ERROR ------------------------------------------------------------------------------------------
```

From browsing the forum, I see that this problem is most commonly caused by having too few variants. But I have lots of variants in my .vcf -- over 700,000 SNPs and 100,000 indels.

I have successfully run VQSR on this dataset before, resulting in a LOT of false positives in my tranche plots. I thought it might have been because I improperly specified the read groups, so I started the pipeline over after the alignment step (adding RG to the bams, then proceeding as before).

Notably, this time around, my cluster was having downtime and wasn't always completely writing output files. I thought I had fixed individual problem files by re-running individual steps, but it is possible something went wrong somewhere earlier on. I am at a loss for where to begin to look (and what this error actually means).

Any suggestions would be much appreciated! Thanks so much.

Comments

  • emilyeemilye Member
    edited February 22
  • emilyeemilye Member
    edited February 22
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @emilye,

    Is it possible your variant profiles are more uniform than usual or are more varied than can be captured by the --maxGaussians 4 setting? I suggest trying different numbers of Gaussians. I see you are using 4 for SNPs calls, which is lower than normal. Can you please try the default and also numbers between?

  • emilyeemilye Member
    Hi @shlee ,

    Thanks so much for your response! I added --maxGaussians 4 at the prompt of the error message (before, it ran on the default, and gave the same error). I can certainly try more options, but I'm also wondering if there's a way to tell what the actual problem is. For example, is there a way to assess what the variant profiles are?
  • emilyeemilye Member
    I'm worried, for instance, that not all the samples have the same annotations. "Clustering with this few variants and these annotations is unsafe" doesn't give me many clues on where to look.
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @emilye,

    Here's the most recent VQSR/hard-filtering tutorial, for reference and in case you need to move to a hard-filtering approach: https://software.broadinstitute.org/gatk/documentation/article?id=23216.

    What I suggest is to examine the annotations profile of your callset. The article https://gatkforums.broadinstitute.org/gatk/discussion/6925/understanding-and-adapting-the-generic-hard-filtering-recommendations explains what to expect from annotation profiles. Towards getting to the plots, we have some hands-on tutorials from our workshops that show you how to use an R script either in a Jupyter notebook or in RStudio. All of our workshop materials are publically available and you can find a link at https://software.broadinstitute.org/gatk/documentation/presentations.

  • emilyeemilye Member
    @shlee,

    Thanks for the links. I used VariantsToTable to extract the annotations from my .vcf, and noticed a couple things.

    (1) The distributions of QD, MQ, MQRankSum, ReadPosRankSum, FS, and SOR all look quite similar to the tutorial, with the exception of QD. It looks like a lot of variants have quite low QD (see attachment).

    (1) MQRankSum and ReadPosRankSum have 'NA' values for ~15K variants each. This is still a small fraction of total variants, but the other annotations have only 0-184 missing values.

    Since I still have my old .vcf (with the wrong read groups) that DID work in VariantRecalibrator, I extracted the annotations from that .vcf too. Those distributions look exactly the same as in the new .vcf that failed.

    This suggests to me that the annotations aren't the issue, although I welcome your feedback!
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @emilye, for your abnormal QD plot, you will want to figure out whether qual is low or if depth is high. A QUAL vs. Depth scatter plot should be helpful here. Also, did you gate the callset with your exome target intervals list? I do believe you will want to exclude from your callset variants that fall outside the targets, i.e. on the edges of read pileups.

  • emilyeemilye Member
    Hi @shlee -- It looks like high depth is more likely the culprit -- a few images at different scales are attached. Do you think this is impacting VQSR somehow?

    I did indeed use '-L PrimeExome.intervals -ip 100' to gate the callset in the BaseRecalibrator and HaplotypeCaller steps (but not GenotypeGVCFs; I can try this)
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited February 27

    Hi @emilye,

    No need to gate GenotypeGVCFs as it only iterates over the intervals in the GVCFs.

    Your DP vs. QUAL plots show a positive correlation between the two as expected. Do these profiles look pronouncedly different from your other callsets that filter fine? How strongly does the evidence point to high depths being a culprit?

    I am just thinking plots like the ones at http://yulearning.blogspot.com/2014/11/einsteins-most-famous-equation-is-emc2.html or in:

    would be most helpful to get a sense of the clustering pattern. This would be done I believe with the VQSLOD annotation. In lieu of this, you can just try further lowering the Gaussians to 3 and then to 2 and see what happens.

    The only other thing I can think of is to try out the new qual model, which with GATK4.1.0.0 replaces the old qual model. It is also available in GATK3.8. See --useNewAFCalculator/
    -newQual here.

    I'm afraid this is as much as I can think of at the moment. The next step is to consult with a developer, which I'd be happy to do for you.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @emilye, I heard back from the develop I had pinged. Here is what they say:

    Remove --InbreedingCoeff from the command line (it's not used in our production) and also try removing --MQCapoption and potentially MQ from annotations altogether. One thing to try also is to increase the --maxGaussians values, as it could lead to a better model fit.

  • emilyeemilye Member
    Hi @shlee,

    Thanks so much for following up on this and responding multiple times. I really appreciate your time!

    For resolution: I went back a step, re-ran HaplotypeCaller on my samples again, and then the VQSR steps worked. I think something in the *g.vcf files was corrupted because of my cluster issues. Nonetheless, going through the annotations carefully helped me learn more about my data. Thanks again!
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I'm glad to hear everything is working as expected now @emilye!

Sign In or Register to comment.