Error in trying to use VariantRecalibrator

Hi,

I'm trying to use the VariantRecalibrator on SNP calls from ~160 yeast genomes. Because there are no resources for known SNPs, I had to create my own. As all the genomes are of yeast that derived in an evolution experiment, they have a (relatively small) common set of differences relative to the reference. As coverage varies between strains, and across the genome, not all common SNPs get called in all strains, so I generated one vcf file with SNPs that were called in 25% or more of the strains - manual examination of these suggests that they are all real (I called this truth.vcf - it has ~120 SNPs) - and one file of SNPs that were called in between 10-25% of strains (I called this train.vcf - it only has ~25 SNPs). I then try to do variant recalibration as:

gatk-4.0.11.0/gatk VariantRecalibrator -R Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta -V called_genotypes.vcf.gz --resource truth,known=false,training=true,truth=true,prior=15.0:truth.vcf --resource train,known=false,training=true,truth=false,prior=10.0:train.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -O output.recal --tranches-file output.tranches --rscript-file output.plots.R --max-gaussians=4

but got the following output:

Using GATK jar /Volumes/Promise_Pegasus/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Volumes/Promise_Pegasus/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar VariantRecalibrator -R Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta -V called_genotypes.vcf.gz --resource truth,known=false,training=true,truth=true,prior=15.0:truth.vcf --resource train,known=false,training=true,truth=false,prior=10.0:train.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -O output.recal --tranches-file output.tranches --rscript-file output.plots.R --max-gaussians=4
18:23:35.917 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Volumes/Promise_Pegasus/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
18:23:37.594 INFO VariantRecalibrator - ------------------------------------------------------------
18:23:37.594 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.0.11.0
18:23:37.595 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
18:23:37.595 INFO VariantRecalibrator - Executing as [email protected] on Mac OS X v10.13.3 x86_64
18:23:37.595 INFO VariantRecalibrator - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_51-b16
18:23:37.595 INFO VariantRecalibrator - Start Date/Time: January 9, 2019 6:23:35 PM PST
18:23:37.595 INFO VariantRecalibrator - ------------------------------------------------------------
18:23:37.595 INFO VariantRecalibrator - ------------------------------------------------------------
18:23:37.596 INFO VariantRecalibrator - HTSJDK Version: 2.16.1
18:23:37.596 INFO VariantRecalibrator - Picard Version: 2.18.13
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:23:37.596 INFO VariantRecalibrator - Deflater: IntelDeflater
18:23:37.596 INFO VariantRecalibrator - Inflater: IntelInflater
18:23:37.596 INFO VariantRecalibrator - GCS max retries/reopens: 20
18:23:37.596 INFO VariantRecalibrator - Requester pays: disabled
18:23:37.596 INFO VariantRecalibrator - Initializing engine
18:23:38.024 INFO FeatureManager - Using codec VCFCodec to read file file:///Volumes/Promise_Pegasus/Lucas/whole_genome_seq/Gavin_Haploid_Analysis/truth.vcf
18:23:38.040 INFO FeatureManager - Using codec VCFCodec to read file file:///Volumes/Promise_Pegasus/Lucas/whole_genome_seq/Gavin_Haploid_Analysis/train.vcf
18:23:38.047 INFO FeatureManager - Using codec VCFCodec to read file file:///Volumes/Promise_Pegasus/Lucas/whole_genome_seq/Gavin_Haploid_Analysis/called_genotypes.vcf.gz
18:23:38.082 INFO VariantRecalibrator - Done initializing engine
18:23:38.086 INFO TrainingSet - Found truth track: Known = false Training = true Truth = true Prior = Q15.0
18:23:38.086 INFO TrainingSet - Found train track: Known = false Training = true Truth = false Prior = Q10.0
18:23:38.092 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "recal". Defaulting to VCF.
18:23:38.107 INFO ProgressMeter - Starting traversal
18:23:38.108 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
18:23:38.347 INFO VariantRecalibrator - No variants filtered by: AllowAllVariantsVariantFilter
18:23:38.348 INFO ProgressMeter - XI:67773 0.0 1598 399500.0
18:23:38.348 INFO ProgressMeter - Traversal complete. Processed 1598 total variants in 0.0 minutes.
18:23:38.349 INFO VariantDataManager - QD: mean = 31.03 standard deviation = 2.13
18:23:38.349 INFO VariantDataManager - MQ: mean = 56.26 standard deviation = 5.91
18:23:38.349 INFO VariantDataManager - MQRankSum: mean = 0.17 standard deviation = 0.65
18:23:38.350 INFO VariantDataManager - ReadPosRankSum: mean = 0.22 standard deviation = 0.89
18:23:38.350 INFO VariantDataManager - FS: mean = 0.53 standard deviation = 3.12
18:23:38.351 INFO VariantDataManager - SOR: mean = 1.03 standard deviation = 0.77
18:23:38.351 INFO VariantDataManager - DP: mean = 3260.79 standard deviation = 10238.34
18:23:38.358 INFO VariantDataManager - Annotation order is: [DP, MQ, QD, FS, ReadPosRankSum, MQRankSum, SOR]
18:23:38.358 INFO VariantDataManager - Training with 112 variants after standard deviation thresholding.
18:23:38.358 WARN VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable.
18:23:38.361 INFO GaussianMixtureModel - Initializing model with 100 k-means iterations...
18:23:38.394 INFO VariantRecalibratorEngine - Finished iteration 0.
18:23:38.412 INFO VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.04664
18:23:38.419 INFO VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.05034
18:23:38.421 INFO VariantRecalibratorEngine - Convergence after 11 iterations!
18:23:38.424 WARN VariantRecalibratorEngine - Model could not pre-compute denominators.
18:23:38.424 INFO VariantDataManager - Selected worst 0 scoring variants --> variants with LOD <= -5.0000.
18:23:38.427 INFO VariantRecalibrator - Shutting down engine
[January 9, 2019 6:23:38 PM PST] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=1202192384
java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:34)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:656)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:968)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

I don't understand what the "No data found" error at the end means, but it results in me having no SNPs in the output.recal file (should that be a vcf file?) , and I get no .R file from which to generate any plots.

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Gavin_Sherlock

    For non-human organisms it is often not possible to do this. Here is a document that might help understand this error: https://software.broadinstitute.org/gatk/documentation/article?id=11097

  • Gavin_SherlockGavin_Sherlock StanfordMember

    Ah - does that mean that joint calling won't make any difference than individual sample variant calling? Or do some of the variant metrics have different values due to the joint calling taking more data into account?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @Gavin_Sherlock

    I am looking into this question at the moment. Can you possibly provide more context for what the goal is?

  • Gavin_SherlockGavin_Sherlock StanfordMember

    Hi,

    The goal is simply to take advantage of having 200 genomes to get better calling, especially over the parts of genomes that might have lower coverage for some samples. If I look at some SNPs called in some strains and not in others, by and large you can see in IGV that all the strains have the SNPs, but that some don't have enough read at the SNP site. My (possibly erroneous) thought was that joint calling would help with these cases, as look at those sites over all genomes would make it clear that there was always a SNP present. It's a pain to look at manually for every possibly SNP in every possible sample, so I was hoping that joint calling would make this a lot easier.

    Thanks,
    Gavin

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @Gavin_Sherlock

    The VariantRecalibrator requires a large amount of sites and a well-curated genome to operate correctly, but there is a workaround using hard filtering. This is similar to the answer to your other post on the forum, so I think by researching this method, you can solve both of these issues.

Sign In or Register to comment.