We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VCF validation...

willpitcherswillpitchers Michigan State UMember

Dear All,

We have a pile of full-genome sequences from fish genomes, and we're trying to call variants so that we can look for P-G associations. Given the genetic distance between some of our samples, we're trying to genotype both using HaplotypeCaller (genotype & call), and using HaplotypeCaller followed by GenotypeGVCFs . We hope that comparing the two approaches will help us work out which is doing the better job.

My problem is that I have to parallelise in order to get my jobs to run on a shared computing cluster. I've been doing this by running HaplotypeCaller & GenotypeGVCFs on ~1MBp 'slices' of the genome, specified using the '-L' intervals flag, then 'sticking' together a full-genome vcf file using the CatVariants tool. However, I'm seeing error messages like the one below when I then try to use the VariantsToBinaryPed tool to convert our vcfs for analysis using PLINK.

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 24252683: there are 62 genotypes while the header requires that 63 genotypes be present for all records at Scaffold167:15

I can track down the 'slice' on which the genotype is missing, remake that file, then re-CatVariants the big vcf, but then I get another similar error when I get back to the VariantsToBinaryPed step. Since slicing the genome gives me ~2000 vcf files to CatVar. back together – and I've been iterating this error-discovery process for a few days now – I'd really like to be able to programatically discover errors in the vcfs, but the ValidateVariants tools does not seem to catch them (I suspect that it might not be able to since we have no a priori 'dbsnp' file to provide).

Can anyone suggest another way to flag these errors, or perhaps to make VariantsToBinaryPed more error-tolerant?

Many thanks!

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Hi there, if I were you my priority would be to discover why these errors are happening and fix the cause, rather than trying to patch up the results. It sounds like some of your scatter jobs are failing randomly. How are you parallelizing execution? Do you have a way to check for success at that stage?
  • willpitcherswillpitchers Michigan State UMember

    Hey @Geraldine_VdAuwera – I agree completely! We'd much rather have a pipeline that works or at the least tells me when it fails...
    I've tried to use the ValidateVariants tool; so that my script reads an index from a list (using the intervals flag, e.g. -L Scaffold6:5000001-5750215) runs HaplotypeCaller to generate the ..g.vcf for that interval, then runs ValidateVariants on the output file.

    However, because we're working in a non-model system, we do not have a reliable dbsnp file, which limits what ValidateVariants can check for (unless I've misunderstood) and the --validateGVCF flag throws an error message... has this feature been added recently? (the cluster that we're running on has GATK/3.5.0 installed).

    Will

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Yes, GVCF validation was added recently -- in 3.6 if memory serves.
  • willpitcherswillpitchers Michigan State UMember

    Thanks for the hint about the version @Geraldine_VdAuwera , I am trying to get version 3.6 up and running now.

    I have tried running the 3.6 version of ValidateVariants twice on the same set of files; each time it reports There are no warn messages when I run without the -gcvf flag, but with it I'm getting the following error:

    ```##### ERROR --

    ERROR stack trace

    java.lang.NullPointerException
    at org.broadinstitute.gatk.tools.walkers.variantutils.ValidateVariants.onTraversalDone(ValidateVariants.java:255)
    at org.broadinstitute.gatk.tools.walkers.variantutils.ValidateVariants.onTraversalDone(ValidateVariants.java:126)
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Code exception (see stack trace for error itself)
    ERROR ------------------------------------------------------------------------------------------```

    Is this a known issue?

    W.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @willpitchers
    Hi W,

    What is the exact command you are running to get the error?

    Thanks,
    Sheila

  • willpitcherswillpitchers Michigan State UMember

    Hey @Sheila

    java -Xmx30g -jar /mnt/home/pitchers/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref} -V APA_6675/APA_6675_10_2016_slice_100.g.vcf --warnOnErrors --validationTypeToExclude ALLELES

    ...works fine, but:

    java -Xmx30g -jar /mnt/home/pitchers/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref} -V APA_6675/APA_6675_10_2016_slice_100.g.vcf --warnOnErrors --validationTypeToExclude ALLELES -gvcf

    ..returns:

    `##### ERROR stack trace
    java.lang.NullPointerException
    at org.broadinstitute.gatk.tools.walkers.variantutils.ValidateVariants.onTraversalDone(ValidateVariants.java:255)
    at org.broadinstitute.gatk.tools.walkers.variantutils.ValidateVariants.onTraversalDone(ValidateVariants.java:126)
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209)`

    thanks, Will

    Issue · Github
    by Sheila

    Issue Number
    1375
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    ronlevine
  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @willpitchers
    Hi Will,

    Hmm. I just tested this myself and got the same error message. Let me talk to the team and get back to you.

    -Sheila

Sign In or Register to comment.