Error with VariantAnnotator

Hi,
I am having a problem with VariantAnnotator.

I am running the command:
java1.7 -jar /usr/local/packages/GATK3/GenomeAnalysisTK.jar \
-R /projects4/ruth/Burkholderia/cutadapt/cenocepacia/B_cenocepacia_J2315.fasta \
-T VariantAnnotator \
-I N501.C8967_R1.fastq_to_B_cenocepacia_J2315.sorted.RG.bam \
-o output2.vcf \
-V N501.C8967_R1.fastq_to_B_cenocepacia_J2315.sorted_GATK.vcf \
-A AlleleBalance \
-A BaseCounts \
-A Coverage \
-A FisherStrand \
-A GenotypeSummaries \
-A LowMQ \
-A RMSMappingQuality \
-A AlleleBalanceBySample

Where the .vcf file was made using GATK HaplotypeCaller.

The error I get is:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 0
at org.broadinstitute.gatk.tools.walkers.annotator.AlleleBalanceBySample.annotateWithPileup(AlleleBalanceBySample.java:127)
at org.broadinstitute.gatk.tools.walkers.annotator.AlleleBalanceBySample.annotate(AlleleBalanceBySample.java:113)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateGenotypes(VariantAnnotatorEngine.java:420)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:216)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:312)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:85)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: 0
ERROR ------------------------------------------------------------------------------------------

Interestingly the error does not occur when I run exactly the same command with a vcf file created using samtools (and the sam bam file).

I have installed the most up-to-date version of GATK (3.4-46) in case that was causing the error, but that does not seem to be the answer.

Any suggestions about what is causing this error would be greatly appreciated.

Thanks,

Ruth

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited November 2015

    @Ruth
    Hi Ruth,

    I just ran the same command on some of my own test files and it ran with no error. Can you please try validating your bam file and VCF file? http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile

    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php

    Thanks,
    Sheila

  • Hi,
    Thanks for getting back to me.

    You are right, the vcf file does fail validation - with the error:

    ERROR MESSAGE: File /projects4/ruth/Burkholderia/cutadapt/cenocepacia/bowtie/N501.C8967_R1.fastq_to_B_cenocepacia_J2315.sorted_GATK.vcf fails strict validation: one or more of the ALT allele(s) for the record at position B_cenocepacia_J2315.fasta:5487 are not observed at all in the sample genotypes

    However, I am a bit confused about this, as I created the vcf file using GATK and the command:
    gatk3 \
    -T HaplotypeCaller \
    -R /projects4/ruth/Burkholderia/cutadapt/cenocepacia/B_cenocepacia_J2315.fasta \
    -I N501.C8967_R1.fastq_to_B_cenocepacia_J2315.sorted.RG.bam \
    --emitRefConfidence GVCF \
    --variant_index_type LINEAR \
    --variant_index_parameter 128000 \
    -o N501.C8967_R1.fastq_to_B_cenocepacia_J2315.sorted_GATK.vcf

    Which appeared to run without any error. So why would this vcf fail validation?

    Thanks

    Ruth

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, that's actually a development oversight -- ValidateVariants was not updated to interpret the NON REF allele correctly in gVCFs.

    So this validation error does not explain the problem you're seeing. What happens if you run the gVCF through GenotypeGVCFs then run VariantAnnotator on that?

    Utlimately, the gVCF file is not meant to be used as an end product so it is possible that VariantAnnotator is also choking on it for some reason that we don't yet understand. What would be especially helpful would be if you can narrow down the error to a particular record or subset of records. If the problem is linked to the gVCF format, running on just a tiny region should still reproduce the error.

  • Hi,

    Yes, it seems to work OK if I run GenotypeGVCFs first. I just put one --variant in (which was my vcf file).
    However, running this looses a lot of information, as my vcf file now has a lot fewer lines (57985 as opposed to 554164)

    Is there any way to fix this that does not involve shortening my vcf file?

    Thanks,

    Ruth

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Ruth, the final VCF is expected to be much shorter than the GVCF file -- this is completely normal and not a cause for concern. If this does not make sense to you, please read the explanation of the workflow that is given in the Best Practices documentation.

    You can have GenotypeGVCFs emit all sites including non-variant sites if that's what you want (see the tool documentation for available arguments), but it will take up a lot more space and may not be useful, depending on what you want to analyze.

  • Hi,

    My aim is to generate a complete genome sequence for each of my samples. I don't want to assume the reference sequence for all positions that are not SNVs, so I wanted to generate a gVCF file, so that I can then run the VariantAnnotator on every position, so I know which non-SNV positions are low confidence as well as SNV positions. Then I plan to run FastaAlternateReferenceMaker with a SNP mask on all sites that fail specific filters identified by the VariantAnnotator, whether they are SNVs or not.

    Thank you for pointing me to the best practices, as by reading through, I have realised that perhaps HaplotypeCaller with the option --emitRefConfidence BP_RESOLUTION is more appropriate for my question. However, this produces the same error when I put the output into VariantAnnotator.

    Do you know of any way I can annotate a vcf file with positions for the whole genome?

    Thanks,

    Ruth

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, I see, that makes sense. Then what you need is definitely to run GenotypeGVCFs with -allSites output (be sure to check the argument name, I may have misspelled it). Non-variant sites will include the RGQ annotation which gives you an estimate of reference genotype confidence that you can use to filter and then mask out low-confidence ref sites.

  • That is exactly what I need. Thank you very much for your help.
    Ruth

Sign In or Register to comment.