To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

VQSR Runtime Error

Hi,

I am trying to build a variant truth set for Macaque based on validated dbsnp entries to train VQSR with. This has required a lot of formatting and I'm wondering if I can get some advice.

I have written a script that writes entries from dbsnp into a VCF file. My file currently looks like this:

fileformat=VCFv4.1

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

INFO=Ferguson Generated from bed file: Cyno_ferguson_70.bed and altref file: cyno_ferguson_70_mapped.refalt

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ferguson

16 32245403 rs55621530 A T 200 . NONE GT:AD 1/1:1000,1000
16 32245243 rs55621531 T C 200 . NONE GT:AD 1/1:1000,1000
16 32245478 rs55621532 C G 200 . NONE GT:AD 1/1:1000,1000
16 32245023 rs55621533 G A 200 . NONE GT:AD 1/1:1000,1000
16 32244999 rs55621534 A G 200 . NONE GT:AD 1/1:1000,1000
16 32245437 rs55621535 A G 200 . NONE GT:AD 1/1:1000,1000
16 32245436 rs55621536 T C 200 . NONE GT:AD 1/1:1000,1000
16 30670323 rs55621537 T C 200 . NONE GT:AD 1/1:1000,1000
16 30670352 rs55621538 T C 200 . NONE GT:AD 1/1:1000,1000
16 30670373 rs55621539 G A 200 . NONE GT:AD 1/1:1000,1000
16 30670417 rs55621540 G A 200 . NONE GT:AD 1/1:1000,1000
16 30670488 rs55621541 C T 200 . NONE GT:AD 1/1:1000,1000
16 30670556 rs55621542 C A 200 . NONE GT:AD 1/1:1000,1000
16 30670565 rs55621543 A C 200 . NONE GT:AD 1/1:1000,1000
16 30670620 rs55621544 A G 200 . NONE GT:AD 1/1:1000,1000
16 30670682 rs55621545 T G 200 . NONE GT:AD 1/1:1000,1000
16 30670690 rs55621546 A G 200 . NONE GT:AD 1/1:1000,1000
16 30670715 rs55621547 T G 200 . NONE GT:AD 1/1:1000,1000
16 30670722 rs55621548 C T 200 . NONE GT:AD 1/1:1000,1000
2 101214264 rs55621549 A T 200 . NONE GT:AD 1/1:1000,1000
2 101214002 rs55621550 G A 200 . NONE GT:AD 1/1:1000,1000
2 101213888 rs55621551 G A 200 . NONE GT:AD 1/1:1000,1000
2 100913098 rs55621552 G A 200 . NONE GT:AD 1/1:1000,1000
2 100912772 rs55621553 T C 200 . NONE GT:AD 1/1:1000,1000
2 100912982 rs55621554 C T 200 . NONE GT:AD 1/1:1000,1000
14 32210259 rs55621555 G A 200 . NONE GT:AD 1/1:1000,1000
14 32210253 rs55621556 T C 200 . NONE GT:AD 1/1:1000,1000
14 32210248 rs55621557 A T 200 . NONE GT:AD 1/1:1000,1000
14 32210191 rs55621558 G A 200 . NONE GT:AD 1/1:1000,1000
14 32210022 rs55621559 A G 200 . NONE GT:AD 1/1:1000,1000
14 32210015 rs55621560 G T 200 . NONE GT:AD 1/1:1000,1000
6 149748328 rs55621561 G A 200 . NONE GT:AD 1/1:1000,1000
6 149748319 rs55621562 G A 200 . NONE GT:AD 1/1:1000,1000
6 149748305 rs55621563 C T 200 . NONE GT:AD 1/1:1000,1000
6 149748257 rs55621564 G A 200 . NONE GT:AD 1/1:1000,1000
6 149748230 rs55621565 G A 200 . NONE GT:AD 1/1:1000,1000
6 149748196 rs55621566 T C 200 . NONE GT:AD 1/1:1000,1000
6 149748173 rs55621567 G A 200 . NONE GT:AD 1/1:1000,1000
6 149748104 rs55621568 G A 200 . NONE GT:AD 1/1:1000,1000
6 149748006 rs55621569 T G 200 . NONE GT:AD 1/1:1000,1000
9 43016196 rs55621575 T C 200 . NONE GT:AD 1/1:1000,1000
9 43016042 rs55621576 C T 200 . NONE GT:AD 1/1:1000,1000
9 43015988 rs55621577 T C 200 . NONE GT:AD 1/1:1000,1000
11 67194545 rs55621581 G A 200 . NONE GT:AD 1/1:1000,1000
11 67194426 rs55621582 C T 200 . NONE GT:AD 1/1:1000,1000
11 67194347 rs55621583 T C 200 . NONE GT:AD 1/1:1000,1000
11 67194301 rs55621584 T C 200 . NONE GT:AD 1/1:1000,1000
11 67194269 rs55621585 G A 200 . NONE GT:AD 1/1:1000,1000
11 120994383 rs55621586 C T 200 . NONE GT:AD 1/1:1000,1000
11 120994351 rs55621587 A G 200 . NONE GT:AD 1/1:1000,1000
11 120994301 rs55621588 G A 200 . NONE GT:AD 1/1:1000,1000
11 120994267 rs55621589 C T 200 . NONE GT:AD 1/1:1000,1000
11 120994191 rs55621590 C T 200 . NONE GT:AD 1/1:1000,1000
11 120994162 rs55621591 C T 200 . NONE GT:AD 1/1:1000,1000
11 120994159 rs55621592 G A 200 . NONE GT:AD 1/1:1000,1000
15 19724502 rs55621593 G A 200 . NONE GT:AD 1/1:1000,1000
15 19724413 rs55621594 G A 200 . NONE GT:AD 1/1:1000,1000
15 19724387 rs55621595 C A 200 . NONE GT:AD 1/1:1000,1000
15 19724374 rs55621596 A C 200 . NONE GT:AD 1/1:1000,1000
15 19724220 rs55621597 T A 200 . NONE GT:AD 1/1:1000,1000
15 19724213 rs55621598 G C 200 . NONE GT:AD 1/1:1000,1000
15 19724092 rs55621599 C T 200 . NONE GT:AD 1/1:1000,1000

When I try to run VQSR using this vcf as a truth set, I get thrown an error. I am using a nightly build of GATK because of a previous issue with htsjdk I ran into in the past, the error looks similar to this past issue (http://gatkforums.broadinstitute.org/discussion/6128/gatk-selectvariants-runtime-error#latest)

Command: :
java -Xmx8g -jar ~/GenomeAnalysisTK-nightly-2015-12-17-gec72eac/GenomeAnalysisTK.jar \
-T VariantRecalibrator -R ../../../174\ WGS:WES/0.174k\ macaque\ genome\ project/ref/mafa5/mafa5.fa \
-input ../../../174\ WGS:WES/0.174k\ macaque\ genome\ project/01-unfiltered\ variants/16557.all.exons.mafa5.ann.vcf \
-resource:ferguson,known=false,training=true,truth=true,prior=15.0 16797.ferguson.mafa5.mcm.validatedvcf \
-an QD -an MQ -an MQRankSum \
-mode SNP \
-recalFile test.recal \
-tranchesFile test.tranches \
-rscriptFile test.plots.R

Error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException
at htsjdk.variant.vcf.VCFCompoundHeaderLine.(VCFCompoundHeaderLine.java:168)
at htsjdk.variant.vcf.VCFInfoHeaderLine.(VCFInfoHeaderLine.java:48)
at htsjdk.variant.vcf.AbstractVCFCodec.parseHeaderFromLines(AbstractVCFCodec.java:199)
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:111)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:88)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:41)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readHeader(IndexFactory.java:413)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:401)
at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:312)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:441)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:327)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:264)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:153)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1047)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:828)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:286)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2015-12-17-gec72eac):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)

Do you know why this is happening? When I run VQSR against another truth set that was called using haplotypecaller and genotypegvcfs I don't run into this problem which is why I believe it has something to do with the format of my dbsnp VCF. I'm hoping to understand how to format VCFs properly to run with VQSR in order to build this set out with more validated entries from dbsnp. Any advice would be greatly appreciated.

Thank you,
Trent

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @trentprall
    Hi Trent,

    There is a tool called VariantsToVCF that might help you. I think you have a bed file so it should work.

    If the VariantsToVCF tool does not work with your truth file, you can try adding in the sequence contigs from your reference. It looks like that is missing in your header.

    -Sheila

  • trentpralltrentprall MadisonMember

    Thanks for the responce Sheila. I appended sequence contigs to the vcf and sorted it in order. I am still thrown the same error. I also cannot get VariantsToVCF to work on Bed files downloaded from dbsnp. Can you give me an example bed file that will be properly converted to VCF by VariantsToVCF?

  • It's hard to tell because of the forum formatting, but the INFO line in the header of your VCF looks malformed. It should look ver similar to the FORMAT lines, with an ID, Number, Type, and Description

  • trentpralltrentprall MadisonMember

    @pdexheimer @Shelia
    I appended sequence contigs to the header and added a correctly formatted info line as well. I then sorted to reads using: java -jar ~/picard.jar which sorts according to the sequence header.

    My new vcf looks like this:

    fileformat=VCFv4.2

    FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

    FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

    INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">

    contig=<ID=1,length=227556264>

    contig=<ID=2,length=192460366>

    contig=<ID=3,length=192294377>

    contig=<ID=4,length=170955103>

    contig=<ID=5,length=189454096>

    contig=<ID=6,length=181584905>

    contig=<ID=7,length=171882078>

    contig=<ID=8,length=146850525>

    contig=<ID=9,length=133195287>

    contig=<ID=10,length=96509753>

    contig=<ID=11,length=137757926>

    contig=<ID=12,length=132586672>

    contig=<ID=13,length=111193037>

    contig=<ID=14,length=130733371>

    contig=<ID=15,length=112612857>

    contig=<ID=16,length=80997621>

    contig=<ID=17,length=96864807>

    contig=<ID=18,length=75711847>

    contig=<ID=19,length=59248254>

    contig=<ID=20,length=78541002>

    contig=<ID=MT,length=16575>

    contig=<ID=X,length=152835861>

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ferguson

    2 100912772 rs55621553 T C 200 . AC=2 GT:AD 1/1:1000,1000
    2 100912982 rs55621554 C T 200 . AC=2 GT:AD 1/1:1000,1000
    2 100913098 rs55621552 G A 200 . AC=2 GT:AD 1/1:1000,1000
    2 101213888 rs55621551 G A 200 . AC=2 GT:AD 1/1:1000,1000
    2 101214002 rs55621550 G A 200 . AC=2 GT:AD 1/1:1000,1000
    2 101214264 rs55621549 A T 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748006 rs55621569 T G 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748104 rs55621568 G A 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748173 rs55621567 G A 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748196 rs55621566 T C 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748230 rs55621565 G A 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748257 rs55621564 G A 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748305 rs55621563 C T 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748319 rs55621562 G A 200 . AC=2 GT:AD 1/1:1000,1000
    6 149748328 rs55621561 G A 200 . AC=2 GT:AD 1/1:1000,1000
    9 43015988 rs55621577 T C 200 . AC=2 GT:AD 1/1:1000,1000
    9 43016042 rs55621576 C T 200 . AC=2 GT:AD 1/1:1000,1000
    9 43016196 rs55621575 T C 200 . AC=2 GT:AD 1/1:1000,1000
    11 67194269 rs55621585 G A 200 . AC=2 GT:AD 1/1:1000,1000
    11 67194301 rs55621584 T C 200 . AC=2 GT:AD 1/1:1000,1000
    11 67194347 rs55621583 T C 200 . AC=2 GT:AD 1/1:1000,1000
    11 67194426 rs55621582 C T 200 . AC=2 GT:AD 1/1:1000,1000
    11 67194545 rs55621581 G A 200 . AC=2 GT:AD 1/1:1000,1000
    11 120994159 rs55621592 G A 200 . AC=2 GT:AD 1/1:1000,1000
    11 120994162 rs55621591 C T 200 . AC=2 GT:AD 1/1:1000,1000
    11 120994191 rs55621590 C T 200 . AC=2 GT:AD 1/1:1000,1000
    11 120994267 rs55621589 C T 200 . AC=2 GT:AD 1/1:1000,1000
    11 120994301 rs55621588 G A 200 . AC=2 GT:AD 1/1:1000,1000
    11 120994351 rs55621587 A G 200 . AC=2 GT:AD 1/1:1000,1000
    11 120994383 rs55621586 C T 200 . AC=2 GT:AD 1/1:1000,1000
    14 32210015 rs55621560 G T 200 . AC=2 GT:AD 1/1:1000,1000
    14 32210022 rs55621559 A G 200 . AC=2 GT:AD 1/1:1000,1000
    14 32210191 rs55621558 G A 200 . AC=2 GT:AD 1/1:1000,1000
    14 32210248 rs55621557 A T 200 . AC=2 GT:AD 1/1:1000,1000
    14 32210253 rs55621556 T C 200 . AC=2 GT:AD 1/1:1000,1000
    14 32210259 rs55621555 G A 200 . AC=2 GT:AD 1/1:1000,1000
    15 19724092 rs55621599 C T 200 . AC=2 GT:AD 1/1:1000,1000
    15 19724213 rs55621598 G C 200 . AC=2 GT:AD 1/1:1000,1000
    15 19724220 rs55621597 T A 200 . AC=2 GT:AD 1/1:1000,1000
    15 19724374 rs55621596 A C 200 . AC=2 GT:AD 1/1:1000,1000
    15 19724387 rs55621595 C A 200 . AC=2 GT:AD 1/1:1000,1000
    15 19724413 rs55621594 G A 200 . AC=2 GT:AD 1/1:1000,1000
    15 19724502 rs55621593 G A 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670323 rs55621537 T C 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670352 rs55621538 T C 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670373 rs55621539 G A 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670417 rs55621540 G A 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670488 rs55621541 C T 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670556 rs55621542 C A 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670565 rs55621543 A C 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670620 rs55621544 A G 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670682 rs55621545 T G 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670690 rs55621546 A G 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670715 rs55621547 T G 200 . AC=2 GT:AD 1/1:1000,1000
    16 30670722 rs55621548 C T 200 . AC=2 GT:AD 1/1:1000,1000
    16 32244999 rs55621534 A G 200 . AC=2 GT:AD 1/1:1000,1000
    16 32245023 rs55621533 G A 200 . AC=2 GT:AD 1/1:1000,1000
    16 32245243 rs55621531 T C 200 . AC=2 GT:AD 1/1:1000,1000
    16 32245403 rs55621530 A T 200 . AC=2 GT:AD 1/1:1000,1000
    16 32245436 rs55621536 T C 200 . AC=2 GT:AD 1/1:1000,1000
    16 32245437 rs55621535 A G 200 . AC=2 GT:AD 1/1:1000,1000
    16 32245478 rs55621532 C G 200 . AC=2 GT:AD 1/1:1000,1000

    And I'm thrown this ERROR:

    ERROR MESSAGE: Input files fergusen and reference have incompatible contigs: The contig order in fergusen and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    ERROR fergusen contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 3, 4, 5, 6, 7, 8, 9, MT, X]
    ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, MT, X]

    How is this possible, Ive sorted the contigs in ascending order as specified in the header. Additionally, I went back and sorted the using a sequence dictionary of the same reference I'm calling VQSR against. What is happening??

    Thanks again,

    Trent

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Trent, try deleting the vcf index file. It might not have been updated following the modifications you made.

  • trentpralltrentprall MadisonMember

    @Geraldine_VdAuwera I don't have a vcf index file for this VCF, could that be the issue?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No, GATK should regenerate an index for you automatically.

    How did you sort the vcf?

  • trentpralltrentprall MadisonMember

    @Geraldine_VdAuwera I sorted the VCF using Picard's SortVcf, both using a reference dictionary for mafa5 (same reference called in the VQSR command and generated by picard) and sorting based on the sequence contigs in the header. In both cases the same error is thrown.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We think this might be a Picard bug as someone else has reported a similar problem. @Sheila will be in touch shortly to request a formal bug report (test files so we can reproduce and troubleshoot the problem locally).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @trentprall
    Hi Trent,

    Indeed, another user has reported an issue similar to yours. We suspect it is related to Picard's SortVcf. Would you submit a bug report so we can debug locally? Instructions are here.

    Thanks,
    Sheila

  • trentpralltrentprall MadisonMember

    @Shelia @Geraldine_VdAuwera I will submit a bug report for future convenience.
    I eventually was able to get GATK to accept the VCF by sorting the positions using python rather than Picard and using a nightly build of GATK.jar which allows the non-human reference to be used as discovered in my previous issue.

Sign In or Register to comment.