CombineGVCFs error in GATK 4.0.1.2

prasundutta87prasundutta87 EdinburghMember
edited March 2018 in Ask the GATK team

Hi,

I am aware that some people have faced this error, but they are from old version of GATK and I am not sure if it applies to the GATK version I am using or not (4.0.1.2 with Java 1.8.0_74)..but I am facing these errors:

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:190)
at htsjdk.tribble.TribbleIndexedFeatureReader.loadIndex(TribbleIndexedFeatureReader.java:162)
at htsjdk.tribble.TribbleIndexedFeatureReader.hasIndex(TribbleIndexedFeatureReader.java:227)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:251)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource.lambda$new$0(MultiVariantDataSource.java:89)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource.(MultiVariantDataSource.java:88)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.initializeDrivingVariants(MultiVariantWalker.java:71)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:47)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:558)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.onStartup(MultiVariantWalker.java:48)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:277)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:181)
... 16 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at htsjdk.tribble.index.interval.IntervalTree.insert(IntervalTree.java:57)
at htsjdk.tribble.index.interval.IntervalTreeIndex$ChrIndex.read(IntervalTreeIndex.java:223)
at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:404)
at htsjdk.tribble.index.interval.IntervalTreeIndex.(IntervalTreeIndex.java:53)
at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:181)
at htsjdk.tribble.TribbleIndexedFeatureReader.loadIndex(TribbleIndexedFeatureReader.java:162)
at htsjdk.tribble.TribbleIndexedFeatureReader.hasIndex(TribbleIndexedFeatureReader.java:227)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:251)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource.lambda$new$0(MultiVariantDataSource.java:89)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource$$Lambda$59/1292784864.accept(Unknown Source)
... 12 more

The command I am running is:

java -Xmx200g -jar /exports/eddie3_homes_local/s0928794/tools/gatk-package-4.0.1.2-local.jar CombineGVCFs -R GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa --variant All_gvcfs.list -O combined_81.g.vcf.gz

The All_gvcfs.list contains absolute paths to 81 GVCF files of varied sizes (24-106 GB) generated by haplotycaller of GATK 4.0.1.2. Ex:

/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Lodi_female_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Pandharpuri_female_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Lodi_male_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Bhadawari_male_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/indian_wgs_10x_gvcf/Surti-214_10x.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/indian_wgs_10x_gvcf/Jaffrabadi-548_10x.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/indian_wgs_10x_gvcf/Bhadhwari-B254_10x.g.vcf

.....Total 81 GVCFs

I tested many java heap sizes (started from 8G, but not all files were being read by VCFCodec, when I gave 200G, it read all, but the above error came when the traversal was actually going to start.

Post edited by prasundutta87 on
Tagged:

Best Answer

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Can you try doing GenomicsDBImport instead of CombineGVCFs. It is faster and much reliable compared to CombineGVCFs. I shared 2 perl scripts to ease of the procedure for all contigs here

  • prasundutta87prasundutta87 EdinburghMember

    Thanks @SkyWarrior for sharing the post..it was helpful..let me try it out..I had checked GenomicsDBImport but then I came across the caveat that it can work with only one contig at a time..I have approx. 367000 contigs (from a draft assembly)..the script may be helpful..

  • prasundutta87prasundutta87 EdinburghMember
    edited March 2018

    Just a minor question, you have added a reference sequence in the import script..but in the example script of GenomicsDBimport, -R is not used..is there any reason for this difference, as in does adding the reference sequence make in difference in the program output?

    Post edited by prasundutta87 on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @prasundutta87
    Hi,

    First, thanks @SkyWarrior for sharing the scripts. I am not sure if the reference makes a difference in GenomicsDBImport, but I have only run it without a reference. I would think adding the reference would add more time to load it, but perhaps @SkyWarrior has more insight.

    As for your CombineGVCFs error, that usually arises from bad indices. Can you try re-generating the GVCF indices?

    -Sheila

  • prasundutta87prasundutta87 EdinburghMember
    edited March 2018

    Thanks @Sheila ..bad indices? Haplotypecaller generated them automatically after creating GVCFs..how do I create new indices of the GVCF files? Will tabix index do?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    they are tabix indices. Just use tabix -p vcf filename.g.vcf.gz

    I didn't pay attention to the reference scripts. But -R is optional for GenomicsDBImport so I added. I will check if that matters at all for the speed. BTW I forgot to add -new-qual switch to the script that I shared but normally I use -new-qual which uses a new algorithm to calculate AFs.

  • prasundutta87prasundutta87 EdinburghMember

    So I believe I need to delete the existing .idx files, make .tbi files..hopefully, it won't hamper any downstream tool usage..Is it necessary to rename .tbi to .idx?

    Thanks for mentioning about -new-qual..

  • prasundutta87prasundutta87 EdinburghMember
    edited March 2018

    Thanks for the tips @SkyWarrior..I will go with bgzip and tabix indexing..hopefully things will progress from there..space is indeed important for me currently..will delete the .idx files as well..since all my 81 .idx files are present, I believe I don't need to run validatevariants on each gvcf file to check their intactness..

Sign In or Register to comment.