GenotypeGVCFs GATK 4.1.0.0 error bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.

cmtcmt Seattle WAMember

Hello!

I am trying to run GenotypeGVCFs on a set of 12 pooled-seq (50 fish per pool, using a ploidy of 20 in GATK) whole genome samples. Unfortunately I have had recurrent java memory issues ( I think!) and this one error that I can't figure out. I posted a version of this question originally on gatk/discussion/13430 and didn't pursue it because I thought it was tied to other problems I was having with GenomicsDBImports. Since then I have tried a couple things, but still get the error.

Here is a little bit about my analysis so far: I have pre-processed bam files (according to GATK best practices) for each of my 12 pools. I'm working on a non-model organism, so I don't have a full genome to align my reads to, just 24 linkage groups and 8626 scaffolds. The number of scaffolds has presented some issues, so at times along my pipeline I have split the analysis into two tracks, keeping the linkage groups together and the scaffolds together. I start with HaplotypeCaller scatter/gathering over the linkage groups then using -L and treating the scaffolds as intervals. Then MergeGVCFs to combine all the GVCFs into one GVCF file per pool. I use LeftAlignAndTrimVariants to left align and trim the variants (I was having a problem with GenomicsDB where there were variants with a ton of PLs and at it was preventing me from importing to the GenomicsDB, this seems to have solved that problem). Then I use GenomicsDBImport to make a GenomicsDB for each of my linkage groups, with all my pools included. I'm using CombineGVCFs to combine all the GVCFs for the scaffolds.

This is where I run into trouble. I have made a GenomicsDB for each of my 24 linkage groups without issue, but to run GenotypeGVCFs on the GenomicsDB and get genotypes I have to split the linkage group into intervals. I can get some intervals of just under ~2million base pairs to genotype with this command and runtime:

  command <<<
    set -e

    tar -xf ${workspace_tar}
    WORKSPACE=$( basename ${workspace_tar} .tar)

    "/gatk/gatk" --java-options "-Xmx125g -Xms125g" \
     GenotypeGVCFs \
     -R ${ref_fasta} \
     -O ${output_vcf_filename} \
     --disable-read-filter NotDuplicateReadFilter \
     -V gendb://$WORKSPACE \
     --verbosity ERROR \
     -L ${interval}

  >>>
  runtime {
    docker: "broadinstitute/gatk:4.1.0.0"
    memory: "150 GB"
    cpu: "4"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptible
  }

But some intervals fail with the following error, which I have been treating like an out of memory error, and continuously bumping up the -Xmx, -Xms, total memory and the disk space (up to 500G), though I'm not sure which would help- I just feel desperate to get them genotyped.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb8e3875edf, pid=19, tid=0x00007fdefaaed700
#
# JRE version: OpenJDK Runtime Environment (8.0_191-b12) (build 1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.191-b12 mixed mode linux-amd64 )
# Problematic frame:
# C  [libtiledbgenomicsdb571603236969900509.so+0x17dedf]  void VariantOperations::reorder_field_based_on_genotype_index<int>(std::vector<int, std::allocator<int> > const&, unsigned long, MergedAllelesIdxLUT<true, true> const&, unsigned int, bool, bool, unsigned int, RemappedDataWrapperBase&, std::vector<unsigned long, std::allocator<unsigned long> >&, int, std::vector<int, std::allocator<int> > const&, unsigned long, std::vector<int, std::allocator<int> >&)+0x6f
#
# Core dump written. Default location: /cromwell_root/core or core.19
#
# An error report file with more information is saved as:
# /cromwell_root/hs_err_pid19.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Following the examples in this thread: gatk/discussion/comment/43887, I have also tried adding -XX:+UseSerialGC and using --use-jdk-deflater true and --use-jdk-inflater true, and using the most recent GATK: 4.1.1.0. Those have not helped.

Do you know why I might be getting this error? Should I continue to increase the memory/disk space, or reduce the size of the interval, or is there something else that could be wrong?

The other error that I get is maybe more cryptic:

java: /home/vagrant/GenomicsDB/dependencies/htslib/vcf.c:3641: bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.
Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx125g -Xms125g -jar /gatk/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /cromwell_root/uw-hauser-pcod-gatk-tests/gadMor2.fasta -O LG01.3.Genotyped.vcf.gz --disable-read-filter NotDuplicateReadFilter -V gendb://genomicsdb_LGs --verbosity ERROR -L LG01:6000001-8000000

That one seems to be tied to a specific location in the genome. Though this run with the interval LG01:6000001-8000000 failed, it produced part of a VCF file. I made two smaller intervals from the original interval to see if avoiding the area around where the VCF file stopped would help it finish running. My two new intervals were: LG01:6000001-7134350 and LG01:7150000-8000000. The first interval worked fine, but the second gave me the same error and no VCF file.

Do you have any ideas what might be happening here, and what bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed means? I don't need to genotype every location- if some are giving me issues because they are too messy I am ok dropping them. I would just like to make sure that I am not inadvertently introducing errors in my data, or ignoring some warning flags for issues I need to address in the analysis. And, with so many linkage groups split up into intervals, it is tough to have a pipeline that requires manual tinkering when intervals fail.

Thank you so much for any advice you have!

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @cmt I am going to take a closer look and get back to you soon!

  • AdelaideRAdelaideR Member admin

    @cmt

    I am inclined to think that you are correct about needing more memory to do the scattered interval reassembly in GenotypeGVCFs. The files just need this amount of memory to hold the files and do the scatters on the intervals. By splitting the genotyping by scaffolds, you may be encountering problems on the very long scaffolds. Have you noticed a pattern between the length of the scaffold and whether it fails in the memory?

    Also, there is a note in the latest documentation:

    The amount of temporary disk storage required by GenomicsDBImport may exceed what is available in the default location: `/tmp`. The command line argument `--tmp-dir` can be used to specify an alternate temporary storage location with sufficient space.
    

    I would try to up the temporary disk storage before increasing the memory allocation.

    I am not sure what is going on in the cryptic error, I would definitely try to fix the memory issue first.

    However, I did find a bcftools issue here that might be part of the ploidy issue.

    Maybe applying the bcftools tag2tag --pl-to-gl option on that section of chromosome could help? It seems this error may pop up if the VCF file stream contains PLs but no GPs. But I am just guessing.

    If you want to share your VCF file that is giving the error with me, please send it along.

Sign In or Register to comment.