Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

cmtcmt Seattle WAMember

Hello!

I am running the GATK (v4.1.0.0) best practices pipeline on FireCloud with 12 pooled WGS samples; one pooled sample contains ~48 individual fish (I am using a ploidy of 20 throughout the pipeline). Though I have 24 linkage groups I also have 8286 very small scaffolds that my reads are aligned to, which has caused some issues with using scatter/gather and running the tasks by interval with -L (though that is not my main issue here). Lately I have run into a problem at the JointGenotyping stage.

I have one GVCF for each pool from HaplotypeCaller, and I tried to combine them all using CombineGVCFs. Because of the ploidy of 20 I thought I could not use GenomicsDBImport. I had the same error using CombineGVCFs as the person in this thread: gatkforums.broadinstitute.org/gatk/discussion/13430/gatk-v4-0-10-1-combinegvcfs-failing-with-java-lang-outofmemoryerror-not-using-memory-provided. No matter the amount of memory I allowed the task, it failed every time.

But following @shlee's advice and reading this: github.com/broadinstitute/gatk/issues/5383 I decided to give GenomicsDBImport a try. I just used my 24 linkage groups, so my interval list has only those 24 listed.

I am stumped by the error I got for many of the linkage groups:

***********************************************************************

A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG07:4616323 in VCF /6942d818-1ae4-4c81-a4be-0f27ec47ec16/HaplotypeCallerGVCF_halfScatter_GATK4/3a4a3acc-2f06-44dc-ab6d-2617b06f3f46/call-MergeGVCFs/301508.merged.matefixed.sorted.markeddups.recal.g.vcf.gz

***********************************************************************

What is the best way to address this? I didn't see anything in the GenomicsDB documentation about flagging the MNPs or ignoring them. I was thinking of removing the MNPs using SelectVariants, before importing the GVCFs into GenomicsDB but how do you get SelectVariants to output a GVCF, which is needed for Joint Genotyping.

What would you recommend I do to get past this MNP hurdle?

Best Answer

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @cmt

    This could be due to multiple variants in multiple lines of the same locus. Would you please try to collapse the variants in different lines into one using vcftools and trying again?

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham !

    Thank you for your suggestion. I have been looking into using VCFtools, and I know that it is not a program you troubleshoot, but can you give me any more information for what I am trying to do?

    I have read through the manual, but nothing there or in through google searches is helping me understand what to do with vcftools to collapse the variants that are on different lines into one line.

    Is there another way to do this that is more straight forward? Also, do you know why this happens? Is there something I can do at an earlier stage of GATK to prevent it?

    Thanks!

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham

    I tried a couple things to try to alter the gvcf files in bctools, using --multiallelics and --remove-duplicates under norm, but they just reduced the file size dramatically (5.0 Gb down to 3.3 Gb) and according to the program output there were no changes:

    Lines total/split/realigned/skipped: 130135928/0/0/0

    So I did what I should have done right away, and took a look at the locations that were flagged in the standard error log of the GBImports:

    A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG12:1179503 in VCF /1f0ddd38-9aa0-4e5e-8c42-a355f4ee76f9/HaplotypeCallerGVCF_halfScatter_GATK4/55ebd0b5-00b0-4756-b0f3-8cd74ba10cf5/call-MergeGVCFs/301498.merged.matefixed.sorted.markeddups.recal.g.vcf.gz
    

    This is a screenshot of the line that has the issue- the PL field has the max 10,626 values separated by commas, you can see how long the line is on the right side of the text editor.

    How do I deal with that issue? I think they are probably valid considering the variation at that location in the gvcf file:

    LG23 2475332 . GAAGATATTTA AAAGATATTTA,CAAGATATTTA,TAAGATATTTA,<NON_REF> 0.66 . AS_RAW_BaseQRankSum=||||;AS_RAW_MQ=0.00|0.00|0.00|0.00|0.00;AS_RAW_MQRankSum=||||;AS_RAW_ReadPosRankSum=||||;AS_SB_TABLE=0,0|0,0|0,0|0,0|0,0;DP=49;MLEAC=0,0,0,0;MLEAF=0.00,0.00,0.00,0.00;RAW_MQandDP=27580,49 [...]

    Is there any way to remove these locations from the gvcf so I may proceed with GenomicsDBImport and GenotypeGVCFs? Is there a filter that I am missing, or an argument in the HaplotypeCaller step?

    And is it strange that I only have one location in each of 6 of my 12 pools that has this problem?

    Thank you so much for your help!

  • cmtcmt Seattle WAMember
    edited March 1

    Ok, I have looked at all my flagged sites and they all have 10626 PLs.
    Here are other examples:

    LG20    5518501 .   ATTTTT  ATTTT,A,ATTTTTT,<NON_REF>   0.71    .   AS_RAW_BaseQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_MQ=1600.00|0.00|0.00|0.00|0.00;AS_RAW_MQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_ReadPosRankSum=|NaN|NaN|NaN|NaN;AS_SB_TABLE=0,0|0,0|0,0|0,0|0,0;DP=9;MLEAC=0,0,0,0;MLEAF=0.00,0.00,0.00,0.00;RAW_MQandDP=5414,9 GT:AD:DP:GQ:PL:SB   0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:1 [...]
    
    LG20    5518500 .   GATTTTTTTTTA    AATTTTTTTTTA,CATTTTTTTTTA,TATTTTTTTTTA,<NON_REF>    0.81    .   AS_RAW_BaseQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_MQ=1600.00|0.00|0.00|0.00|0.00;AS_RAW_MQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_ReadPosRankSum=||||;AS_SB_TABLE=0,0|0,0|0,0|0,0|0,0;DP=9;MLEAC=0,0,0,0;MLEAF=0.00,0.00,0.00,0.00;RAW_MQandDP=5414,9 GT:AD:DP:GQ:PL:SB   0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0: [...]
    

    These regions are obviously very messy, my species is known for having a lot of mini and micro satellites in the genome and I'm not interested in resolving them- how do I remove or avoid the areas that are getting flagged by GBImports? Can I use SelectVariants to subset only SNPs before GBImports?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 2

    Hi @cmt

    Yes. You can use SelectVariants to select only for SNPs from a sample by using this option: -selectType SNP.
    Let me know if this solution works for you. If not then we can look into it further.

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham

    Thanks for the suggestion, but that did not work- I think it has something to do with the NON-REF allele? I wish I had made a better note of where I read that. There were no errors when I ran SelectVariants on my gvcf file, but I got this in my stderr.log:

    18:35:28.667 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    18:42:40.474 INFO  ProgressMeter -             unmapped              7.2                     0              0.0
    18:42:40.475 INFO  ProgressMeter - Traversal complete. Processed 0 total variants in 7.2 minutes.
    18:42:40.524 INFO  SelectVariants - Shutting down engine
    [March 6, 2019 6:42:40 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 7.26 minutes.
    Runtime.totalMemory()=211730432
    Using GATK jar /gatk/gatk-package-4.0.12.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx3000m -jar /gatk/gatk-package-4.0.12.0-local.jar SelectVariants -R /cromwell_root/uw-hauser-pcod-gatk-tests/gadMor2.fasta --variant /cromwell_root/fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded/e179e377-dbf0-4831-9494-45894cf99945/HaplotypeCallerGVCF_halfScatter_GATK4/e5a94374-27e2-4d97-b661-00d01b45b620/call-MergeGVCFs/301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz --output SelectVariants.vcf.gz --select-type-to-include SNP
    
    

    I moved on to try a couple of different things on my gvcf files, but I haven't landed on a solution yet.

    The first was to use bcftools to normalize the indels, but so far nothing I have tried with that program has solved the issue.

    Splitting the multiallelic sites:

    bcftools norm --multiallelics -any --output-type z ./301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301496.bcftools_splitMNPany_out.g.vcf.gz

    I get this error, which I googled and see other people with multiple ploidy get using this command:

    Error at LG01:211, the tag PL has wrong number of fields

    Joining the biallelic sites in to multiallelic sites didn't change the fields according to the output (Lines total/split/realigned/skipped: 122656841/0/0/0), but the files were significantly smaller (5GB to 3.3 GB):

    bcftools norm --multiallelics +both --output-type z ./301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301496.bcftools_multiBoth_out.g.vcf.gz

    Tried to remove duplicate records, but bcftools didn't make any changes according to the output (Lines total/split/realigned/skipped: 122656841/0/0/0) but it changed the file size(5GB to 3.3 GB) :

    bcftools norm --rm-dup both --output-type z ./301498.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301498.bcftools_RMVdups_out.g.vcf.gz

    I tried GATK's LeftAlignAndTrimVariants, where I split multiallelics:

    gatk --java-options "-Xmx${command_mem}m" LeftAlignAndTrimVariants \ -R ${ref_fasta} \ --variant ${raw_vcf} \ --output "${output_name}.${vcf_suffix}" \ --split-multi-allelics true \ --dont-trim-alleles true \ --disable-read-filter NotDuplicateReadFilter \ --max-indel-length 500

    But using the gvcf output of that in GDBImports gave me the following error at every linkage group, and when I originally ran the GDBImports on my gvcf files from HaplotypeCaller it did not fail at every linkage group. That gives me the impression that I don't need to split up my multiallelics, I just need some way to reduce the number of PLs for some variants:

    20:56:52.695 INFO GenomicsDBImport - Importing batch 1 with 2 samples terminate called after throwing an instance of 'VCF2TileDBException' what(): VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 0, 210 ] current cell: [ 0, 210 ]. The most likely cause is unexpected data in the input file: (a) A VCF file has two lines with the same genomic position (b) An unsorted CSV file (c) Malformed VCF file (or malformed index) See point 2 at: https://github.com/Intel-HLS/GenomicsDB/wiki/Importing-VCF-data-into-GenomicsDB#organizing-your-data

    I still have a couple things I want to try. Right now I'm left aligning but not trimming or splitting the multiallelics up using GATK's LeftAlignAndTrimVariants.

    I also want to try the bcftools normalize command again, but with the argument --do-not-normalize, which I saw recommended here: gatkforums.broadinstitute.org/gatk/discussion/23462/gatk-v4-1-0-0-validatevariants-gvcf-mode-error-non-in-v4-0-11-0

    I also haven't tried the output of the bcftools files in the GenomicsDBImport because I didn't trust the files with the reduced file size with no explanation by the program for what had changed in the gvcf files. I have thought about running them through the GDBImport, but I don't think I could trust the files without knowing why they shrunk so with bcftools.

    My last hope is to manually edit the vcf files to remove the 10,626 PLs and see if that works, though I am still no closer to understanding why I have that many PLs (the default cap, right?) at just a couple locations, and why those are the locations that prevent my using GenomicsDBImport.

    Do you have any ideas for what might help?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 18

    Hi @cmt

    I asked the dev team to look into this and this is what they had to say:

    Can you please try to collapse the variants in different lines into one using using bcftools as shown in this doc: https://github.com/GenomicsDB/GenomicsDB/wiki/Useful-external-tools#useful-bcftools-commands.

    This is a know error and can be fixed by this method.

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham!

    I did that, but it was not helpful; I got the same errors. The variants in my gvcf files were not split on different lines, the extreme number of PL values for some variants seems to have been the main issue (or a symptom of an issue I don't understand). What ended up working was LeftAlignAndTrimVariants, with just the left align and the trim arguments, which reduced the size of the MNPs.

    I can ask this in another thread, but does CatVariants work in GATK4?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 18

    Hi @cmt

    Thank you for sharing your solution.

    No, CatVariants does not work in GATK4. It is a deprecated tool. You can however use gathervfcs tool for the same purpose.

  • cmtcmt Seattle WAMember
  • alanhoylealanhoyle UNC Lineberger Member

    I am running into this issue trying to build a PON to use with Mutect2 in gatk 4.1.1. I called the variants with gatk Mutect2 -I normal.bam --output sample1.tumor_only.vcf.gz --reference GTRCh38.d1.vd1.fa, but the vcf that it created appears to have this type of variant inside it.

    A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr5:21285319 in VCF sample1.tumor_only.vcf.gz

    The line in question in the vcf (REF/ALT, and genomic location obfuscated/changed) is:

    chr5 21285319 . TT CT,CC . . DP=421;ECNT=3;MBQ=32,32,20;MFRL=183,186,172;MMQ=60,60,60;MPOS=18,12;POPAF=7.30,7.30;TLOD=309.97,4.23 GT:AD:AF:DP:F1R2:F2R1:SB 0/1/2:214,164,14:0.435,0.021:392:54,54,4:97,78,8:116,98,95,83

  • manolismanolis Member ✭✭

    Hi @alanhoyle,

    your issue maybe is related to my last issue... maybe this help

    Best

  • alanhoylealanhoyle UNC Lineberger Member

    That seems to be the case . Thanks !

  • jogeibjogeib GermanyMember

    Hi @bhanuGandham,
    I am runnning into the same problems as @cmt while trying to import g.VCF files from pooled samples (ploidy 20) to GenomicsDB via gatk4.1.1.0 GenomicsDBImport. However, LeftAlignAndTrimVariants did not solve the problem, so I tried to dig deeper into the issue. My MNPs were created due to the --max-alternate-alleles and --max-genotype-count thresholds, as they removed shorter possible variants and the original indel positions then appeared to be MNPs.

    default thresholds (I trimmed the very long list of 1771 PL valus):

    chr1    115074661       .       AT      GT,TT,<NON_REF> 811.54  .       AS_RAW_BaseQRankSum=|1.3,1|-0.1,1|NaN;AS_RAW_MQ=21600.00|61200.00|21600.00|0.00;AS_RAW_MQRankSum=|0.0,1|0.0,1|NaN;AS_
    RAW_ReadPosRankSum=|0.1,1|0.5,1|NaN;AS_SB_TABLE=4,2|5,12|2,4|0,0;BaseQRankSum=1.040;DP=32;MLEAC=10,4,0;MLEAF=0.500,0.200,0.00;MQRankSum=0.000;RAW_MQandDP=115200,32;ReadPosRankSum=0.269   GT
    :AD:DP:GQ:PL:SB       0/0/0/0/1/1/1/1/1/1/1/1/1/1/1/1/2/2/2/2:6,17,6,0:29:0:777,368,299,258,...
    

    increased --max-alternate-alleles and --max-genotype-count (contains much more PL values (10626) now):

    chr1    115074661       .       AT      A,GT,TT,<NON_REF>       996.20  .       AS_RAW_BaseQRankSum=||||;AS_RAW_MQ=0.00|21600.00|57600.00|21600.00|0.00;AS_RAW_MQRankSum=||||;AS_RAW_ReadPosR
    ankSum=||||;AS_SB_TABLE=0,0|4,2|5,11|2,4|0,0;DP=32;MLEAC=4,10,4,0;MLEAF=0.200,0.500,0.200,0.00;RAW_MQandDP=115200,32       GT:AD:DP:GQ:PL:SB       1/1/1/1/2/2/2/2/2/2/2/2/2/2/2/2/3/3/3/3:0,
    6,16,6,0:28:0:955,839,822,813,809,806,...
    

    However, using LeftAlignAndTrimVariants did cut out the T allele and by this the complete position 115074662, resulting in a non valid g.VCF file and leading to further errors:

    chr1    115074661       .       A       G,T,<NON_REF>   811.54  .       AS_RAW_BaseQRankSum=|1.3,1|-0.1,1|NaN;AS_RAW_MQ=21600.00|61200.00|21600.00|0.00;AS_RAW_MQRankSum=|0.0,1|0.0,1|NaN;AS_ ...
    chr1    115074663       .       T       <NON_REF>       .       .       END=115074672   GT:DP:GQ:MIN_DP:PL      0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:32:6:31:0,6,12,19,26,34,42,51,60,70,8 ...
    

    I also tried to run the according steps with version 4.1.2.0, but the MNP support for GenomicsDB seems not to be included yet. Is there alternatively the possibility to somehow use a gVCF mode for LeftAlignAndTrimVariants, which keeps the trimed reference alleles in the g.VCF?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jogeib

    There is a known error with GenomicsDBImport when provided with gvcfs containing MNPs. One way around this is to exclude MNPs by using the option --max-mnp-distance 0 in the variant calling step. Let me know if this helps resolve the error.

  • jogeibjogeib GermanyMember

    Hi @bhanuGandham
    thanks for your fast answer. For HaplotypeCaller, --max-mnp-distance is set to 0 by default:

    --max-mnp-distance
     -mnp-dist  0   Two or more phased substitutions separated by this distance or less are merged into MNPs. WARNING: When used in GVCF mode, resulting GVCFs cannot be joint-genotyped.
    

    However, I also crosschecked by manually setting the option to zero, but that did not change the output. It really seems to be a problem of --max-alternate-alleles and --max-genotype-count as described above. That's why I asked for a possibility to correct it afterwards.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @jogeib @bhanuGandham If -max-mnp-distance is 0 and this problem still occurs it seems like a possible issue with HaplotypeCaller. For example, one of @jogeib's calls was AT GT,TT which isn't a MNP but an incorrect representation of a SNP A G,T. We fixed an issue like this in Mutect2 a while back. The problem was that the variant representation wasn't being trimmed after low-quality alleles were dropped. The fix was, essentially, to run LeftAlignAndTrimVariants within the Mutect2 genotyping, which is totally negligible in terms of runtime. Here's the PR for the M2 fix: https://github.com/broadinstitute/gatk/pull/3509.

  • jogeibjogeib GermanyMember

    @davidben @bhanuGandham It seems as using CombineGVCFs instead of GenomicsDB is solving my problem for now, but is it likely that the bug will be removed in the next version of gatk?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jogeib

    I don't know if the fix will be in the next version but I know this is being worked on, and will be posted here once fixed and released.

  • jejacobs23jejacobs23 Portland, ORMember

    I am having the same issue. I am working with several tumor samples and am in the process of creating the PON using the new 3-step process described here.

    When I attempt to import the .vcf.gz files using GenomicsDBImport, I get the following error message:

    A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr1:10247 in VCF /home/exacloud/lustre1/jjacobs/data/osteo/SJOS010_G/pre-PON.vcf.gz

    It seems that the tool, "VariantsToAllelicPrimitives" would solve this, but this tool is no longer available in GATK. Was this tool replaced by anything that does the same job in GATK4?

    I suppose I could go back to the tumor-only mode step and set --max-mnp-distance to 0 but this would be a significant time commitment to redo all of my samples.

  • jogeibjogeib GermanyMember

    Hi @bhanuGandham,
    as the bug fix seems to need some more time, I tried to get around the problem by increasing the --max-genotype-count threshold and thus avoiding this false representations. However, I then run into another errror message:

    java: /home/vagrant/GenomicsDB.v1.0.0/dependencies/htslib/vcf.c:3641: bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.
    

    I was able to track it down to a small interval (chr16:2057610-2057615) which shows lots of overlapping INDELs (see attached file which contains the subsetted g.vcfs), but could not identify any other strange things. Do you have any clues about what the problem could be there?

    test.txt 249.2K
  • cmtcmt Seattle WAMember

    Hi @jogeib - I got the same error, which drove me mad. I had to run run it split up into intervals that avoided those areas, which I made by trial and error. They were not the type of variation I am looking for in my project, so that worked for me, but it was a total pain.

    I would love a solution for future use!

  • jogeibjogeib GermanyMember

    @jogeib said:
    Hi @bhanuGandham,
    as the bug fix seems to need some more time, I tried to get around the problem by increasing the --max-genotype-count threshold and thus avoiding this false representations. However, I then run into another errror message:

    java: /home/vagrant/GenomicsDB.v1.0.0/dependencies/htslib/vcf.c:3641: bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.
    

    I was able to track it down to a small interval (chr16:2057610-2057615) which shows lots of overlapping INDELs (see attached file which contains the subsetted g.vcfs), but could not identify any other strange things. Do you have any clues about what the problem could be there?

    Sorry, I forgot to mention that it happened in genotypeGVCFs. So genomicsDBImport worked, but genotypeGVCFs stopped at this interval.

  • jejacobs23jejacobs23 Portland, ORMember

    I wanted to give an update. Since the GATK 4.1.2.0 3-step process for creating a PONs didn't seem quite ready for primetime, I went back to the GATK 4.0.12.0 version and created my PONs using the old 2-step process described here.

    This worked well and I was able to implement the PON with the rest of the somatic mutation-calling workflow

    My question is: what is the difference between the old 2-step process and the new 3-step process for creating a PON? Are there advantages to using the new process?

  • bshifawbshifaw Member, Broadie, Moderator admin

    @jejacobs23, VariantsToAllelicPrimitives is not available in GATK4 and it doesn't look like there are any similar tools. The differences between for the process of creating pon is mentioned in the tutorial here under the "A step-by-step guide to the new Mutect2 Panel of Normals Workflow" section.

    @jogeib Looks like cmt has come accrose this error message before (here), would it be possible to work around these trouble areas like cmt mentioned or is it required for your project?

    Hi @jogeib - I got the same error, which drove me mad. I had to run run it split up into intervals that avoided those areas, which I made by trial and error. They were not the type of variation I am looking for in my project, so that worked for me, but it was a total pain.

    Its difficult to tell whats wrong with provided sample, would be more useful with IGV images and/or bug report level information.

  • jogeibjogeib GermanyMember

    @bshifaw It should be possible. However, I uploaded data snippets (bams and gVCFs for chr16:2057000-2058200, log file and command file, reference genome; bugReportGATK_jogeib.tar.gz) to your ftp server so that you are able to check out the problem. The detailed region where the error occured was chr16:2057610-2057615. Don't mind the specific memory, annotation group and maxGenotype settings. I played around with them, but the bug occured with every setting. Could you plese inform me if you can track down the problem? I would be interested in what it was about.

  • cmtcmt Seattle WAMember

    Hi @jogeib @bshifaw,

    I'm having the same issue again and this time can't find a work around.

    I'm trying to do Joint Genotyping on a set of 12 GVCFs from pooled samples (50 individuals per pool, 12 pools total, using a ploidy of 20 and GATK 4.1.2.0). The GVCFs were generated without incident from HaplotypeCaller and GenomicsDBImport combined all 12 GVCFs into one GenomicsDB per chromosome, but have not been successful in genotyping the pools from the GenomicsDB. Because of the computational load and memory (apparently) needed to genotype my pooled samples, I split the chromosomes into even intervals of 50k base pairs and scattered over them, but about a quarter of the scatter tasks fail with the error:

    bcf_update_format: Assertion 'nps && nps*line_>n_sample==n' failed

    (Full disclosure, I'm getting other errors too, but they all seem to do with memory limitations or corruption on specific intervals, though I might be posting about that soon!)

    In earlier runs through gatk with this data set I got the bcf_update_format error rarely enough that I could exclude that region from my analysis. Now I get the error so often that I can't afford to skip over those regions. I am using the same bam files as the original input for this run through gatk, but I changed a couple parameters between that previous run and this one to make the analysis more applicable to my pooled data. The biggest thing is that I removed the bqsr step from the pre-processing of the bam files, which means I have a lot more reads per base pair that are making into the genotyping stage. I tried to reduce the computational load of that by changing the max_genotype_count from 10,000 to 1771 (which I hope is appropriate for a ploidy of 20 where I don't care about capturing rare variants?)

    A couple of things that I have tried:

    Previously (here), @AdelaideR suggested: Maybe applying the bcftools tag2tag --pl-to-gl option on that section of chromosome could help? It seems this error may pop up if the VCF file stream contains PLs but no GPs. But I am just guessing.  I tried that bcftools plugin on a gvcf , but it literally just turned the PLs into GLs, (removing the PLs) and doesn't GenotypeGVCFs rely on the PLs? So this was a 1/2 try, I didn't actually try to genotype after.

    ValidateVariants, to make sure my GVCF files don't have something weird. They all passed with no warnings.

    I'm also trying a run with --use-jdk-inflater and --use-jdk-deflater but now I'm just grasping at straws.

    Do you have any ideas for what might help?

    Thank you so much!

  • bshifawbshifaw Member, Broadie, Moderator admin

    @cmt

    Thanks for the detailed notes on the error, I'll check with dev team and see if they have any suggestions.

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited July 21

    @cmt

    The team believe this could be related to a GenomicsDB bug, although they can't be certain without a test file (for GenomicsDB in this case), the command line being used, and the stacktrace. It would help if you pass those along.

    As a side test, could you please retry GenomicsDB using GATKs nightly builds, bug fixes are currently in the works for this tool and the nightly builds might already have a fix for this error.

    Post edited by bshifaw on
  • cmtcmt Seattle WAMember

    Hi @bshifaw thanks for following up :)

    I won't be able to get to trying out the nightly build until Monday, but I have bundled a bunch of files up for y'all to take a look at. I included the code I have run and the error files, but because my data set is so large I only focused on a little region of one linkage group that is pretty typical of the rest of the data. The error files I included were only for that region, though I have run the whole genome's worth of intervals and gotten the same errors elsewhere. I also made gvcf files that y'all can test with that are just that region. I followed the instructions in Article #1894 and uploaded the zipped folder called BUGreportGATK.zip. Sorry for the generic name!

    Let me know if I can send anything else and thanks so much for taking a look at this for me.

  • cmtcmt Seattle WAMember

    hi @bshifaw

    Have y'all have a chance to take a look at the files that I uploaded? I tried the nightly last week with a little test set of 10 intervals that give me both errors:
    bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.

    and

    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00007fd21db5aec9, pid=17, tid=0x00007ff15983a700
    #
    # JRE version: OpenJDK Runtime Environment (8.0_191-b12) (build 1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12)
    # Java VM: OpenJDK 64-Bit Server VM (25.191-b12 mixed mode linux-amd64 )
    # Problematic frame:
    # C  [libtiledbgenomicsdb1636105264538757432.so+0x17dec9]  void VariantOperations::reorder_field_based_on_genotype_index<int>(std::vector<int, std::allocator<int> > const&, unsigned long, MergedAllelesIdxLUT<true, true> const&, unsigned int, bool, bool, unsigned int, RemappedDataWrapperBase&, std::vector<unsigned long, std::allocator<unsigned long> >&, int, std::vector<int, std::allocator<int> > const&, unsigned long, std::vector<int, std::allocator<int> >&)+0x59
    #
    # Core dump written. Default location: /cromwell_root/core or core.17
    [...]
    

    and they still gave me the same errors with the nightly build. I hope y'all have had more success troubleshooting!

    In the mean time I'm running GenotypeGVCFs with intervals of 10k base pairs and having a success rate of 63-68% of the intervals working. Coming up with techniques to identify and track down the vcf file paths for the intervals that worked has been tricky but I think I have some working methods. I'm hoping y'all with have some suggestions for the cause of the errors I'm seeing.

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @cmt

    @bshifaw will get back to you shortly.

Sign In or Register to comment.