We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

Hello!
I am running the GATK (v4.1.0.0) best practices pipeline on FireCloud with 12 pooled WGS samples; one pooled sample contains ~48 individual fish (I am using a ploidy of 20 throughout the pipeline). Though I have 24 linkage groups I also have 8286 very small scaffolds that my reads are aligned to, which has caused some issues with using scatter/gather and running the tasks by interval with -L (though that is not my main issue here). Lately I have run into a problem at the JointGenotyping stage.
I have one GVCF for each pool from HaplotypeCaller, and I tried to combine them all using CombineGVCFs. Because of the ploidy of 20 I thought I could not use GenomicsDBImport. I had the same error using CombineGVCFs as the person in this thread: gatkforums.broadinstitute.org/gatk/discussion/13430/gatk-v4-0-10-1-combinegvcfs-failing-with-java-lang-outofmemoryerror-not-using-memory-provided. No matter the amount of memory I allowed the task, it failed every time.
But following @shlee's advice and reading this: github.com/broadinstitute/gatk/issues/5383 I decided to give GenomicsDBImport a try. I just used my 24 linkage groups, so my interval list has only those 24 listed.
I am stumped by the error I got for many of the linkage groups:
*********************************************************************** A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG07:4616323 in VCF /6942d818-1ae4-4c81-a4be-0f27ec47ec16/HaplotypeCallerGVCF_halfScatter_GATK4/3a4a3acc-2f06-44dc-ab6d-2617b06f3f46/call-MergeGVCFs/301508.merged.matefixed.sorted.markeddups.recal.g.vcf.gz ***********************************************************************
What is the best way to address this? I didn't see anything in the GenomicsDB documentation about flagging the MNPs or ignoring them. I was thinking of removing the MNPs using SelectVariants, before importing the GVCFs into GenomicsDB but how do you get SelectVariants to output a GVCF, which is needed for Joint Genotyping.
What would you recommend I do to get past this MNP hurdle?
Best Answers
-
bshifaw admin
Hi @cmt
The files were sent to the dev team and here is their recent response
Yes, this is a bug in GenomicsDB where a large ploidy + number of genotypes is leading to math overflow. We are working on a workaround - basically to catch the overflow and stop enumerating genotypes for now.
-
bshifaw admin
Hi @cmt ,
Update: The fix for the bug will be in the next gatk release, here is link the issue ticket
-
Tiffany_at_Broad Cambridge, MA admin
Hi @cmt !
Expect this fix Beri linked to to be in the next release (it is not in 4.1.4.0). With that particular fix, you shouldn't have to run the HaplotypeCaller step again.
Answers
Hi @cmt
This could be due to multiple variants in multiple lines of the same locus. Would you please try to collapse the variants in different lines into one using vcftools and trying again?
Hi @bhanuGandham !
Thank you for your suggestion. I have been looking into using VCFtools, and I know that it is not a program you troubleshoot, but can you give me any more information for what I am trying to do?
I have read through the manual, but nothing there or in through google searches is helping me understand what to do with vcftools to collapse the variants that are on different lines into one line.
Is there another way to do this that is more straight forward? Also, do you know why this happens? Is there something I can do at an earlier stage of GATK to prevent it?
Thanks!
Hi @bhanuGandham
I tried a couple things to try to alter the gvcf files in bctools, using --multiallelics and --remove-duplicates under norm, but they just reduced the file size dramatically (5.0 Gb down to 3.3 Gb) and according to the program output there were no changes:
Lines total/split/realigned/skipped: 130135928/0/0/0
So I did what I should have done right away, and took a look at the locations that were flagged in the standard error log of the GBImports:
This is a screenshot of the line that has the issue- the PL field has the max 10,626 values separated by commas, you can see how long the line is on the right side of the text editor.
How do I deal with that issue? I think they are probably valid considering the variation at that location in the gvcf file:
LG23 2475332 . GAAGATATTTA AAAGATATTTA,CAAGATATTTA,TAAGATATTTA,<NON_REF> 0.66 . AS_RAW_BaseQRankSum=||||;AS_RAW_MQ=0.00|0.00|0.00|0.00|0.00;AS_RAW_MQRankSum=||||;AS_RAW_ReadPosRankSum=||||;AS_SB_TABLE=0,0|0,0|0,0|0,0|0,0;DP=49;MLEAC=0,0,0,0;MLEAF=0.00,0.00,0.00,0.00;RAW_MQandDP=27580,49 [...]
Is there any way to remove these locations from the gvcf so I may proceed with GenomicsDBImport and GenotypeGVCFs? Is there a filter that I am missing, or an argument in the HaplotypeCaller step?
And is it strange that I only have one location in each of 6 of my 12 pools that has this problem?
Thank you so much for your help!
Ok, I have looked at all my flagged sites and they all have 10626 PLs.
Here are other examples:
These regions are obviously very messy, my species is known for having a lot of mini and micro satellites in the genome and I'm not interested in resolving them- how do I remove or avoid the areas that are getting flagged by GBImports? Can I use SelectVariants to subset only SNPs before GBImports?
Thanks!
Hi @cmt
Yes. You can use SelectVariants to select only for SNPs from a sample by using this option:
-selectType SNP
.Let me know if this solution works for you. If not then we can look into it further.
Hi @bhanuGandham
Thanks for the suggestion, but that did not work- I think it has something to do with the NON-REF allele? I wish I had made a better note of where I read that. There were no errors when I ran SelectVariants on my gvcf file, but I got this in my stderr.log:
I moved on to try a couple of different things on my gvcf files, but I haven't landed on a solution yet.
The first was to use bcftools to normalize the indels, but so far nothing I have tried with that program has solved the issue.
Splitting the multiallelic sites:
bcftools norm --multiallelics -any --output-type z ./301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301496.bcftools_splitMNPany_out.g.vcf.gz
I get this error, which I googled and see other people with multiple ploidy get using this command:
Error at LG01:211, the tag PL has wrong number of fields
Joining the biallelic sites in to multiallelic sites didn't change the fields according to the output (Lines total/split/realigned/skipped: 122656841/0/0/0), but the files were significantly smaller (5GB to 3.3 GB):
bcftools norm --multiallelics +both --output-type z ./301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301496.bcftools_multiBoth_out.g.vcf.gz
Tried to remove duplicate records, but bcftools didn't make any changes according to the output (Lines total/split/realigned/skipped: 122656841/0/0/0) but it changed the file size(5GB to 3.3 GB) :
bcftools norm --rm-dup both --output-type z ./301498.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301498.bcftools_RMVdups_out.g.vcf.gz
I tried GATK's LeftAlignAndTrimVariants, where I split multiallelics:
gatk --java-options "-Xmx${command_mem}m" LeftAlignAndTrimVariants \ -R ${ref_fasta} \ --variant ${raw_vcf} \ --output "${output_name}.${vcf_suffix}" \ --split-multi-allelics true \ --dont-trim-alleles true \ --disable-read-filter NotDuplicateReadFilter \ --max-indel-length 500
But using the gvcf output of that in GDBImports gave me the following error at every linkage group, and when I originally ran the GDBImports on my gvcf files from HaplotypeCaller it did not fail at every linkage group. That gives me the impression that I don't need to split up my multiallelics, I just need some way to reduce the number of PLs for some variants:
20:56:52.695 INFO GenomicsDBImport - Importing batch 1 with 2 samples terminate called after throwing an instance of 'VCF2TileDBException' what(): VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 0, 210 ] current cell: [ 0, 210 ]. The most likely cause is unexpected data in the input file: (a) A VCF file has two lines with the same genomic position (b) An unsorted CSV file (c) Malformed VCF file (or malformed index) See point 2 at: https://github.com/Intel-HLS/GenomicsDB/wiki/Importing-VCF-data-into-GenomicsDB#organizing-your-data
I still have a couple things I want to try. Right now I'm left aligning but not trimming or splitting the multiallelics up using GATK's LeftAlignAndTrimVariants.
I also want to try the bcftools normalize command again, but with the argument --do-not-normalize, which I saw recommended here: gatkforums.broadinstitute.org/gatk/discussion/23462/gatk-v4-1-0-0-validatevariants-gvcf-mode-error-non-in-v4-0-11-0
I also haven't tried the output of the bcftools files in the GenomicsDBImport because I didn't trust the files with the reduced file size with no explanation by the program for what had changed in the gvcf files. I have thought about running them through the GDBImport, but I don't think I could trust the files without knowing why they shrunk so with bcftools.
My last hope is to manually edit the vcf files to remove the 10,626 PLs and see if that works, though I am still no closer to understanding why I have that many PLs (the default cap, right?) at just a couple locations, and why those are the locations that prevent my using GenomicsDBImport.
Do you have any ideas for what might help?
Hi @cmt
I asked the dev team to look into this and this is what they had to say:
This is a know error and can be fixed by this method.
Hi @bhanuGandham!
I did that, but it was not helpful; I got the same errors. The variants in my gvcf files were not split on different lines, the extreme number of PL values for some variants seems to have been the main issue (or a symptom of an issue I don't understand). What ended up working was LeftAlignAndTrimVariants, with just the left align and the trim arguments, which reduced the size of the MNPs.
I can ask this in another thread, but does CatVariants work in GATK4?
Thanks!
Hi @cmt
Thank you for sharing your solution.
No, CatVariants does not work in GATK4. It is a deprecated tool. You can however use gathervfcs tool for the same purpose.
Thank you @bhanuGandham !
I am running into this issue trying to build a PON to use with Mutect2 in gatk 4.1.1. I called the variants with
gatk Mutect2 -I normal.bam --output sample1.tumor_only.vcf.gz --reference GTRCh38.d1.vd1.fa
, but the vcf that it created appears to have this type of variant inside it.A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr5:21285319 in VCF sample1.tumor_only.vcf.gz
The line in question in the vcf (REF/ALT, and genomic location obfuscated/changed) is:
chr5 21285319 . TT CT,CC . . DP=421;ECNT=3;MBQ=32,32,20;MFRL=183,186,172;MMQ=60,60,60;MPOS=18,12;POPAF=7.30,7.30;TLOD=309.97,4.23 GT:AD:AF:DP:F1R2:F2R1:SB 0/1/2:214,164,14:0.435,0.021:392:54,54,4:97,78,8:116,98,95,83
Hi @alanhoyle,
your issue maybe is related to my last issue... maybe this help
Best
That seems to be the case . Thanks !
Hi @bhanuGandham,
I am runnning into the same problems as @cmt while trying to import g.VCF files from pooled samples (ploidy 20) to GenomicsDB via gatk4.1.1.0 GenomicsDBImport. However, LeftAlignAndTrimVariants did not solve the problem, so I tried to dig deeper into the issue. My MNPs were created due to the --max-alternate-alleles and --max-genotype-count thresholds, as they removed shorter possible variants and the original indel positions then appeared to be MNPs.
default thresholds (I trimmed the very long list of 1771 PL valus):
increased --max-alternate-alleles and --max-genotype-count (contains much more PL values (10626) now):
However, using LeftAlignAndTrimVariants did cut out the T allele and by this the complete position 115074662, resulting in a non valid g.VCF file and leading to further errors:
I also tried to run the according steps with version 4.1.2.0, but the MNP support for GenomicsDB seems not to be included yet. Is there alternatively the possibility to somehow use a gVCF mode for LeftAlignAndTrimVariants, which keeps the trimed reference alleles in the g.VCF?
Hi @jogeib
There is a known error with GenomicsDBImport when provided with gvcfs containing MNPs. One way around this is to exclude MNPs by using the option
--max-mnp-distance 0
in the variant calling step. Let me know if this helps resolve the error.Hi @bhanuGandham
thanks for your fast answer. For HaplotypeCaller,
--max-mnp-distance
is set to 0 by default:However, I also crosschecked by manually setting the option to zero, but that did not change the output. It really seems to be a problem of
--max-alternate-alleles
and--max-genotype-count
as described above. That's why I asked for a possibility to correct it afterwards.@jogeib @bhanuGandham If
-max-mnp-distance
is 0 and this problem still occurs it seems like a possible issue withHaplotypeCaller
. For example, one of @jogeib's calls wasAT GT,TT
which isn't a MNP but an incorrect representation of a SNPA G,T
. We fixed an issue like this inMutect2
a while back. The problem was that the variant representation wasn't being trimmed after low-quality alleles were dropped. The fix was, essentially, to runLeftAlignAndTrimVariants
within theMutect2
genotyping, which is totally negligible in terms of runtime. Here's the PR for the M2 fix: https://github.com/broadinstitute/gatk/pull/3509.@davidben @bhanuGandham It seems as using CombineGVCFs instead of GenomicsDB is solving my problem for now, but is it likely that the bug will be removed in the next version of gatk?
Hi @jogeib
I don't know if the fix will be in the next version but I know this is being worked on, and will be posted here once fixed and released.
I am having the same issue. I am working with several tumor samples and am in the process of creating the PON using the new 3-step process described here.
When I attempt to import the .vcf.gz files using GenomicsDBImport, I get the following error message:
A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr1:10247 in VCF /home/exacloud/lustre1/jjacobs/data/osteo/SJOS010_G/pre-PON.vcf.gz
It seems that the tool, "VariantsToAllelicPrimitives" would solve this, but this tool is no longer available in GATK. Was this tool replaced by anything that does the same job in GATK4?
I suppose I could go back to the tumor-only mode step and set
--max-mnp-distance
to 0 but this would be a significant time commitment to redo all of my samples.Hi @bhanuGandham,
as the bug fix seems to need some more time, I tried to get around the problem by increasing the
--max-genotype-count
threshold and thus avoiding this false representations. However, I then run into another errror message:I was able to track it down to a small interval (chr16:2057610-2057615) which shows lots of overlapping INDELs (see attached file which contains the subsetted g.vcfs), but could not identify any other strange things. Do you have any clues about what the problem could be there?
Hi @jogeib - I got the same error, which drove me mad. I had to run run it split up into intervals that avoided those areas, which I made by trial and error. They were not the type of variation I am looking for in my project, so that worked for me, but it was a total pain.
I would love a solution for future use!
Sorry, I forgot to mention that it happened in genotypeGVCFs. So genomicsDBImport worked, but genotypeGVCFs stopped at this interval.
I wanted to give an update. Since the GATK 4.1.2.0 3-step process for creating a PONs didn't seem quite ready for primetime, I went back to the GATK 4.0.12.0 version and created my PONs using the old 2-step process described here.
This worked well and I was able to implement the PON with the rest of the somatic mutation-calling workflow
My question is: what is the difference between the old 2-step process and the new 3-step process for creating a PON? Are there advantages to using the new process?
@jejacobs23,
VariantsToAllelicPrimitives
is not available in GATK4 and it doesn't look like there are any similar tools. The differences between for the process of creating pon is mentioned in the tutorial here under the "A step-by-step guide to the new Mutect2 Panel of Normals Workflow" section.@jogeib Looks like cmt has come accrose this error message before (here), would it be possible to work around these trouble areas like cmt mentioned or is it required for your project?
Its difficult to tell whats wrong with provided sample, would be more useful with IGV images and/or bug report level information.
@bshifaw It should be possible. However, I uploaded data snippets (bams and gVCFs for chr16:2057000-2058200, log file and command file, reference genome; bugReportGATK_jogeib.tar.gz) to your ftp server so that you are able to check out the problem. The detailed region where the error occured was chr16:2057610-2057615. Don't mind the specific memory, annotation group and maxGenotype settings. I played around with them, but the bug occured with every setting. Could you plese inform me if you can track down the problem? I would be interested in what it was about.
Hi @jogeib @bshifaw,
I'm having the same issue again and this time can't find a work around.
I'm trying to do Joint Genotyping on a set of 12 GVCFs from pooled samples (50 individuals per pool, 12 pools total, using a ploidy of 20 and GATK 4.1.2.0). The GVCFs were generated without incident from HaplotypeCaller and GenomicsDBImport combined all 12 GVCFs into one GenomicsDB per chromosome, but have not been successful in genotyping the pools from the GenomicsDB. Because of the computational load and memory (apparently) needed to genotype my pooled samples, I split the chromosomes into even intervals of 50k base pairs and scattered over them, but about a quarter of the scatter tasks fail with the error:
bcf_update_format: Assertion 'nps && nps*line_>n_sample==n' failed
(Full disclosure, I'm getting other errors too, but they all seem to do with memory limitations or corruption on specific intervals, though I might be posting about that soon!)
In earlier runs through gatk with this data set I got the bcf_update_format error rarely enough that I could exclude that region from my analysis. Now I get the error so often that I can't afford to skip over those regions. I am using the same bam files as the original input for this run through gatk, but I changed a couple parameters between that previous run and this one to make the analysis more applicable to my pooled data. The biggest thing is that I removed the bqsr step from the pre-processing of the bam files, which means I have a lot more reads per base pair that are making into the genotyping stage. I tried to reduce the computational load of that by changing the max_genotype_count from 10,000 to 1771 (which I hope is appropriate for a ploidy of 20 where I don't care about capturing rare variants?)
A couple of things that I have tried:
Previously (here), @AdelaideR suggested: Maybe applying the bcftools tag2tag --pl-to-gl option on that section of chromosome could help? It seems this error may pop up if the VCF file stream contains PLs but no GPs. But I am just guessing. I tried that bcftools plugin on a gvcf , but it literally just turned the PLs into GLs, (removing the PLs) and doesn't GenotypeGVCFs rely on the PLs? So this was a 1/2 try, I didn't actually try to genotype after.
ValidateVariants, to make sure my GVCF files don't have something weird. They all passed with no warnings.
I'm also trying a run with --use-jdk-inflater and --use-jdk-deflater but now I'm just grasping at straws.
Do you have any ideas for what might help?
Thank you so much!
@cmt
Thanks for the detailed notes on the error, I'll check with dev team and see if they have any suggestions.
@cmt
The team believe this could be related to a GenomicsDB bug, although they can't be certain without a test file (for GenomicsDB in this case), the command line being used, and the stacktrace. It would help if you pass those along.
As a side test, could you please retry GenomicsDB using GATKs nightly builds, bug fixes are currently in the works for this tool and the nightly builds might already have a fix for this error.
Hi @bshifaw thanks for following up
I won't be able to get to trying out the nightly build until Monday, but I have bundled a bunch of files up for y'all to take a look at. I included the code I have run and the error files, but because my data set is so large I only focused on a little region of one linkage group that is pretty typical of the rest of the data. The error files I included were only for that region, though I have run the whole genome's worth of intervals and gotten the same errors elsewhere. I also made gvcf files that y'all can test with that are just that region. I followed the instructions in Article #1894 and uploaded the zipped folder called BUGreportGATK.zip. Sorry for the generic name!
Let me know if I can send anything else and thanks so much for taking a look at this for me.
hi @bshifaw
Have y'all have a chance to take a look at the files that I uploaded? I tried the nightly last week with a little test set of 10 intervals that give me both errors:
bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.
and
and they still gave me the same errors with the nightly build. I hope y'all have had more success troubleshooting!
In the mean time I'm running GenotypeGVCFs with intervals of 10k base pairs and having a success rate of 63-68% of the intervals working. Coming up with techniques to identify and track down the vcf file paths for the intervals that worked has been tricky but I think I have some working methods. I'm hoping y'all with have some suggestions for the cause of the errors I'm seeing.
Thanks!
Hi @cmt
@bshifaw will get back to you shortly.
Hi @cmt
The files were sent to the dev team and here is their recent response
Hi @cmt ,
Update: The fix for the bug will be in the next gatk release, here is link the issue ticket
Hi @bshifaw and thank you!
I was able to work though the GATK pipeline using v 4.1.2.0 by scattering over successively smaller chunks of my genome, but I still ended up having to skip between 10-20% of each Linkage Group because of the errors described above.
If I were to try to capture that missing 10-20% of the genome by genotyping my pools again using the newest version of GATK with the bug fix, 4.1.4.0, I know I have to go back to at least GenomicsDB, but should I start over with HaplotypeCaller again?
Thanks!
Hi @cmt !
Expect this fix Beri linked to to be in the next release (it is not in 4.1.4.0). With that particular fix, you shouldn't have to run the HaplotypeCaller step again.