Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

cmtcmt Seattle WAMember

Hello!

I am running the GATK (v4.1.0.0) best practices pipeline on FireCloud with 12 pooled WGS samples; one pooled sample contains ~48 individual fish (I am using a ploidy of 20 throughout the pipeline). Though I have 24 linkage groups I also have 8286 very small scaffolds that my reads are aligned to, which has caused some issues with using scatter/gather and running the tasks by interval with -L (though that is not my main issue here). Lately I have run into a problem at the JointGenotyping stage.

I have one GVCF for each pool from HaplotypeCaller, and I tried to combine them all using CombineGVCFs. Because of the ploidy of 20 I thought I could not use GenomicsDBImport. I had the same error using CombineGVCFs as the person in this thread: gatkforums.broadinstitute.org/gatk/discussion/13430/gatk-v4-0-10-1-combinegvcfs-failing-with-java-lang-outofmemoryerror-not-using-memory-provided. No matter the amount of memory I allowed the task, it failed every time.

But following @shlee's advice and reading this: github.com/broadinstitute/gatk/issues/5383 I decided to give GenomicsDBImport a try. I just used my 24 linkage groups, so my interval list has only those 24 listed.

I am stumped by the error I got for many of the linkage groups:

***********************************************************************

A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG07:4616323 in VCF /6942d818-1ae4-4c81-a4be-0f27ec47ec16/HaplotypeCallerGVCF_halfScatter_GATK4/3a4a3acc-2f06-44dc-ab6d-2617b06f3f46/call-MergeGVCFs/301508.merged.matefixed.sorted.markeddups.recal.g.vcf.gz

***********************************************************************

What is the best way to address this? I didn't see anything in the GenomicsDB documentation about flagging the MNPs or ignoring them. I was thinking of removing the MNPs using SelectVariants, before importing the GVCFs into GenomicsDB but how do you get SelectVariants to output a GVCF, which is needed for Joint Genotyping.

What would you recommend I do to get past this MNP hurdle?

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @cmt

    This could be due to multiple variants in multiple lines of the same locus. Would you please try to collapse the variants in different lines into one using vcftools and trying again?

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham !

    Thank you for your suggestion. I have been looking into using VCFtools, and I know that it is not a program you troubleshoot, but can you give me any more information for what I am trying to do?

    I have read through the manual, but nothing there or in through google searches is helping me understand what to do with vcftools to collapse the variants that are on different lines into one line.

    Is there another way to do this that is more straight forward? Also, do you know why this happens? Is there something I can do at an earlier stage of GATK to prevent it?

    Thanks!

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham

    I tried a couple things to try to alter the gvcf files in bctools, using --multiallelics and --remove-duplicates under norm, but they just reduced the file size dramatically (5.0 Gb down to 3.3 Gb) and according to the program output there were no changes:

    Lines total/split/realigned/skipped: 130135928/0/0/0

    So I did what I should have done right away, and took a look at the locations that were flagged in the standard error log of the GBImports:

    A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG12:1179503 in VCF /1f0ddd38-9aa0-4e5e-8c42-a355f4ee76f9/HaplotypeCallerGVCF_halfScatter_GATK4/55ebd0b5-00b0-4756-b0f3-8cd74ba10cf5/call-MergeGVCFs/301498.merged.matefixed.sorted.markeddups.recal.g.vcf.gz
    

    This is a screenshot of the line that has the issue- the PL field has the max 10,626 values separated by commas, you can see how long the line is on the right side of the text editor.

    How do I deal with that issue? I think they are probably valid considering the variation at that location in the gvcf file:

    LG23 2475332 . GAAGATATTTA AAAGATATTTA,CAAGATATTTA,TAAGATATTTA,<NON_REF> 0.66 . AS_RAW_BaseQRankSum=||||;AS_RAW_MQ=0.00|0.00|0.00|0.00|0.00;AS_RAW_MQRankSum=||||;AS_RAW_ReadPosRankSum=||||;AS_SB_TABLE=0,0|0,0|0,0|0,0|0,0;DP=49;MLEAC=0,0,0,0;MLEAF=0.00,0.00,0.00,0.00;RAW_MQandDP=27580,49 [...]

    Is there any way to remove these locations from the gvcf so I may proceed with GenomicsDBImport and GenotypeGVCFs? Is there a filter that I am missing, or an argument in the HaplotypeCaller step?

    And is it strange that I only have one location in each of 6 of my 12 pools that has this problem?

    Thank you so much for your help!

  • cmtcmt Seattle WAMember
    edited March 1

    Ok, I have looked at all my flagged sites and they all have 10626 PLs.
    Here are other examples:

    LG20    5518501 .   ATTTTT  ATTTT,A,ATTTTTT,<NON_REF>   0.71    .   AS_RAW_BaseQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_MQ=1600.00|0.00|0.00|0.00|0.00;AS_RAW_MQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_ReadPosRankSum=|NaN|NaN|NaN|NaN;AS_SB_TABLE=0,0|0,0|0,0|0,0|0,0;DP=9;MLEAC=0,0,0,0;MLEAF=0.00,0.00,0.00,0.00;RAW_MQandDP=5414,9 GT:AD:DP:GQ:PL:SB   0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:1 [...]
    
    LG20    5518500 .   GATTTTTTTTTA    AATTTTTTTTTA,CATTTTTTTTTA,TATTTTTTTTTA,<NON_REF>    0.81    .   AS_RAW_BaseQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_MQ=1600.00|0.00|0.00|0.00|0.00;AS_RAW_MQRankSum=|NaN|NaN|NaN|NaN;AS_RAW_ReadPosRankSum=||||;AS_SB_TABLE=0,0|0,0|0,0|0,0|0,0;DP=9;MLEAC=0,0,0,0;MLEAF=0.00,0.00,0.00,0.00;RAW_MQandDP=5414,9 GT:AD:DP:GQ:PL:SB   0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0: [...]
    

    These regions are obviously very messy, my species is known for having a lot of mini and micro satellites in the genome and I'm not interested in resolving them- how do I remove or avoid the areas that are getting flagged by GBImports? Can I use SelectVariants to subset only SNPs before GBImports?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 2

    Hi @cmt

    Yes. You can use SelectVariants to select only for SNPs from a sample by using this option: -selectType SNP.
    Let me know if this solution works for you. If not then we can look into it further.

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham

    Thanks for the suggestion, but that did not work- I think it has something to do with the NON-REF allele? I wish I had made a better note of where I read that. There were no errors when I ran SelectVariants on my gvcf file, but I got this in my stderr.log:

    18:35:28.667 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    18:42:40.474 INFO  ProgressMeter -             unmapped              7.2                     0              0.0
    18:42:40.475 INFO  ProgressMeter - Traversal complete. Processed 0 total variants in 7.2 minutes.
    18:42:40.524 INFO  SelectVariants - Shutting down engine
    [March 6, 2019 6:42:40 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 7.26 minutes.
    Runtime.totalMemory()=211730432
    Using GATK jar /gatk/gatk-package-4.0.12.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx3000m -jar /gatk/gatk-package-4.0.12.0-local.jar SelectVariants -R /cromwell_root/uw-hauser-pcod-gatk-tests/gadMor2.fasta --variant /cromwell_root/fc-6ebc1ade-bf3e-4d51-a270-eebefb480ded/e179e377-dbf0-4831-9494-45894cf99945/HaplotypeCallerGVCF_halfScatter_GATK4/e5a94374-27e2-4d97-b661-00d01b45b620/call-MergeGVCFs/301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz --output SelectVariants.vcf.gz --select-type-to-include SNP
    
    

    I moved on to try a couple of different things on my gvcf files, but I haven't landed on a solution yet.

    The first was to use bcftools to normalize the indels, but so far nothing I have tried with that program has solved the issue.

    Splitting the multiallelic sites:

    bcftools norm --multiallelics -any --output-type z ./301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301496.bcftools_splitMNPany_out.g.vcf.gz

    I get this error, which I googled and see other people with multiple ploidy get using this command:

    Error at LG01:211, the tag PL has wrong number of fields

    Joining the biallelic sites in to multiallelic sites didn't change the fields according to the output (Lines total/split/realigned/skipped: 122656841/0/0/0), but the files were significantly smaller (5GB to 3.3 GB):

    bcftools norm --multiallelics +both --output-type z ./301496.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301496.bcftools_multiBoth_out.g.vcf.gz

    Tried to remove duplicate records, but bcftools didn't make any changes according to the output (Lines total/split/realigned/skipped: 122656841/0/0/0) but it changed the file size(5GB to 3.3 GB) :

    bcftools norm --rm-dup both --output-type z ./301498.merged.matefixed.sorted.markeddups.recal.g.vcf.gz >./301498.bcftools_RMVdups_out.g.vcf.gz

    I tried GATK's LeftAlignAndTrimVariants, where I split multiallelics:

    gatk --java-options "-Xmx${command_mem}m" LeftAlignAndTrimVariants \ -R ${ref_fasta} \ --variant ${raw_vcf} \ --output "${output_name}.${vcf_suffix}" \ --split-multi-allelics true \ --dont-trim-alleles true \ --disable-read-filter NotDuplicateReadFilter \ --max-indel-length 500

    But using the gvcf output of that in GDBImports gave me the following error at every linkage group, and when I originally ran the GDBImports on my gvcf files from HaplotypeCaller it did not fail at every linkage group. That gives me the impression that I don't need to split up my multiallelics, I just need some way to reduce the number of PLs for some variants:

    20:56:52.695 INFO GenomicsDBImport - Importing batch 1 with 2 samples terminate called after throwing an instance of 'VCF2TileDBException' what(): VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 0, 210 ] current cell: [ 0, 210 ]. The most likely cause is unexpected data in the input file: (a) A VCF file has two lines with the same genomic position (b) An unsorted CSV file (c) Malformed VCF file (or malformed index) See point 2 at: https://github.com/Intel-HLS/GenomicsDB/wiki/Importing-VCF-data-into-GenomicsDB#organizing-your-data

    I still have a couple things I want to try. Right now I'm left aligning but not trimming or splitting the multiallelics up using GATK's LeftAlignAndTrimVariants.

    I also want to try the bcftools normalize command again, but with the argument --do-not-normalize, which I saw recommended here: gatkforums.broadinstitute.org/gatk/discussion/23462/gatk-v4-1-0-0-validatevariants-gvcf-mode-error-non-in-v4-0-11-0

    I also haven't tried the output of the bcftools files in the GenomicsDBImport because I didn't trust the files with the reduced file size with no explanation by the program for what had changed in the gvcf files. I have thought about running them through the GDBImport, but I don't think I could trust the files without knowing why they shrunk so with bcftools.

    My last hope is to manually edit the vcf files to remove the 10,626 PLs and see if that works, though I am still no closer to understanding why I have that many PLs (the default cap, right?) at just a couple locations, and why those are the locations that prevent my using GenomicsDBImport.

    Do you have any ideas for what might help?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 18

    Hi @cmt

    I asked the dev team to look into this and this is what they had to say:

    Can you please try to collapse the variants in different lines into one using using bcftools as shown in this doc: https://github.com/GenomicsDB/GenomicsDB/wiki/Useful-external-tools#useful-bcftools-commands.

    This is a know error and can be fixed by this method.

  • cmtcmt Seattle WAMember

    Hi @bhanuGandham!

    I did that, but it was not helpful; I got the same errors. The variants in my gvcf files were not split on different lines, the extreme number of PL values for some variants seems to have been the main issue (or a symptom of an issue I don't understand). What ended up working was LeftAlignAndTrimVariants, with just the left align and the trim arguments, which reduced the size of the MNPs.

    I can ask this in another thread, but does CatVariants work in GATK4?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 18

    Hi @cmt

    Thank you for sharing your solution.

    No, CatVariants does not work in GATK4. It is a deprecated tool. You can however use gathervfcs tool for the same purpose.

  • cmtcmt Seattle WAMember
Sign In or Register to comment.