We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

M2 and GDBI for PON: [E::vcf_parse_format] Invalid character '.' in 'AF' FORMAT field at chr1:16949

manolismanolis Member ✭✭✭

GATK 4.1.1.0, local linux server

Hi,

I ran some WES normal samples:

${gatk} Mutect2 \
-R ${hg38} \
-I "${sample}.bam" \ 
-O "${sample}.vcf.gz" \
-L ${interval} \
-ip 5 \
--max-mnp-distance 0

and then GenomicsDBImport:

${gatk} GenomicsDBImport \
-R ${hg38} \
-V "${sample1}.vcf.gz" \
-V "${sample2}.vcf.gz" \
--batch-size 1 --reader-threads 1 \
--genomicsdb-workspace-path "GDBI_pon" \
-L chr1

Here the error:

13:18:45.329 INFO  GenomicsDBImport - Done initializing engine
13:18:45.517 INFO  GenomicsDBImport - Vid Map JSON file will be written to /home/manolis/prove/GDBI_pon/GDBI_pon/vidmap.json
13:18:45.517 INFO  GenomicsDBImport - Callset Map JSON file will be written to /home/manolis/prove/GDBI_pon/GDBI_pon/callset.json
13:18:45.517 INFO  GenomicsDBImport - Complete VCF Header will be written to /home/manolis/prove/GDBI_pon/GDBI_pon/vcfheader.vcf
13:18:45.517 INFO  GenomicsDBImport - Importing to array - /home/manolis/prove/GDBI_pon/GDBI_pon/genomicsdb_array
13:18:45.517 INFO  ProgressMeter - Starting traversal
13:18:45.517 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
13:18:45.820 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
[E::vcf_parse_format] Invalid character '.' in 'AF' FORMAT field at chr1:14653
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe721be816b, pid=12942, tid=0x00007fe7801f7700
#
# JRE version: OpenJDK Runtime Environment (8.0_152-b12) (build 1.8.0_152-release-1056-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.152-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtiledbgenomicsdb8166440819035845683.so+0x35416b]  bcf_unpack+0x36b
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/manolis/prove/GDBI_pon/hs_err_pid12942.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

Here the header of the vcf.gz and the variant:

##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">

chr1    14653   .   C   T   .   .   DP=13;ECNT=2;MBQ=20,30;MFRL=212,211;MMQ=43,33;MPOS=40;POPAF=7.30;TLOD=10.18 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:9,4:0.333:13:6,2:3,1:0|1:14653_C_T:14653:5,4,3,1

Here the vcf validation:

${gatk} ValidateVariants \
-R ${hg38} \
-V "${sample1}.vcf.gz" \
-L ${interval} \
-ip 5

No any warning ...

When I process the "${sample1}.vcf.gz" with:

bcftools annotate -x FORMAT/AF "${sample1}.vcf.gz" -O z -o "${sample1}_noAF.vcf.gz"

and then running GenomicsDBImport I do not have any error ...

Any suggestion please?
Many thanks

Answers

  • manolismanolis Member ✭✭✭

    Hi, fixed. Seems that was a problem related with one of the hosts of the cluster. Sorry for boring you.

    Best

  • Hi, manolis !
    Tell me, please, how exactly did you solve this problem?
    Many thanks

  • jpfloridojpflorido SevilleMember

    Hi manolis,

    I'm having exactly the same issue with my PoN creation. Supposedly the AF field is correct and all my VCFs (using only 3 for test purposes) passed the ValidateVariants test. I also use the --max-mnp-distance=0 option in Mutect2 to prevent from the known bug in the GenomicsDBImport tool. But still same "Invalid character '.' in 'AF' FORMAT field at ..." and "A fatal error has been detected by the Java Runtime Environment" error happening.

    Would you please mind to let me know what was your host problems and how did you fix it? Just in case the same is happening here...

    Thanks in advance!

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member

    Any suggestion about this ^ @manolis

    Thanks!

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member

    Any suggestion about this ^ @manolis ?

    Thanks!

  • manolismanolis Member ✭✭✭
    edited July 2019

    Our "solution" is totally crazy and we still can not explain why happening this! We have a linux cluster with 6 hosts.
    When I'am going to run GDBI for PON creation (GATK v4.1.1.0) during the day does not work, even if there are no jobs in all hosts!
    When I'm going to run it during the late night it works.

    For now we can not explain this behavior :o:/ We are waiting an answer from our server support.

    Best

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member

    @jpflorido said:
    Hi manolis,

    I'm having exactly the same issue with my PoN creation. Supposedly the AF field is correct and all my VCFs (using only 3 for test purposes) passed the ValidateVariants test. I also use the --max-mnp-distance=0 option in Mutect2 to prevent from the known bug in the GenomicsDBImport tool. But still same "Invalid character '.' in 'AF' FORMAT field at ..." and "A fatal error has been detected by the Java Runtime Environment" error happening.

    Would you please mind to let me know what was your host problems and how did you fix it? Just in case the same is happening here...

    Thanks in advance!

    Thank you for your answer manolis!

    Is there someone else from the GATK team that can advice with this? I am quite sure my Mutect2 outputs where generated correctly and the AF field seems right to me but maybe I am wrong.

    Just to refresh, we (@jpflorido and me) are trying to create a panel with 3 exome samples but it fails when putting together the VCFs with the GenomicsDBImport tool.

    I have also tried to build the last version of GATK directly from the repository in case this is something that have been fixed recently but same error occurs. I can share whatever you could need.

    Thanks in advance,
    Francisco

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited July 2019

    Hi @fmortuno

    As shown in this doc, can you please try to run GenomicsDBImport with --max-mnp-distance 0 as shown in this tutorial https://software.broadinstitute.org/gatk/documentation/article?id=24057 and see if that resolves the error?

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member

    Thank you very much for your answer @bhanuGandham !!!

    However, maybe I am missing something but I cannot see the option max-mnp-distance for GenomicsDBImport in GATK v4.1.2.0 so I get the error:

    max-mnp-distance is not a recognized option

    I checked tutorial for that tool and that version but still cannot see the option. I already used that option in Mutect2 where it is available but got same error at GenomicsDBImport step. Any other suggestion?

    Thanks in advance

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member

    Any thoughts here @bhanuGandham or anyone else? Thanks!!!!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @fmortuno

    I am looking into this and will get back to you shortly.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited July 2019

    Hi @fmortuno

    There was an error in the documentation, --max-mnp-distance 0 should only be used in the mutcet2 command and not in GenomicsDBImport step. You were right about that.

    Its a long thread and I do not see your error log in this thread. Would you please post the exact command you are using and the error you are seeing. Thank you. This will help the dev team debug the issue.

    Sorry for the delay in getting back to you, we have been facing large volumes of questions recently.

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member
    edited July 2019

    Thanks @bhanuGandham !!!

    The error is quite the same as the initially posted in this thread. That is why I asked in here. Let me show you the specific logs I got:

    Command (gatk v4.1.2.0):

    gatk GenomicsDBImport \
              --genomicsdb-workspace-path pon_db \
              --R hs37d5.fa \
              -V <sample1>.vcf.gz \
              -V <sample2>.vcf.gz \
              -V <sample3>.vcf.gz \
              -L 0000-scattered.interval_list
    

    Error:

    13:07:27.084 INFO  GenomicsDBImport - Done initializing engine
    13:07:27.378 INFO  GenomicsDBImport - Vid Map JSON file will be written to /mnt/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/tmp/PoN/f42346dd-4a81-424e-984e-73e5b43d4eab/call-CreatePanel/shard-0/execution/pon_db/vidmap.json
    13:07:27.379 INFO  GenomicsDBImport - Callset Map JSON file will be written to /mnt/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/tmp/PoN/f42346dd-4a81-424e-984e-73e5b43d4eab/call-CreatePanel/shard-0/execution/pon_db/callset.json
    13:07:27.379 INFO  GenomicsDBImport - Complete VCF Header will be written to /mnt/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/tmp/PoN/f42346dd-4a81-424e-984e-73e5b43d4eab/call-CreatePanel/shard-0/execution/pon_db/vcfheader.vcf
    13:07:27.379 INFO  GenomicsDBImport - Importing to array - /mnt/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/tmp/PoN/f42346dd-4a81-424e-984e-73e5b43d4eab/call-CreatePanel/shard-0/execution/pon_db/genomicsdb_array
    13:07:27.379 INFO  ProgressMeter - Starting traversal
    13:07:27.379 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
    13:07:27.928 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
    [E::vcf_parse_format] Invalid character '.' in 'AF' FORMAT field at 1:13079
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00002b1be876416b, pid=29197, tid=0x00002b1bb7fa0700
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libtiledbgenomicsdb434897115576972739.so+0x35416b]  bcf_unpack+0x36b
    #
    # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # An error report file with more information is saved as:
    # /mnt/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/tmp/PoN/f42346dd-4a81-424e-984e-73e5b43d4eab/call-CreatePanel/shard-0/execution/hs_err_pid29197.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #
    

    The interval file was generated with SplitIntervals. As we mentioned before, the three VCFs were generated with Mutect2 using the --max-mnp-distance 0 option and they were validated by ValidateVariants without errors or warnings. I am quite sure the AF field format is correct but If I filter out the AF field from my VCFs the GenomicsDBImport command run without errors. Here is an example of AF header and variant in one VCF:

    ##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
    ...
    1   13116   .   T   G   .   haplotype;map_qual  CONTQ=93;DP=23;ECNT=2;GERMQ=36;MBQ=20,37;MFRL=267,254;MMQ=27,24;MPOS=59;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=37.57    GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:13,10:0.440:23:6,6:4,4:0|1:13116_T_G:13116:6,7,5,5
    

    Any suggestion about what could be going on? I can share any other logs or detail you may need.

    Thanks again,
    Francisco

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @fmortuno and @manolis

    Our dev team is looking into this right now. Would you please share your input files with us so we can recreate the error and debug it.
    Please find the details on how to share your data here: https://software.broadinstitute.org/gatk/guide/article?id=1894

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member

    Thank you @bhanuGandham !

    I just uploaded to the FTP my input files, logs and command line as suggested in the article. The name of the compressed file is AF_error_GDBI_for_PoN.tar.gz. I slipped the three VCFs only to the MT chromosome to make easier reproducing the error.

    Please, if possible, confirm you got the shared file correctly in your FTP and let me know when you have more information about the error.

    Thanks again,
    Francisco.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @fmortuno

    We have shared your data with the developers who are trying to recreate the error. We will get back to you shortly.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited August 2019

    HI @fmortuno

    We are unable to recreate this error on our end using the command and files you provided to us. Those commands worked just fine on our end.
    I am not sure why you are seeing this error. Your logs indicated you were using 4.0.9.0 could you please try to use the latest version GATKv4.1.3.0 and see if the error persists?

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member
    edited August 2019

    Hi @bhanuGandham

    Sorry for the late response. I have tried the newest version v4.1.3.0 but error still persists:

    gatk GenomicsDBImport -R hs37d5.fa \
                          --genomicsdb-workspace-path pon_db \
                          -V sample1.MT.nn.vcf.gz \
                          -V sample2.MT.nn.vcf.gz \
                          -V sample3.MT.nn.vcf.gz -L MT
    

    Error Log (for v4.1.3.0):

    Using GATK jar /home/fmortuno/tools/gatk/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/fmortuno/tools/gatk/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar GenomicsDBImport -R /data/lustre/scratch/CBRA/data/indexed_genomes/bwa/hs37d5/hs37d5.fa --genomicsdb-workspace-path pon_db -V sample1.MT.nn.vcf.gz -V sample2.MT.nn.vcf.gz -V sample3.MT.nn.vcf.gz -L MT
    09:28:48.196 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/fmortuno/tools/gatk/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Aug 23, 2019 9:28:49 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    09:28:49.834 INFO  GenomicsDBImport - ------------------------------------------------------------
    09:28:49.834 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.3.0
    09:28:49.834 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    09:28:49.835 INFO  GenomicsDBImport - Initializing engine
    09:28:50.101 INFO  IntervalArgumentCollection - Processing 16569 bp from intervals
    09:28:50.134 INFO  GenomicsDBImport - Done initializing engine
    09:28:50.333 INFO  GenomicsDBImport - Vid Map JSON file will be written to /data/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/AF_error_GDBI_for_PoN/pon_db/vidmap.json
    09:28:50.333 INFO  GenomicsDBImport - Callset Map JSON file will be written to /data/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/AF_error_GDBI_for_PoN/pon_db/callset.json
    09:28:50.333 INFO  GenomicsDBImport - Complete VCF Header will be written to /data/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/AF_error_GDBI_for_PoN/pon_db/vcfheader.vcf
    09:28:50.333 INFO  GenomicsDBImport - Importing to array - /data/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/AF_error_GDBI_for_PoN/pon_db/genomicsdb_array
    09:28:50.333 INFO  ProgressMeter - Starting traversal
    09:28:50.333 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
    09:28:50.443 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
    [E::vcf_parse_format] Invalid character '.' in 'AF' FORMAT field at MT:73
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00007f565c8b7dfb, pid=25817, tid=0x00007f564fdff700
    #
    # JRE version: OpenJDK Runtime Environment (8.0_191-b12) (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12)
    # Java VM: OpenJDK 64-Bit Server VM (25.191-b12 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libtiledbgenomicsdb6724707253584796459.so+0x3cbdfb]  bcf_unpack+0x36b
    #
    # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # An error report file with more information is saved as:
    # /data/lustre/scratch/CBRA/projects/lung_cancer_sas/PoN/AF_error_GDBI_for_PoN/hs_err_pid25817.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #
    

    Any other suggestion? If you cannot reproduce the error I understand there is something wrong on my end but I tried running in different machines and I got always the same error. I would need to fix that sooner than later but no idea what can be going on.

    Thanks!

  • manolismanolis Member ✭✭✭
    edited August 2019

    Hi, I still have the same problem (gatk v4.1.1.0) and I believe that also in my case is related to our server/host.

    We do not know why we can run GDBI only during the night and not during the day (I know seems a crazy situation)...

    @fmortuno, you were the only one logged in the server during your tests? Did you try later in the night without other users logged in?

    Thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited August 2019

    Hi @manolis and @fmortuno

    This is a weird situation because we are unable to recreate the error but both of you have reported the same error. In order to figure out what might be common in the way both of you are processing the data would you please answer the following questions:

    1) Are you using a shared file system?
    2) Did you use a docker?
    3) The native error report is usually persisted as a hs_err_pid.log file. Would it be possible to provide us with that file? This is usually found in the directory from where gatk was invoked. But, it is configurable by the system and/or user, so the best way is to grab the filename from standard output. Also, it will be useful if you could set "ulimit -c unlimited" before running gatk.

    Post edited by bhanuGandham on
  • manolismanolis Member ✭✭✭

    Hi @bhanuGandham

    1) yes
    2) no, I converted the wdl pipelines to bash pipe
    3) I have to check

    Thanks

  • mlatharamlathara USAMember

    Hi @manolis @fmortuno

    I'm a developer working on GenomicsDB...couple more things that might help us in figuring this out

    1) Can you try giving the import more memory. That is, something like:

    gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport <rest of your options>
    

    Sometimes a lack of memory can cause weird errors, so I'm hoping explicitly giving 4g should be enough for the example vcfs you provide. (this, of course, assumes you have more than 4g available)

    2) Can you convert your compressed vcfs to uncompressed and try importing those? You can use bgzip or bcftools (for instance) to uncompress. And (for instance) GATK's IndexFeatureFile tool to index the resulting vcf files. Then import those and let us know if you still see these errors.

    Thanks.

  • manolismanolis Member ✭✭✭

    Hi @mlathara

    1) Same problem with "Xmx4g -Xms4g" (see hs_err_pid10807.log file)

    2) Still I have the same problem (see hs_err_pid11327.log file)

    Thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @manolis and @fmortuno

    Because we are unable to recreate this issue on our end, we are not quite sure what more we can do. We will keep an eye out to see if there are other users are coming up with a similar issue to maybe find more clues.

    However, if this is big blocker for you then we will try to investigate a little further. Please provide the following information, we will see what we can find:
    1) Your runtime environment details
    2) enable core dump and please provide that to us too.

  • fmortunofmortuno Clinical Bioinformatics Area, FPS, Seville (Spain)Member

    Hi @bhanuGandham, thank you for the support. I totally understand, it seems some weird incompatibility with the system. It didn't work even when I tried in two different environment.

    However, I finally tried using the GATK v4.1.3.0 docker and it worked that way. I think I can go and create my PoN using the docker.

    Thanks again!

  • Hi!

    I faced the same issue. Found that removing reader_threads solved the problem. Working with GATK 4.1.2.0 without WDL.

    Cheers,

  • manolismanolis Member ✭✭✭

    Thanks @JoanGibert! I will try (I'm using bash) and I will give you a feedback.

    Best

  • isaienceisaience ParisMember
    @manolis I had the same problem, were you able to solve it?
  • isaienceisaience ParisMember
    For future readers: I also was able to solve the problem by using through docker, I leave the code if it is useful for you:

    `sudo docker run -v `pwd`:`pwd` -w `pwd` -i -t broadinstitute/gatk gatk GenomicsDBImport -R GATK/Reference/GRCh38.d1.vd1.fa -L Cleaned_bqsr.bams/SRR5273612_SRR5273621_realign_target.intervals --genomicsdb-workspace-path PON/pon_db --merge-input-intervals true -V SRR5273610.vcf.gz -V SRR5273611.vcf.gz`
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @fmortuno @JoanGibert @manolis @isaience

    Thank you for the updates. This will be very useful for other community members. GATK team is grateful for your assistance.

  • henahena FinlandMember

    Hi,

    I'm getting the same error as above

    [E::vcf_parse_format] Invalid character '.' in 'AF' FORMAT field at

    I do have enough memory (max is 8g at the moment though I tried with 35 as well) and I don't have reader_threads option in use. Vcf files were generated with mutect2 using the I also tried the suggestion of uncompressing the vcf files and indexing the uncompressed and using them, but not working. I'm using GATK v4.1.2.0 and testing with two vcf files (the original data set would have ~150).

    As a sidenote. The download page for GATK is offering me v3.8-0 and not v4 which I assume would have a later version than the one I'm using, which might help with the issue.

    Regards,

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @hena

    Please provide the exact command you are using and the entire error log.

  • henahena FinlandMember

    The command is (I'm using the jar directly as the cluster environment has it's own java installation directory)

    /apps/java/jdk1.8.0_77/bin/java -jar /fs/vault/pipelines/common/external/gatk/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar GenomicsDBImport -R /fs/vault/pipelines/vcp/data_files/ensembl/73/Homo_sapiens.GRCh37.73.dna.chr.fa -L all.bed  --genomicsdb-workspace-path test_db -V 00005.vcf -V 00039.vcf 2> err.log
    
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @hena

    Have you tried the solutions provided by other users in this thread above?

  • henahena FinlandMember

    As much as I could and none seem to help. I can't run docker images so that I haven't tried.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @hena

    Can you try to run it with the GATK launch script. We don't recommend using the jar directly, the launch script sets all the options and settings properly.

  • henahena FinlandMember

    Running through gatk command didn't help.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited December 2019

    HI @hena

    We usually use GenomicsDBImport for samples sizes in 1000s. Because you have a smaller sample size, please try to use CombineGVCFs instead. That might be a better solution here.

  • henahena FinlandMember
    edited January 3

    I think I figured it out. Finnish decimal separator is comma ','. Now it seems that the environmental variables include multiple LC_ values such as following LC_NUMERIC=fi_FI.UTF-8. If the vcf parse library uses those values to determine that a valid float number should be then like 0,5 instead of 0.5 then this kind of error could arise. Thus I tested this with following command
    _JAVA_OPTIONS="-Xmx5g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" LC_ALL=C /fs/vault/pipelines/common/external/gatk/gatk-4.1.4.1/gatk GenomicsDBImport -R /fs/vault/pipelines/vcp/data_files/ensembl/73/Homo_sapiens.GRCh37.73.dna.chr.fa -L all.bed --genomicsdb-workspace-path test_db -V 00005.vcf -V 00039.vcf 2> err.log

    I think this worked fine at least it didn't say that it crashed and the database directory has entries for all chromosomes. Though if this is added to script then I'd like to see a possibility to define java path to allow execution of different installled javas. The full log is attached.

    As a side question. If I have exome data, should the target be full chromosomes or just the exome target for this? Does it matter?

  • wdecosterwdecoster University of AntwerpMember

    For the record, I get the same error, planning to generate a PoN for 16 samples.

    Based on the suggestion from Hena I looked at the locale settings.
    $ echo $LC_NUMERIC
    nl_NL.UTF-8

    Changing that:
    LC_NUMERIC=en_US.UTF-8

    seems to fix the issue

Sign In or Register to comment.