Problem parsing interval list file to GenomicsDBImport

YatrosYatros Seattle, WA, USAMember
edited December 2017 in Ask the GATK team

Hello,

I'm trying to combine 6 GVCF files into a single VCF file using GenomicsDBImport + GenotypeGVCFs with GATK4. I'm still using GATK4.beta5, since I'm not able to find the link to GATK4.beta6.

My target.interval file looks like this:
1:8022745-8023035
1:8025283-8025585
1:8029304-8029564

If I run the following command:

gatk-launch GenomicsDBImport \
-V sample_001.gvcf.gz \
-V sample_002.gvcf.gz \
-V sample_003.gvcf.gz \
-V sample_004.gvcf.gz \
-V sample_005.gvcf.gz \
-V sample_006.gvcf.gz \
--genomicsDBWorkspace testDB \
--L 1:8022745-8023035

GATK will work and generate a database with genotypes for this specific interval.

However, if I run the following script:

            #!/bin/bash
            while read in; do \
            gatk-launch GenomicsDBImport \
            -V sample_001.gvcf.gz \
            -V sample_002.gvcf.gz \
            -V sample_003.gvcf.gz \
            -V sample_004.gvcf.gz \
            -V sample_005.gvcf.gz \
            -V sample_006.gvcf.gz \
            --genomicsDBWorkspace testDB \
            --L "$in"; done < target.interval

GATK runs, but I get the following error for every single interval and nothing is added to the database:

: Problem parsing start/end value in interval string. Value was: 8023035arse Genome Location string: 1:8022745-8023035

: Problem parsing start/end value in interval string. Value was: 8025585arse Genome Location string: 1:8025283-8025585

I understand that the software is recognizing each line of my file correctly but, for some reason, it cannot parse the start and end position of each interval.

I have already looked for this error in the forum, but I haven't found any posts that answer my question.

I also tried to use a regular bed file as input file with the following format:
1 8022745 8023035
1 8025283 8025585
1 8029304 8029564

In that case the error was the following one:

A USER ERROR has occurred: Badly formed genome unclippedLoc: Contig '1  8022745 8023035' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

Can anybody help me with this issue?

Thank you very much,

Here you have the entire error message for the first two intervals:

09:50:28.057 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/opt/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/inte                                                                     l/gkl/native/libgkl_compression.so
[December 15, 2017 9:50:27 AM PST] GenomicsDBImport  --genomicsDBWorkspace SAmerican_panel --variant 17-50618.gatk.gvcf.gz --variant 17-50619.gatk.gvcf.gz --varian                                                                     t 17-50620.gatk.gvcf.gz --variant 17-50621.gatk.gvcf.gz --variant 17-50622.gatk.gvcf.gz --variant 17-50623.gatk.gvcf.gz --variant 17-50624.gatk.gvcf.gz --variant 1                                                                     7-50625.gatk.gvcf.gz --variant 17-50626.gatk.gvcf.gz --variant 17-50627.gatk.gvcf.gz --variant 17-50628.gatk.gvcf.gz --variant 17-50629.gatk.gvcf.gz --variant 17-5                                                                     0630.gatk.gvcf.gz --variant 17-50631.gatk.gvcf.gz --variant 17-50632.gatk.gvcf.gz --variant 17-50633.gatk.gvcf.gz --variant 17-50634.gatk.gvcf.gz --variant 17-5063                                                                       --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNa                                                                     meMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency                                                                      SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVarian                                                                     tIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPref                                                                     etchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_infla                                                                     ter false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 15, 2017 9:50:27 AM PST] Executing as user@server on Linux 3.10.0-693.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_151-b12; Version: 4.beta                                                                     .5
09:50:28.443 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5
09:50:28.443 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:50:28.443 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
09:50:28.444 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:50:28.444 INFO  GenomicsDBImport - Deflater: IntelDeflater
09:50:28.444 INFO  GenomicsDBImport - Inflater: IntelInflater
09:50:28.444 INFO  GenomicsDBImport - GCS max retries/reopens: 20
09:50:28.444 INFO  GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree                                                                     /dr_all_nio_fixes
09:50:28.444 INFO  GenomicsDBImport - Initializing engine
09:50:38.228 INFO  GenomicsDBImport - Shutting down engine
[December 15, 2017 9:50:38 AM PST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.17 minutes.
Runtime.totalMemory()=1368915968
***********************************************************************

: Problem parsing start/end value in interval string. Value was: 8023035arse Genome Location string: 1:8022745-8023035

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
09:50:47.419 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/opt/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/inte                                                                     l/gkl/native/libgkl_compression.so
[December 15, 2017 9:50:47 AM PST] GenomicsDBImport  --genomicsDBWorkspace SAmerican_panel --variant 17-50618.gatk.gvcf.gz --variant 17-50619.gatk.gvcf.gz --varian                                                                     t 17-50620.gatk.gvcf.gz --variant 17-50621.gatk.gvcf.gz --variant 17-50622.gatk.gvcf.gz --variant 17-50623.gatk.gvcf.gz --variant 17-50624.gatk.gvcf.gz --variant 1                                                                     7-50625.gatk.gvcf.gz --variant 17-50626.gatk.gvcf.gz --variant 17-50627.gatk.gvcf.gz --variant 17-50628.gatk.gvcf.gz --variant 17-50629.gatk.gvcf.gz --variant 17-5                                                                     0630.gatk.gvcf.gz --variant 17-50631.gatk.gvcf.gz --variant 17-50632.gatk.gvcf.gz --variant 17-50633.gatk.gvcf.gz --variant 17-50634.gatk.gvcf.gz --variant 17-5063                                                                       --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNa                                                                     meMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency                                                                      SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVarian                                                                     tIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPref                                                                     etchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_infla                                                                     ter false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 15, 2017 9:50:47 AM PST] Executing as user@server on Linux 3.10.0-693.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_151-b12; Version: 4.beta                                                                     .5
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:50:47.976 INFO  GenomicsDBImport - Deflater: IntelDeflater
09:50:47.976 INFO  GenomicsDBImport - Inflater: IntelInflater
09:50:47.976 INFO  GenomicsDBImport - GCS max retries/reopens: 20
09:50:47.976 INFO  GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree                                                                     /dr_all_nio_fixes
09:50:47.976 INFO  GenomicsDBImport - Initializing engine
09:50:53.175 INFO  GenomicsDBImport - Shutting down engine
[December 15, 2017 9:50:53 AM PST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=1389887488
***********************************************************************

: Problem parsing start/end value in interval string. Value was: 8025585arse Genome Location string: 1:8025283-8025585

***********************************************************************

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Yatros
    Hi,

    You can download beta 6 here.

    Right now GenomicsDBImport only accepts 1 interval at a time. I think the team is working on it accepting more than 1 interval in the future. Have a look at this article as well.

    -Sheila

  • YatrosYatros Seattle, WA, USAMember

    Thank you very much,

    I don't think a bash script it is going to solve this issue.

    I'm working on a WDL+CROMWELL script to get this task done.

    I'll post it, once I've got it working appropriately.

    Best,

    Yatros

  • YatrosYatros Seattle, WA, USAMember

    Hello,

    I have tried to handle this issue with a WDL + CROMWELL script, but I am still facing a problem. I will detail my input files and my script to see if you can help me solve the problem.

    I am running the following command:
    java -Dconfig.file=application.conf -jar $CROMWELL run my.wdl --inputs my.json

    My application.conf looks like this:

    backend {
      providers {
        Local {
          config {
              concurrent-job-limit = 25
            filesystems {
              local {
                localization: [
                  "soft-link", "hard-link", "copy"
                ]
                caching {
                  duplication-strategy: [
                    "soft-link"
                  ]
                }
              }
            }
          }
        }
      }
    }
    

    my.json file looks like this:

    {
    "Merge_gvcfs.human_g1k_v37_fa":"/mnt/ref/human_g1k_v37.fasta",
     "Merge_gvcfs.human_g1k_v37_fai":"/mnt/ref/human_g1k_v37.fasta.fai",
     "Merge_gvcfs.human_g1k_v37_dict":"/mnt/ref/human_g1k_v37.dict",
     "Merge_gvcfs.wdir":"/mnt/mywdir/",
     "Merge_gvcfs.target_interval":"/mnt/mywdir/Test.interval"
    }
    

    my.wdl script is the following one:

        workflow Merge_gvcfs {
    
        File human_g1k_v37_decoy_fa
        File human_g1k_v37_decoy_fai
        File human_g1k_v37_decoy_dict
        String wdir
        File target_interval
        Array[String] intervals = read_lines(target_interval)
        Array[File] in_files = merge_gvcfs.out
    
    
        # list all gvcf.gz files in wdir
        call get_file_names {
            input:
                wdir = wdir
        }
    
    
        # Run GenomicsDBImport for each interval in the intervals file
        scatter(inputs_row in intervals) {
            String interval = inputs_row
            call merge_gvcfs {
                    input:
                        wdir = wdir,
                        interval = interval,
                        file_names = get_file_names.file_names
            }
        }   
    
        call generate_vcf {
            input:
                human_g1k_v37_decoy_fa = human_g1k_v37_decoy_fa,
                in_files = in_files
        }
    
    }
    
    
        task get_file_names {
        String wdir
        command <<<
            readlink -f ${wdir}*.gvcf.gz>file_names.txt
            >>> 
            output {
                File file_names = "file_names.txt"
            }
        }   
    
        task merge_gvcfs{
        String interval
        String wdir
        File file_names
        Array[String] samples = read_lines(file_names)
                command {
                java -jar $GATK GenomicsDBImport \
                -V ${sep= ' -V ' samples} \
                --genomicsDBWorkspace Test \
                --intervals ${interval}
            }
            output {
            File out = "Test"
            }
            }
    
        task generate_vcf {
            File human_g1k_v37_decoy_fa
            Array[File] in_files
            command {
                java -jar $GATK GenotypeGVCFs \
                -R ${human_g1k_v37_decoy_fa} \
                -V gendb://${in_files} \
                -G StandardAnnotation -newQual \
                -O Test.vcf 
            }
        }
    

    The error that I am getting with this script is the following one:
    Error attempting to Execute java.lang.UnsupportedOperationException: Expression 'WdlExpression(<string:79:20 identifier "aW5fZmlsZXM=">)' evaluated to an Array but no 'sep' was specified

    If I specified the 'sep' option I would have several input values for the GenotypeGVCFs -V option and I would get the following error:

    A USER ERROR has occurred: Argument '[V, variant]' cannot be specified more than once.

    My main problem is that the output of the merge_gvcfs task is an array of files and GenotypeGVCFs -V option allows only a single input value. Even if I flattened the merge_gvcfs array into a string, I would still have several input values for the generate_vcf task.

    Is there any way to combine into a single folder the multiple output folders from the scatter function that loops through the intervals file list? This way I would be able to provide a single input value for the GenotypeGVCFs command.

    Thank you very much for your help,

    Best,

    Yatros

  • YatrosYatros Seattle, WA, USAMember

    Hello,

    I know that this solution is not elegant, nor solves the issue of adding the GenomicsDBImport command to a WDL/CROMWELL pipeline, but for now, this is what works best for me:

    #!/bin/bash
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8022745-8023035
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8025283-8025585
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8029304-8029564
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8030853-8031123
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8037611-8037898
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8044853-8045214
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11073684-11074123
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11076800-11077164
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11078689-11079031
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11080385-11080757
    java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11082080-11083405
    

    [...]

  • YatrosYatros Seattle, WA, USAMember

    Not even this works!!!!! It crashes after a while (252 intervals to be precise) with the following error:

    [December 19, 2017 3:26:26 PM PST] Executing as user@server on Linux 3.10.0-693.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_151-b12; Version: 4.beta.6
    15:26:26.445 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5
    15:26:26.445 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    15:26:26.445 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
    15:26:26.445 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    15:26:26.445 INFO  GenomicsDBImport - Deflater: IntelDeflater
    15:26:26.445 INFO  GenomicsDBImport - Inflater: IntelInflater
    15:26:26.445 INFO  GenomicsDBImport - GCS max retries/reopens: 20
    15:26:26.445 INFO  GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    15:26:26.445 INFO  GenomicsDBImport - Initializing engine
    15:26:27.196 INFO  IntervalArgumentCollection - Processing 282 bp from intervals
    15:26:27.198 INFO  GenomicsDBImport - Done initializing engine
    15:26:27.199 INFO  GenomicsDBImport - Vid Map JSON file will be written to SAmerican_panel/vidmap.json
    15:26:27.199 INFO  GenomicsDBImport - Callset Map JSON file will be written to SAmerican_panel/callset.json
    15:26:27.199 INFO  GenomicsDBImport - Importing to array - SAmerican_panel/genomicsdb_array
    15:26:27.209 INFO  GenomicsDBImport - Shutting down engine
    [December 19, 2017 3:26:27 PM PST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.02 minutes.
    Runtime.totalMemory()=2038431744
    Exception in thread "main" java.lang.ExceptionInInitializerError
            at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.onTraversalStart(GenomicsDBImport.java:390)
            at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:836)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158)
            at org.broadinstitute.hellbender.Main.main(Main.java:239)
    Caused by: com.intel.genomicsdb.GenomicsDBException: Could not load genomicsdb native library
            at com.intel.genomicsdb.GenomicsDBImporter.<clinit>(GenomicsDBImporter.java:71)
            ... 8 more
    15:26:34.554 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/opt/gatk-4.beta.6/gatk-package-4.beta.6-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [December 19, 2017 3:26:34 PM PST] GenomicsDBImport  --genomicsDBWorkspace SAmerican_panel --variant sample_001 --variant sample_002 --variant sample_003 --variant sample_004 --variant sample_005 --variant sample_006 --intervals 11:2192814-2193116  --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNameMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPrefetchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
    
  • YatrosYatros Seattle, WA, USAMember

    Thank you very much @shlee ,

    That was really helpful,

    Best,

    Yatros

Sign In or Register to comment.