GenomicsDBimport failed to create reader error with updated joint-discovery-gatk4-local.wdl pipeline

YatrosYatros Seattle, WA, USAMember

Hello,

I'm experiencing a similar problem to the one reported in this post https://gatkforums.broadinstitute.org/wdl/discussion/12382/genomicsdbimport-failed-to-create-reader#latest

I am using the updated joint-discovery-gatk4-local.wdl pipeline. I'm trying to merge several samples with genomicsDBimport + GenotypeGVCFs, but I always get the "A USER ERROR has occurred: Failed to create reader from file" error.

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell-executions/JointGenotyping/21af0807-e11c-435f-bc3c-befffcaefc9c/call-ImportGVCFs/shard-179/execution/tmp.3FPwiI
23:42:45.044 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:42:45.383 INFO  GenomicsDBImport - ------------------------------------------------------------
23:42:45.384 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.6.0
23:42:45.384 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
23:42:45.385 INFO  GenomicsDBImport - Executing as [email protected] on Linux v4.4.0-133-generic amd64
23:42:45.386 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11
23:42:45.387 INFO  GenomicsDBImport - Start Date/Time: September 4, 2018 11:42:44 PM UTC
23:42:45.387 INFO  GenomicsDBImport - ------------------------------------------------------------
23:42:45.388 INFO  GenomicsDBImport - ------------------------------------------------------------
23:42:45.389 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
23:42:45.390 INFO  GenomicsDBImport - Picard Version: 2.18.7
23:42:45.390 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:42:45.390 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:42:45.390 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:42:45.391 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:42:45.391 INFO  GenomicsDBImport - Deflater: IntelDeflater
23:42:45.391 INFO  GenomicsDBImport - Inflater: IntelInflater
23:42:45.391 INFO  GenomicsDBImport - GCS max retries/reopens: 20
23:42:45.392 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
23:42:45.392 INFO  GenomicsDBImport - Initializing engine
23:42:45.467 INFO  GenomicsDBImport - Shutting down engine
[September 4, 2018 11:42:45 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=4116185088
***********************************************************************

A USER ERROR has occurred: Failed to create reader from file:///cromwell-executions/JointGenotyping/21af0807-e11c-435f-bc3c-befffcaefc9c/call-ImportGVCFs/shard-179/inputs/mnt/ND27/genomicsDBimport/ND27.sample_map

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /gatk/gatk-package-4.0.6.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -Xms4g -jar /gatk/gatk-package-4.0.6.0-local.jar GenomicsDBImport --genomicsdb-workspace-path genomicsdb --batch-size 50 -L 9:35060697-35061276 --sample-name-map inputs.list --reader-threads 5 -ip 500

From my understanding, I think that the error starts with the following python code:

python << CODE
    gvcfs = ['${sep="','" input_gvcfs}']
    sample_names = ['${sep="','" sample_names}']

    if len(gvcfs)!= len(sample_names):
      exit(1)

    with open("inputs.list", "w") as fi:
      for i in range(len(gvcfs)):
        fi.write(sample_names[i] + "\t" + gvcfs[i] + "\n") 

CODE

Instead of getting a inputs.list file with the name of the samples in the first column and their path location in the second one, I get a file with the following content:

/mnt/ND27/genomicsDBimport/ND27.samples /cromwell-executions/JointGenotyping/21af0807-e11c-435f-bc3c-befffcaefc9c/call-ImportGVCFs/shard-1/inputs/mnt/ND27/genomicsDBimport/ND27.sample_map

The file contains the paths to the two original 'samples' and 'sample_map' files. Can somebody explain me why does python behave this way? What should I change in order to get an inputs.file with the format:

Sample Path_to_sample_location

If I generate the inputs.file manually and I run the script that was generated during the pipeline run, the genomicsdb folder is generated in the right way and I get the expected results.

Thank you very much,

Best Regards,

Yatros

Best Answer

  • YatrosYatros Seattle, WA, USA
    Accepted Answer

    Hello,

    Just in case someone is struggling with a similar issue, I can explain what I was doing wrong. I was trying to parse the sample_names, the sample_paths and the sample_path_indices using three actual files instead of including the info in the json file to create the inputs.list file from there.

    My initial json file looked like this:

    [...]
    "JointGenotyping.sample_names": ["/mnt/ND27/genomicsDBimport/ND27.samples"],
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/genomicsDBimport/ND27.sample_map"],
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/genomicsDBimport/ND27.sample_map_indices"],
    [...]
    

    This is completely wrong!! The actual way the sample names and the paths need to be parsed is in the following way:

    [...]
    "JointGenotyping.sample_names": ["Sample_01", "Sample_02", "Sample_03"]
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz"]
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz.tbi"]
    [...]
    

    This will generate the right inputs.list file and not a file with path locations only.

    This post can be closed by the administrators.

    Thanks,

    Yatros

Answers

  • YatrosYatros Seattle, WA, USAMember
    Accepted Answer

    Hello,

    Just in case someone is struggling with a similar issue, I can explain what I was doing wrong. I was trying to parse the sample_names, the sample_paths and the sample_path_indices using three actual files instead of including the info in the json file to create the inputs.list file from there.

    My initial json file looked like this:

    [...]
    "JointGenotyping.sample_names": ["/mnt/ND27/genomicsDBimport/ND27.samples"],
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/genomicsDBimport/ND27.sample_map"],
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/genomicsDBimport/ND27.sample_map_indices"],
    [...]
    

    This is completely wrong!! The actual way the sample names and the paths need to be parsed is in the following way:

    [...]
    "JointGenotyping.sample_names": ["Sample_01", "Sample_02", "Sample_03"]
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz"]
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz.tbi"]
    [...]
    

    This will generate the right inputs.list file and not a file with path locations only.

    This post can be closed by the administrators.

    Thanks,

    Yatros

Sign In or Register to comment.