We've moved!
For WDL questions, see the WDL specification and WDL docs.
For Cromwell questions, see the Cromwell docs and please post any issues on Github.

GenomicsDBimport failed to create reader error with updated joint-discovery-gatk4-local.wdl pipeline

YatrosYatros Seattle, WA, USAMember ✭✭

Hello,

I'm experiencing a similar problem to the one reported in this post https://gatkforums.broadinstitute.org/wdl/discussion/12382/genomicsdbimport-failed-to-create-reader#latest

I am using the updated joint-discovery-gatk4-local.wdl pipeline. I'm trying to merge several samples with genomicsDBimport + GenotypeGVCFs, but I always get the "A USER ERROR has occurred: Failed to create reader from file" error.

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell-executions/JointGenotyping/21af0807-e11c-435f-bc3c-befffcaefc9c/call-ImportGVCFs/shard-179/execution/tmp.3FPwiI
23:42:45.044 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:42:45.383 INFO  GenomicsDBImport - ------------------------------------------------------------
23:42:45.384 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.6.0
23:42:45.384 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
23:42:45.385 INFO  GenomicsDBImport - Executing as [email protected] on Linux v4.4.0-133-generic amd64
23:42:45.386 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11
23:42:45.387 INFO  GenomicsDBImport - Start Date/Time: September 4, 2018 11:42:44 PM UTC
23:42:45.387 INFO  GenomicsDBImport - ------------------------------------------------------------
23:42:45.388 INFO  GenomicsDBImport - ------------------------------------------------------------
23:42:45.389 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
23:42:45.390 INFO  GenomicsDBImport - Picard Version: 2.18.7
23:42:45.390 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:42:45.390 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:42:45.390 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:42:45.391 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:42:45.391 INFO  GenomicsDBImport - Deflater: IntelDeflater
23:42:45.391 INFO  GenomicsDBImport - Inflater: IntelInflater
23:42:45.391 INFO  GenomicsDBImport - GCS max retries/reopens: 20
23:42:45.392 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
23:42:45.392 INFO  GenomicsDBImport - Initializing engine
23:42:45.467 INFO  GenomicsDBImport - Shutting down engine
[September 4, 2018 11:42:45 PM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=4116185088
***********************************************************************

A USER ERROR has occurred: Failed to create reader from file:///cromwell-executions/JointGenotyping/21af0807-e11c-435f-bc3c-befffcaefc9c/call-ImportGVCFs/shard-179/inputs/mnt/ND27/genomicsDBimport/ND27.sample_map

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /gatk/gatk-package-4.0.6.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -Xms4g -jar /gatk/gatk-package-4.0.6.0-local.jar GenomicsDBImport --genomicsdb-workspace-path genomicsdb --batch-size 50 -L 9:35060697-35061276 --sample-name-map inputs.list --reader-threads 5 -ip 500

From my understanding, I think that the error starts with the following python code:

python << CODE
    gvcfs = ['${sep="','" input_gvcfs}']
    sample_names = ['${sep="','" sample_names}']

    if len(gvcfs)!= len(sample_names):
      exit(1)

    with open("inputs.list", "w") as fi:
      for i in range(len(gvcfs)):
        fi.write(sample_names[i] + "\t" + gvcfs[i] + "\n") 

CODE

Instead of getting a inputs.list file with the name of the samples in the first column and their path location in the second one, I get a file with the following content:

/mnt/ND27/genomicsDBimport/ND27.samples /cromwell-executions/JointGenotyping/21af0807-e11c-435f-bc3c-befffcaefc9c/call-ImportGVCFs/shard-1/inputs/mnt/ND27/genomicsDBimport/ND27.sample_map

The file contains the paths to the two original 'samples' and 'sample_map' files. Can somebody explain me why does python behave this way? What should I change in order to get an inputs.file with the format:

Sample Path_to_sample_location

If I generate the inputs.file manually and I run the script that was generated during the pipeline run, the genomicsdb folder is generated in the right way and I get the expected results.

Thank you very much,

Best Regards,

Yatros

Best Answer

  • YatrosYatros Seattle, WA, USA ✭✭
    Accepted Answer

    Hello,

    Just in case someone is struggling with a similar issue, I can explain what I was doing wrong. I was trying to parse the sample_names, the sample_paths and the sample_path_indices using three actual files instead of including the info in the json file to create the inputs.list file from there.

    My initial json file looked like this:

    [...]
    "JointGenotyping.sample_names": ["/mnt/ND27/genomicsDBimport/ND27.samples"],
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/genomicsDBimport/ND27.sample_map"],
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/genomicsDBimport/ND27.sample_map_indices"],
    [...]
    

    This is completely wrong!! The actual way the sample names and the paths need to be parsed is in the following way:

    [...]
    "JointGenotyping.sample_names": ["Sample_01", "Sample_02", "Sample_03"]
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz"]
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz.tbi"]
    [...]
    

    This will generate the right inputs.list file and not a file with path locations only.

    This post can be closed by the administrators.

    Thanks,

    Yatros

Answers

  • YatrosYatros Seattle, WA, USAMember ✭✭
    Accepted Answer

    Hello,

    Just in case someone is struggling with a similar issue, I can explain what I was doing wrong. I was trying to parse the sample_names, the sample_paths and the sample_path_indices using three actual files instead of including the info in the json file to create the inputs.list file from there.

    My initial json file looked like this:

    [...]
    "JointGenotyping.sample_names": ["/mnt/ND27/genomicsDBimport/ND27.samples"],
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/genomicsDBimport/ND27.sample_map"],
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/genomicsDBimport/ND27.sample_map_indices"],
    [...]
    

    This is completely wrong!! The actual way the sample names and the paths need to be parsed is in the following way:

    [...]
    "JointGenotyping.sample_names": ["Sample_01", "Sample_02", "Sample_03"]
    "JointGenotyping.input_gvcfs": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz"]
    "JointGenotyping.input_gvcfs_indices": ["/mnt/ND27/GVCFs/Sample_01.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_02.gvcf.gz.tbi", "/mnt/ND27/GVCFs/Sample_03.gvcf.gz.tbi"]
    [...]
    

    This will generate the right inputs.list file and not a file with path locations only.

    This post can be closed by the administrators.

    Thanks,

    Yatros

Sign In or Register to comment.