Error running the GATK WDL best practices pipeline

orodehorodeh Mountain View, CAMember

Hello,
I am running the GATK WDL best practices pipeline with the sample data from the Google Cloud (https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.wdl). The pipeline hits an error close to the end, at the HaplotypeCaller scatter.

The failing command is being run like this:
INFO 01:07:19,794 HelpFormatter - --------------------------------------------------------------------------------
INFO 01:07:19,797 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 01:07:19,797 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 01:07:19,797 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 01:07:19,800 HelpFormatter - Program Args: -T HaplotypeCaller -R /home/dnanexus/execution/inputs/Homo_sapiens_assembly38.fasta -o NA12878.vcf.gz -I /home/dnanexus/execution/inputs/NA12878.bam -L /home/dnanexus/execution/inputs/scattered.interval_list -ERC GVCF --max_alternate_alleles 3 -variant_index_parameter 128000 -variant_index_type LINEAR -contamination 0 --read_filter OverclippedRead

The error is:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalStateException: Updating last read seen in the traversal with read H06HDADXX130110:2:1101:10101:44450 2/2 250b aligned read. with span chr1:59003339-59003543 but this occurs before the previously seen read chr1:117920599-117920848
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.rememberLastReadLocation(TraverseActiveRegions.java:426)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:354)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

Runs with other interval lists succeed, but this one fails. The interval list is attached. Any ideas would be appreciated.

Thanks,
Ohad.

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @orodeh, sorry for the delay. Further to @shlee's observations, can you clarify whether you're running with a custom interval list? That seems to be the case based on your above statement:

    Runs with other interval lists succeed, but this one fails

    but we need to be sure because that makes a difference for troubleshooting.

  • orodehorodeh Mountain View, CAMember

    @shlee, Sorry for the long delay. I thought I was going to get an e-mail from the forum, but never did.

    I am running the analysis in the cloud. Together with Mike Lin, I am developing a compiler, which translates the WDL workflow into a dnanexus workflow. The inputs are the samples provided by your team (https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.inputs.json).

    The java version, etc., are all as specified by the WDL pipeline
    (https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.wdl).

    To answer @Geraldine_VdAuwera's question, the other interval lists are the other instances of the scatter loop from
    https://github.com/broadinstitute/wdl/blame/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.wdl#L735.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    So which interval is the "this one" that fails, then?
  • orodehorodeh Mountain View, CAMember

    I believe it is the second interval file (temp_0002_of_50/scattered.interval_list), and the interval itself is:
    chr1 59000000 117999999

    I now think that the bam file is, indeed, not properly sorted. I will know more tomorrow.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh, I thought you meant you were running on custom intervals. The pipeline should be deterministic and ran fine last time we tested it, but I'll kick off another run now to make sure. If that verifies, the only explanation I see is that something is getting lost in translation tour workflow format.

  • orodehorodeh Mountain View, CAMember

    This turned out to be a translation problem.

    There were supposed to be 19 distinct BAM files, but when they were downloaded to a cloud instance, they all received the same filesystem path. This caused the GatherBamFiles step (java -jar /usr/gitc/picard.jar GatherBamFiles) to merge 19 identical files, instead of non-overlapping, coordinate sorted, bams.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Ouch @orodeh. Glad you figured that out.

  • orodehorodeh Mountain View, CAMember

    I am now trying to run the pipeline "as is". I am using Linux 16.04, with docker, and all the files downloaded from the google cloud (inputs file modified accordingly). I am running Cromwell like this:

    java -jar cromwell-25.jar run PublicPairedSingleSampleWf_160927.wdl PublicPairedSingleSampleWf_160927.inputs.json

    I am hitting this error:

    /cromwell-executions/PairedEndSingleSampleWorkflow/7e304bfc-4d52-4102-a909-637ed5adc2b3/call-SamToFastqAndBwaMem/shard-2/execution/script: line 20: genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06JUADXX130110.1.ATCACGAT.20k_reads.unmerged.bam: No such file or directory
    tee: genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06JUADXX130110.1.ATCACGAT.20k_reads.unmerged.bwa.stderr.log: No such file or directory
    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell-executions/PairedEndSingleSampleWorkflow/7e304bfc-4d52-4102-a909-637ed5adc2b3/call-SamToFastqAndBwaMem/shard-2/execution/tmp.CJzO2O
    [Fri Mar 10 22:46:05 UTC 2017] picard.sam.SamToFastq INPUT=/cromwell-executions/PairedEndSingleSampleWorkflow/7e304bfc-4d52-4102-a909-637ed5adc2b3/call-SamToFastqAndBwaMem/shard-2/inputs/home/orodeh/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06JUADXX130110.1.ATCACGAT.20k_reads.bam FASTQ=/dev/stdout INTERLEAVE=true INCLUDE_NON_PF_READS=true OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Fri Mar 10 22:46:05 UTC 2017] Executing as [email protected] on Linux 4.4.0-66-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_91-8u91-b14-1~bpo8+1-b14; Picard version: 2.5.0-gradle-SNAPSHOT
    [M::bwa_idx_load_from_disk] read 3171 ALT contigs
    [W::main_mem] when '-p' is in use, the second query file is ignored.
    [Fri Mar 10 22:46:07 UTC 2017] picard.sam.SamToFastq done. Elapsed time: 0.03 minutes.
    Runtime.totalMemory()=1931476992
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" htsjdk.samtools.SAMException: Error in writing fastq file /dev/stdout
    at htsjdk.samtools.fastq.BasicFastqWriter.write(BasicFastqWriter.java:68)
    at picard.sam.SamToFastq.writeRecord(SamToFastq.java:350)
    at picard.sam.SamToFastq.doWork(SamToFastq.java:195)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    I tried working around this in several ways, without success. It looks like the problem is that Picard can't write to /dev/stdout. Since this is the Broad docker image being used, I cannot influence the software.

    Any ideas?

  • EADGEADG KielMember
    edited March 2017

    Hi @orodeh,

    could it be that wdl fail to locate your input-files ? You can write a little test.wdl like this:

    <br /># Test Picard Inputs
    task TestPicardInputs {
      File input_bam
      String output_basename
    command {
          java -Xmx3000m -jar /usr/gitc/picard.jar \
            SamToFastq \
            INPUT=${input_bam} \
            FASTQ=${output_bam_basename}.fastq \
            INTERLEAVE=true \
            NON_PF=true 
       }
      runtime {
        docker: "broadinstitute/genomes-in-the-cloud:2.2.4-1469632282"
      }
      output {
        File output_fastq = "${output_basename}.fastq"
      }
    }
    workflow PairedEndSingleSampleWorkflow {
    Array[File] flowcell_unmapped_bams
    String unmapped_bam_suffix
    
    scatter (unmapped_bam in flowcell_unmapped_bams) {
    
        String sub_strip_path = "gs://.*/" # Manipulate to match your filesystem
        String sub_strip_unmapped = unmapped_bam_suffix + "$"
    
        call TestPicardInputs {
         input:
           input_bam = unmapped_bam,
           output_basename = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "") + ".unmerged"
        }
    } 
    output {
    
    }
    }
    

    This test.wdl should work with your inputs.json. You can check cromwell-execution/.../call-TestPicardInputs/ if the fastq will be created or the stderr. (Hope are there no typos in I write this wdl from mind ;)

    Also you can manipulate the DockerFile from Broad, add your own software and create a customized DockerImage.

    Greeting EADG

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @orodeh ,

    The following indicates that the workflow cannot locate the data:

    data/dna/wgs/hiseq2500/NA12878/H06JUADXX130110.1.ATCACGAT.20k_reads.unmerged.bam: No such file or directory
    tee: genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06JUADXX130110.1.ATCACGAT.20k_reads.unmerged.bwa.stderr.log: No such file or directory

    You should be able use the INPUTS json we provide at https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.inputs.json. The data are publicly available. Then you can narrow down whether it is your INPUTS file paths that is causing this error.

  • orodehorodeh Mountain View, CAMember

    The issue is that inputs.json file does not have files with that name. It does have a file by the name:
    data/dna/wgs/hiseq2500/NA12878/H06JUADXX130110.1.ATCACGAT.20k_reads.bam
    but "unmerged" is not there:
    data/dna/wgs/hiseq2500/NA12878/H06JUADXX130110.1.ATCACGAT.20k_reads.unmerged.bam

    The unmerged file, I believe, is generated by the SamToFastqAndBwaMem task, in this scatter call:
    https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.wdl#L573

  • orodehorodeh Mountain View, CAMember

    At the bottom of the stack, after Cromwell calls docker, which executes a shell script, the following shell command fails:

    java -Xmx3000m -jar /usr/gitc/picard.jar \
    SamToFastq \
    INPUT=/cromwell-executions/PairedEndSingleSampleWorkflow/610155cd-e94f-4db2-bd9e-9b2322fc412f/call-SamToFastqAndBwaMem/shard-0/inputs/home/orodeh/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.bam \
    FASTQ=/dev/stdout \
    INTERLEAVE=true \
    NON_PF=true | \
    /usr/gitc/bwa mem -K 100000000 -p -v 3 -t 16 $bash_ref_fasta /dev/stdin - 2> >(tee genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.unmerged.bwa.stderr.log >&2) | \
    samtools view -1 - > genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.unmerged.bam && \
    grep -m1 "read .* ALT contigs" genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.unmerged.bwa.stderr.log | \
    grep -v "read 0 ALT contigs"

    The input file exists, but something goes wrong, and the outputs are not generated. This caused the rest of the pipeline to fail. Perhaps that provides more debugging information.

    Issue · Github
    by shlee

    Issue Number
    1815
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @orodeh,

    The part of the WDL workflow that adds an "unmerged" label is:

    # Map reads to reference
        call SamToFastqAndBwaMem {
          input:
            input_bam = unmapped_bam,
            bwa_commandline = bwa_commandline,
            output_bam_basename = sub(sub(unmapped_bam, sub_strip_path, ""), sub_strip_unmapped, "") + ".unmerged",
            ref_fasta = ref_fasta,
            ref_fasta_index = ref_fasta_index,
            ref_dict = ref_dict,
            ref_alt = ref_alt,
            ref_bwt = ref_bwt,
            ref_amb = ref_amb,
            ref_ann = ref_ann,
            ref_pac = ref_pac,
            ref_sa = ref_sa,
            disk_size = flowcell_medium_disk,
            preemptible_tries = preemptible_tries
    

    I see that you are using Cromwell v25, which is the latest release from last week. However, we wrote these scripts using a previous version of Cromwell. Let me check with our developers if the new version could be the cause of the error.

  • RuchiRuchi Member, Broadie, Moderator, Dev

    Hey @orodeh, I'm going to try and recreate the same error on the latest 25 Cromwell release. Can you send a link to the exact inputs/wdl you used?

  • RuchiRuchi Member, Broadie, Moderator, Dev

    Also @orodeh would you mind sharing the stderr file for the failed task?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • orodehorodeh Mountain View, CAMember

    Right. I am using the original json file, except that I downloaded all the files, so they will be local. Please see attached inputs.json file.

  • orodehorodeh Mountain View, CAMember

    @ruchi, I already closed the cloud instance down, as it is expensive. However, the error is pasted in an earlier post:

    ava -Xmx3000m -jar /usr/gitc/picard.jar \
    SamToFastq \
    INPUT=/cromwell-executions/PairedEndSingleSampleWorkflow/610155cd-e94f-4db2-bd9e-9b2322fc412f/call-SamToFastqAndBwaMem/shard-0/inputs/home/orodeh/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.bam \
    FASTQ=/dev/stdout \
    INTERLEAVE=true \
    NON_PF=true | \
    /usr/gitc/bwa mem -K 100000000 -p -v 3 -t 16 $bash_ref_fasta /dev/stdin - 2> >(tee genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.unmerged.bwa.stderr.log >&2) | \
    samtools view -1 - > genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.unmerged.bam && \
    grep -m1 "read .* ALT contigs" genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/H06HDADXX130110.1.ATCACGAT.20k_reads.unmerged.bwa.stderr.log | \
    grep -v "read 0 ALT contigs"

    I hope that is sufficient for now.

  • orodehorodeh Mountain View, CAMember

    @Geraldine_VdAuwera, the dxWDL compiler now runs the entire GATK pipeline. I am now trying to validate the results by comparing to Cromwell.

  • orodehorodeh Mountain View, CAMember

    @Geraldine_VdAuwera, I forgot to mention that I made one change to the pipeline. I rewrote a forward reference between calls. The compiler I have can't deal with that at the moment, and I would rather not deal with the complexity, if possible.

    In the original workflow, the GatherBqsrReports call is made in line 685, but is referenced in line 674.

    Call is made: https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.wdl#L685

    Call is referenced: https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.wdl#L674

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, good point @EADG. We should add that in the docs -- this WDL assumes you're running from GCS inputs and needs to be adapted for other execution environments.

  • orodehorodeh Mountain View, CAMember

    @EADG, thank you, that was the problem. After replacing all the cases where sub(sub(...)) was used in the original WDL file with a new variable Cromwell started working.

    The workflow has progressed almost to the end, and is now failing in the Haplotype scatter phase. I believe one of the java processes runs out of memory. The machine has 120GB of RAM, and 16 CPUs, all of which are used. @Geraldine_VdAuwera, is there a way to limit the number of concurrent executing tasks?

  • orodehorodeh Mountain View, CAMember

    @Geraldine_VdAuwera As of last week, the dxWDL compiler generates a dnanexus workflow that creates semantically equivalent results to Cromwell. The differences in the result files stem from slight indeterminism in the tools. For example, the current directory is recorded; this is different based on the target machine.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Ah, good to know, thanks!
Sign In or Register to comment.