Does GATK 4 support multiple bam files as input?

In the command line help message, it says

--input,-I:String BAM/SAM/CRAM file containing reads This argument must be specified at least once.

However, if we actually give multiple input files, it says

org.broadinstitute.hellbender.exceptions.UserException: Sorry, we only support a single reads input for spark tools for now.

On the other hand, if we specify the input parameter as the folder containing all partial bam files, it actually works. Could you explain how this feature works now? We are using GATK 4 master branch, commit b82b5b6c5cbef8973b373edfb314cf42bea5eb1a, with Spark 2.0.2.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blaok, different tools have different requirements. Some tools allow multiple -I inputs, but some do not. Which tool are you trying to run and what is your command line?

  • blaokblaok Member

    Hi Geraldine,

    Thanks for asking my question. We are trying to run ReadsPipelineSpark and HaplotypeCallerSpark. Our command line looks like this:

    gatk-launch \
        ReadsPipelineSpark \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00000.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00001.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00002.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00003.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00004.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00005.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00006.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00007.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00008.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00009.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00010.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00011.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00012.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00013.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00014.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00015.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00016.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00017.bam \
        -I hdfs://ip-172-31-2-45:9000/user/blaok/ERR000097.sorted.bam/part-r-00018.bam \
        -R hdfs://ip-172-31-2-45:9000/genome/ref/human_g1k_v37.2bit \
        -O ~/get/ERR000097.after-bqsr.bam \
        --knownSites ~/get/dbsnp_138.b37.excluding_sites_after_129.vcf \
        --shardedOutput false \
        --emit_original_quals \
        --duplicates_scoring_strategy SUM_OF_BASE_QUALITIES \
        -- \
        --sparkRunner SPARK \
        --driver-memory 60G \
        --executor-memory 60G \
        --executor-cores 16 \
        --num-executors 2 \
        --sparkMaster spark://ip-172-31-78-182:7077
    

    and we get error looking like this

     ***********************************************************************
    
    A USER ERROR has occurred: Sorry, we only support a single reads input for spark tools for now.
    
    ***********************************************************************
    org.broadinstitute.hellbender.exceptions.UserException: Sorry, we only support a single reads input for spark tools for now.
            at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:376)
            at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:361)
            at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:351)
            at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
            at org.broadinstitute.hellbender.Main.main(Main.java:230)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
            at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
            at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
            at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
            at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    

    The HaplotypeCallerSpark works in a similar way.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @blaok
    Hi,

    Let me confirm with the developers if Spark tools do not accept more than one input BAM.

    -Sheila

    Issue · Github
    by Sheila

    Issue Number
    2208
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @blaok
    Hi again,

    I have confirmation that the Spark tools and pipelines (which are all still experimental at this point) are restricted to a single reads input, at least for now.

    -Sheila

  • Hi Sheila,

    We are using HaplotypeCallerSpark in GATK4 and we would be very much interested in processing multiple bam files, too. This functionality is crucial for our pipeline.

    Do you think processing of multiple bam files will be possible on HaplotypeCallerSpark anytime soon? Is it on the roadmap?

    What is the plan for moving HaplotypeCallerSpark from beta version to an official production version?

    Many thanks for your help, much appreciated!!!

    Ivo

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ilasek
    Hi Ivo,

    It seems there are plans for this to be done in the second quarter of this year. You can keep track of the issue here.

    -Sheila

Sign In or Register to comment.