We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Can the GATK Best Practices Pipeline on Google Cloud Platform be used on FASTQ inputs?

claudiadastclaudiadast Member
edited July 2018 in Ask the GATK team

I read the documentation on this pipeline (https://cloud.google.com/genomics/docs/tutorials/gatk) and saw that its input is unaligned BAMs. Is there a way to use the pipeline for input FASTQs?


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    No, you need to create uBAMs first, then input them. The workflow cannot accept FASTQ files.


  • mikedamourmikedamour Member
    edited August 2018

    I am running the GATK Best Practices pipeline on Google Cloud Platform (GCP). I have run the tutorials with some success.

    My 1.4TB of tumor/normal .fastq files were delivered to me in a GCP bucket. For a further test and to meet the uBAM requirement, I downloaded two mated .fastq files for PE reads and converted them to one uBAM, uploaded back to the cloud and ran the GATK pipeline as described.

    It worked fine. (With the Side Issue exception below.)

    Main Issue: Could you please publish the command to invoke Picard fastq2sam on GCP so we can do the conversion in the cloud? I have little experience with GCP but I feel certain the existing GATK docker containers must support this. Please demonstrate. I am not requesting any modification to the GATK Best Practices WDL, just the separate Picard tool invocation.

    (Side Issue: After 12 hours running just one uBAM file, it blew up the run because the call-ValidateCram step didn't like BGISEQ as a platform.

    ERROR: Read name I103_CL100070399_L1, The platform (PL) attribute (BGISEQ) + was not one of the valid values for read group

    I now know how to lie to it next time and say Illumina. But we really need that easy fix to the ValidateSamFile bug first reported in December elsewhere on this forum).

    Thank you, Mike D'

  • mikedamourmikedamour Member
    edited September 2018

    Convert Fastqs to Unmapped BAM on GCP

    Further to the above, I see that the Broadies are ahead of me, as expected. I have located the paired-fastq-to-unmapped-bam.wdl script at https://github.com/gatk-workflows/seq-format-conversion. Thank you.

    Unlike the GATK Best Practices tutorial, there is no explicit demonstration command line (or I couldn't find it?) for the script, so I ran a very similar command on six files (three PE sets).

    Cloned https://github.com/gatk-workflows/seq-format-conversion.git to local.

    Modified paired-fastq-to-unmapped-bam.inputs.json to include file gs://path/names of fastqs, output ubam list name, and appropriate read group name entries.

    # Run command from directory - ~/Genomics/GATK/openwdl/wdl/runners/cromwell_on_google
    gcloud alpha genomics pipelines run \
      --pipeline-file wdl_runner/wdl_pipeline.yaml \
      --zones us-central1-f \
      --memory 5 \
      --logging "${GATK_OUTPUT_DIR}/logging" \
      --inputs-from-file WDL="${GATK_GOOGLE_DIR}/paired-fastq-to-unmapped-bam.wdl" \
      --inputs-from-file WORKFLOW_INPUTS="${GATK_GOOGLE_DIR}/paired-fastq-to-unmapped-bam.inputs.json" \
      --inputs-from-file WORKFLOW_OPTIONS="${GATK_GOOGLE_DIR}/generic.google-papi.options.json" \
      --inputs WORKSPACE="${GATK_OUTPUT_DIR}/workspace" \
      --inputs OUTPUTS="${GATK_OUTPUT_DIR}/outputs"

    Each of the 3 PE sets was 2 x ~9GB files, each set resulting in 1 x ~25GB .unmapped.bam file (resulting total of 3 x ~25GB files).

    Conversion ran on GCP in 1:40. It appears that the script causes only two sets (shards) to be converted at a time, with the third set completing well after the first two sets. A maximum of 5 vCPU instances were used, with a maximum instance/cpu/utilization of about 2.0 and an average of well below. vCPU quotas are set at 10,000 so vCPU usage is throttled by the script/program, not by quotas. (Log files available, if needed.)

    Note that instance/cpu/utilization stays below 2.1 throughout. Above instance/cpu/utilization shows run from 1:22AM to 3:02AM only. (Disregard earlier portion of graph.)

    I have 48 normal .fastqs (24 PE sets) and 64 tumor .fastqs (32 PE sets) giving 90x coverage for the two samples. The above data predict that the tumor .fastqs may take ~27 wall-clock hours to convert to .ubam unless manually parallelized, if possible. (Of course, normal .fastqs will be run separately in parallel and may take ~20 hours.)

    I will follow with results.

    Best, Mike D'

    Post edited by mikedamour on
  • I should mention that what I referred to as .fastq files above are .fq.gz files.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Mike,

    It looks like you figured everything out except for the Read group issue. Have a look at this thread.


  • Further to Convert Fastqs to Unmapped BAM on GCP above, I was able to use the https://github.com/gatk-workflows/seq-format-conversion to create the .unmapped.bam files for the large datasets - 48 normal .fq.gz and 64 tumor .fq.gz in only 2 hours. I divided the files into sets of 16 .fq.gz / 8 PE sets in multiple (7) ${GATK_GOOGLE_DIR}/paired-fastq-to-unmapped-bam.inputs.json files. Seven separate paired-fastq-to-unmapped-bam.wdl runs completed the conversion, using many cpu instances.

    Thanks, Mike D'

Sign In or Register to comment.