Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Error running "gatk/PreProcessingForVariantDiscovery_GATK4" on FireCloud

bigbadbobigbadbo Member, Broadie

Dear GATK4 team,

This is Bo from the Broad Institute. I am running "gatk/PreProcessingForVariantDiscovery_GATK4" on my data in FireCloud and got the following error message:

Workflow failed
causedBy:
message: Unable to complete JES Api Request
causedBy:
message: Pipeline 11178155052716254973: Unable to evaluate parameters: parameter "PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator.known_indels_sites_VCFs-0" has invalid value: ["gs://broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.vcf.gz"]

Could you let us know what happened?

Thanks,
Bo

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @bigbadbo
    Hi Bo,

    I just moved your question to the FireCloud forum where @KateN can help you.

    -Sheila

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hi Bo,

    Could you share your workspace in FireCloud with [email protected]? I'd like to take a look, as it appears we have a couple versions of that pipeline. It could be that you simply have an out-of-date one, or there could be something else wrong with the way that particular variable was set up.

  • bigbadbobigbadbo Member, Broadie

    Hi @KateN , Sure thing. We just shared the workspace. It is called regev-ludwig/Ribo-seq

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Looking through the workspace, you do have that particular variable declared correctly. The only difference between yours and the published method, is that you have the root entity type set to participant. Try switching it to sample, and re-run the workflow.

    If that doesn't work, my next instinct is that you should upgrade to the latest version of the method. The one you are referencing uses GATK4 beta, and we have since published methods for GATK4 after it was launched.

    Since it appears you are using the b37 resources, I would recommend importing this method's configuration. It is one of our featured methods, and it is a newer version of the one you were using. Try running your workflow with that (setting the root entity type to sample as well), and let me know if it works.

  • bigbadbobigbadbo Member, Broadie

    Hi @KateN,

    Thanks!

    Could you let us know if gatk/mutect2-gatk4 is up-to-date?

    Thanks,
    Bo

  • bigbadbobigbadbo Member, Broadie

    Hi @KateN ,

    I have rerun gatk/PreProcessingForVariantDiscovery_GATK4_MC with the sample data model. It failed again.

    Could you check if the workspace attributes under 'regev-ludwig/Ribo-seq' is correct?

    Thanks,
    Bo

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Ah, I think I see the issue now. The workflow is having trouble with the input for known_indels_sites_VCFs. In your Method Config, it sites the workspace attribute workspace.known_indels_array. In the workspace attributes section, you can click "edit" and it'll show a dropdown that says it was determined to be a String type, rather than List of Strings. If you change it to List of Strings, it should work.

    To answer your other question, gatk/mutect2-gatk4 is up to date. The best way to find our most up-to-date methods is my looking at the Featured Methods section of the Methods Repository, here.

  • bigbadbobigbadbo Member, Broadie

    @KateN, I have changed the type from "String" to "List of strings". Unfortunately, it still did not work. Now I am trying gatk/pre-processing-b37-gatk4 and hopefully it will work.

    By the way, it seems that gatk/pre-processing-b37-gatk4 produces CRAM format files but gatk/mutect2-gatk4 requires BAM files. Any comments?

    Thanks.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    After having spoken with my colleagues, I have a couple options for you. First off, Mutect2 does require a BAM input, as it cannot take a CRAM right now. The way to get your BAM input is to either take the intermediate BAM that the pre-processing pipeline produces, or convert the CRAM to BAM using this method.

  • bigbadbobigbadbo Member, Broadie

    @KateN, it seems that if I removed ["..."], then the pipeline could run to finish. Thanks!

    By the way, I used 'vdauwera/BamToUnmappedRGBams' to extract unmapped BAM files. Could you confirm if this WDL is up-to-date?

    Thanks,
    Bo

  • bigbadbobigbadbo Member, Broadie

    @KateN , I have some questions related to setting gatk/mutect2-gatk4's parameters:

    1) For Mutect2's 'intervals' parameter, could I set it to workspace.scattered_calling_intervals_list?

    2) I do not want to apply the contamination filter. In this case, I should leave 'variants_for_contamination' as blank. Am I correct?

    3) Where could I get the 'gnomad' file for b37? This is used for Mutect2's 'gnomad' option.

    Thanks.
    Bo

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited June 2018

    Hi @bigbadbo

    1) For Mutect2's 'intervals' parameter, could I set it to workspace.scattered_calling_intervals_list?
    There is a featured workspace for the mutect2 WDL which demonstrates which resources are available and where they are located in googlecloud. Check the workspace attributes for the resources files used in the example. SNV workspace

    2) I do not want to apply the contamination filter. In this case, I should leave 'variants_for_contamination' as blank. Am I correct?
    The wdl contains a conditional statement that runs the CalculateContamination task only if the variants_for_contamination is defined. If not defined the WDL will perform basic filtering without the contamination table.

    3) Where could I get the 'gnomad' file for b37? This is used for Mutect2's 'gnomad' option.
    See answer in question 1.

    Also view the following method for a more recent wdl to convert bam to unmapped bam

  • bigbadbobigbadbo Member, Broadie
  • bigbadbobigbadbo Member, Broadie

    Hi @bshifaw ,

    Thanks a lot for your help!

    I have looked at the SNV workspace. But I'm not sure if '‎gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list‎' is the interval list I should use. First, I do not have the permission to view this file. Second, my data are whole genome sequencing data instead of whole exome sequencing data.

    Any suggestions?

    Thanks,
    Bo

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited June 2018

    @bigbadbo ,

    After consulting with the developer of the workflow you have two options.
    1) Run WGS sequences by leaving the intervals variable blank.
    2) Use the b37 wgs intervals as Tiffany suggested but note that these intervals were "determined empirically using HaplotypeCaller a few years ago. It's possible that they include some regions that were pathological with 76-bp reads but are no longer problematic. The total size of all excluded regions is very small, so it's only a minor worry. " Using the b37 intervals will save you some runtime but lower sensitivity by 0.1%

  • bigbadbobigbadbo Member, Broadie

    Hi @bshifaw , if I leave the intervals blank, will the pipeline automatically calculate intervals for me? Or it will not do a scatter?

    Thanks,
    Bo

  • bshifawbshifaw Member, Broadie, Moderator admin

    Intervals will be generated by SplitIntervals task with or without intervals variable set.

  • bigbadbobigbadbo Member, Broadie

    @bshifaw , thanks a lot for your help!

    I have one more question about the gnomad file listed in your SNV workspace.

    ‎gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf‎

    Is the above vcf file only for exome data? Or we can also use it for our genome data?

    Thanks!
    Bo

  • bshifawbshifaw Member, Broadie, Moderator admin

    Yes, thats fine to run with genome. For more background on gnomad (Genome Aggregation Database) please visit gnomAD browser

  • bigbadbobigbadbo Member, Broadie

    Thanks a lot!

  • bigbadbobigbadbo Member, Broadie
    edited July 2018

    @bshifaw @Tiffany_at_Broad @KateN,

    Thanks a lot for your helps. I'm almost finished this analysis.

    When I run mutect2 on the normal-tumor pair, I encountered this error for call #50 of Mutect2.M2:

    "htsjdk.samtools.FileTruncatedException: Premature end of file: /f4653fc5-f4a9-4f7e-ab53-1725b639043f/PairedEndSingleSampleWorkflow/e25a51f7-139d-40b7-8a5a-e8b5a2417ef1/call-GatherBamFiles/example_tumor.bam"

    However, example_tumor.bam was generated by gatk/pre-processing-b37-gatk4, which was finished successfully. So I do not think this BAM file is truncated.

    To make sure it is not a random GCP problem, I rerun mutect2. This time I got a similar but different error message at call #50 of Mutect2.M2:

    "htsjdk.samtools.FileTruncatedException: Premature end of file: /27d5041a-60b4-4be2-b8e2-185547225d35/PairedEndSingleSampleWorkflow/08661c64-3b4b-44d1-a10d-8cd5e6dabf0a/call-GatherBamFiles/example_fibroblast.bam"

    Again, I do not think "example_fibroblast.bam" is truncated.

    Any suggestions?

    Thanks,
    Bo

    Post edited by bigbadbo on
  • bshifawbshifaw Member, Broadie, Moderator admin

    @bigbadbo , in the long run it neater to keep different questions in separate forum posts. Please post your latest question in a new forum post.

    Thanks

  • bigbadbobigbadbo Member, Broadie
Sign In or Register to comment.