We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Error running "gatk/PreProcessingForVariantDiscovery_GATK4" on FireCloud

Dear GATK4 team,
This is Bo from the Broad Institute. I am running "gatk/PreProcessingForVariantDiscovery_GATK4" on my data in FireCloud and got the following error message:
Workflow failed
causedBy:
message: Unable to complete JES Api Request
causedBy:
message: Pipeline 11178155052716254973: Unable to evaluate parameters: parameter "PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator.known_indels_sites_VCFs-0" has invalid value: ["gs://broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.vcf.gz"]
Could you let us know what happened?
Thanks,
Bo
Best Answers
-
KateN Cambridge, MA admin
When you changed it to "List of strings" did it update the formatting of the indels array to no longer show the
["..."]
wrapping the entries? I had hoped that it would, but if it didn't you may need to remove those characters.For your other question, I will consult one of my colleagues. I'm not sure if you would need to convert CRAM to BAM, or if Mutect2 has been written to accept either input type.
-
Tiffany_at_Broad Cambridge, MA admin
Hi @bigbadbo I've asked a colleague to take a look at your questions since he puts together & updates the gatk pipelines in our Featured workspaces.
Answers
@bigbadbo
Hi Bo,
I just moved your question to the FireCloud forum where @KateN can help you.
-Sheila
Hi Bo,
Could you share your workspace in FireCloud with
[email protected]
? I'd like to take a look, as it appears we have a couple versions of that pipeline. It could be that you simply have an out-of-date one, or there could be something else wrong with the way that particular variable was set up.Hi @KateN , Sure thing. We just shared the workspace. It is called regev-ludwig/Ribo-seq
Looking through the workspace, you do have that particular variable declared correctly. The only difference between yours and the published method, is that you have the root entity type set to
participant
. Try switching it tosample
, and re-run the workflow.If that doesn't work, my next instinct is that you should upgrade to the latest version of the method. The one you are referencing uses GATK4 beta, and we have since published methods for GATK4 after it was launched.
Since it appears you are using the b37 resources, I would recommend importing this method's configuration. It is one of our featured methods, and it is a newer version of the one you were using. Try running your workflow with that (setting the root entity type to sample as well), and let me know if it works.
Hi @KateN,
Thanks!
Could you let us know if gatk/mutect2-gatk4 is up-to-date?
Thanks,
Bo
Hi @KateN ,
I have rerun gatk/PreProcessingForVariantDiscovery_GATK4_MC with the sample data model. It failed again.
Could you check if the workspace attributes under 'regev-ludwig/Ribo-seq' is correct?
Thanks,
Bo
Ah, I think I see the issue now. The workflow is having trouble with the input for
known_indels_sites_VCFs
. In your Method Config, it sites the workspace attributeworkspace.known_indels_array
. In the workspace attributes section, you can click "edit" and it'll show a dropdown that says it was determined to be a String type, rather than List of Strings. If you change it to List of Strings, it should work.To answer your other question,
gatk/mutect2-gatk4
is up to date. The best way to find our most up-to-date methods is my looking at the Featured Methods section of the Methods Repository, here.@KateN, I have changed the type from "String" to "List of strings". Unfortunately, it still did not work. Now I am trying gatk/pre-processing-b37-gatk4 and hopefully it will work.
By the way, it seems that gatk/pre-processing-b37-gatk4 produces CRAM format files but gatk/mutect2-gatk4 requires BAM files. Any comments?
Thanks.
When you changed it to "List of strings" did it update the formatting of the indels array to no longer show the
["..."]
wrapping the entries? I had hoped that it would, but if it didn't you may need to remove those characters.For your other question, I will consult one of my colleagues. I'm not sure if you would need to convert CRAM to BAM, or if Mutect2 has been written to accept either input type.
After having spoken with my colleagues, I have a couple options for you. First off, Mutect2 does require a BAM input, as it cannot take a CRAM right now. The way to get your BAM input is to either take the intermediate BAM that the pre-processing pipeline produces, or convert the CRAM to BAM using this method.
@KateN, it seems that if I removed ["..."], then the pipeline could run to finish. Thanks!
By the way, I used 'vdauwera/BamToUnmappedRGBams' to extract unmapped BAM files. Could you confirm if this WDL is up-to-date?
Thanks,
Bo
@KateN , I have some questions related to setting gatk/mutect2-gatk4's parameters:
1) For Mutect2's 'intervals' parameter, could I set it to workspace.scattered_calling_intervals_list?
2) I do not want to apply the contamination filter. In this case, I should leave 'variants_for_contamination' as blank. Am I correct?
3) Where could I get the 'gnomad' file for b37? This is used for Mutect2's 'gnomad' option.
Thanks.
Bo
Hi @bigbadbo I've asked a colleague to take a look at your questions since he puts together & updates the gatk pipelines in our Featured workspaces.
Hi @bigbadbo
1) For Mutect2's 'intervals' parameter, could I set it to workspace.scattered_calling_intervals_list?
There is a featured workspace for the mutect2 WDL which demonstrates which resources are available and where they are located in googlecloud. Check the workspace attributes for the resources files used in the example. SNV workspace
2) I do not want to apply the contamination filter. In this case, I should leave 'variants_for_contamination' as blank. Am I correct?
The wdl contains a conditional statement that runs the CalculateContamination task only if the variants_for_contamination is defined. If not defined the WDL will perform basic filtering without the contamination table.
3) Where could I get the 'gnomad' file for b37? This is used for Mutect2's 'gnomad' option.
See answer in question 1.
Also view the following method for a more recent wdl to convert bam to unmapped bam
@Tiffany_at_Broad , thanks a lot!
Hi @bshifaw ,
Thanks a lot for your help!
I have looked at the SNV workspace. But I'm not sure if 'gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list' is the interval list I should use. First, I do not have the permission to view this file. Second, my data are whole genome sequencing data instead of whole exome sequencing data.
Any suggestions?
Thanks,
Bo
What about using the b37 wgs intervals list here:https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37?project=broad-dsde-outreach&organizationId=548622027621
@bigbadbo ,
After consulting with the developer of the workflow you have two options.
1) Run WGS sequences by leaving the intervals variable blank.
2) Use the b37 wgs intervals as Tiffany suggested but note that these intervals were "determined empirically using HaplotypeCaller a few years ago. It's possible that they include some regions that were pathological with 76-bp reads but are no longer problematic. The total size of all excluded regions is very small, so it's only a minor worry. " Using the b37 intervals will save you some runtime but lower sensitivity by 0.1%
Hi @bshifaw , if I leave the intervals blank, will the pipeline automatically calculate intervals for me? Or it will not do a scatter?
Thanks,
Bo
Intervals will be generated by SplitIntervals task with or without
intervals
variable set.@bshifaw , thanks a lot for your help!
I have one more question about the gnomad file listed in your SNV workspace.
gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf
Is the above vcf file only for exome data? Or we can also use it for our genome data?
Thanks!
Bo
Yes, thats fine to run with genome. For more background on gnomad (Genome Aggregation Database) please visit gnomAD browser
Thanks a lot!
@bshifaw @Tiffany_at_Broad @KateN,
Thanks a lot for your helps. I'm almost finished this analysis.
When I run mutect2 on the normal-tumor pair, I encountered this error for call #50 of Mutect2.M2:
"htsjdk.samtools.FileTruncatedException: Premature end of file: /f4653fc5-f4a9-4f7e-ab53-1725b639043f/PairedEndSingleSampleWorkflow/e25a51f7-139d-40b7-8a5a-e8b5a2417ef1/call-GatherBamFiles/example_tumor.bam"
However, example_tumor.bam was generated by gatk/pre-processing-b37-gatk4, which was finished successfully. So I do not think this BAM file is truncated.
To make sure it is not a random GCP problem, I rerun mutect2. This time I got a similar but different error message at call #50 of Mutect2.M2:
"htsjdk.samtools.FileTruncatedException: Premature end of file: /27d5041a-60b4-4be2-b8e2-185547225d35/PairedEndSingleSampleWorkflow/08661c64-3b4b-44d1-a10d-8cd5e6dabf0a/call-GatherBamFiles/example_fibroblast.bam"
Again, I do not think "example_fibroblast.bam" is truncated.
Any suggestions?
Thanks,
Bo
@bigbadbo , in the long run it neater to keep different questions in separate forum posts. Please post your latest question in a new forum post.
Thanks
@bshifaw , sure thing.