If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Best way to get a definitive list of shards that succeeded in a scatter gather workflow that failed

cmtcmt Seattle WAMember
edited July 17 in Ask the GATK team


I am using GATKv on google cloud to Joint Genotype 12 pooled samples (each pool has 50 individual fish, I used a ploidy of 20 in the analysis). I don't have a complete reference genome for my species, so I am using a pretty good sister species' that has 24 linkage groups and +8k scaffolds.

Because of the structure of my data, I have split the genotyping into two parts. I am running the linkage groups (which can use GenomicsDB then GenotypeGVCFs) separately from the scaffolds. For the scaffolds, I am combining them with CombineGVCFs before using GenotypeGVCFS. With 8000+ scaffolds, I'm trying a scatter/gather hierarchical approach to combining gvcf files. I combined groups of scaffolds first. Now I have scaffolds that did not combine in groups, so I am scattering over the scaffolds, about 3000 scaffolds at a time.

Some scaffolds are troublesome and will not combine. There are a variety of errors and I do want to solve the ones that I can (out of memory is pretty easy, for example). The problem is that most of the scaffolds combine with no problems (out of 3000 scaffolds ~100 failed) but when one or more of the tasks/scatters fails, it causes the whole run to fail, and there is no output listed for the scaffolds that worked. To move forward I have to figure out the best way to know which scatter tasks/scaffolds failed and which completed.

I have tried a couple of things.
1. Sort through the "failures" list on the summary page of the job and pull out the shard numbers.
2. Search the "calls" page for the words "done" and "failed" and "retry failure" for counts of each to compare to the other methods I used here.
3. Use gsutil to search recursively through all the folders in the google bucket to find the paths that end in .tbi which seemed at the time to be a decent indication that the scatter had worked, but I don't think that now. (Is there another thing that would clue me in to the scaffold working?)
4. Use a python script to pull all the shard numbers associated with the phrase "Status change from Running to Success" from the workflow log for the job.

All of those give me different answers for which/how many shards finished combining successfully. The workflow log seemed like the best solution, except that it gave me shards that were listed as failed on the summary page. When I went to those shards' stderr files, they had OutOfMemory errors. **The "failures" list on the summary page now seems like the best indication of what has actually failed. Is that right? **

I think that going forward I will probably copy the failures list from the summary page to a text file and parse it for the shard number to get the shards that failed. To get the paths that worked, I can use my gsutil recursive search to pull out all the paths that have .tbi (it does seem to get me closer to the true list of shards that worked than searching for .g.vcf) then remove any shards that failed.

Is there a more simple solution? Am I missing something obvious?



Post edited by cmt on

Best Answer


Sign In or Register to comment.