The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Recommendation on performance using scatter/gather

LaurentLaurent Member, Broadie Posts: 43 ✭✭
edited April 2013 in Ask the GATK team

Dear all,

I am currently running an analysis using the HaplotypeCaller on 300 large BAM files on our cluster and decided to chunk the the genome in 3MB bins in order for them to be processed in a decent time. I'm however experiencing very long runtimes as more and more jobs get scheduled to run in parallel on the same files. Looking at the GATK options, I saw these 2 that I thought could be of help and was wondering what were the recommendation for using them:
--num_bam_file_handles
--read_buffer_size

More precisely, does the num_bam_file_handles increase processing time by a lot? and what is the default value for --read_buffer_size ?

Thanks a lot,
Laurent

Tagged:

Best Answer

Answers

  • LaurentLaurent Member, Broadie Posts: 43 ✭✭

    Hi Mark,

    Thanks for the explanation! I will digg more into the problem, but at the moment what I am reporting is only "observation" of my runtimes getting extremely high when running in parallel. I have observed similar problems when running other walkers using scatter/gather in the past on our cluster. I'll give a shot at extracting the regions using PrintReads to a local scratch beforehand and let you know if this helps! I'll also use the nightly build to benefit from the latest improvements.

Sign In or Register to comment.