Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
UG calling on large number of samples (>2,000) : IO issues
I am trying to run UG across just over 2,200 individuals (exome sequencing). I have successfully done this on our computing cluster with just over 1,000 samples without issues (apart from having to get the limit on no. of open files (ulimit) increased).
I got another increase in ulimit to allow me to run UG on the larger set. However, our IO is being pushed over the edge with the 2,200 input samples. I have two questions:
- does UG open all of the input bam files at the same time? It seems like it, since a ulimit of 2048 was not sufficient for 2,200 input files.
- is there a way to optimise this, possibly by getting UG to open files sequentially - or do they have to be all open at the same time? I suspect this will become more of a problem as the size of the datasets available increases.
Would appreciate any advice you would have on getting this to run on this size of data.