UG calling on large number of samples (>2,000) : IO issues
I am trying to run UG across just over 2,200 individuals (exome sequencing). I have successfully done this on our computing cluster with just over 1,000 samples without issues (apart from having to get the limit on no. of open files (ulimit) increased).
I got another increase in ulimit to allow me to run UG on the larger set. However, our IO is being pushed over the edge with the 2,200 input samples. I have two questions:
- does UG open all of the input bam files at the same time? It seems like it, since a ulimit of 2048 was not sufficient for 2,200 input files.
- is there a way to optimise this, possibly by getting UG to open files sequentially - or do they have to be all open at the same time? I suspect this will become more of a problem as the size of the datasets available increases.
Would appreciate any advice you would have on getting this to run on this size of data.