Many simultaneous GATKs taking down Lustre storage
We have a cluster pointing at a Lustre scratch storage system on a DDN device. Over the last couple weeks, we've noticed a whole bunch of nodes going down when people were running hundreds of simultaneous GATK jobs that were reading/writing to/from that scratch system. It occurred with at least two users in different groups running newish versions of GATK (possibly 2.6-5 or 2.7-4, but we're still analyzing things - and it looks like one user had a nightly from January 20.) It looks like some of the runs with with -nct 12 and others had no -nct option. Obviously this would cause lots of I/O, but we haven't seen similar crashes from other programs doing heavy I/O on our system. When small batches of these same jobs are rerun, they seem to finish OK, so it's probably not a function of the particular input data.
Unfortunately, a search for lustre on the forum just finds a bunch of messages with /lustre file paths. So can anyone tell me if there are reports of this kind of error with GATK?
Harvard Medical School