Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Many simultaneous GATKs taking down Lustre storage

Hi,

We have a cluster pointing at a Lustre scratch storage system on a DDN device. Over the last couple weeks, we've noticed a whole bunch of nodes going down when people were running hundreds of simultaneous GATK jobs that were reading/writing to/from that scratch system. It occurred with at least two users in different groups running newish versions of GATK (possibly 2.6-5 or 2.7-4, but we're still analyzing things - and it looks like one user had a nightly from January 20.) It looks like some of the runs with with -nct 12 and others had no -nct option. Obviously this would cause lots of I/O, but we haven't seen similar crashes from other programs doing heavy I/O on our system. When small batches of these same jobs are rerun, they seem to finish OK, so it's probably not a function of the particular input data.

Unfortunately, a search for lustre on the forum just finds a bunch of messages with /lustre file paths. So can anyone tell me if there are reports of this kind of error with GATK?

Thanks,

-Amir Karger
Research Computing
Harvard Medical School

Tagged:

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Amir,

    This is Eric Banks. Long time no see.

    We don't have any experience with Lustre here, so we'll need to rely on external users to give their feedback. I can say that we use Isilon class storage for our big GATK calling needs and haven't really ever encountered a problem with it going down.

    A couple of thoughts:

    1. Do you know how many simultaneous reads/writes were occurring when the nodes started to fail? It would be good to diagnose the limits of that storage system. We created a private tool called "IOCrusher" to help us test the limits of our local hardware.

    2. I wonder if perhaps you see the same behavior with many simultaneous jobs, but where none of them use the -nct option.

    Anyways, good luck. Please let us know if you find anything interesting.

  • Hi, Eric!

    The Lustre tends to have much faster read/write than our Isilon storage, so we encourage people to use that, and then transfer just their necessary results to Isilon.

    I don't know what the I/O load was at the time. Looks like hundreds of simultaneous runs of HaplotypeCaller, UnifiedGenotyper, or DepthOfCoverage managed to break stuff. The HaplotypeCaller run was using -nct 12 but reserving only one core. That can't have been good. Other runs had no nct option.

    -Amir

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    Also what version of Lustre client and server are you using? The Sanger makes quite heavy usage of Lustre and I've not seen GATK taking down any of our nodes, this may be because of how we use it though.

Sign In or Register to comment.