Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

How many genomes do you run UnifiedGenotyper on?

I'm running UnifiedGenotyper on a large number of post-processing BAMs - which of course makes UnifiedGenotyper run a bit slower than when you run it on individual genomes. There's an accuracy advantage to running on large numbers of genomes at a time, but the returns begin to diminish at increasing n.

My question is: when you have a very large number of 4x full genomes (not exomes) available - say, high plural thousand - at what point do you want to say that the advantage to including another genome in a single variant call run is traded off by the disadvantage of longer runtime, higher chance of failure, gigantic VCF output file, etc? Where is your cutoff point? Of course, it depends on the speed and reliability of your computing system, but an experience on any system would be useful. Do you yourself do variant calls on max 50 BAMs at a time? 500? 5000?

The G1K methodology doesn't hint at where they've set their run-size cutoff point, but both they and uk10k seem to have their VCF's in filesizes of a thousand genomes per file or so. I'm tempted to take that as a hint, but I want to ask the smart people of the community first.

Best Answer

Answers

  • redzengenoistredzengenoist Posts: 27Member

    Cool @ami, thanks for a quick reply.

    So, 60k for exomes, and 2k for 4x wgs, no problem. And to be clear, 2k @4x is the highest you did.

    No indication that you wanted to batch, for 2k @4x?

    In my own case, I'm occasionally having systemwide "blips" (nothing to do with GATK, it's the computer cluster) sabotage processes that run for a long time. So this is why I've been tempted to sacrifice a bit of accuracy for higher process reliability. Also, I want to be able to parallelize better. But I guess you don't have this problem, my hardware cluster is sort of dated. ^_^

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    Does your cluster support Queue? Using Queue to scatter-gather your jobs would solve your issue with system blips. It can chop up your UG jobs into small intervals so the processes don't run as long, combine the results appropriately, and will auto-manage re-running any jobs that do fail.

    Geraldine Van der Auwera, PhD

  • amiami Posts: 35GATK Developer mod

    just as technical note, we do run it all in parallel, for the 26K exomes, we ran joint calling on each chromosome, and scatter each chromosome to 1000 jobs. (since UG can work on each locus separately), you can do it relatively easily with Queue (there is a workshop next month that will teach how to do so). As to your question, we are sure that running 2k @4x in batches is not the best (or good) way to run it.

  • redzengenoistredzengenoist Posts: 27Member

    I've been using the local queueing system, but it's probably less smart about parallelization than queue (it parallelizes by thread, but not by node, I believe).

    Ok, I'm going to parallelize in queue, and then feedback you both about the performance improvement.

Sign In or Register to comment.