Cromwell is very slow while running RealignerTargetCreator

dannykwellsdannykwells San FranciscoMember

Hi everyone,

I am trying to develop a simple example pipeline of using WDL to call Somatic Variants in a paired Tumor/Normal sample. In this pipeline I am employing RealignerTargetCreator (since I plan to use Mutect1 as one method to call SNV)., and, as is documented in this thread (http://gatkforums.broadinstitute.org/wdl/discussion/6800/known-sites-for-indel-realignment-and-bqsr-in-hg38-bundle), I am inputting Mills_and_1000G_gold_standard.indels.hg38.vcf and 1000G_phase1.snps.high_confidence.hg38.vcf.gz.

My call looks like this:
command { java -Xmx8g -jar /task/GenomeAnalysisTK-3.4-g3c929b0.jar \
-T RealignerTargetCreator \
-R ${ReferenceGenome} \
-nt 3 \
-known ${indelRef1} \
-known ${indelRef2} \
-I ${inputFile} \
-o ${sampleName}_${type}_indelints.intervals \
--allow_potentially_misencoded_quality_scores
}

If I just run this command in the terminal (changing paths, etc.), it completes in 5-10 minutes (on 200 reads). When I run it in the context of my WDL pipeline using Cromwell, it just hangs and does not complete (I left it over night - no progress). Interestingly, during this time my cpu usage also goes essentially to zero as well.

Do you have any guidance about what could be going on here?

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @dannykwells, what kind of infrastructure are you running on? And can you tell me why you are using the --allow_potentially_misencoded_quality_scores argument?

  • dannykwellsdannykwells San FranciscoMember

    Hi @Geraldine_VdAuwera ! 1. I am developing locally (4-core mac) with the plan to push to gcp. I am applying the pipeline to only 200 reads and will apply to entire datasets in the cloud (once the pipeline is working).

    1. the --allow_potentially_misencoded_quality_scores argument is one used by my collaborator. Is there any known bad behaviors (or is it, in general, not a desired argument?) I had never seen it used before, personally.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @dannykwells,

    Try providing genome intervals using -L to speed up execution. With the old references this wasn't really necessary, but Hg38 has a large number of ALT contigs and I suspect that may be what's causing the slowdown. You can either simply pass a list of canonical contigs or use the genomic intervals that we provide in the resource bundle. See the recent tutorial on hg38 for more details about how the pipeline deals with the ALTs. To be clear I'm not 100% sure that's what's causing it but it's worth a shot.

    That argument overrides a safeguard that has to do with base quality encodings. It should only be used on a case by case basis after checking that it is justified. Definitely not something you want to apply by default...

  • dannykwellsdannykwells San FranciscoMember

    Hi @Geraldine_VdAuwera, I'll give this a try, but to be honest, my current hypothesis is that this is a cromwell issue (since the command completes in a reasonable amount of time running outside of cromwell.) I've tried doing this command using hg19 (from UCSC) and it also did not work.

    Do you have any idea why cromwell might be grumpy at the above command?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, fair point about it working when you run cromwell-free. What happens if you remove nt? And what runtime are you specifying?

  • dannykwellsdannykwells San FranciscoMember

    Hi @Geraldine_VdAuwera so our current hypothesis is that Docker is crashing during this step, and it's not really a cromwell issue. I've been pushing on other things but when I learn more I'll put it here.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Thanks @dannykwells, let us know if there's anything we can do to help.
Sign In or Register to comment.