To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Elapsed time about the CNVDiscoveryPipeline

@bhandsaker
Hi Bob,why does the CNVDiscoveryPipeline is so time consuming? I test a WGS sample (about 30x),and run about 4 days,and it is still runing.This is my script about the CNVDiscoveryPipeline:

!/bin/bash

If you adapt this script for your own use, you will need to set these two variables based on your environment.

SV_DIR is the installation directory for SVToolkit - it must be an exported environment variable.

SV_TMPDIR is a directory for writing temp files, which may be large if you have a large data set.

export SV_DIR=/work/SoftW/svtoolkit
SV_TMPDIR=2016006L-3-1/tmpdir_CNVDiscovry

runDir=2016006L-3-1
inputFile=/work1/wsh/4.test/1.perl/1.pipetest/WGS/2016006L-3-1.dedupped.bam
sites=2016006L-3-1.discovery.vcf
genotypes=2016006L-3-1.genotypes.vcf

These executables must be on your path.

which java > /dev/null || exit 1
which Rscript > /dev/null || exit 1
which samtools > /dev/null || exit 1

For SVAltAlign, you must use the version of bwa compatible with Genome STRiP.

export PATH=${SV_DIR}/bwa:${PATH}
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"

mkdir -p ${runDir}/logs || exit 1
mkdir -p ${runDir}/metadata || exit 1

java -Xmx4g -cp ${classpath} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile conf/genstrip_parameters.txt \
-R /work/wsh/0.Pipeline/TargetSeq/Genome_STRiP_ref/Homo_sapiens_assembly19.fasta \
-I ${inputFile} \
-md ${runDir}/metadata \
-runDirectory ${runDir} \
-jobLogDir ${runDir}/logs \
-intervalList /work/wsh/0.Pipeline/TargetSeq/Genome_STRiP_ref/Homo_sapiens_assembly19.interval.list \
-genderMapFile /work1/wsh/4.test/1.perl/1.pipetest/WGS/2016006L-3-1_gender.map \
-jobRunner Shell \
--disableJobReport \
-tempDir ${SV_TMPDIR} \
-gatkJobRunner Shell \
-retry 10 \
-tilingWindowSize 1000 \
-tilingWindowOverlap 500 \
-maximumReferenceGapLength 1000 \
-boundaryPrecision 100 \
-minimumRefinedLength 500 \
-genotypingParallelRecords 500 \
-run

#

#

#

Could you help me check my script Whether there are some mistake? Thank you very much.

Tagged:

Answers

  • bhandsakerbhandsaker Member, Broadie, Moderator

    The pipeline is designed to run on multiple samples (generally 20 to 30 or more, but for best results batches of 100 or so are preferable). I'm not sure what the expected behavior would be on one sample, but it's possible this is causing it to run for an excessively long time.

  • bhandsakerbhandsaker Member, Broadie, Moderator

    If you want to run a small test, it is better to use multiple sample but run on a small interval (e.g. using -intervalList with a single small interval).

Sign In or Register to comment.