Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CNV disco hanging at stage 12

Will_GilksWill_Gilks University of Sussex, UKMember ✭✭
edited May 2016 in GenomeSTRiP

Hi @cwhelan @bhandsaker ,

I’m running CNV discovery on 200 Drosophila whole genomes (30X, 140Mb each). The run goes for 3 minutes, and then hangs at stage 12, with the last log entry being:

INFO 18:46:48,044 DrmaaJobRunner - Submitted job id: 9188590 INFO 18:46:48,045 QGraph - 16832 Pend, 1870 Run, 0 Fail, 3 Done

I've attached my code and log. My sysadmin has set the server so jobs can be submitted automatically by genomestrip. I’ve set the code to do each of the major Drosophila scaffolds separately, using e.g "-L chr2L" because, i) There are a lot of small scaffolds which tend to cause problems, and ii) The "specify intervals" command doesn’t seem to work in this case.

The run hangs for at least 12 hours at stage 12. The created metadata and cnv_sentinel folders are empty. The stage 11 directory is populated with folders for each of the ~2000 scaffolds (including unwanted ones), and the stage 12 directory is empty.

I'm going to try this on smaller regions, and fiddle with a the window sizes a bit, but otherwise I really don't know where the problem is. My custom mask files and configuration data worked for the deletion pipeline. Any suggestions?

Sincerely,

William Gilks

Best Answers

Answers

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    @bhandsaker

    Thanks. It worked with -intervalList test_intervals.list in which the .list file contained only "chr2L".

    The hanging at stage 12 is resolved. The only indication that the job was running at this stage was increase tmpdir/ file sizes, so I guess there wasn't really a problem, it was just me being impatient/paranoid.

    I've pasted in the working code in case it's useful for anyone, or anyone has suggestions for improvement.

    I've also attached the depth-distribution plots from some duplications. Some seem a bit imperfect.

    #!/bin/sh
    #$ -N log_gStrip_CNVs2
    #$ -pe openmp 10
    #$ -S /bin/sh
    #$ -cwd
    #$ -j y
    #$ -q bioinf.q
    
    ## DESCRIPTION
    ## Genomic structural variation identification and genotyping.
    ## Drosophila melanogaster LHm hemiclones and one Berkeley Reference Genome individual.
    ## Genomestrip CNV discovery pipeline.
    ## WGS-NGS data, Sussex Uni SGE.
    ## Issues with specifying intervals, and with looping through major chromosomes.
    
    
    
    . /etc/profile.d/modules.sh
    
    ## State date & time, set timer, state environment, full log, terminate on error.
        echo "`date +%y/%m/%d_%H:%M:%S`"
        START=$(date +%s.%N)
        uname -sa
        set -ex
        find . -type f -exec ls -lt "{}" \;
    
    
        module load sge
        module load genomestrip/2.0
        module load jre/1.7.0_25
    
    
    
    
    ## Set input variables
    
        SV_TMPDIR=./tmpdir
        SV_DIR=/cm/shared/apps/svtoolkit/2.0.1602/
        bams=../reference_data/cnv_lhm_RG_bams.list
        ref_seq=../reference_data/dm6.fa
        my_config=../reference_data/params_dm6_gstrip.txt
        my_meta=../preprocessing_metadata
        genome_mask=../reference_data/dm6.svmask.fasta
        depth_mask=../reference_data/dm6.rdmask.bed
        ploidy=../reference_data/ploidy_dm6.map
        gender_map=../reference_data/gstrip_lhm_rg_gender.map
    
    
    
            which java > /dev/null || exit 1
            which Rscript > /dev/null || exit 1
            which samtools > /dev/null || exit 1
    
                export PATH=${SV_DIR}/bwa:${PATH}
                export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
    
        mx="-Xmx4g"
        classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
    
        java -cp ${classpath} ${mx} -jar ${SV_DIR}/lib/SVToolkit.jar
    
    
    
    ## CNV discovery
    
    #     mkdir chr2L_cnvs
    #     runDir=chr2L_cnvs
    #     mkdir -p ${runDir}/logs || exit 1
    #     mkdir -p ${runDir}/metadata || exit 1
    
            mkdir -p logs || exit 1
            mkdir -p metadata || exit 1
    
            java -cp ${classpath} ${mx} org.broadinstitute.gatk.queue.QCommandLine \
                 -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
                 -S ${SV_DIR}/qscript/SVQScript.q \
                 -jobRunner Drmaa \
                    -gatkJobRunner Drmaa \
                        -jobNative '-V -pe openmp 10 -q bioinf.q' \
                        -cp ${classpath} \
                        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
                        -configFile ${my_config} \
                            -R ${ref_seq} \
                            -I ${bams} \
                                -genderMapFile ${gender_map} \
                                -runDirectory . \
                                -md ${my_meta} \
                                -jobLogDir logs \
                                -genomeMaskFile ${genome_mask} \
                                -readDepthMaskFile ${depth_mask} \
                                -ploidyMapFile ${ploidy} \
                                -intervalList test_intervals.list \
                        -tempDir ${SV_TMPDIR} \
                    -disableGATKTraversal \
                    -minimumRefinedLength 500 \
                    -tilingWindowSize 1000 \
                    -tilingWindowOverlap 500 \
                    -maximumReferenceGapLength 1000 \
                    -boundaryPrecision 100 \
                    -debug true \
                    --verbose true \
                    -run \
                    || exit 1
    
    #                             -runDirectory ${runDir} \
    #                             -L chr2L \
    
    
            module unload genomestrip/2.0
            module unload jre/1.7.0_25
            module unload sge
            uname -a
            ls -la
    
    
    
    # for i in chr2L chr2R chr3L chr3R chrX chr4
    #     do
    # done;
    
    # -intervalList ${intervals} \
    ## attmepting to specify only the useful chromosomes is met with invalid java sequence.
    ## now need to generate qc data, filter, merge with deletion results, phase with haplotyper caller results.
    ## http://www.broadinstitute.org/software/genomestrip/org_broadinstitute_sv_qscript_CNVDiscoveryPipeline.html#--intervalList
    ## http://www.sussex.ac.uk/lifesci/morrowlab/
    ## Sussex University computer cluster http://www.sussex.ac.uk/its/services/research/highperformance
    ## @wpgilks
    
    
        END=$(date +%s.%N)
        Duration=$(echo "$END - $START" | bc)
        echo "`date +%y/%m/%d_%H:%M:%S`"
        echo $Duration
        exit
    ###
    ###
    ###
    
  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    Thanks for sharing the plots!

    The data looks a little noisy, but these are also quite short sites (depending on your sequencing depth).

    A couple of thoughts/suggestions:

    First, did you purposely disable parity correction with, e.g., -P depth.parityCorrectionThreshold:some-value for some reason? Some of your plots do not seem to be in HWE (e.g. mode copy number of 3).

    Second, it is possible that you have a few samples with a more erratic read depth signal. Not all samples perform equally well in read depth analysis. How many samples were filtered out in stage 5? Can you share the file from cnv_output/cnv_stage5/eval/VariantsPerSample.report.pdf ?

    Finally, the CNV pipeline outputs raw calls with some very light default filtering. More filtering is required to get a good call set. Two good strategies for filtering are to use either (a) length of CNV segment or (b) the GSCNQUAL annotation. GSCNQUAL is a quality score that does a pretty good job of tracking with false discovery rate, although I can't recommend a specific threshold. Higher values are better.

    As a point of comparison, on 25x-30x human data, we are finding we can call deletions with the CNV pipeline down to about 1kb - 1.5kb with a low (1-2%) FDR and duplications down to about 2.5kb-3kb with a low (3-5%) FDR. The raw calls will contain deletions down to 1kb and duplications down to 2kb by default.

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    Hi @bhandsaker

    I chose to plot a selection of interesting looking events, mostly duplications, so it's not an unbiased selection.

    I haven't purposely disabled parity correction. To be honest, I haven't heard of it. Hardy-Weinberg Equilibrium is expected to be broken because the flies have a mother from the inbred reference genome strain, but have different fathers from an outbred population.

    I don't know about the sample filtering in stage 5. I deleted the output from the last run, as it was only a test on one chromosome. I'll keep those reports in future. I don't remember seeing any sample exclusion but I do remember seeing elevated counts of singleton events in some of them.

    DNA from all flies was purified, prepared, sequenced and mapped using identical procedures. The depth of coverage varies really from 20-40X. I'm running all chromosomes now. Nothing's moved for 24hrs. I'm tempted to restart, doing a separate run for each chromosome. I think my multi-threading variables might be wrong as well.

    Filtering recommendations noted.

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    Hi @bhandsaker

    Using -intervalsList for the five major scaffolds/chromosomes, the CNV calling pipeline completed in about 3 days. I've attached the variants-per-sample pdf report from stage 5 evaluation. There seems to be a few samples with elevated numbers of CNVs, which I suppose are probably artefacts.

    Will.

Sign In or Register to comment.