Invalid intervals error

lucasllucasl Planet EarthMember
edited March 19 in GenomeSTRiP

Hi, I'm currently working with Genomestrip (version 3.7.GS-r1748-0-g74bfe0b) to do some structural variant calling on some BAMs that were aligned to hg38. Naturally, I used the resources available from the hg38 prerelease bundle for Genomestrip available here: ftp://ftp.broadinstitute.org/pub/svtoolkit/hg38_pre_release/

So far, whenever I use Genomestrip with hg38, I get several invalid interval errors. These appears to be legitimate intervals, so I'm not sure how Genomestrip is failing on these. Here are the intervals it fails on:

chr22:24317068-24322453
chrY:10022097-10027150
chr21:20211411-20216694

And here's a stack trace that's representative of these errors:

##### ERROR --
##### ERROR stack trace 
java.lang.RuntimeException: Invalid interval: chrY:16566323-16571422
        at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.<init>(ReadCountFileReader.java:415)
        at org.broadinstitute.sv.metadata.depth.ReadCountFileReader.getReadCountDataIteratorInternal(ReadCountFileReader.java:313)
        at org.broadinstitute.sv.metadata.depth.ReadCountFileReader.getReadCountDataIterator(ReadCountFileReader.java:282)
        at org.broadinstitute.sv.metadata.depth.ReadCountFileReader.getReadCounts(ReadCountFileReader.java:257)
        at org.broadinstitute.sv.common.ReadCountCache.getReadCounts(ReadCountCache.java:100)
        at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:283)
        at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:240)
        at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getReadCounts(GenotypingDepthModule.java:225)
        at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getCnpReadCounts(GenotypingDepthModule.java:212)
        at org.broadinstitute.sv.genotyping.GenotypingDepthModule.genotypeCnp(GenotypingDepthModule.java:136)
        at org.broadinstitute.sv.genotyping.GenotypingDepthModule.genotypeCnp(GenotypingDepthModule.java:57)
        at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.genotypeCnpInternal(GenotypingAlgorithm.java:149)
        at org.broadinstitute.sv.genotyping.GenotypingAlgorithm.genotypeCnp(GenotypingAlgorithm.java:114)
        at org.broadinstitute.sv.genotyping.SVGenotyperWalker.processVCFFile(SVGenotyperWalker.java:276)
        at org.broadinstitute.sv.genotyping.SVGenotyperWalker.map(SVGenotyperWalker.java:220)
        at org.broadinstitute.sv.genotyping.SVGenotyperWalker.map(SVGenotyperWalker.java:57)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:106)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
        at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:133)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:87)
        at org.broadinstitute.sv.main.SVGenotyper.main(SVGenotyper.java:21)
##### ERROR ------------------------------------------------------------------------------------------

Any help would be greatly appreciated!

Post edited by shlee on

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @lucasl,

    java.lang.RuntimeException: Invalid interval: chrY:16566323-16571422

    Shows an interval outside the error intervals you list:

    chr22:24317068-24322453
    chrY:10022097-10027150
    chr21:20211411-20216694
    

    From the README you link:

    Note that use of the hg38 human genome reference and alt-aware alignments requires the latest version of Genome STRiP (2.00.1685 or later).

    Be sure to be using the correct version of the tool for how you aligned your reads to GRCh38. I will contact the GenomeStrip developer. In the meanwhile, it would be helpful to have the command you ran. Thanks.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @bhandsaker, can you please advise? Thanks.

  • bhandsakerbhandsaker Member, Broadie, Moderator

    FIrst of all, you should use the official reference bundle for hg38, which is here:
    ftp://ftp.broadinstitute.org/pub/svtoolkit/reference_metadata_bundles/Homo_sapiens_assembly38_12Oct2016.tar.gz

    I just deleted that pre-release version for hg38, to avoid future confusion.

    Beyond that, I would like to see the command you are running and also the VCF record that corresponds to that interval (it looks like you are running genotyping against a vcf).

  • lucasllucasl Planet EarthMember

    Hi, I tried using the official reference bundle, but I'm still getting the same error. I'm actually running Genomestrip on a BAM. My command line is:

    module load java/1.8.0_73
    module load jre
    module load pbs-drmaa
    cd /home/lochol/code/1kg-phase4
    ./genomestrip-run.sh HG00512 /home/lochol/projects/data/1kg-phase4/illumina/output/HG00512.bam /home/lochol/projects/data/1kg-phase4/illumina/alt-ref-3/output /home/lochol/projects/data/hg38/Homo_sapiens_assembly38 /home/lochol/projects/data/1kg-phase4/illumina/output/genomestrip
    

    So control passes to genomestrip-run.sh, which looks like this:

    #!/bin/sh
    # Note:
    #    $1 - file_id           # pop
    #    $2 - file_name         # pop.list
    #    $3 - data_dir, directory containing subject data
    #    $4 - ref_dir, directory containing reference data
    #    $5 - top_container, root folder containing all subfolders
    
    
    # ---------------- Start Time -----------------
    echo "START: " `date`
    echo ""
    
    
    # ------------- PATH Executables --------------
    source ~/.bashrc     # JPL
    which java > /dev/null || exit 1
    which Rscript > /dev/null || exit 1
    which samtools > /dev/null || exit 1
    
    # ------------- Executables Version -----------
    echo `java -version`
    echo `Rscript --version`
    echo `samtools --version`
    echo ""
    
    # ------------- Containers / Directory Structure -------------
    top_container=$5    # pop # $5
    file_id=$1             # $1
    runDir=$top_container/$file_id
    
    # ------------- Sample Data -------------
    data_dir=$3 # $3
    file_name=$2   # $2
    # bam=$data_dir/$file_name
    bam=$file_name
    genderMapFile=/home/lochol/projects/data/1kg-phase4/illumina/output/genomestrip/HG00512_gender_map.txt # Required
    
    # ------------- Reference Data -------------
    refDir=$4 # $4
    reference_genome=$refDir/Homo_sapiens_assembly38.fasta
    ploidyMapFile=$refDir/Homo_sapiens_assembly38.ploidymap.txt
    genomeMaskFile=$refDir/Homo_sapiens_assembly38.svmask.fasta # Required
    copyNumberMaskFile=$refDir/Homo_sapiens_assembly38.gcmask.fasta # Required
    
    # ------------- Output Data -------------
    sites=$runDir/$file_id.discovery.vcf
    genotypes=$runDir/$file_id.genotypes.vcf
    
    # ------------- Executables -------------
    # SV_DIR is the SVToolkit installation directory - it must be an exported environment variable.
    # SV_TMPDIR is a directory for writing temp files, which may be large if you have a large data set.
    #JPL export SV_DIR=`cd .. && pwd`
    ###export SV_DIR=$HOME/builds/svtoolkit        # NB: must be an ENV variable (i.e., export) JPL
    SV_TMPDIR=$runDir/tmpdir
    
    
    echo $SV_DIR    # JPL
    echo $SV_TMPDIR # JPL
    
    ###module load java/1.7.0   # JPL
    
    
    # ------------- Library & Class PATH -------------
    # For SVAltAlign, you must use the version of bwa compatible with Genome STRiP.
    ###export PATH=${SV_DIR}/bwa:${PATH}
    ###export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
    ###export LD_LIBRARY_PATH=${DRMAA_LIBRARY_PATH}:${LD_LIBRARY_PATH}
    
    mx="-Xmx4g"
    classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
    
    
    mkdir -p ${runDir}/logs || exit 1
    mkdir -p ${runDir}/metadata || exit 1
    
    cd $runDir      # $SV_DIR/installtest   # JPL
    echo "pwd=" `pwd`   # JPL
    
    
    # Display version information.
    java -cp ${classpath} ${mx} -jar ${SV_DIR}/lib/SVToolkit.jar
    
    
    # +--------------------------------------------------------------------+
    # | Note:                                                              |
    # |    { Tell Queue kind of jobRunner (e.g., PBS) and its arguments. } |
    # |                                                                    |
    # |   -gatkJobRunner PbsEngine \                                       |
    # |   -jobRunner PbsEngine \                                           |
    # |                                                                    |
    # |   -jobNative "-v SV_DIR=$SV_DIR" \                                 |
    # |   -jobNative "-v SV_TMPDIR=$SV_TMPDIR" \                           |
    # |   -jobNative "-v PATH=$PATH" \                                     |
    # |   -jobNative "-v LD_LIBRARY_PATH=$LD_LIBRARY_PATH" \               |
    # |   -jobNative "-v classpath=$classpath" \                           |
    # |   -jobNative "-l nodes=1:ppn=6,walltime=344:00:00" \               |
    # |   -jobNative "-q test" \                                           |
    # |                                                                    |
    # | NOTE: Could all "-v ..." be replaced by single "-S /bin/bash"?     |
    # |       Yes, except SV_TMPDIR and classpath, which are defined       |
    # |       within this script.                                          |
    # +--------------------------------------------------------------------+
    
    
    # ----------------------------------------------------------------------------------------
    # --SVPreprocess
    # Preprocess a set of input BAM files to generate genome-wide metadata used by other
    # Genome STRiP modules. This is a pre-requisite for all other Genome STRiP pipelines.
    
    echo " ================== Preprocessing (`date`) =================="
    
    # -L /home/lochol/projects/data/grch38/canonical-intervals-test.list \
    
    # Run preprocessing.
    # For large scale use, you should use -reduceInsertSizeDistributions, but this is too slow for the installation test.
    # The method employed by -computeGCProfiles requires a GC mask and is currently only supported for human genomes.
    java -cp ${classpath} ${mx} \
        org.broadinstitute.gatk.queue.QCommandLine \
        -S ${SV_DIR}/qscript/SVPreprocess.q \
        -S ${SV_DIR}/qscript/SVQScript.q \
        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
        -gatkJobRunner PbsEngine \
        -jobRunner PbsEngine \
        --disableJobReport \
        -cp ${classpath} \
        -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
        -tempDir ${SV_TMPDIR} \
        -R ${reference_genome} \
        -genomeMaskFile ${genomeMaskFile} \
        -copyNumberMaskFile ${copyNumberMaskFile} \
        -genderMapFile ${genderMapFile} \
        -L chr22 \
        -runDirectory ${runDir} \
        -md ${runDir}/metadata \
        -disableGATKTraversal \
        -useMultiStep \
        -reduceInsertSizeDistributions true \
        -bamFilesAreDisjoint true \
        -computeGCProfiles true \
        -computeReadCounts true \
        -jobLogDir ${runDir}/logs \
        -I ${bam} \
        -jobNative "-v SV_DIR=$SV_DIR" \
        -jobNative "-v SV_TMPDIR=$SV_TMPDIR" \
        -jobNative "-v PATH=$PATH" \
        -jobNative "-v LD_LIBRARY_PATH=$LD_LIBRARY_PATH" \
        -jobNative "-v classpath=$classpath" \
        -run \
        || exit 1
    
    
    # ----------------------------------------------------------------------------------------
    # --SVDiscovery
    # Run deletion discovery on a set of input BAM files, producing a VCF file of potentially variant sites.
    
    echo " ================== Discovery (`date`) ======================"
    
    # Run discovery.
    java -cp ${classpath} ${mx} \
        org.broadinstitute.gatk.queue.QCommandLine \
        -S ${SV_DIR}/qscript/SVDiscovery.q \
        -S ${SV_DIR}/qscript/SVQScript.q \
        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
        -gatkJobRunner PbsEngine \
        -jobRunner PbsEngine \
        --disableJobReport \
        -cp ${classpath} \
        -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
        -tempDir ${SV_TMPDIR} \
        -R ${reference_genome} \
        -genomeMaskFile ${genomeMaskFile} \
        -genderMapFile ${genderMapFile} \
        -L chr22 \
        -runDirectory ${runDir} \
        -md ${runDir}/metadata \
        -disableGATKTraversal \
        -jobLogDir ${runDir}/logs \
        -minimumSize 100 \
        -maximumSize 1000000 \
        -suppressVCFCommandLines \
        -P select.validateReadPairs:false \
        -I ${bam} \
        -O ${sites} \
        -jobNative "-v SV_DIR=$SV_DIR" \
        -jobNative "-v SV_TMPDIR=$SV_TMPDIR" \
        -jobNative "-v PATH=$PATH" \
        -jobNative "-v LD_LIBRARY_PATH=$LD_LIBRARY_PATH" \
        -jobNative "-v classpath=$classpath" \
        -run \
        || exit 1
    
    
    # ----------------------------------------------------------------------------------------
    # --SVGenotyper
    # Genotype a set of polymorphic structural variation loci described in an input VCF file.
    #
    # SVDiscovery *VS* SVGenotyper
    #     Discovery - list of potentially variant sites.
    #     Genotyping - the process of determining which genetic variants an individual possesses.
    
    
    echo " ================== Genotyper (`date`) ======================"
    
    # Run genotyping on the discovered sites.
    java -cp ${classpath} ${mx} \
        org.broadinstitute.gatk.queue.QCommandLine \
        -S ${SV_DIR}/qscript/SVGenotyper.q \
        -S ${SV_DIR}/qscript/SVQScript.q \
        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
        -gatkJobRunner PbsEngine \
        -jobRunner PbsEngine \
        --disableJobReport \
        -cp ${classpath} \
        -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
        -tempDir ${SV_TMPDIR} \
        -R ${reference_genome} \
        -genomeMaskFile ${genomeMaskFile} \
        -genderMapFile ${genderMapFile} \
        -L chr22 \
        -runDirectory ${runDir} \
        -md ${runDir}/metadata \
        -disableGATKTraversal \
        -jobLogDir ${runDir}/logs \
        -I ${bam} \
        -vcf ${sites} \
        -O ${genotypes} \
        -jobNative "-v SV_DIR=$SV_DIR" \
        -jobNative "-v SV_TMPDIR=$SV_TMPDIR" \
        -jobNative "-v PATH=$PATH" \
        -jobNative "-v LD_LIBRARY_PATH=$LD_LIBRARY_PATH" \
        -jobNative "-v classpath=$classpath" \
        -run \
        || exit 1
    
    
    
    # ----------------------------------------------------------------------------------------
    # --CNVDiscovery
    # Run the Genome STRiP 2.0 pipeline for discovery and genotyping of CNVs (including
    # deletions, duplications and mCNVs), seeding on read depth.
    
    echo " ================== CNV Discovery (`date`) ======================"
    
    # Run CNV discovery pipline.
    java -showversion ${mx} -cp ${classpath} \
         org.broadinstitute.gatk.queue.QCommandLine \
         -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
         -S ${SV_DIR}/qscript/SVQScript.q \
         -cp ${classpath} \
         -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
         -gatkJobRunner PbsEngine \
         -jobRunner PbsEngine \
         -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
         -R ${reference_genome} \
         -I ${bam} \
         -genderMapFile ${genderMapFile} \
         -L chr22 \
         -md ${runDir}/metadata \
         -runDirectory ${runDir} \
         -jobLogDir ${runDir}/logs \
         -tilingWindowSize 5000 \
         -tilingWindowOverlap 2500 \
         -maximumReferenceGapLength 25000 \
         -boundaryPrecision 200 \
         -minimumRefinedLength 2500 \
         -jobNative "-v SV_DIR=$SV_DIR" \
         -jobNative "-v SV_TMPDIR=$SV_TMPDIR" \
         -jobNative "-v PATH=$PATH" \
         -jobNative "-v LD_LIBRARY_PATH=$LD_LIBRARY_PATH" \
         -jobNative "-v classpath=$classpath" \
         -run \
         || exit 1
    

    Again, I appreciate any help you can give me!

  • skashinskashin Member

    Hi @lucasl,

    It seems that you ran preprocessing only on chr22, so I’d look at the vcf file you are trying to genotype, to make sure that all the sites there are on chr22.

    Also, GenomeStrip is a multi-sample suite of software tools, so you will need to have a minimum of 20 to 30 genomes to get acceptable results when running discovery and genotyping.

    Seva

  • lucasllucasl Planet EarthMember

    OK, so I removed the line of code that was limiting the analysis to chr22, and also gave Genomestrip a list of all my BAMs, rather than a single BAM as I had done previously.

    What I found is that the subjobs started by Genomestrip in the PBS queue system were exceeding their walltime, so I took out the PBS engine options in my command line. Now, however, the program ends with something like:

    INFO  20:32:22,451 QCommandLine - Script failed: 53 Pend, 0 Run, 2 Fail, 18 Done
    

    I checked the logs and there's no sign of an error message, so I'm not sure what's going on here. I'm currently seeing if I can run Genomestrip on a queue with a longer maximum walltime. Aside from that, I'm not sure how to proceed from here.

Sign In or Register to comment.