Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

SVDiscovery walker

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

The SVDiscovery walker traverses a set of BAM files to perform structural variation discovery. This walker is the main component of the SVDiscovery pipeline.

Currently, only discovery of deletions relative to the reference is implemented.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -runDirectory <directory> : The directory where auxilliary output files will be written (default is the current directory).

  • -md <directory> : The metadata directory containing metadata about the input data set. See SVPreprocess.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. : See Genome Mask Files.

  • -configFile <configuration-file> : This file contains settings for specialized settings that do not normally need to be changed. : A default configuration file is provided in conf/genstrip_parameters.txt.

  • -partitionName <string> : This specifies the name of the partition being computed during parallel runs. : The output files will be prefixed with the name of the partition.

  • -searchLocus <interval> : The genomic locus being searched. : Only structural variations that fit within the specified locus will be output. If non-overlapping search loci are used, then the union of the discovered variants should be non-redundant.

  • -searchWindow <interval> : The interval to be used for searching the input BAM files. : This is typically larger than the search locus to avoid missing events due to boundary effects. : This argument should typically be set to the same value as the GATK -L argument.

  • -searchMinimumSize <size> : The minimum length of a deletion event for it to be included in the output.

  • -searchMaximumSize <size> : The maximum length of a deletion event for it to be included in the output.

3. Outputs

  • -O <vcf-file> : The main output is a VCF file containing descriptions of the variant sites along with annotations about the evidence for the variability of the site. : The output VCF file will need to be filtered, based on the annotations, to select a final set of high specificity variants.

Depending on settings in the configuration file, this walker will also produce a number of auxilliary output files. These files are mostly useful for debugging. The content and format of these files is subject to change.

4. Running

Currently, this walker needs to be invoked through a special wrapper around the GATK command line interface. This wrapper accepts all of the standard GATK command line options. An example is shown below.

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.main.SVDiscovery \ 
    -T SVDiscovery \ 
    -configFile conf/genstrip_parameters.txt \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -I input1.bam -I input2.bam \ 
    -O output.sites.vcf \ 
    -runDirectory run1 \ 
    -minimumSize 100 \
    -maximumSize 1000000 \ 
    -searchLocus chr20::1-1000000 \ 
    -L chr20:1-1000000 \
    -searchWindow chr20:1-1000000 

5. Dependencies

The SV Discovery code uses some R scripts. R needs to be installed and the Rscript executable needs to be on your path to run this walker.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • santinosantino Posts: 20Member

    I got this error when I running SVDiscovery. What does it mean? Thanks!

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.RuntimeException: Cannot get platform for read FCD03GDACXX:2:2102:1598:195682#GATCAGAT at org.broadinstitute.sv.util.ReadPairOrientation.getOrientation(ReadPairOrientation.java:99) at org.broadinstitute.sv.discovery.ReadPairRecordSelector.getReadPairOrientation(ReadPairRecordSelector.java:212) at org.broadinstitute.sv.discovery.ReadPairRecordSelector.selectReadPairRecord(ReadPairRecordSelector.java:110) at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.processRead(DeletionDiscoveryAlgorithm.java:164) at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.runTraversal(DeletionDiscoveryAlgorithm.java:149) at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVDiscoveryWalker.java:107) at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVDiscoveryWalker.java:42) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:123) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:286) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:125) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:79) at org.broadinstitute.sv.main.SVDiscovery.main(SVDiscovery.java:21)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.5-2-g5671483):
    ERROR
    ERROR Please check the documentation guide to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Cannot get platform for read FCD03GDACXX:2:2102:1598:195682#GATCAGAT
    ERROR ------------------------------------------------------------------------------------------
  • bhandsakerbhandsaker Posts: 140Member, Third-party Developer ✭✭✭

    What do your bam headers look like? Especially the PU tags on your @RG (read group) records?

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member
    edited February 17

    @bhandsaker said: What do your bam headers look like? Especially the PU tags on your RG (read group) records?

    It looks like this: samtools view -h input0001.bam 16:30000001-40000000 |less

    .
    .
    .
    @SQ     SN:GL000225.1   LN:211173
    @SQ     SN:GL000192.1   LN:547496
    @RG     ID:110611_I297_FCD03GDACXX_L1_HUMcdgRDCDIAAPEI-9_1.fq.gz_bwa    PU:illumina     LB:ipsych       SM:210099
    @RG     ID:110611_I297_FCD03GDACXX_L2_HUMcdgRDCDIAAPEI-9_1.fq.gz_bwa    PU:illumina     LB:ipsych       SM:210099
    @PG     ID:GATK IndelRealigner  VN:2.7-2-g6bda569       CL:knownAlleles=[(RodBinding name=knownAlleles source=/home/lescai/resources/GATKbundle/2.3/b37/dbsnp_137.b37.vcf), (RodBinding name=knownAlleles2 source=/home/lescai/resources/GATKbundle/2.3/b37/1000G_phase1.indels.b37.vcf)] targetIntervals=/project/GATKQUEUElescai/faroeWGS.210099.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
    @PG     ID:GATK PrintReads      VN:2.7-2-g6bda569       CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
    @PG     ID:MarkDuplicates       PN:MarkDuplicates       CL:net.sf.picard.sam.MarkDuplicates INPUT=[/project/GATKQUEUElescai/faroeWGS.210099.clean.bam] OUTPUT=/project/GATKQUEUElescai/faroeWGS.210099.clean.dedup.bam METRICS_FILE=/project/GATKQUEUElescai/faroeWGS.210099.metrics TMP_DIR=[/project/GATKQUEUElescai/tmp] VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false
    @PG     ID:bwa  PN:bwa  VN:0.7.4-r385
    FCD03GDACXX:1:2201:7486:120317#GATCAGAT 163     16      2999992060      90M     =       30000294464     CTGTAGAACTTCACCGAACTTCTTGGTCCCACAGAGCGCCTGTTTGCGGGTTCCTTCCAGAGGGCTTAGTTTCCAGGCCCTTGGTCTCCT      =@->=CCBBCD:E;;;CCBDD+DDBB9?D>A9;7*8*7@@AACDCCC<D>/@AC:'555*,<."?DD<>AEE7CECC<+D=B>AAA@<9@      X0:i:1  X1:i:0  BD:Z:KKOJQQNNMNLMMINLMLLMKLKKMOMMMJMHHNKLOLLNMNIKCLNLKJMKKMMJKMMNKLNJNNJKMMLDLNNOOOPLOLOQQQKLNN MD:Z:63T26      PG:Z:MarkDuplicates     RG:Z:110611_I297_FCD03GDACXX_L1_HUMcdgRDCDIAAPEI-9_1.fq.gz_bwa  XG:i:0  BI:Z:MMQMSRPPOQNOOLOOPNNPMNNMOOQPOLPLLQNORNOPPQMNGOQNNLQNNOPMNOPQNOOLQQMOPROHOPQRPRQNROQQTSNLOP AM:i:37 NM:i:1  SM:i:37 XM:i:1  XO:i:0  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ MQ:i:60 XT:A:U
    FCD03GDACXX:1:1306:3596:91716#GATCAGAT  147     16      2999992560      90M     =       29999513-502    GAACTTCACCGAACTTCTTGGTCCCACAGAGCGCCTGTTTGCGGGTTCCTTCCAGAGGTCTTAGTTTCCAGGCCCTTGGTCTCCTCATCC      ==<==BA@:8-A@6E>EDEDDDDCE<DFDFB<)CAFCEDFD9DB>?CC@CB@A@@8A?CC@=DBCCC7BDCC@D>@ECBD4>&><9658;      X0:i:1  X1:i:0  BD:Z:LLNLOQLPMNMMNLLLLKNNMMJNIINJKNLLNNNHKBJNLKJMKKMNKKMNNJKMMMKKKLLKCLNOONOOKOLKNNOMNOQOQQMNKK MD:Z:90 PG:Z:MarkDuplicates     RG:Z:110611_I297_FCD03GDACXX_L1_HUMcdgRDCDIAAPEI-9_1.fq.gz_bwa  XG:i:0  BI:Z:NNQOQTOSPPPORPOOPNQPPPLOMMQNLQONQOQLNFMQOOLONNPOONPOQNLPOONONPPNFNPOQPPQLOOMPOPOPQQQSSPPMM AM:i:37 NM:i:0  SM:i:37 XM:i:0  XO:i:0  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ MQ:i:60 XT:A:U
    
    Post edited by santino on
  • bhandsakerbhandsaker Posts: 140Member, Third-party Developer ✭✭✭
    edited February 17

    I mistyped before: It's the PL tag that the software is looking for here. You seem to have PU:illumina, when what is expected is PL:illumina. PU is platform unit (i.e. a run identifier) whereas PL says what kind of sequencing (and we interpret this to mean what kind of library construction, in particular how normal read pairs should be oriented).

    Re-headering your bams is the right thing to do. But if you don't want to do this, another option is that you can run Genome STRiP with -P input.platformMapFile:platform_map.txt where platform_map.txt is a file you create, tab-delimited, with two columns: the first being the read group ID (e.g. 110611_I297_FCD03GDACXX_L1_HUMcdgRDCDIAAPEI-9_1.fq.gz_bwa) and the second column being what should have been in the PL tag. You need to include one line for every read group in your input data set.

    Post edited by bhandsaker on

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member

    Yea, that's pretty strange. I don't know why the files have PU:illumina instead of PL:illumina. I'll try the resolutions you mentioned. Thanks!

  • santinosantino Posts: 20Member
    edited February 17

    @bhandsaker BTW, I have one more question: it seems that the SVDiscovery doesn't support "ploidyMapFile" option. But the online tutorial (http://www.broadinstitute.org/software/genomestrip/sites/default/files/materials/GATKWorkshop_GenomeSTRiP_tutorial_July2013.pdf page 28) has the example command line with 'ploidyMapFile' option. Is the tutorial wrong?

    Post edited by santino on
  • bhandsakerbhandsaker Posts: 140Member, Third-party Developer ✭✭✭

    Yes, this is a bug in the tutorial. Thanks for catching it!

    Currently, SVDiscovery does not use the ploidy map, although other parts of Genome STRiP do. SVDiscovery is "ploidy-aware", but it is hard-coded for humans and assumes the sex chromsomes are named X/Y or chrX/chrY, and it doesn't understand the pseudo-autosomal regions. This is on our todo list.

    I will also put in a temporary fix so that the SVDiscovery Queue script accepts the ploidy map, but it will silently ignore it.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member

    @bhandsaker said: I mistyped before: It's the PL tag that the software is looking for here. You seem to have PU:illumina, when what is expected is PL:illumina. PU is platform unit (i.e. a run identifier) whereas PL says what kind of sequencing (and we interpret this to mean what kind of library construction, in particular how normal read pairs should be oriented).

    Re-headering your bams is the right thing to do. But if you don't want to do this, another option is that you can run Genome STRiP with -P input.platformMapFile:platform_map.txt where platform_map.txt is a file you create, tab-delimited, with two columns: the first being the read group ID (e.g. 110611_I297_FCD03GDACXX_L1_HUMcdgRDCDIAAPEI-9_1.fq.gz_bwa) and the second column being what should have been in the PL tag. You need to include one line for every read group in your input data set.

    Hi @bhandsaker, I meet the problem again when using SVGenotyper, seems this time -P input.platformMapFile:platform_map.txt doesn't work. Does it using different parameter or I have to re-head my bams? Thanks.

  • santinosantino Posts: 20Member
    edited February 24

    hello, @bhandsaker, could you help me?

    I meet the problem again when using SVGenotyper, seems this time -P input.platformMapFile:platform_map.txt doesn't work for SVGenotyper. It only works for SVDiscovery. Does SVGenotyper using different parameter or I have to re-head my bams? Thanks.

    Post edited by santino on
  • bhandsakerbhandsaker Posts: 140Member, Third-party Developer ✭✭✭

    You are right, that parameter isn't currently supported by SVGenotyper. I will file a bug report, but it may take a week or two to get the next release out because we're making some larger changes. In the meantime, the best option is to re-header your bams.

    As an alternative, you could disable read-pair genotyping, but this will entail a loss of genotyping accuracy. You could use either -P genotyping.modules:depth or if you have any assembled breakpoints -P genotyping.modules:depth,split.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member
  • iancianc Posts: 3Member

    Hi, I'm getting an error when running SVDiscover with this command

    java -Xmx8g -cp ${SV_DIR}/lib/gatk/Queue.jar:${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \ org.broadinstitute.sting.queue.QCommandLine \ -S /${GENOMESTRIP_PATH}/GenomeSTRiP/svtoolkit/qscript/SVDiscovery.q \ -S /${GENOMESTRIP_PATH}/GenomeSTRiP/svtoolkit/qscript/SVQScript.q \ -gatk /${GENOMESTRIP_PATH}/GenomeSTRiP/svtoolkit/lib/gatk/GenomeAnalysisTK.jar \ -cp ${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \ -configFile /${GENOMESTRIP_PATH}/GenomeSTRiP/svtoolkit/conf/genstrip_parameters.txt \ -tempDir / tmp/ \ -runDirectory run1 \ -md /master/ianc/align/MA_CNVs/metadata \ -R /${REF_PATH}/PlasmoDB-9.2_Pfalciparum3D7_Genome.fasta \ -genomeMaskFile /${REF_PATH}/GenomeSTRiP/PlasmoDB-9.2_Pfalciparum_Genome_mask.fasta \ -genderMapFile MA_gendermap.txt \ -ploidyMapFile MA_ploidymap.txt \ -copyNumberMaskFile MA_CN2.fasta \ -minimumSize 1 \ -maximumSize 1000000 \ -windowSize 10000000 \ -windowPadding 10000 \ -bamFilesAreDisjoint \ -I /First.bam -I /Second.bam -I /Third.bam \ -run

    The error is:

    INFO 11:51:15,284 HelpFormatter - Date/Time: 2014/06/18 11:51:15 INFO 11:51:15,285 HelpFormatter - ---------------------------------------------------------------------- INFO 11:51:15,285 HelpFormatter - ---------------------------------------------------------------------- INFO 11:51:15,297 QCommandLine - Scripting SVDiscovery INFO 11:51:15,418 QCommandLine - Done with errors

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    org.broadinstitute.sting.utils.exceptions.UserException$CannotExecuteQScript: Unable to execute QScript: SVDiscovery.script() threw the following exception: java.lang.NullPointerException at org.broadinstitute.sting.queue.QCommandLine$$anonfun$execute$5.apply(QCommandLine.scala:159) at org.broadinstitute.sting.queue.QCommandLine$$anonfun$execute$5.apply(QCommandLine.scala:147) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.JavaConversions$JIteratorWrapper.foreach(JavaConversions.scala:573) at scala.collection.IterableLike$class.foreach(IterableLike.scala:73) at scala.collection.JavaConversions$JListWrapper.foreach(JavaConversions.scala:615) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:147) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) Caused by: java.lang.NullPointerException at org.broadinstitute.sv.qscript.SVQScript.swapExt(SVQScript.q:409) at SVDiscovery.script(SVDiscovery.q:22) at org.broadinstitute.sting.queue.QCommandLine$$anonfun$execute$5.apply(QCommandLine.scala:156) ... 10 more

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.5-2-g5671483):
    ERROR
    ERROR Please check the documentation guide to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Unable to execute QScript: SVDiscovery.script() threw the following exception: java.lang.NullPointerException
    ERROR ------------------------------------------------------------------------------------------

    INFO 11:51:15,440 QCommandLine - Shutting down jobs. Please wait...

    Can you shed any light on whats happening here?

    Thanks

    Ian

  • iancianc Posts: 3Member

    I should mention I'm running this on a non-human, haploid genome also

  • bhandsakerbhandsaker Posts: 140Member, Third-party Developer ✭✭✭

    Looking at the stack trace, I think the problem is that -O output-file is missing. If that's the cause, we should clearly provide a better error message.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • iancianc Posts: 3Member

    Yup that fixed that problem, thanks for the help

Sign In or Register to comment.