Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

SVGenotyper walker

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,903Administrator, GATK Developer admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

The SVGenotyper walker traverses a VCF file to compute genotypes for structural variations. This walker is the main component of the SVGenotyper pipeline.

Currently, only genotyping of deletions relative to the reference is implemented.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -runDirectory <directory> : The directory where auxilliary output files will be written (default is the current directory).

  • -md <directory> : The metadata directory containing metadata about the input data set. See SVPreprocess.

  • -R <fasta-file> : Reference sequence. An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. See Genome Mask Files.

  • -configFile <configuration-file> : This file contains settings for specialized settings that do not normally need to be changed. A default configuration file is provided in conf/genstrip_parameters.txt.

  • -sample <sample-ID> : The sample to gentoype (or list of samples if multiple arguments are supplied). By default, genotypes are computed for all samples present in the input BAM files.

  • -sampleList <file> : A file containing the list of samples to genotype (one sample ID per line).

  • -altAlleleAlignments <bam-file> : A BAM file containing alignments to the alternate alleles of events present in the input VCF file. These alternate alignments should be computed by the SVAltAlign pipeline.

  • -partitionName <string> : This specifies the name of the partition being computed during parallel runs. The output files will be prefixed with the name of the partition.

  • -partition <partition-spec> : Describes the subset of the VCF file to process. : The format is "records:N-M" where ''N'' and ''M'' are the 1-based indexes of a range of records from the input VCF file that will be processed.

3. Outputs

  • -O <vcf-file> : The main output is a VCF file containing genotypes for structural variation sites from the input VCF file.

Depending on settings in the configuration file, this walker will also produce a number of auxilliary output files. These files are mostly useful for debugging. The content and format of these files is subject to change.

4. Running

Currently, this walker needs to be invoked through a special wrapper around the GATK command line interface. This wrapper accepts all of the standard GATK command line options. An example is shown below.

The input VCF file should be passed as a GATK ROD (reference ordered datum) file. This walker also requires the -BTI argument to be passed to the GATK engine.

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.main.SVGenotyper \ 
    -T SVGenotyper \ 
    -configFile conf/genstrip_parameters.txt \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -altAlignments alt_allele_alignments.bam \ 
    -B:input,VCF input.sites.vcf \ 
    -BTI \ 
    -I input1.bam -I input2.bam \ 
    -O output.genotypes.vcf \ 
    -runDirectory run1

5. Dependencies

The SV Genotyping code uses some R scripts. R needs to be installed and the Rscript executable needs to be on your path to run this walker.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • santinosantino Posts: 20Member

    hi, I'm running SVGenotyper on a big data set. Some partition records take only less than 10 mins. Some of them take like 3 days and still running. If I kill one of them, the whole task fails. How can I limit the maximum running time for each job without fail the whole task? Thanks.

  • bhandsakerbhandsaker Posts: 136Member, Third-party Developer ✭✭✭

    For genotyping, the run time is generally proportional to the size of the site. Are you genotyping really large deletions?

    One thing you can do to smooth out the performance is to segregate the large sites into a separate VCF, and then run one record per partition (e.g. -parallelRecords 1) for the large sites.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member

    @bhandsaker said: For genotyping, the run time is generally proportional to the size of the site. Are you genotyping really large deletions?

    One thing you can do to smooth out the performance is to segregate the large sites into a separate VCF, and then run one record per partition (e.g. -parallelRecords 1) for the large sites.

    I'm using -parallelRecords 10, I thought it's already small enough. If I change to -parallelRecords 1, I have to rerun all of them. Is there any way to avoid this? Thanks.

  • bhandsakerbhandsaker Posts: 136Member, Third-party Developer ✭✭✭

    I'm surprised it is taking so long, unless they are really big sites. There are some performance enhancements coming that speed up genotyping quite a bit, but that doesn't help you right now.

    You can estimate the progress by looking at the *.gmm.dat file for each partition (one line per record that has completed).

    If you end up having to abandon some of the partitions, then you will need to do the merging step manually. The merge is just a simple concatenation of all of the intermediate files (which are all text or VCF) but not duplicating the headers. You will need to watch out for partial records (i.e. partial lines) in the terminated partitions (or skip those partitions altogether).

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member

    It's seems that the software runs into eternal loops. No output for days. All the output files' date are days ago. How can I check whether the software has bug? Does it have any heartbeat output?

  • bhandsakerbhandsaker Posts: 136Member, Third-party Developer ✭✭✭

    No heartbeat output, unfortunately.

    I would suggest two things: First, check the output so far. Feel free to send me the tail of a sample run. You can see which sites are being processed and how long each has taken like so (you can do this with partial output too):

    $ grep "Genotyping site:" SVGenotyper-1.log
    INFO  09:26:00,159 SVGenotyper - Genotyping site: CNP CNV_3_64078968_64084029 CNV 3 64078969 64078968 64084030 64084029 
    INFO  09:26:00,522 SVGenotyper - Genotyping site: CNP CNV_3_65876950_65882130 CNV 3 65876951 65876950 65882131 65882130 
    INFO  09:26:01,768 SVGenotyper - Genotyping site: CNP CNV_3_68009863_68015684 CNV 3 68009864 68009863 68015685 68015684

    Second, it might be helpful to get a thread dump from one of the stuck processes using kill -QUIT <pid>. This won't kill the process (despite the ominous name) but will cause the JVM to write a thread dump to stdout, which hopefully you have access to. Feel free to email me directly with the thread dump and the tail of the output from one of the stuck jobs.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member
    cat /home/list/work/iPSYCH/CNV/Genome_STRiP_test/20140207/test1/logs/SVGenotyper-43.out|grep "Genotyping site:"
    INFO  03:06:46,245 SVGenotyper - Genotyping site: CNP DEL_P0381_606 DEL GL000225.1 91274 91411 122645 122782
    INFO  03:28:25,245 SVGenotyper - Genotyping site: CNP DEL_P0381_607 DEL GL000225.1 91441 91440 128997 128996
    INFO  03:32:40,279 SVGenotyper - Genotyping site: CNP DEL_P0381_608 DEL GL000225.1 91625 91749 92310 92434
    INFO  03:36:00,134 SVGenotyper - Genotyping site: CNP DEL_P0381_609 DEL GL000225.1 91629 91796 91753 91920
    

    Hi @bhandsaker , could you please tell me what's the numbers of the last four columns means? I found that all eternal looped jobs have the same trait: the second number >= the third number.

    If this is the reason why all the jobs have eternal loop, then it must be a really stupid bug either in SVDiscovery or in SVGenotyper. Could you please tell me how to filter out these kind of sites from the vcf output of SVDiscovery? Thanks.

  • santinosantino Posts: 20Member
    edited March 4
    GL000225.1      91712   DEL_P0381_609   G   <DEL>       .       COVERAGE;DEPTH;DEPTHPVAL        CIEND=-84,84;CIPOS=-84,84;END=91836;GSCOHERENCE=-1.6410100758528952;GSCOHFN=-0.8205050379264476;GSCOHPVALUE=0.51;GSCOORDS=91574,91619,92097,92189;GSDEPTHCALLTHRESHOLD=5.920481287710887;GSDEPTHNOBSSAMPLES=1;GSDEPTHNTOTALSAMPLES=1;GSDEPTHOBSSAMPLES=210213;GSDEPTHPVALUE=0.264107;GSDEPTHPVALUECOUNTS=104,89255462,25315,24368370738;GSDEPTHRANKSUMPVALUE=NA;GSDEPTHRATIO=1.1216234965603349;GSDMAX=577;GSDMIN=1;GSDOPT=123;GSDSPAN=477;GSMEMBNPAIRS=2;GSMEMBNSAMPLES=1;GSMEMBOBSSAMPLES=210213;GSMEMBPVALUE=0.0025;GSMEMBSTATISTIC=516.8776367126542;GSNDEPTHCALLS=202;GSNPAIRS=2;GSNSAMPLES=1;GSOUTLEFT=0;GSOUTLIERS=0;GSOUTRIGHT=0;GSREADGROUPS=110613_I327_FCC01KCACXX_L5_HUMcdgRGJDIAAPEI-10_1.fq.gz_bwa,110613_I327_FCC01KCACXX_L6_HUMcdgRGJDIAAPEI-10_1.fq.gz_bwa;GSREADNAMES=FCC01KCACXX:5:2206:16841:65470#TAGCTTAT,FCC01KCACXX:6:2106:11104:118311#TAGCTTAT;GSSAMPLES=210213,210213;IMPRECISE;SVLEN=-123;SVTYPE=DEL
    

    The above line is the site in the vcf file. How is it reformed into the following line:

    INFO  03:36:00,134 SVGenotyper - Genotyping site: CNP DEL_P0381_609 DEL GL000225.1 91629 91796 91753 91920
    

    Thanks!

    Post edited by santino on
  • bhandsakerbhandsaker Posts: 136Member, Third-party Developer ✭✭✭

    Yes, maybe this is a bug. I will look into it.

    Those four numbers are the boundaries of the deletion with the confidence intervals (CIPOS, CIEND) applied (e.g. 91629 = 91712 -84 + 1).

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • skashinskashin Posts: 3Member

    santino, I am trying to reproduce your problem here at the Broad, but when I attempt to genotype the site DEL_P0381_609, it works fine for me. What version of SVToolkit are you using? You can run this to determine it: $ java -jar SVToolkit.jar --version Could you send me the config file and the script you are using to invoke SVGenotyper? Thanks

  • santinosantino Posts: 20Member
    edited March 5

    @skashin said: santino, I am trying to reproduce your problem here at the Broad, but when I attempt to genotype the site DEL_P0381_609, it works fine for me. What version of SVToolkit are you using? You can run this to determine it: $ java -jar SVToolkit.jar --version Could you send me the config file and the script you are using to invoke SVGenotyper? Thanks

    I really wondering how could you reproduce my problem without my data?

    My SVToolkit version is:

    SVToolkit version 1.04 (build 1336)
    Build date: 2014/01/12 16:48:22
    Web site: http://www.broadinstitute.org/software/genomestrip
    

    My config file is -configFile ${SV_DIR}/conf/genstrip_parameters.txt. I used the default config file, I didn't even touch it. My script is:

    java -cp ${classpath} ${mx} \
        org.broadinstitute.sting.queue.QCommandLine \
        -S ${SV_DIR}/qscript/SVGenotyper.q \
        -S ${SV_DIR}/qscript/SVQScript.q \
        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
        -cp ${classpath} \
        -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
        -tempDir ${SV_TMPDIR} \
        -R /data/refseq/tooldata/GATKbundle/2.5/b37/human_g1k_v37.fasta \
        -genomeMaskFile ../raw_data/human_g1k_v37.mask.36.fasta \
        -genderMapFile ../raw_data/gender.map \
        -runDirectory ${runDir} \
        -md ${runDir}/metadata \
        -jobLogDir ${runDir}/logs \
        -P input.platformMapFile:platform_map.txt \
        -I ${bam} \
        -vcf ${sites} \
        -O ${genotypes} \
        -jobRunner Drmaa \
        -jobProject CNV \
        -jobNative "-l mem=4gb,walltime=72:00:00,nodes=1:ppn=1 -A m128" \
        -parallelRecords 10 \
        -run 
    

    Hope this could help. Just ask me if you want my data or anything else.

    Post edited by santino on
  • bhandsakerbhandsaker Posts: 136Member, Third-party Developer ✭✭✭

    Well, we wanted to try to see if there was something about that locus or the VCF record or the fact that the problem intervals all seem to share this property that the confidence intervals overlap, as you pointed out.

    If you would, could you try to just genotype that one site in isolation? Just make a tiny one-record vcf and run it against your data. If you don't run it through Drmaa, then you should be able to get a better sense of whether it is in an infinite loop. You can also run Queue once in dry-run mode to get the command line and then invoke the java command directly (bypassing Queue altogether). It shouldn't take more than a few minutes to run.

    If you can reproduce it like this, there are other things we can do to try to narrow it down. For example, you can try -P genotyping.modules:depth (which disables read pair processing) to see if this makes the problem go away. If you are able to send us your data, the other thing might be to try to reproduce it with a subset of your bam files and/or a small slice of the data just within a few Kb of this locus.

    If we can figure out how to reproduce the problem, we're happy to fix it.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • santinosantino Posts: 20Member

    Hi all, thanks for the suggestions. I've tried to modify the error lines in discover.vcf like the following:

    if (POS+r1 >= END+l2) {
        mid=int((END-POS-1)/2);
        replace CIPOS=l1,r1 with CIPOS=l1,mid
        replace CIEND=l2,r2 with CIEND=-mid,r2
    }
    

    Then everything goes right. All jobs finished now.

  • bhandsakerbhandsaker Posts: 136Member, Third-party Developer ✭✭✭

    Thanks for letting us know.

    As noted above, we weren't able to reproduce this (on the 1000 Genomes data) using the same coordinates, so it doesn't seem like it is due solely to the overlapping confidence intervals.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

Sign In or Register to comment.