SVGenotyper Queue script

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,427Administrator, GATK Developer admin
edited September 2012 in GenomeSTRiP Documentation

Introduction

SVGenotyper.q is a sample Queue script that is part of Genome STRiP.

This script genotypes a set of input structural variation loci to determine
the structural alleles carried by each sample. The script takes as input a VCF
file of variant sites and a set of input bam files that have been previously
run through the SVPreprocess pipeline to generate auxilliary
metadata. To use split reads in genotyping, you also need to run the input VCF
file through the SVAltAlign pipeline, which will realign all previously
unmapped reads to the alternate alleles specified in the VCF file.

The input VCF file can be the output of Genome STRiP SV discovery, or it can
be a VCF file of known or putative variants, or it can be the output from
another SV discovery algorithm.

Currently, only genotyping of deletions (relative to the reference sequence)
is supported. Although there is experimental code for genotyping other
categories of structural variation, this code is not ready for external use.

Inputs / Arguments

  • -vcf <input-vcf-file> : A VCF file containing descriptions of the
    structural variations to genotype.

  • -I <bam-file> : The set of input BAM files.

  • -md <directory> : The metadata directory in which to store computed
    metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing
    the reference sequence that the input BAM files were aligned against. The
    fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of
    the reference sequence. : See Genome Mask Files.

  • -genderMapFile <gender-map-file> : A file that contains the expected
    gender for each sample. Tab delimited file with sample ID and gender on each
    line. Gender can be specified as M/F or 1 (male) and 2 (female).

  • -configFile <configuration-file> : This file contains values for
    specialized settings that do not normally need to be changed. A default
    configuration file is provided in conf/genstrip_parameters.txt.

  • -runDirectory <directory> : Directory in which to place output files and
    intermediate run files.

  • -altAlignements <bam-file> : A BAM file of alternate allele alignments
    produced by the SVAltAlign pipeline.

  • -parallelJobs <n> : Run using N parallel jobs by partitioning the input
    VCF file into N subsets.

  • -parallelRecords <n> : Run in parallel processing N VCF records in each
    parallel job.

Outputs

  • -O <vcf-file> : The main output is a VCF file containing the input SV
    records plus genotypes for each sample in the input bam files. : The output
    VCF file will include genotype likelihoods for each sample at each variant
    site plus hard calls at a threshold of 95% confidence.

The SVGenotyper pipeline also produces a number of other intermediate output
files, useful mostly for debugging. The content of these files is not
documented and is subject to change. If the genome is processed in parallel,
there will be output from each parallel partition plus merged genome-wide
output.

Running

The SVGenotyper.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line
must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVGenotyper.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \ 
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \
    -runDirectory run1 \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -genderMapFile
    sample_genders.map \ 
    -I input1.bam -I input2.bam \ 
    -altAlignments altalign.bam \ 
    -vcf input.sites.vcf \ 
    -O output.genotypes.vcf \ 
    -parallelJobs 100 \ 
    -run \
    -bsub \ 
    -jobQueue lsf_queue_name \ 
    -jobProject lsf_project \ 
    -jobLogDir logs

Parallel Processing

The genotyping pipeline is designed to allow parallelism across many
processors. Parallelism is achieved by partitioning the input VCF file,
genotyping each subset of variants separately, and merging the results into
the final output VCF file.

Overlapping Variants

Overlapping variants are currently genotyped independently. The posterior
genotype likelihoods for one event do not effect the posterior genotype
likelihoods for any other event, even if the events overlap and have
incompatible alleles.

Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP
pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. : The base script SVQScript.q from the
    SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This
    must include SVToolkit.jar and GenomeAnalysisTK.jar. Note: Both -cp
    arguments are required in the example command. The first -cp argument is for
    the invocation of Queue itself, the second -cp argument is for the invocation
    of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.