The current GATK version is 3.3-0

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GenerateAltAlleleFasta

edited September 2012

1. Introduction

The GenerateAltAlleleFasta utility processes a VCF file to extract the sequences of the alternate alleles.

For each structural variation record in the VCF, this utility will generate one output sequence in fasta format for each alternative allele that has precise breakpoints. The identifier for the alternate allele will be variantID_alleleNumber where alleleNumber is the number of the allele in the ALT column of the VCF file (the first ALT allele is allele 1).

The remainder of each fasta header line after the ID contains an encoded description of how the allele sequence maps back to the reference genome. The naming convention for the fasta sequences and the format of the rest of the header line is understood by other programs that use the alternate allele fasta file as input.

Here is an example of a generated fasta header:

>P2_M_061510_20_81_1 L:chr20:51913435-51913634;1-200|R:chr20:51913736-51913935;202-401|LENGTH:401


This example us for the first alternate allele of a variant with ID P2_M_061510_20_81. The length of the generated fasta sequence is 401 bases. Bases 1-200 of the alternate allele sequence aligns to chr20:51913435-51913634 of the reference sequence and bases 202-401 of the fasta sequence aligns to bases chr20:51913736-51913935 of the reference sequence. Thus, this event represents a deletion of 101bp of the reference (chr20:51913635-51913735) with one base of non-template sequence present in the alternate allele.

2. Inputs / Arguments

• -I <vcf-file> : The input VCF file.

• -R <fasta-file> : Reference sequence. An indexed fasta file containing the reference sequence. The fasta file must be indexed with 'samtools faidx' or the equivalent.

• -flankLength <N> : The number of reference bases to include around each alternate allele (default 200). The flank length is counted outside of any micro-homology around the breakpoints.

3. Outputs

• -O <fasta-file> : An output fasta file containing one entry for each alternative structural allele. The default is to write to stdout.
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 1Member

Hello, As I am sure you hear a lot, I am new to SNP calling using GATK; however, I have been rapidly learning from the resources on this site. Currently, I am working on a pipeline to process short read Illumina data and extract SNPs for phylogenetic analysis. I am also doing this within the Galaxy interface. I have been successful in generating a VCF file using UnifiedGenotyper from multiple BAM files. At this point, I would like to create a multiple alignment of all the SNPs by sample name in a FASTA format which I can take forward to phylogenetic analysis. I am having difficulty finding the proper tool to do this. I don't even know if this is something common that is being done. It appears that this tool may be what I am looking for. Does this sound like I am going in the right direction? Thank you for any assistance you are able to provide.

• Posts: 151Member, Third-party Developer ✭✭✭

No, I'm afraid this is not the tool you are looking for.

This is a specialized tool for structural variants that generates a fasta file containing non-reference alleles with some flanking reference sequence that can be used to align and capture unmapped reads that were unmapped due to inability to align across a structural variant junction. This utility is mostly useful for shorter reads where a novel junction can cause the read to be unaligned.

Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

• Posts: 11Member

Hi, could you please tell me how to use GenerateAltAlleleFasta? I cannot find this module in the GenomeSTRIP packge (svtoolkit_1.04.1228.tar.gz) or in the GATK module.

• Posts: 151Member, Third-party Developer ✭✭✭
edited November 2013

Hi, the program is still there. It is a java command line program, the full java class is org.broadinstitute.sv.apps.GenerateAltAlleleFasta. Here is an invocation that generates a help message (missing -I, -R arguments).

\$ java -cp lib/SVToolkit.jar:lib/gatk/GenomeAnalysisTK.jar org.broadinstitute.sv.apps.GenerateAltAlleleFasta
---------------------------------------------------------------
---------------------------------------------------------------
---------------------------------------------------------------
usage: java -jar SVToolkit.jar -I <inputFile> -R <referenceSequence> [-args <arg_file>] [-O <outputFile>] [-flankLength
<flankLength>] [-l <logging_level>] [-log <log_to_file>] [-h] [-version]

-I,--inputFile <inputFile>                   VCF input file
-R,--referenceSequence <referenceSequence>   Reference sequence fasta file
-args,--arg_file <arg_file>                  Reads arguments from the specified file
-O,--outputFile <outputFile>                 Output file (default stdout)
-flankLength,--flankLength <flankLength>     Flank length (default 200)
-l,--logging_level <logging_level>           Set the minimum level of logging, i.e. setting INFO get's you INFO up to
FATAL, setting ERROR gets you ERROR and FATAL level logging.
-log,--log_to_file <log_to_file>             Set the logging location
-h,--help                                    Generate this help message
-version,--version                           Output version information