We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!


Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

The GenerateAltAlleleFasta utility processes a VCF file to extract the
sequences of the alternate alleles.

For each structural variation record in the VCF, this utility will generate
one output sequence in fasta format for each alternative allele that has
precise breakpoints. The identifier for the alternate allele will be
variantID_alleleNumber where alleleNumber is the number of the allele
in the ALT column of the VCF file (the first ALT allele is allele 1).

The remainder of each fasta header line after the ID contains an encoded
description of how the allele sequence maps back to the reference genome. The
naming convention for the fasta sequences and the format of the rest of the
header line is understood by other programs that use the alternate allele
fasta file as input.

Here is an example of a generated fasta header:

>P2_M_061510_20_81_1 L:chr20:51913435-51913634;1-200|R:chr20:51913736-51913935;202-401|LENGTH:401

This example us for the first alternate allele of a variant with ID P2_M_061510_20_81. The length of the generated fasta sequence is 401 bases.
Bases 1-200 of the alternate allele sequence aligns to chr20:51913435-51913634
of the reference sequence and bases 202-401 of the fasta sequence aligns to
bases chr20:51913736-51913935 of the reference sequence. Thus, this event
represents a deletion of 101bp of the reference (chr20:51913635-51913735) with
one base of non-template sequence present in the alternate allele.

2. Inputs / Arguments

  • -I <vcf-file> : The input VCF file.

  • -R <fasta-file> : Reference sequence. An indexed fasta file containing
    the reference sequence. The fasta file must be indexed with 'samtools faidx'
    or the equivalent.

  • -flankLength <N> : The number of reference bases to include around each
    alternate allele (default 200). The flank length is counted outside of any
    micro-homology around the breakpoints.

3. Outputs

  • -O <fasta-file> : An output fasta file containing one entry for each
    alternative structural allele. The default is to write to stdout.
Post edited by Geraldine_VdAuwera on


  • EpiDemos82EpiDemos82 Member

    As I am sure you hear a lot, I am new to SNP calling using GATK; however, I have been rapidly learning from the resources on this site. Currently, I am working on a pipeline to process short read Illumina data and extract SNPs for phylogenetic analysis. I am also doing this within the Galaxy interface. I have been successful in generating a VCF file using UnifiedGenotyper from multiple BAM files. At this point, I would like to create a multiple alignment of all the SNPs by sample name in a FASTA format which I can take forward to phylogenetic analysis. I am having difficulty finding the proper tool to do this. I don't even know if this is something common that is being done. It appears that this tool may be what I am looking for. Does this sound like I am going in the right direction?
    Thank you for any assistance you are able to provide.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    No, I'm afraid this is not the tool you are looking for.

    This is a specialized tool for structural variants that generates a fasta file containing non-reference alleles with some flanking reference sequence that can be used to align and capture unmapped reads that were unmapped due to inability to align across a structural variant junction. This utility is mostly useful for shorter reads where a novel junction can cause the read to be unaligned.

  • Hi, could you please tell me how to use GenerateAltAlleleFasta? I cannot find this module in the GenomeSTRIP packge (svtoolkit_1.04.1228.tar.gz) or in the GATK module.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭
    edited November 2013

    Hi, the program is still there.
    It is a java command line program, the full java class is org.broadinstitute.sv.apps.GenerateAltAlleleFasta.
    Here is an invocation that generates a help message (missing -I, -R arguments).

    $ java -cp lib/SVToolkit.jar:lib/gatk/GenomeAnalysisTK.jar org.broadinstitute.sv.apps.GenerateAltAlleleFasta
    Program Name: org.broadinstitute.sv.apps.GenerateAltAlleleFasta
    usage: java -jar SVToolkit.jar -I <inputFile> -R <referenceSequence> [-args <arg_file>] [-O <outputFile>] [-flankLength 
           <flankLength>] [-l <logging_level>] [-log <log_to_file>] [-h] [-version]
     -I,--inputFile <inputFile>                   VCF input file
     -R,--referenceSequence <referenceSequence>   Reference sequence fasta file
     -args,--arg_file <arg_file>                  Reads arguments from the specified file
     -O,--outputFile <outputFile>                 Output file (default stdout)
     -flankLength,--flankLength <flankLength>     Flank length (default 200)
     -l,--logging_level <logging_level>           Set the minimum level of logging, i.e. setting INFO get's you INFO up to 
                                                  FATAL, setting ERROR gets you ERROR and FATAL level logging.
     -log,--log_to_file <log_to_file>             Set the logging location
     -h,--help                                    Generate this help message
     -version,--version                           Output version information
    Exception in thread "main" org.broadinstitute.sting.commandline.MissingArgumentException: 
    Argument with name '--inputFile' (-I) is missing.
    Argument with name '--referenceSequence' (-R) is missing.

    If you have more specific questions that aren't answered on this page, let me know.
    Note that this program is only useful if you have deletions with exact breakpoints.
    Moreover, with modern sequencing (reads 70bp and up), there is not much utility in using unaligned reads in genotyping, which is what this program is used for.

  • jglessnerjglessner Boston, MAMember

    What is the specific input VCF requirement for GenerateAltAlleleFasta?
    I used TIGRA-ext.pl to run TIGRA, BWA alignment, infer_sv.pl, and overlap with putative GenomeSTRiP calls. The output VCF looks like this (I added REF using samtools faidx as the "." in the original file threw an error):

    chr1    148599842   N   G   <DEL>   40  PASS    PRECISE;CIPOS=-10,10;CIEND=-10,10;SOMATIC;SVTYPE=DEL;CHR2=chr1;END=149222367;SVLEN=-622525;MICRO=NA;TEMPLATE=NA;NONTEMPLATE=NA; GT  ./. chr1^148600016^chr1^149222471^ITX^622455^+-.Contig17.

    Currently I am generating an empty .fa output from GenerateAltAlleleFasta called by SVAltAlign-1.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    I recommend you don't try to use GenerateAltAlleleFasta any longer, unless you have some special problem (like very short reads). This was more important when reads were short (36bp or 50bp) and many reads could not be partially aligned across breakpoint junctions.

    If you do need to use it for some reason, you need to have precise breakpoints (i.e. exact base sequences for both the ref and alt alleles, instead of "G" and "").

    Your VCF also looks like it is malformed and has an extra field:

  • jglessnerjglessner Boston, MAMember

    I got it to work. REF G and ALT <DEL> does not work. You need the REF to be the full sequence impacted by the deletion and ALT to be the first nucleotide. To get the full sequence, run samtools faidx ref.fa on the CHROM:POS-END values. The IDs must be unique such as the contig string rather than N. The VCF must also be sorted using vcf-tools. The developer of TIGRA-ext.pl fixed the extra field problem by adding the contig string to the INFO field.

  • StooffStooff Member

    Hi Geraldine,
    I understand that this option is giving the alternative to the reference fasta file. Nevertheless I am wondering, in case of a homozygous allele that would differ from the reference, in a diploid case. How can we recover the non-alternative fasta even though it is different from the reference file?
    reference fasta : ATT
    "non-Alternate fasta" : ACT
    Alternate fasta : ACG
    in that case I guess the vcf would show no SNPs for the first one, 1/1 for the second and 0/1 for the third nucleotide. What I wonder is that in the alternate fasta there would be the alternative but how to get the "non-Alternative fasta" ?
    I explain, I am trying to find out my Site allelic frequency spectrum for my population and I would loose the 1/1 information if I use the reference as the second fasta sequence from my diploid.
    Please help me,
    I might not sound really clear so don't hesitate to ask me for more details,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Stooff Although I posted the content of these articles, I'm not actually the one who supports GenomeSTRiP -- that honor goes to Bob Handsaker @bhandsaker, who may be able to help you.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    This is probably not the tool you are looking for. It was developed as part of a tool chain for analyzing structural variation and was used for older technology with very short Illumina reads (36 to 50bp) to remap completely unaligned reads that crossed structural variation boundaries. This tool chain is now deprecated. With modern technology and reads > 70bp, it is not useful.

Sign In or Register to comment.