The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# FastaAlternateReferenceMaker realigns intervals in which order?

Member Posts: 3

Hi,
I am using FastaAlternateReferenceMaker and have a set of intervals ordered first by chromosome and then by their start positions. I have tried ordering chromosomes alphabetically(chr1, chr10, chr11,..) as well as numerically (chr1, chr2, chr3...) but the output fasta sequence returned is not in the same order as listed in interval file. I find that even the names target_1, target_2 etc are also not used as fasta headers in the output file. I am stuck with mapping the input intervals with the output fasta sequences.
Thanks in advance for all the help,
Ramya

Tagged:

Hi Ramya,

The output sequence will always be returned in the same order, with the same contig names, as the master reference your data is aligned against. If you want that order to be different, you will need to reorder/rename contigs in your master reference first. Or modify your new reference after it is generated, of course.

Geraldine Van der Auwera, PhD

• Member Posts: 3

Thanks Geraldine. You are right that the sequence is always returned in the same order as the master reference. I figured that my problem was with overlapping intervals in the bed file. Overlapping intervals are NOT allowed and FastaAlternateReferenceMaker combines overlapping intervals into one while returning the output.I also observe something very peculiar. If there is a deletion overlapping with the boundaries of an interval, then it affects the sequence of the next interval.

To clarify my point I am attaching some sample files in zipped version: (1) sample reference fasta file (sample_reference.fasta) and its indexes (sample_reference.fasta.fai, sample_reference.dict) (2) sample variant file (sample_variant.bcf) (3) two sample interval files (sample_interval.bed, sample_interval1.bed) and their corresponding output files (sample_interval.fasta, sample_interval1.fasta). The command I used is : java -Xmx2g -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -l ERROR -R sample_reference.fasta -L sample_interval.bed -o sample_interval.fasta --variant sample_variant.bcf

Kindly pay attention to the second interval in sample_interval.bed file whose output depends on its preceding interval's boundary overlapping with a deletion. In sample_interval1.bed file, if I give only single interval, then the output sequence is correct.

Thanks,
Ramya

Well, overlapping intervals are allowed in the sense that they won't cause an error -- it's just that the engine will combine them before proceeding with the analysis. This is documented in the FAQ about intervals.

Can you explain in a little more detail what is the issue you observe with the deletion affecting the next interval? I need you to tell me explicitly rather than show me the files at this stage, because unfortunately I don't have the time to go through multiple files trying to guess what I should be seeing.

Geraldine Van der Auwera, PhD

• Member Posts: 3

Dear Geraldine,

There is an INDEL at position 110349 at which AAGGG in the reference is called as A in the alternate reference (file: sample_variant.bcf). When I give the intervals (file: sample_interval.bed) as

Sbay_1 110346 110350 . target_1
Sbay_1 110466 110470 . target_2

The output fasta sequences are TAA and T respectively (file: sample_interval.fasta).

Whereas if I give just the second interval in a different file (file: sample_interval1.bed)

Sbay_1 110466 110470 . target_2

The output fasta sequence is TCAT (file: sample_interval1.fasta).

This happens because in the first case the previous interval (target_1) spans an INDEL. However if the end of the previous interval (i.e. target_1) is >=110353 (i.e 110349 + length(AAGGG) - 1), the effect on subsequent interval's output vanishes.

Thus, in case an interval partially spans an INDEL, the next interval's output gets affected.Correct me if I am wrong.

Thanks,
Ramya

• Member Posts: 8

Hello,

I would just add that I am having a similar problem which could be solved if FastaAlternateReferenceMaker was given the name (target1, target2) as header of the fasta results file.
Giving number as fasta header make it harder to find which sequences are overlapping (and which one are not processed ).
By having the name of the interval in the output file it will make it easier to analyse.