FastaAlternateReferenceMaker realigns intervals in which order?

RP1RP1 Posts: 3Member

Hi, I am using FastaAlternateReferenceMaker and have a set of intervals ordered first by chromosome and then by their start positions. I have tried ordering chromosomes alphabetically(chr1, chr10, chr11,..) as well as numerically (chr1, chr2, chr3...) but the output fasta sequence returned is not in the same order as listed in interval file. I find that even the names target_1, target_2 etc are also not used as fasta headers in the output file. I am stuck with mapping the input intervals with the output fasta sequences. Thanks in advance for all the help, Ramya

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,421Administrator, GATK Developer admin

    Hi Ramya,

    The output sequence will always be returned in the same order, with the same contig names, as the master reference your data is aligned against. If you want that order to be different, you will need to reorder/rename contigs in your master reference first. Or modify your new reference after it is generated, of course.

    Geraldine Van der Auwera, PhD

  • RP1RP1 Posts: 3Member

    Thanks Geraldine. You are right that the sequence is always returned in the same order as the master reference. I figured that my problem was with overlapping intervals in the bed file. Overlapping intervals are NOT allowed and FastaAlternateReferenceMaker combines overlapping intervals into one while returning the output.I also observe something very peculiar. If there is a deletion overlapping with the boundaries of an interval, then it affects the sequence of the next interval.

    To clarify my point I am attaching some sample files in zipped version: (1) sample reference fasta file (sample_reference.fasta) and its indexes (sample_reference.fasta.fai, sample_reference.dict) (2) sample variant file (sample_variant.bcf) (3) two sample interval files (sample_interval.bed, sample_interval1.bed) and their corresponding output files (sample_interval.fasta, sample_interval1.fasta). The command I used is : java -Xmx2g -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -l ERROR -R sample_reference.fasta -L sample_interval.bed -o sample_interval.fasta --variant sample_variant.bcf

    Kindly pay attention to the second interval in sample_interval.bed file whose output depends on its preceding interval's boundary overlapping with a deletion. In sample_interval1.bed file, if I give only single interval, then the output sequence is correct.

    Thanks, Ramya

    zip
    zip
    GATK_Query.zip
    56K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,421Administrator, GATK Developer admin

    Well, overlapping intervals are allowed in the sense that they won't cause an error -- it's just that the engine will combine them before proceeding with the analysis. This is documented in the FAQ about intervals.

    Can you explain in a little more detail what is the issue you observe with the deletion affecting the next interval? I need you to tell me explicitly rather than show me the files at this stage, because unfortunately I don't have the time to go through multiple files trying to guess what I should be seeing.

    Geraldine Van der Auwera, PhD

  • RP1RP1 Posts: 3Member

    Dear Geraldine,

    There is an INDEL at position 110349 at which AAGGG in the reference is called as A in the alternate reference (file: sample_variant.bcf). When I give the intervals (file: sample_interval.bed) as

    Sbay_1 110346 110350 . target_1 Sbay_1 110466 110470 . target_2

    The output fasta sequences are TAA and T respectively (file: sample_interval.fasta).

    Whereas if I give just the second interval in a different file (file: sample_interval1.bed)

    Sbay_1 110466 110470 . target_2

    The output fasta sequence is TCAT (file: sample_interval1.fasta).

    This happens because in the first case the previous interval (target_1) spans an INDEL. However if the end of the previous interval (i.e. target_1) is >=110353 (i.e 110349 + length(AAGGG) - 1), the effect on subsequent interval's output vanishes.

    Thus, in case an interval partially spans an INDEL, the next interval's output gets affected.Correct me if I am wrong.

    Thanks, Ramya

  • MartinBMartinB Posts: 8Member

    Hello,

    I would just add that I am having a similar problem which could be solved if FastaAlternateReferenceMaker was given the name (target1, target2) as header of the fasta results file. Giving number as fasta header make it harder to find which sequences are overlapping (and which one are not processed ). By having the name of the interval in the output file it will make it easier to analyse.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,421Administrator, GATK Developer admin

    Hi Martin, that would be problematic for large sets of targets. It would be better to run some kind of validation on your targets first to check that there won't be any adverse interactions. Unfortunately we don't have the resources right now to devote to improving this tool; it is provided as-is.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.