Service note: Geraldine is on vacation this week; other members of GSA will be responding to questions, but they have a lot of work besides this, so be aware that responses may be a little slower than usual. Thank you for your patience.

Local Realignment around Indels

delangeldelangel Posts: 61GSA Official Member mod

Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.


Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.


Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R /path/to/reference.fasta \
  -o /path/to/output.intervals \
  -known /path/to/indel_calls.vcf

The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
  -jar /path/to/GenomeAnalysisTK.jar \
  -I <lane-level.bam> \
  -R <ref.fasta> \
  -T IndelRealigner \
  -targetIntervals <intervalListFromStep1Above.intervals> \
  -o <realignedBam.bam> \
  -known /path/to/indel_calls.vcf
  --consensusDeterminationModel KNOWNS_ONLY \
  -LOD 0.4
Post edited by Geraldine_VdAuwera on

Comments

  • aeonsimaeonsim Posts: 24Member

    Hi

    Is there any way to run the IndelRealigner on BAM/SAM files where the multiple possible alignments are reported for a read, rather than the Top Random alignment that BWA uses?

    As the aligner we are using generates BAMs which report multiple possible alignments for reads, that have no unique best alignment. ie if a Read has three equal likely alignments (a repeat for example) the aligner will report all three and flag them as non unique.

    However if we try to feed these BAMs into GATK IndelRealigner we get an error due to these duplicate reads.

    ERROR MESSAGE: Error caching SAM record HWI-ST196R_0530:2:2204:3691:99597#0/1, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present.

    Is there any way around this, an on the fly filtering option we can add to GATK to discard these reads, or get it to work properly with them?

    Thanks

    Post edited by aeonsim on
  • ebanksebanks Posts: 475GSA Official Member mod

    Hi there, I'm pretty sure we already have a read filter that will work for you: try the NotPrimaryAlignmentFilter ('-rf NotPrimaryAlignment'). See the Read Filters section for more details. And if this doesn't work for you, it should be extremely easy for you to create a filter modeled after this one that will work. Good luck!

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • aeonsimaeonsim Posts: 24Member

    That seems to have worked, thanks! Might be worth extending your error message to mention it as an option.

  • virenpatelvirenpatel Posts: 6Member

    Hello. I am trying to run the Indel Realigner but the message I get is:

    ##### ERROR MESSAGE: File associated with name s_1.bwt2.sorted.wRG.recal.interval_list is malformed: Interval file could not be parsed in any supported format. caused by Failed to parse Genome Location string: chr1:58879-59946:95-193
    

    The command line I am using is:

    java -Xmx2g -jar GenomeAnalysisTK.jar -T IndelRealigner -T reference.fasta -I input.bam -o output.bam -targetIntervals forIndelRealigner.interval_list -known reference_snps.vcf --validation_strictness LENIENT
    

    There are no errors from the previous step of RealignerTargetCreator. What am I doing wrong?

    Post edited by Geraldine_VdAuwera on
  • virenpatelvirenpatel Posts: 6Member

    Sorry. Above command has -R reference.fasta, not -T as shown.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    The error message says it all: your interval list is malformed. Look closely at the Genome Loc string at the end of the error: chr1:58879-59946:95-193 and ask yourself, what is this supposed to represent, and what's wrong about it?

    Geraldine Van der Auwera, PhD

  • virenpatelvirenpatel Posts: 6Member

    I am sorry but I don't know what's wrong other than it's obviously not a valid genomic coordinate in the usual sense. Since this was output by RealignerTargetCreator I assumed it would be correctly interpreted by IndelRealigner. I have a contig in my reference that has coordinates chr1:58879-59946 but I do not know what the 95-193 refers to.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Alright, well the 95-193 part shouldn't be there at all, it's not part of the normal makeup of a genome loc, so it seems the interval list was not created correctly by the RTC. Please realize that a more accurate description of your problem would have been "I just ran RealignerTargetCreator and it produced an interval list that looks like it's incorrectly formatted". Technically this is a problem that should be posted under RealignerTargetCreator, not IndelRealigner...

    Can you check (by using -XL to exclude the problem location) whether this only occurs a single time in your intervals file, or whether there are other instances?

    Another helpful test would be to run your original RTC command again to see if it produces the same malformed locus a second time. This could have been due to a one-time hiccup, so let's check for that before going too deep.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 475GSA Official Member mod

    I agree with Geraldine that it looks like a one time blip with your file system.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • virenpatelvirenpatel Posts: 6Member

    The whole interval list file looks the same. Some contigs have multiple entries, e.g:

    chr1:239794835-239795056:6-63
    chr1:239794835-239795056:142
    chr1:239794835-239795056:218-219
    

    I did rerun the RTC command and the output was the same.

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    This is weird -- can you please post your full command line for the RealignerTargetCreator step?

    Geraldine Van der Auwera, PhD

  • virenpatelvirenpatel Posts: 6Member
    java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I input.bam -o forIndelRealigner.interval_list --known reference_snps.vcf --validation_strictness LENIENT
    
    Post edited by Geraldine_VdAuwera on
  • virenpatelvirenpatel Posts: 6Member

    I have attached the output of RealignerTargetCreator.

    log
    log
    RealignerTargetCreator.log
    54K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    What happens if you run that command without the --validation_strictness flag?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Also, can you tell me what reference you are using? Is it one of the standard ones from our bundle or is it something else? It looks like the locations are formatted in a way that's not compatible with our tools. If you upload your reference .idx file we may be able to verify if that's where the problem is.

    Geraldine Van der Auwera, PhD

  • hovelsonhovelson Posts: 15Member

    Here is a problem related to realigning around know indel sites that I cannot solve:

    After realigning around known indel sites (using the 1000G phase1 vcf file file from your bundle, as well as different copy of the same 1000G phase1 known indels vcf file), I notice some strange results when looking at plots of empirical and observed base-level phred scores pre- and post-indel-realignment. The attached PDF file shows side-by-side empirical v. reported phred score plots generated from both a raw bam file and a bam file after local indel realignment. These bam files are generated from the same sample, but you can see that after indel realignment there is a substantial deflation of empirical phred scores compared to pre-realignment results.

    My question is: why do we see such results? The trend appears similar (pre- and post-realignment), but the empirical scores appear capped at ~25. I'm still attempting to better understand what might be happening here, but something about the way I am running indel realignment appears to also be impacting downstream base quality re-calibration results as well (similar, strange empirical vs. reported plot results post-recalibration). If I exclude the indel realignment step, the base quality re-calibration appears to work successfully (a much tighter adherence to the expected empirical v. reported line post-recalibration).

    The code I used to carry out the two-stage indel realignment step is as follows:

    /usr/bin/java -Xmx2g -jar /net/amd/hovelson/bin/GenomeAnalysisTK-1.4-13-gef50e77/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -known /net/fantasia/home/gjun/data/indel/1000G_phase1.indels.b37.vcf

    /usr/bin/java -Xmx4g -jar /net/amd/hovelson/bin/GenomeAnalysisTK-1.4-13-gef50e77/GenomeAnalysisTK.jar -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

    samtools index /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

    I've spent a lot of time trying to troubleshoot this to no avail. I hope I'm just missing something simple, but am very eager to see whether you folks have any thoughts about what might be going wrong!

    Best, Dan

    pdf
    pdf
    indel_realignment_neava.20121009.pdf
    100K
    Post edited by hovelson on
  • ebanksebanks Posts: 475GSA Official Member mod

    Hi Dan,

    That isn't the behavior I would have expected. A few thoughts: 1. What do the AnalyzeCovariates (since you are using an older version of the code) plots look like? Can you confirm that they show the same behavior? 2. What if you were to remove the -L argument in the IndelRealigner command? Does the plot for the realigned bam look better?

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 15Member

    Thanks very much for your reply, Eric.

    I assume by AnalyzeCovariates plots you mean plots after base quality recalibration (is this correct?). If I run recalibration after local indel realignment, I get the very odd plot you see on the lefthand-side of the attached PDF doc. If I run recalibration without any local indel realignment, I get the more sensible plot you see on the right side of the attached doc.

    I will re-run local realignment without supplying the target region file and let you know what happens shortly.

    pdf
    pdf
    indel_realign_AnalyzeCovariates.pdf
    87K
    Post edited by hovelson on
  • ebanksebanks Posts: 475GSA Official Member mod

    No, this isn't right. I don't want you to generate these plots with your own tools, but rather I want to see what they look like with our plotting tools. Perhaps you should upgrade to the latest release of the GATK and use BaseRecalibrator which is documented (and supported).

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 15Member

    I'll work to generate these plots with your tools. I am a little unclear, though: how will proceeding with base quality recalibration help us sort out the unexpected results from the local indel realignment?

    Thanks, Dan

  • hovelsonhovelson Posts: 15Member

    For what it's worth, our tools do still show the same deflation in empirical phred score values for bam files created from local realignment when removing the -L target file option.

  • ebanksebanks Posts: 475GSA Official Member mod

    I just want to confirm that the problem is not with your plotting tools.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 15Member

    Ok, thanks Eric! I've updated to the latest release of GATK, and am currently re-running indel realignment and base quality recalibration on this sample. I'll generate the QQ plots using your tools, and let you know what I find.

  • ebanksebanks Posts: 475GSA Official Member mod

    You shouldn't need to rerun the re-alignment - just the recalibration.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 15Member

    So (using the latest GATK release) the target interval creation step (for local realignment) ran smoothly. However, when running IndelRealigner I receive a stack trace error that is as follows:

    INFO  12:07:59,775 ArgumentTypeDescriptor - Dynamically determined type of /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed to be BED
    INFO  12:07:59,811 HelpFormatter - ---------------------------------------------------------------------------------
    INFO  12:07:59,811 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-13-g0f021e6, Compiled 2012/10/10 19:42:23
    INFO  12:07:59,811 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO  12:07:59,812 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO  12:07:59,812 HelpFormatter - Program Args: -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam
    INFO  12:07:59,813 HelpFormatter - Date/Time: 2012/10/11 12:07:59
    INFO  12:07:59,813 HelpFormatter - ---------------------------------------------------------------------------------
    INFO  12:07:59,813 HelpFormatter - ---------------------------------------------------------------------------------
    INFO  12:07:59,892 GenomeAnalysisEngine - Strictness is SILENT
    INFO  12:07:59,961 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO  12:07:59,982 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
    INFO  12:08:01,566 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
    INFO  12:08:01,567 TraversalEngine -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
    INFO  12:08:02,640 GATKRunReport - Uploaded run statistics report to AWS S3
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR stack trace
    java.lang.ArrayIndexOutOfBoundsException: -95
            at org.broadinstitute.sting.utils.BaseUtils.simpleBaseToBaseIndex(BaseUtils.java:201)
            at org.broadinstitute.sting.utils.BaseUtils.isRegularBase(BaseUtils.java:235)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.mismatchQualitySumIgnoreCigar(IndelRealigner.java:649)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.determineReadsThatNeedCleaning(IndelRealigner.java:917)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.clean(IndelRealigner.java:681)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.cleanAndCallMap(IndelRealigner.java:547)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:519)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:114)
            at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:104)
            at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52)
            at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71)
            at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265)
            at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
            at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.1-13-g0f021e6):
    ##### ERROR
    ##### ERROR Please visit the wiki to see if this is a known problem
    ##### ERROR If not, please post the error, with stack trace, to the GATK forum
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: -95
    ##### ERROR ------------------------------------------------------------------------------------------
    

    The code used to generate this is as follows (I tried with both a -L genomic interval file and without):

    /usr/bin/java -Xmx4g -jar /net/amd/hovelson/bin/GenomeAnalysisTKLite-2.1-13-g0f021e6/GenomeAnalysisTKLite.jar -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

    Any thoughts?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    I noticed you are using a BED interval file. Have you checked the coordinates/ indexing? BED files are indexed differently, this may just be an "off by one" error.

    Geraldine Van der Auwera, PhD

  • hovelsonhovelson Posts: 15Member

    I had wondered the same, except I've tested the local realignment without using the BED interval file and still receive the same error. The BED file I have been using has start position as 0-based, end position as 1-based.

  • ebanksebanks Posts: 475GSA Official Member mod

    It looks like your BAM file is malformed. Can you please run it through Picard's ValidateSAMFile?

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 15Member

    ValidateSAMFile Results:

    I have >50 million reads in the BAM file, and ~98,000 records are flagged with an error essentially the same as below:

    ERROR: Record 1234, Read name <READ_NAME>, Mate negative strand flag does not match read negative strand flag of mate

    There's also ~50 errors that include one of the two following errors:

    ERROR: .....MAPQ should be 0 for unmapped read or ERROR: .....Mate unmapped flag does not match read unmapped flag of mate

    Is this type of mismatched read-pair information a likely culprit for causing problems at the indel realignment stage?

  • hovelsonhovelson Posts: 15Member

    And before I forget - THANK YOU very much for your help/suggestions in trying to sort all this out. I sincerely appreciate it!

  • hovelsonhovelson Posts: 15Member

    In playing around a little bit with some of these command line arguments, I realized that when running BaseRecalibrator (for example) with an unzipped reference FASTA file, I was able to get it to run just fine. If I run with a gzipped reference file - I get the same sorts of stack error problems described above for the IndelRealigner.

    Could a gzipped reference file be the problem here?

  • ebanksebanks Posts: 475GSA Official Member mod

    Yes, that looks like the culprit.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 15Member

    Okay - it's still a bit strange, though, because I can run the first step of local indel realignment without issue. It only presents problems for the 2nd step of indel realignment (or BaseRecalibrator).

    Post edited by hovelson on
  • Mark_DePristoMark_DePristo Posts: 132Administrator, GSA Official Member admin

    No, actually we are moving away even allowing zipped reference files at all, because the code we use to read it doesn't work so properly...

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • hovelsonhovelson Posts: 15Member

    Okay - thanks very much for the help/feedback with this issue. It's greatly appreciated!

  • akbariakbari Posts: 6Member

    The IndelRealigner works fine with one know indel target intervals. However, when I try to use two known indel target intervals with two separate -targetintervals arguments and -two separate -known arguments, I get the error of too many values for the -targetintervals argument. Should I run the IndelRealigner for each known target intervals separately or there is a way to combine them in a single run.Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    You can give it a file containing a list of target intervals as explained here:

    http://www.broadinstitute.org/gatk/guide/article?id=1204

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 475GSA Official Member mod

    Just combine (e.g. with 'cat') the files into a single file first.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • akbariakbari Posts: 6Member

    I tried to use the MergeIntervalLists walker before, but it seems this walker is not included in the public release of the GATK. If I use cat for combining the intervals, then I will have many overlapping intervals in the final list. Is that ok for indelrealigner walker? Do I need to combine the vcf files as well or I can use them with two separate -known arguments?Thanks,

  • ebanksebanks Posts: 475GSA Official Member mod

    Actually, let's step back for a second. Why do you have 2 input target interval lists? That's not right. You should only have one input list, otherwise you have done something wrong...

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • akbariakbari Posts: 6Member

    Good question. Thank you for noting that. I do not know why I made a separate interval list for each vcf file. I think that I should use the two vcf files together with RealignerTargetCreator walker and make a combined interval list for both vcf files, shouldn't I?

  • ebanksebanks Posts: 475GSA Official Member mod

    Yes!

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • sarmadymsarmadym Posts: 5Member

    I'm working on a pipeline for a very high coverage targeted resequencing project using GATK 2.2. We are interested only in variants within the targeted region. I tested two pipelines one with -L targeted_regions.bed for IndelRealigner, RealignerTargetCreator, BaseRecalibrator, PrintReads and one without -L for those steps. Out of about 430 variants, there are 3 variants that are discordant between the two variant sets (all three with >30 QUAL and >100 DP).

    Which set should I trust? is -L necessary for such targeted resequencing projects? or I should avoid it for certain commands.

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi there,

    There is no strong argument for either side (and what you're seeing is a very minor difference) but I can tell you that we don't use intervals until the PrintReads step (after Base Recalibration).

    Geraldine Van der Auwera, PhD

  • hmviithmviit Posts: 4Member

    Hi,

    I just recently started using GATK 2. I tried but I couldn't quite figure out where to find answers for my questions from the documentations, so I'll just aks straight up.

    I'm working with 2 different Gasterosteus aculeatus lineages. Even though there is a genome, I haven't found a GVF file for the species. The sequences are from illumina. I've following throught the pipeline for local realignment with creating targets and then realigning followed with baserecalibration. I ran into few questions:

    1) When I ran RealignerTargetCreator with a VCF of my own for these samples, I endedn up having in addition to my previously found indels (from samtools) alot of other targets as well. Some of them did seem like candidates even without an indel, but some of them were fine to my eye. What could be causing this? This is what I ran java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R gasAcu_combinedbac_inv7.fa -I readgroup_FF1_inv7c.sorted.bam -L Capture_Target_Regions.intervals -o intervals2_FF1_inv7c -known FF1_inv7c_var.vcf -fixMisencodedQuals

    2) When I ran IndelRealigner using the targets created in the previous, I couldn't any more add my vcf as the --knownAlleles. It ran fine without it though, but when I added it in as an argument it errored with this:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.3-9-ge5ebf34):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Argument with name 'knownAlleles' isn't defined.
    ERROR ------------------------------------------------------------------------------------------

    The code I tried running was java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T IndelRealigner -R gasAcu_combinedbac_inv7.fa -I readgroup_FF1_inv7c.sorted.bam -targetIntervals intervals2_FF1_inv7c.intervals -L Capture_Target_Regions.intervals -knownAlleles FF1_inv7c_var.vcf -o realigned_FF1_inv7c.bam -fixMisencodedQuals

    3) I'm know at the stage of BaseRecalibrator, but my problem is that I'm not sure what to add as the -knownSites based on previous questions and the fact that I know there are samples for which I have the case of SNPs being Ref/Alt and Atl1/Alt2. I wasn't sure after reading the documentation, should I use -knownSites to mask potential SNPs from recalibration and if yes, then can it be my own VCF based on unrealigned and uncalibrated samtools results or should it be something else?

    I appreciate any suggestions here.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    1) Don't worry about it. There are various things the RTC looks for when determining what regions may need to be realigned. Even if a region is flagged that doesn't really need to be, it won't hurt or mess up anything.

    2) The argument name is slightly different for IR than for RTC: --knownAlleles / -known. Always worth looking up the exact arg names in the tech docs if you're getting an error.

    3) You should probably filter your knowns VCF to keep only the high quality calls, otherwise you risk over-masking. This is covered in the Best Practices documentation.

    Geraldine Van der Auwera, PhD

  • hmviithmviit Posts: 4Member

    Thanks for your reply Geraldine, I'll check my argument names and filter my VCF before running the BaseRecalibrator.

    Post edited by hmviit on
  • chongmchongm Posts: 23Member

    I had a conceptual question regarding "strand discordant loci." In the local realignment video tutorial, there was a slide on this concept whereby a unique albeit artefactual SNP cluster seems to be biased towards the ends of either forward or reverse strands due to the presence of an indel. It is not quite clear to me why exactly there is this strand bias. Let's say we have two identical reads (same sequence) which map to the edge of an indel, but one read is read from the forward strand (sequencing towards the indel let's say) and then another read is read from the reverse strand (sequencing away from the indel let's say). Are these two reads not equally likely to map to the same position on the indel? Does it have something to do with the fact that bases at the end of the reads tend to have lesser qualities (and thus a greater chance of mapping to an indelic region with artefactual mismatches?)

    Regardless of this strand discordance, I understand that a high SNP density is also a huge flag for "suspicious" regions requiring realignment but I'd like to understand where this strand bias comes from.

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi there,

    This is an interesting question, but it's more so a question on how aligners work rather than how to use the GATK, so I'm afraid we can't spend time discussing it. I think you would do better to start that discussion somewhere like SeqAnswers.

    That said if someone in the community would like to discuss this with you they should feel free to do so here; we're quite happy to host the discussion even if we don't have time to participate.

    Geraldine Van der Auwera, PhD

  • bioSGbioSG Posts: 2Member

    Hi there, Is it possible to perform RealignerTargetCreator and IndelRealigner with multiple samples at once in the same BAM file or does it need to be done one by one? In case it is possible to perform these steps with multiple samples at once, is there an optimal number of samples? Thanks very much in advance.

    S.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi @bioSG,

    You can realign multiple samples at the same time, yes. They can be in the same file or in separate files that you pass in together, as you prefer.

    There is no hard limit to the number of samples you can realign together, but it is computationally expensive, so the number you can realistically process depends on the capabilities of your computer/platform.

    Geraldine Van der Auwera, PhD

  • julianjulian Posts: 2Member

    Hi,

    In case I run the realigner with multiple samples at once, will it try to find a consistent realignment taking all the samples at a certain site into account or will it treat the samples independently?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Yes, it will apply the same consensus to all the samples.

    Geraldine Van der Auwera, PhD

  • bioSGbioSG Posts: 2Member

    Going back to my question on the realigner with multiple samples at once, is there the risk of not performing the realignment in regions that would have been realigned doing a one by one sample realignment? It looks like realigning with multiple samples at once could decrease the percentages of mismatches at a locus having high entropy, and thus, further tuning of the entropy threshold would be needed.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    That's right, which is why if you want to be really thorough you can first realign per-sample, then do a second round of realignment on multiple samples. That way you're sure that individual samples are realigned internally even if the region doesn't get triggered when running multi-samples. But that takes a whole lot of time & compute, so it's not very popular.

    Geraldine Van der Auwera, PhD

  • mpviverompvivero Posts: 5Member

    Is there anyway to reorder the contigs for the known indels the Broad provides in their hg19 bundle? The hg19 reference genome I have been using starts chr1, chr2, chr3...chr7, chrX, chr8. However, the known indel contigs (1000G and Mills set) start chrM, chr1, chr2...chr7, chr8, chr9. I am trying to run RealignerTargetCreator (and later IndelRealigner) and I get an error stating that GATK doesn't function with different contig orders, it is "unsafe."

    I would much rather reorder the contigs for the known indel VCF once than having to reorder all my BAM files everytime. Or is there another known indel VCF with the chr1, chr2, chr3...chr7, chrX, chr8 order I can download?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    I'm afraid our hg19 resource files all start with chrM too. Maybe someone else has a differently ordered vcf they can share with you; otherwise we have heard of a script tool called VCFsorter that will allow you to reorder the contigs in your VCF according to a dictionary, using Picard: https://code.google.com/p/vcfsorter/

    Just keep in mind that this is not one of our tools so we can't guarantee results nor provide support for that particular tool. Good luck!

    Geraldine Van der Auwera, PhD

  • mpviverompvivero Posts: 5Member

    As long as the -known [file] is in VCF format, will the RealignerTargetCreator function properly, regardless of chromosomal order? I want to make sure this will not hinder GATKs functionally.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    No, the order of the contigs has to be the same for all the files involved, otherwise GATK will freak out. This is an important requirement of how the GATK engine retrieves data from the files.

    Generally speaking, you should always use data resources that match the reference. From when you start designing your experiment, you should plan which reference you will use for alignment to go with which data resources you will use. Choosing afterwards leads to this kind of problem of having to reorder things around, it's messy. Sorry to be the bearer of bad news...

    Geraldine Van der Auwera, PhD

  • mjaegermjaeger Posts: 3Member

    Hi, I used the RealignerTargetCreator to create a list file (sample.intervals) with intervals to be used with the IndelRealigner. But when I call the command I get a error saying that my argument 'targetIntervals' has to many values. There is only one argument given to this parameter:

    java -jar GenomeAnalysisTK.jar -T IndelRealigner -I ... -o ... -targetIntervals sample.intervals

    Do you know some possible work around?

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi @mjaeger,

    Can you please post the full command that you ran?

    Geraldine Van der Auwera, PhD

  • mjaegermjaeger Posts: 3Member

    Yes of cause.

    java -jar GenomeAnalysisTK.jar -T IndelRealigner -R UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -I results/exom/HOCsampleA.bam -o results/exom/HOCsampleA_realigned.bam -targetIntervals sample.intervals

    Here is the head of the intervals file:

    chrM:302-303 chrM:516-517 chr1:132032 chr1:448309 chr1:523157-523158 chr1:533066 chr1:535082 ...

    The file contains about 62.000 positions/lines.

    Thanks. Marten

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Marten,

    That's very odd. You could try running this command again with a copy of the targets file containing only a few lines, and see if that works properly. If it does, it might mean that something is wrong in the original targets file. If it does not, then the problem is elsewhere. Can you also make sure that you are running the latest version of the GATK?

    Geraldine Van der Auwera, PhD

  • mjaegermjaeger Posts: 3Member

    Hi Geraldine,

    The latest version of GATK (2.5-2-gf57256b) was downloaded the same day I first run the realigner. I tried to run the command with a copy with only the first 1.000 lines of the intervals file but got the same error message. So maybe I will rerun the analysis on an other machine.

    Marten

  • CarneiroCarneiro Posts: 153Administrator, GSA Official Member admin

    can you copy & paste the GATK error message? There shouldn't be a limit on the number of intervals.

Sign In or Register to comment.