Local Realignment around Indels

delangeldelangel Posts: 71GATK Developer mod
edited September 2012 in Methods and Workflows

Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.


Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.


Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R /path/to/reference.fasta \
  -o /path/to/output.intervals \
  -known /path/to/indel_calls.vcf

The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
  -jar /path/to/GenomeAnalysisTK.jar \
  -I <lane-level.bam> \
  -R <ref.fasta> \
  -T IndelRealigner \
  -targetIntervals <intervalListFromStep1Above.intervals> \
  -o <realignedBam.bam> \
  -known /path/to/indel_calls.vcf
  --consensusDeterminationModel KNOWNS_ONLY \
  -LOD 0.4
Post edited by Geraldine_VdAuwera on
«1

Comments

  • aeonsimaeonsim Posts: 64Member ✭✭
    edited August 2012

    Hi

    Is there any way to run the IndelRealigner on BAM/SAM files where the multiple possible alignments are reported for a read, rather than the Top Random alignment that BWA uses?

    As the aligner we are using generates BAMs which report multiple possible alignments for reads, that have no unique best alignment. ie if a Read has three equal likely alignments (a repeat for example) the aligner will report all three and flag them as non unique.

    However if we try to feed these BAMs into GATK IndelRealigner we get an error due to these duplicate reads.

    ERROR MESSAGE: Error caching SAM record HWI-ST196R_0530:2:2204:3691:99597#0/1, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present.

    Is there any way around this, an on the fly filtering option we can add to GATK to discard these reads, or get it to work properly with them?

    Thanks

    Post edited by aeonsim on
  • ebanksebanks Posts: 683GATK Developer mod

    Hi there, I'm pretty sure we already have a read filter that will work for you: try the NotPrimaryAlignmentFilter ('-rf NotPrimaryAlignment'). See the Read Filters section for more details. And if this doesn't work for you, it should be extremely easy for you to create a filter modeled after this one that will work. Good luck!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • aeonsimaeonsim Posts: 64Member ✭✭

    That seems to have worked, thanks! Might be worth extending your error message to mention it as an option.

  • virenpatelvirenpatel Posts: 7Member
    edited October 2012

    Hello. I am trying to run the Indel Realigner but the message I get is:

    ##### ERROR MESSAGE: File associated with name s_1.bwt2.sorted.wRG.recal.interval_list is malformed: Interval file could not be parsed in any supported format. caused by Failed to parse Genome Location string: chr1:58879-59946:95-193
    

    The command line I am using is:

    java -Xmx2g -jar GenomeAnalysisTK.jar -T IndelRealigner -T reference.fasta -I input.bam -o output.bam -targetIntervals forIndelRealigner.interval_list -known reference_snps.vcf --validation_strictness LENIENT
    

    There are no errors from the previous step of RealignerTargetCreator. What am I doing wrong?

    Post edited by Geraldine_VdAuwera on
  • virenpatelvirenpatel Posts: 7Member

    Sorry. Above command has -R reference.fasta, not -T as shown.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    The error message says it all: your interval list is malformed. Look closely at the Genome Loc string at the end of the error: chr1:58879-59946:95-193 and ask yourself, what is this supposed to represent, and what's wrong about it?

    Geraldine Van der Auwera, PhD

  • virenpatelvirenpatel Posts: 7Member

    I am sorry but I don't know what's wrong other than it's obviously not a valid genomic coordinate in the usual sense. Since this was output by RealignerTargetCreator I assumed it would be correctly interpreted by IndelRealigner. I have a contig in my reference that has coordinates chr1:58879-59946 but I do not know what the 95-193 refers to.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Alright, well the 95-193 part shouldn't be there at all, it's not part of the normal makeup of a genome loc, so it seems the interval list was not created correctly by the RTC. Please realize that a more accurate description of your problem would have been "I just ran RealignerTargetCreator and it produced an interval list that looks like it's incorrectly formatted". Technically this is a problem that should be posted under RealignerTargetCreator, not IndelRealigner...

    Can you check (by using -XL to exclude the problem location) whether this only occurs a single time in your intervals file, or whether there are other instances?

    Another helpful test would be to run your original RTC command again to see if it produces the same malformed locus a second time. This could have been due to a one-time hiccup, so let's check for that before going too deep.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 683GATK Developer mod

    I agree with Geraldine that it looks like a one time blip with your file system.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • virenpatelvirenpatel Posts: 7Member
    edited October 2012

    The whole interval list file looks the same. Some contigs have multiple entries, e.g:

    chr1:239794835-239795056:6-63
    chr1:239794835-239795056:142
    chr1:239794835-239795056:218-219
    

    I did rerun the RTC command and the output was the same.

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    This is weird -- can you please post your full command line for the RealignerTargetCreator step?

    Geraldine Van der Auwera, PhD

  • virenpatelvirenpatel Posts: 7Member
    edited October 2012
    java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I input.bam -o forIndelRealigner.interval_list --known reference_snps.vcf --validation_strictness LENIENT
    
    Post edited by Geraldine_VdAuwera on
  • virenpatelvirenpatel Posts: 7Member

    I have attached the output of RealignerTargetCreator.

    log
    log
    RealignerTargetCreator.log
    54K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    What happens if you run that command without the --validation_strictness flag?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Also, can you tell me what reference you are using? Is it one of the standard ones from our bundle or is it something else? It looks like the locations are formatted in a way that's not compatible with our tools. If you upload your reference .idx file we may be able to verify if that's where the problem is.

    Geraldine Van der Auwera, PhD

  • hovelsonhovelson Posts: 19Member
    edited October 2012

    Here is a problem related to realigning around know indel sites that I cannot solve:

    After realigning around known indel sites (using the 1000G phase1 vcf file file from your bundle, as well as different copy of the same 1000G phase1 known indels vcf file), I notice some strange results when looking at plots of empirical and observed base-level phred scores pre- and post-indel-realignment. The attached PDF file shows side-by-side empirical v. reported phred score plots generated from both a raw bam file and a bam file after local indel realignment. These bam files are generated from the same sample, but you can see that after indel realignment there is a substantial deflation of empirical phred scores compared to pre-realignment results.

    My question is: why do we see such results? The trend appears similar (pre- and post-realignment), but the empirical scores appear capped at ~25. I'm still attempting to better understand what might be happening here, but something about the way I am running indel realignment appears to also be impacting downstream base quality re-calibration results as well (similar, strange empirical vs. reported plot results post-recalibration). If I exclude the indel realignment step, the base quality re-calibration appears to work successfully (a much tighter adherence to the expected empirical v. reported line post-recalibration).

    The code I used to carry out the two-stage indel realignment step is as follows:

    /usr/bin/java -Xmx2g -jar /net/amd/hovelson/bin/GenomeAnalysisTK-1.4-13-gef50e77/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -known /net/fantasia/home/gjun/data/indel/1000G_phase1.indels.b37.vcf

    /usr/bin/java -Xmx4g -jar /net/amd/hovelson/bin/GenomeAnalysisTK-1.4-13-gef50e77/GenomeAnalysisTK.jar -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

    samtools index /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

    I've spent a lot of time trying to troubleshoot this to no avail. I hope I'm just missing something simple, but am very eager to see whether you folks have any thoughts about what might be going wrong!

    Best, Dan

    pdf
    pdf
    indel_realignment_neava.20121009.pdf
    100K
    Post edited by hovelson on
  • ebanksebanks Posts: 683GATK Developer mod

    Hi Dan,

    That isn't the behavior I would have expected. A few thoughts: 1. What do the AnalyzeCovariates (since you are using an older version of the code) plots look like? Can you confirm that they show the same behavior? 2. What if you were to remove the -L argument in the IndelRealigner command? Does the plot for the realigned bam look better?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 19Member
    edited October 2012

    Thanks very much for your reply, Eric.

    I assume by AnalyzeCovariates plots you mean plots after base quality recalibration (is this correct?). If I run recalibration after local indel realignment, I get the very odd plot you see on the lefthand-side of the attached PDF doc. If I run recalibration without any local indel realignment, I get the more sensible plot you see on the right side of the attached doc.

    I will re-run local realignment without supplying the target region file and let you know what happens shortly.

    pdf
    pdf
    indel_realign_AnalyzeCovariates.pdf
    87K
    Post edited by hovelson on
  • ebanksebanks Posts: 683GATK Developer mod

    No, this isn't right. I don't want you to generate these plots with your own tools, but rather I want to see what they look like with our plotting tools. Perhaps you should upgrade to the latest release of the GATK and use BaseRecalibrator which is documented (and supported).

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 19Member

    I'll work to generate these plots with your tools. I am a little unclear, though: how will proceeding with base quality recalibration help us sort out the unexpected results from the local indel realignment?

    Thanks, Dan

  • hovelsonhovelson Posts: 19Member

    For what it's worth, our tools do still show the same deflation in empirical phred score values for bam files created from local realignment when removing the -L target file option.

  • ebanksebanks Posts: 683GATK Developer mod

    I just want to confirm that the problem is not with your plotting tools.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 19Member

    Ok, thanks Eric! I've updated to the latest release of GATK, and am currently re-running indel realignment and base quality recalibration on this sample. I'll generate the QQ plots using your tools, and let you know what I find.

  • ebanksebanks Posts: 683GATK Developer mod

    You shouldn't need to rerun the re-alignment - just the recalibration.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 19Member

    So (using the latest GATK release) the target interval creation step (for local realignment) ran smoothly. However, when running IndelRealigner I receive a stack trace error that is as follows:

    INFO  12:07:59,775 ArgumentTypeDescriptor - Dynamically determined type of /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed to be BED
    INFO  12:07:59,811 HelpFormatter - ---------------------------------------------------------------------------------
    INFO  12:07:59,811 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-13-g0f021e6, Compiled 2012/10/10 19:42:23
    INFO  12:07:59,811 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO  12:07:59,812 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO  12:07:59,812 HelpFormatter - Program Args: -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam
    INFO  12:07:59,813 HelpFormatter - Date/Time: 2012/10/11 12:07:59
    INFO  12:07:59,813 HelpFormatter - ---------------------------------------------------------------------------------
    INFO  12:07:59,813 HelpFormatter - ---------------------------------------------------------------------------------
    INFO  12:07:59,892 GenomeAnalysisEngine - Strictness is SILENT
    INFO  12:07:59,961 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO  12:07:59,982 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
    INFO  12:08:01,566 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
    INFO  12:08:01,567 TraversalEngine -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
    INFO  12:08:02,640 GATKRunReport - Uploaded run statistics report to AWS S3
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR stack trace
    java.lang.ArrayIndexOutOfBoundsException: -95
            at org.broadinstitute.sting.utils.BaseUtils.simpleBaseToBaseIndex(BaseUtils.java:201)
            at org.broadinstitute.sting.utils.BaseUtils.isRegularBase(BaseUtils.java:235)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.mismatchQualitySumIgnoreCigar(IndelRealigner.java:649)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.determineReadsThatNeedCleaning(IndelRealigner.java:917)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.clean(IndelRealigner.java:681)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.cleanAndCallMap(IndelRealigner.java:547)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:519)
            at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:114)
            at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:104)
            at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52)
            at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71)
            at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265)
            at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
            at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.1-13-g0f021e6):
    ##### ERROR
    ##### ERROR Please visit the wiki to see if this is a known problem
    ##### ERROR If not, please post the error, with stack trace, to the GATK forum
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: -95
    ##### ERROR ------------------------------------------------------------------------------------------
    

    The code used to generate this is as follows (I tried with both a -L genomic interval file and without):

    /usr/bin/java -Xmx4g -jar /net/amd/hovelson/bin/GenomeAnalysisTKLite-2.1-13-g0f021e6/GenomeAnalysisTKLite.jar -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

    Any thoughts?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    I noticed you are using a BED interval file. Have you checked the coordinates/ indexing? BED files are indexed differently, this may just be an "off by one" error.

    Geraldine Van der Auwera, PhD

  • hovelsonhovelson Posts: 19Member

    I had wondered the same, except I've tested the local realignment without using the BED interval file and still receive the same error. The BED file I have been using has start position as 0-based, end position as 1-based.

  • ebanksebanks Posts: 683GATK Developer mod

    It looks like your BAM file is malformed. Can you please run it through Picard's ValidateSAMFile?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 19Member

    ValidateSAMFile Results:

    I have >50 million reads in the BAM file, and ~98,000 records are flagged with an error essentially the same as below:

    ERROR: Record 1234, Read name <READ_NAME>, Mate negative strand flag does not match read negative strand flag of mate

    There's also ~50 errors that include one of the two following errors:

    ERROR: .....MAPQ should be 0 for unmapped read or ERROR: .....Mate unmapped flag does not match read unmapped flag of mate

    Is this type of mismatched read-pair information a likely culprit for causing problems at the indel realignment stage?

  • hovelsonhovelson Posts: 19Member

    And before I forget - THANK YOU very much for your help/suggestions in trying to sort all this out. I sincerely appreciate it!

  • hovelsonhovelson Posts: 19Member

    In playing around a little bit with some of these command line arguments, I realized that when running BaseRecalibrator (for example) with an unzipped reference FASTA file, I was able to get it to run just fine. If I run with a gzipped reference file - I get the same sorts of stack error problems described above for the IndelRealigner.

    Could a gzipped reference file be the problem here?

  • ebanksebanks Posts: 683GATK Developer mod

    Yes, that looks like the culprit.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hovelsonhovelson Posts: 19Member
    edited October 2012

    Okay - it's still a bit strange, though, because I can run the first step of local indel realignment without issue. It only presents problems for the 2nd step of indel realignment (or BaseRecalibrator).

    Post edited by hovelson on
  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    No, actually we are moving away even allowing zipped reference files at all, because the code we use to read it doesn't work so properly...

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • hovelsonhovelson Posts: 19Member

    Okay - thanks very much for the help/feedback with this issue. It's greatly appreciated!

  • akbariakbari Posts: 6Member

    The IndelRealigner works fine with one know indel target intervals. However, when I try to use two known indel target intervals with two separate -targetintervals arguments and -two separate -known arguments, I get the error of too many values for the -targetintervals argument. Should I run the IndelRealigner for each known target intervals separately or there is a way to combine them in a single run.Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    You can give it a file containing a list of target intervals as explained here:

    http://www.broadinstitute.org/gatk/guide/article?id=1204

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 683GATK Developer mod

    Just combine (e.g. with 'cat') the files into a single file first.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • akbariakbari Posts: 6Member

    I tried to use the MergeIntervalLists walker before, but it seems this walker is not included in the public release of the GATK. If I use cat for combining the intervals, then I will have many overlapping intervals in the final list. Is that ok for indelrealigner walker? Do I need to combine the vcf files as well or I can use them with two separate -known arguments?Thanks,

  • ebanksebanks Posts: 683GATK Developer mod

    Actually, let's step back for a second. Why do you have 2 input target interval lists? That's not right. You should only have one input list, otherwise you have done something wrong...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • akbariakbari Posts: 6Member

    Good question. Thank you for noting that. I do not know why I made a separate interval list for each vcf file. I think that I should use the two vcf files together with RealignerTargetCreator walker and make a combined interval list for both vcf files, shouldn't I?

  • ebanksebanks Posts: 683GATK Developer mod

    Yes!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • sarmadymsarmadym Posts: 5Member
    edited November 2012

    I'm working on a pipeline for a very high coverage targeted resequencing project using GATK 2.2. We are interested only in variants within the targeted region. I tested two pipelines one with -L targeted_regions.bed for IndelRealigner, RealignerTargetCreator, BaseRecalibrator, PrintReads and one without -L for those steps. Out of about 430 variants, there are 3 variants that are discordant between the two variant sets (all three with >30 QUAL and >100 DP).

    Which set should I trust? is -L necessary for such targeted resequencing projects? or I should avoid it for certain commands.

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi there,

    There is no strong argument for either side (and what you're seeing is a very minor difference) but I can tell you that we don't use intervals until the PrintReads step (after Base Recalibration).

    Geraldine Van der Auwera, PhD

  • hmviithmviit Posts: 4Member

    Hi,

    I just recently started using GATK 2. I tried but I couldn't quite figure out where to find answers for my questions from the documentations, so I'll just aks straight up.

    I'm working with 2 different Gasterosteus aculeatus lineages. Even though there is a genome, I haven't found a GVF file for the species. The sequences are from illumina. I've following throught the pipeline for local realignment with creating targets and then realigning followed with baserecalibration. I ran into few questions:

    1) When I ran RealignerTargetCreator with a VCF of my own for these samples, I endedn up having in addition to my previously found indels (from samtools) alot of other targets as well. Some of them did seem like candidates even without an indel, but some of them were fine to my eye. What could be causing this? This is what I ran java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R gasAcu_combinedbac_inv7.fa -I readgroup_FF1_inv7c.sorted.bam -L Capture_Target_Regions.intervals -o intervals2_FF1_inv7c -known FF1_inv7c_var.vcf -fixMisencodedQuals

    2) When I ran IndelRealigner using the targets created in the previous, I couldn't any more add my vcf as the --knownAlleles. It ran fine without it though, but when I added it in as an argument it errored with this:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.3-9-ge5ebf34):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Argument with name 'knownAlleles' isn't defined.
    ERROR ------------------------------------------------------------------------------------------

    The code I tried running was java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T IndelRealigner -R gasAcu_combinedbac_inv7.fa -I readgroup_FF1_inv7c.sorted.bam -targetIntervals intervals2_FF1_inv7c.intervals -L Capture_Target_Regions.intervals -knownAlleles FF1_inv7c_var.vcf -o realigned_FF1_inv7c.bam -fixMisencodedQuals

    3) I'm know at the stage of BaseRecalibrator, but my problem is that I'm not sure what to add as the -knownSites based on previous questions and the fact that I know there are samples for which I have the case of SNPs being Ref/Alt and Atl1/Alt2. I wasn't sure after reading the documentation, should I use -knownSites to mask potential SNPs from recalibration and if yes, then can it be my own VCF based on unrealigned and uncalibrated samtools results or should it be something else?

    I appreciate any suggestions here.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    1) Don't worry about it. There are various things the RTC looks for when determining what regions may need to be realigned. Even if a region is flagged that doesn't really need to be, it won't hurt or mess up anything.

    2) The argument name is slightly different for IR than for RTC: --knownAlleles / -known. Always worth looking up the exact arg names in the tech docs if you're getting an error.

    3) You should probably filter your knowns VCF to keep only the high quality calls, otherwise you risk over-masking. This is covered in the Best Practices documentation.

    Geraldine Van der Auwera, PhD

  • hmviithmviit Posts: 4Member
    edited January 2013

    Thanks for your reply Geraldine, I'll check my argument names and filter my VCF before running the BaseRecalibrator.

    Post edited by hmviit on
  • chongmchongm Posts: 33Member

    I had a conceptual question regarding "strand discordant loci." In the local realignment video tutorial, there was a slide on this concept whereby a unique albeit artefactual SNP cluster seems to be biased towards the ends of either forward or reverse strands due to the presence of an indel. It is not quite clear to me why exactly there is this strand bias. Let's say we have two identical reads (same sequence) which map to the edge of an indel, but one read is read from the forward strand (sequencing towards the indel let's say) and then another read is read from the reverse strand (sequencing away from the indel let's say). Are these two reads not equally likely to map to the same position on the indel? Does it have something to do with the fact that bases at the end of the reads tend to have lesser qualities (and thus a greater chance of mapping to an indelic region with artefactual mismatches?)

    Regardless of this strand discordance, I understand that a high SNP density is also a huge flag for "suspicious" regions requiring realignment but I'd like to understand where this strand bias comes from.

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi there,

    This is an interesting question, but it's more so a question on how aligners work rather than how to use the GATK, so I'm afraid we can't spend time discussing it. I think you would do better to start that discussion somewhere like SeqAnswers.

    That said if someone in the community would like to discuss this with you they should feel free to do so here; we're quite happy to host the discussion even if we don't have time to participate.

    Geraldine Van der Auwera, PhD

  • bioSGbioSG Posts: 18Member

    Hi there, Is it possible to perform RealignerTargetCreator and IndelRealigner with multiple samples at once in the same BAM file or does it need to be done one by one? In case it is possible to perform these steps with multiple samples at once, is there an optimal number of samples? Thanks very much in advance.

    S.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @bioSG,

    You can realign multiple samples at the same time, yes. They can be in the same file or in separate files that you pass in together, as you prefer.

    There is no hard limit to the number of samples you can realign together, but it is computationally expensive, so the number you can realistically process depends on the capabilities of your computer/platform.

    Geraldine Van der Auwera, PhD

  • julianjulian Posts: 4Member

    Hi,

    In case I run the realigner with multiple samples at once, will it try to find a consistent realignment taking all the samples at a certain site into account or will it treat the samples independently?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Yes, it will apply the same consensus to all the samples.

    Geraldine Van der Auwera, PhD

  • bioSGbioSG Posts: 18Member

    Going back to my question on the realigner with multiple samples at once, is there the risk of not performing the realignment in regions that would have been realigned doing a one by one sample realignment? It looks like realigning with multiple samples at once could decrease the percentages of mismatches at a locus having high entropy, and thus, further tuning of the entropy threshold would be needed.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    That's right, which is why if you want to be really thorough you can first realign per-sample, then do a second round of realignment on multiple samples. That way you're sure that individual samples are realigned internally even if the region doesn't get triggered when running multi-samples. But that takes a whole lot of time & compute, so it's not very popular.

    Geraldine Van der Auwera, PhD

  • mpviverompvivero Posts: 7Member

    Is there anyway to reorder the contigs for the known indels the Broad provides in their hg19 bundle? The hg19 reference genome I have been using starts chr1, chr2, chr3...chr7, chrX, chr8. However, the known indel contigs (1000G and Mills set) start chrM, chr1, chr2...chr7, chr8, chr9. I am trying to run RealignerTargetCreator (and later IndelRealigner) and I get an error stating that GATK doesn't function with different contig orders, it is "unsafe."

    I would much rather reorder the contigs for the known indel VCF once than having to reorder all my BAM files everytime. Or is there another known indel VCF with the chr1, chr2, chr3...chr7, chrX, chr8 order I can download?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    I'm afraid our hg19 resource files all start with chrM too. Maybe someone else has a differently ordered vcf they can share with you; otherwise we have heard of a script tool called VCFsorter that will allow you to reorder the contigs in your VCF according to a dictionary, using Picard: https://code.google.com/p/vcfsorter/

    Just keep in mind that this is not one of our tools so we can't guarantee results nor provide support for that particular tool. Good luck!

    Geraldine Van der Auwera, PhD

  • mpviverompvivero Posts: 7Member

    As long as the -known [file] is in VCF format, will the RealignerTargetCreator function properly, regardless of chromosomal order? I want to make sure this will not hinder GATKs functionally.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    No, the order of the contigs has to be the same for all the files involved, otherwise GATK will freak out. This is an important requirement of how the GATK engine retrieves data from the files.

    Generally speaking, you should always use data resources that match the reference. From when you start designing your experiment, you should plan which reference you will use for alignment to go with which data resources you will use. Choosing afterwards leads to this kind of problem of having to reorder things around, it's messy. Sorry to be the bearer of bad news...

    Geraldine Van der Auwera, PhD

  • mjaegermjaeger Posts: 3Member

    Hi, I used the RealignerTargetCreator to create a list file (sample.intervals) with intervals to be used with the IndelRealigner. But when I call the command I get a error saying that my argument 'targetIntervals' has to many values. There is only one argument given to this parameter:

    java -jar GenomeAnalysisTK.jar -T IndelRealigner -I ... -o ... -targetIntervals sample.intervals

    Do you know some possible work around?

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @mjaeger,

    Can you please post the full command that you ran?

    Geraldine Van der Auwera, PhD

  • mjaegermjaeger Posts: 3Member

    Yes of cause.

    java -jar GenomeAnalysisTK.jar -T IndelRealigner -R UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa 
     -I results/exom/HOCsampleA.bam -o results/exom/HOCsampleA_realigned.bam -targetIntervals sample.intervals

    Here is the head of the intervals file:

    chrM:302-303
    chrM:516-517
    chr1:132032
    chr1:448309
    chr1:523157-523158
    chr1:533066
    chr1:535082
    ...

    The file contains about 62.000 positions/lines.

    Thanks. Marten

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Marten,

    That's very odd. You could try running this command again with a copy of the targets file containing only a few lines, and see if that works properly. If it does, it might mean that something is wrong in the original targets file. If it does not, then the problem is elsewhere. Can you also make sure that you are running the latest version of the GATK?

    Geraldine Van der Auwera, PhD

  • mjaegermjaeger Posts: 3Member

    Hi Geraldine,

    The latest version of GATK (2.5-2-gf57256b) was downloaded the same day I first run the realigner. I tried to run the command with a copy with only the first 1.000 lines of the intervals file but got the same error message. So maybe I will rerun the analysis on an other machine.

    Marten

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    can you copy & paste the GATK error message? There shouldn't be a limit on the number of intervals.

  • DavidRiesDavidRies Posts: 11Member
    edited May 2013

    Hi, Indel realigner didn't give me any error messages whe I used it like this: java -Xmx8g -jar ~/Java/GenomeAnalysisTKLite-2.2-16-g2cc9ef8/GenomeAnalysisTKLite.jar -T IndelRealigner -R /prj/gf-nugget/processed-data/data/Assemblies/beet/RefBeet-1.0.chromosomes/RefBeet-1.0.gendbe.chrmosomes.fa -I merged_Red_RG.bam -targetIntervals forIndelRealigner.red.intervals -o merged_Red_RG_realigned_CHR2.bam --maxReadsForConsensuses 200 --maxReadsForRealignment 50000 --maxReadsInMemory 300000 -L lcl\|Chr2:6000000-7500000

    But for some indels, it doesn't quite seem to work, since none of the reads are realigned. The Problem might be, that it can not clearly be decided for some reads, if there is an indel or not, the mapping is equally fine in both cases. The indel was in the list created by RealignerTargetCreator.

    Is there an option, so that an ambiguous read is rather realigned than not realigned (which would essentially mean to prefer a gap)?

    I attached a picture of the region.

    Best regards,

    David

    Post edited by DavidRies on
  • DavidRiesDavidRies Posts: 11Member

    And the attachment ...

    igv_snapshot.png
    1673 x 886 - 54K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @DavidRies,

    One thing you can try is change the --consensusDeterminationModel / -model (see the documentation page for the different options). You should also check how much depth you have at that position -- if there's too much depth you may be maxing out the maxReads parameters.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 683GATK Developer mod

    Hi David,

    First off, you should re-run the IndelRealignment with the latest version of the GATK (we do not support older versions) and the default command-line arguments. See how that fares. If it still doesn't catch this indel then you should also check whether there's another indel nearby in the same target interval (only 1 indel is allowed to get realigned in an given interval).

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • DavidRiesDavidRies Posts: 11Member
    edited June 2013

    Hi, first of all thanks for your help. I did rerun the IndelRealignment with GenomeAnalysisTK-2.5 and the default arguments. I only have a coverage of about 16x in that region, and didn't get an error complaining about to many reads. The nearest next indel is about 1900 bases away. Is that to close? I couldn't find an option defining this parameter.
    This might be a little bit off-topic, but the UnifiedGenotyper further downstream identified the indel correctly. Still, I would like to see the realignment in my IGV.

    Best regards,

    David

    Post edited by DavidRies on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    1900 bases away should be far enough to not interfere (or Eric will correct me).

    You can try passing the indel as a known indel when you do the realignment. I think this should force the realignment you want to see.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 683GATK Developer mod

    I agree with Geraldine. If all you want is a nice IGV screenshot, you can also just pass in the small interval around the indel for the -targetIntervals argument.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mikemike Posts: 103Member
    edited June 2013

    Hi,

    I am trying to run the IndelRealigner for my new dataset and revisit this page for update and realized that the command has slightly changed in particular, now it has options as:

    --consensusDeterminationModel KNOWNS_ONLY 
    -LOD 0.4 
    

    Could you explain a bit about why we have these options now and what circumstance we need use it or not? Is it a default setting and we need use it most of the cases?

    Thanks a lot!

    Mike

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @mike,

    For an explanation of these two arguments, please see the tech doc page. I'm happy to clarify if there's anything there you don't understand, but please read that first.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
    edited June 2013

    Hi, Geraldine:

    Thanks for the prompt response. In fact, that's exactly what I am trying to ask. The tech doc page you refer to did give some default value, e,g

    --LODThresholdForCleaning / -LOD ( double with default value 5.0 ) 
    

    but the example as in my first post is

    -LOD 0.4 
    

    which is much smaller than the default, and the tech doc page mentioned:

    Note that this number should be adjusted based on your particular data set. For low coverage and/or when looking for indels with low allele frequency, this number should be smaller.

    The question is how smaller. The dataset I have may also need to look for indels with low allele frequence but my coverage is high (Illumina) and so if I used -LOD 0.4, what would be the pros and cons? Also I do not want this bother my SNP detection as well, any impact on that? I indeed did not use the option in my old dataset for this step, and apparently used the default value. I was concerned the change in the example code may mean some insight there or some cons not using it.

    Also for option: --consensusDeterminationModel, it has three possible values as below and default is USE_READS not the one in your example (KNOWNS_ONLY)

    3 possible value:

    KNOWNS_ONLY
    Uses only indels from a provided ROD of known indels.
    USE_READS
    Additionally uses indels already present in the original alignments of the reads.
    USE_SW
    Additionally uses 'Smith-Waterman' to generate alternate consenses.

    My question for this is what is the pros and cons for using these different values particular for my dataset I just mentioned. For example, I used to not using this option and so I must have used the default (USE_READS), it sound more aggressive, is that a better approach disregard of the time cost or performance issue?

    Thanks a lot for your advice!

    Mike

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    I see. I'm not sure when we changed the example value, but you shouldn't take it as a sign of anything important. The values we give in examples like this are only for demonstration purposes. The default values that are pre-set are reasonable values and are what we normally use most for our projects, but if your project is different, you may need to adapt them. There are a few places where we actually recommend values (like in VQSR and filtering) but even there the problem is that there is no single good value for all projects, and there are so many types of projects that it would be too hard to enumerate all the possible configurations. So I really can't give you numbers.

    What I can do is clarify what the tradeoffs are. The main tradeoff is usually between performance and accuracy.

    Lowering the LOD means that the realignment "cleaner" subprogram will be triggered in more cases. That is good because it increases the chance that your data will be realigned correctly around your rare indels (and that will give better results for SNP detection). But it's also bad because if you have high depth, that will mean the program has to do a lot more work, and it will take more time.

    Using the reads for generating consensus alignments is better than just known indels because the result will be more representative of your data. In your case, especially since you are looking at rare indels, the USE_READS option will likely give more accurate results. But again, the deep coverage means that it will be more work and take more time. Full SW realignment would take even more time and is usually not necessary -- unless you used a gapped aligner to map your reads, the degree of improvement tends to be minor compared to the performance burden.

    Ultimately the best way to find the compromise that works for your data is to experiment with a range of settings on a test region, and see what the results look like. And of course you have to decide for yourself how much time you're willing to take to do this.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Geraldine:

    Thanks a lot for your advice. Appreciated very much! That's all I wish to know

    Best

    Mike

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Just a quick clarification question: you mentioned "Using the reads for generating consensus alignments is better than just known indels because ...", do you specifically mean USE_READS or USE_SW option? the document mentioned:

    USE_READS Additionally uses indels already present in the original alignments of the reads. USE_SW Additionally uses 'Smith-Waterman' to generate alternate consenses.

    You seem mean USE_SW, right? What if USE_READS, what is the difference compared to USE_SW. We used BWA for the mapping, which one would be the better choice here?

    Thanks and best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Mike,

    No, I do mean you should use USE_READS. The name is a little misleading; in this mode, the reads are not realigned de novo when considering consensus, but the tool will consider indels that are encoded in the CIGAR strings of reads when evaluate possible consensus. In USE_SW mode, the tool assumes that the original alignment provided by the mapper cannot be trusted, and the reads are actually realigned de novo before evaluating possible consensus.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks a lot for the clarification, which is very helpful!

    Best

    Mike

  • EmilyEmily Posts: 2Member

    Hi, I have known indel sites and I am using GATK2.5 to realign my reads only around those sites. At some loci there are several positions where the indel could be reasonably aligned and different reads have the indel in different positions.

    Is there a way to force a consensus such that all the reads with the indel have it aligned at the exact same position? I thought that using --consensusDeterminationModel KNOWNS_ONLY would do this but it doesn't.

    Thanks, Emily

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Emily,

    The expected behavior of the KNOWNS_ONLY mode is indeed to force realignment to the given indel. Perhaps this indel is in a complex region? Do you see any realignment at all? If there is too much complexity in the region or excessive read depth, the realigner may give up on realigning the region in order to avoid crashing. If so there are some options you can tweak to override the default safeguards...

    Geraldine Van der Auwera, PhD

  • EmilyEmily Posts: 2Member

    Thank you for your quick response. I am not seeing any realignment. There are 300 reads at the site I am considering. Is this considered excessive depth? There are some repeats around the indel and therefore more than one way for reads with the indel to align. I am not sure if that is what is causing the problem.

    Which parameters can I change to override the default safeguards you mention and force a realignment to the known sites I am providing?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    300 reads shouldn't be excessive. But it's difficult to say without seeing the region why the realigner is quitting.

    Have a look at this page here for arguments that you can tweak: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner.html

    Geraldine Van der Auwera, PhD

  • julianjulian Posts: 4Member

    How does the computational time of the indel realigner scale with the number of samples / reads? Is it a linear relationship?

  • livefallflylivefallfly Posts: 23Member
    edited August 2013

    @ebanks said: Hi there, I'm pretty sure we already have a read filter that will work for you: try the NotPrimaryAlignmentFilter ('-rf NotPrimaryAlignment'). See the Read Filters section for more details. And if this doesn't work for you, it should be extremely easy for you to create a filter modeled after this one that will work. Good luck!

    Can you paste the website address here? I can't find the reads filtered section that has details for solving the problem. Just an empty NotPrimaryAlignmentFilter page thanks

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Ah, that's the document you want but the content is not correct/missing. I will fix it. There is no other resource available in the meantime, sorry. What exactly are you having problems with? Perhaps I can give you the info you need.

    Geraldine Van der Auwera, PhD

  • livefallflylivefallfly Posts: 23Member

    when I ran the IndelRealigner command, I caught up with the question --"MESSAGE: Error caching SAM record FCD054AACXX:4:2102:2977:12785#TTAGGCAT, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present." And I don't know how to fix it. Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Oh, I see. Did you try adding -rf NotPrimaryAlignment to your command as ebanks suggested?

    Geraldine Van der Auwera, PhD

  • yfwangyfwang Posts: 2Member

    I found mapping quality of some reads reach 70 after local realignment, what does it mean? I remember the maximum MQ is 60 in BWA?

    Looking forward to your reply!

    Best regards,

    Lucius

  • ebanksebanks Posts: 683GATK Developer mod

    The Realigner adds Q10 to any read that it realigns.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • yfwangyfwang Posts: 2Member

    @ebanks said: The Realigner adds Q10 to any read that it realigns.

    Thank you. That means a read with Q30 will increase the Q to 40, if it was realigned. It's very helpful for me!

    Best regards,

    Lucius

  • vivekdas_1987vivekdas_1987 MilanPosts: 30Member
    edited November 2013

    I have a query that we do the local realignment around the indels with the known sites because it identifies most parsimonious alignment along all reads at a problematic locus. InDels in reads among the near ends can cause the mappers into mis-aligning with mismatches. These artifactual mismatches can harm base quality recalibration and variant detection. However if one is not interested in performing this step as one might not be interested in the InDel assessment downstream but if one wants a local realignment of the mapped reads around the exonic regions using the bed files containing the probes that was provided by the company to check the quality of reads that mapped only to the exonic regions. Can this be done? I am particularly interested in trying to see how many of my mapped reads which aligned to the whole hg19 genome is actually getting mapped on the exonic region, then we can assess the coverage of the reads in that way. I believe we can do this also by using the DepthofCoverage walker and use it on the local indel realigned file along with the target interval bed file which captures the probe for the exonic regions to find out the statistics of the mapping. But if we skip the local realignment around the indels and rather perform alignment around the exonic region with the bed files before variant calling can we do that? Is there any walker in the GATK 2.3 which can perform this task?

    Thanks,

    Vivek

    Post edited by vivekdas_1987 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @vivekdas_1987,

    To be clear, realignment around indels is not just useful if you are interested in calling indels; it will also improve the results of SNP calling. We do not recommend skipping this step at all.

    As to the other part of your question, there is no tool in GATK to do local realignment in exon regions as such; but the HaplotypeCaller includes a de novo reassembly step that will improve the alignment in the intervals of interest before it proceeds to call variants.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 30Member
    edited November 2013

    HIi @Geraldine_VdAuwera,

    Thanks a lot for the quick response , that makes sense. I missed out on the part of local indel realignment improving the quality of the SNP. But yes I guess HaploypeCaller and Unified Genotyper both require the interval bed files which actually does a reassembly step to improve the alignment. I would say though I am using 6 samples I went ahead with UnifiedGenotyper, I would try with haplotype caller as well to see the changes in my results for calling the true variants. But I was more interested in getting some sort of coverage measures for my mapped reads around the exonic regions and if am not wrong it can be only done with the DepthofCoverage walker right if I use it with the bam files and the interval bed files as I was asking you last day when I was trying to gain some statistical measures to do a quality control check on my mapped data. Please let me know.

    Regards,

    Vivek

    Post edited by vivekdas_1987 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    You can use DepthOfCoverage or DiagnoseTargets to evaluate coverage of your exonic regions; they have slightly different capabilities and usefulness so I encourage you to try both and see which one gives you the most useful information.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 30Member

    Thank you very much for the reply. I will perform it and keep it posted in the forum.

    Regards, Vivek

  • vivekdas_1987vivekdas_1987 MilanPosts: 30Member

    Hi @Geraldine_VdAuwera,

    I have run the DepthofCoverage and the DaignoseTarget walker for my tumor samples and its IPS lines. I am finding it difficult to settle with the Diagnose Targets output but the DepthofCoverage outputs are quite clear. I have some query I would like to get it clarified. I was looking at the file _cumulative_coverage_counts which helps to draw a histogram with cumulative frequency of all bases that have been mapped on the exomes. Here I see two columns at the beginning

                    gte_0    gte_1
    

    NSamples_1 80050421 64522700

    Can you tell me what does gte stands for and if I want to understand what the highest number of reads that got mapped on the exonic region with my samples it should the 64 million read count in the second column right? As this number varies across all samples but the gte_0 is same across all the samples and the same 80 million reads is showing up in all samples. So if I think of the reads that mapped ultimately on the exome it should be second column right? Please suggest me.

    Regards,

    Vivek

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Vivek,

    Please note that DepthOfCoverage looks at coverage per locus, not where reads are, so these counts refer to loci, not to reads. gte stands for "greater than or equal to". Each gte_ column represents the left endpoint of a bin in the cumulative coverage histogram. So in your example, 80050421 is the number of loci that are covered by 0 or more reads, 64522700 is the number of loci covered by 1 or more reads, and so on.

    We generally are mostly interested in analyzing coverage per locus. However, if what you want is a count of how many reads fall in your exonic regions, the simplest way to get that is to use CountReads over your exome targets list.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 30Member

    Dear @Geraldine_VdAuwera Thank you for the reply. I tried the walker CountReads for one of my sample and I see the output in the terminal as 0 reads were filtered out during traversal out of 460289 total (0.00%). This means with my input.bam file and the exome target list (which is the exome baits bed file) I see only 460289 reads getting mapped on the exonic regions. This would be very low as the number of reads for my mapped bam file on the reference genome is 79609248 ( ~80 millions). If am not wrong the number of reads which should map on the exonic region for any normal samples should be roughly around 60-70% of the total reads mapped on the reference genome. But this statistics cannot be correct right? Please guide me where am doing wrong. My intension is to find out of 80 million reads that got mapped onto the reference genome how many got mapped on the exonic regions. So for that I need the bam file, the reference genome and the bed file with the exome baits provided by the company. So you say that this CountReads is the only method to detect how many reads are actually mapping onto the exome? Please enlighten me.

    Regards, Vivek

Sign In or Register to comment.