The current GATK version is 3.4-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Local Realignment around Indels

Posts: 71GATK Dev mod
edited September 2012

## Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.

## Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.

# Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R /path/to/reference.fasta \
-o /path/to/output.intervals \
-known /path/to/indel_calls.vcf


The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
-jar /path/to/GenomeAnalysisTK.jar \
-I <lane-level.bam> \
-R <ref.fasta> \
-T IndelRealigner \
-targetIntervals <intervalListFromStep1Above.intervals> \
-o <realignedBam.bam> \
-known /path/to/indel_calls.vcf
--consensusDeterminationModel KNOWNS_ONLY \
-LOD 0.4

Post edited by Geraldine_VdAuwera on
Tagged:
«1

• Posts: 65Member ✭✭
edited August 2012

Hi

Is there any way to run the IndelRealigner on BAM/SAM files where the multiple possible alignments are reported for a read, rather than the Top Random alignment that BWA uses?

As the aligner we are using generates BAMs which report multiple possible alignments for reads, that have no unique best alignment.
ie if a Read has three equal likely alignments (a repeat for example) the aligner will report all three and flag them as non unique.

However if we try to feed these BAMs into GATK IndelRealigner we get an error due to these duplicate reads.

##### ERROR MESSAGE: Error caching SAM record HWI-ST196R_0530:2:2204:3691:99597#0/1, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present.

Is there any way around this, an on the fly filtering option we can add to GATK to discard these reads, or get it to work properly with them?

Thanks

Post edited by aeonsim on

Hi there,
I'm pretty sure we already have a read filter that will work for you: try the NotPrimaryAlignmentFilter ('-rf NotPrimaryAlignment'). See the Read Filters section for more details. And if this doesn't work for you, it should be extremely easy for you to create a filter modeled after this one that will work. Good luck!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 65Member ✭✭

That seems to have worked, thanks!
Might be worth extending your error message to mention it as an option.

• Posts: 7Member
edited October 2012

Hello. I am trying to run the Indel Realigner but the message I get is:

##### ERROR MESSAGE: File associated with name s_1.bwt2.sorted.wRG.recal.interval_list is malformed: Interval file could not be parsed in any supported format. caused by Failed to parse Genome Location string: chr1:58879-59946:95-193


The command line I am using is:

java -Xmx2g -jar GenomeAnalysisTK.jar -T IndelRealigner -T reference.fasta -I input.bam -o output.bam -targetIntervals forIndelRealigner.interval_list -known reference_snps.vcf --validation_strictness LENIENT


There are no errors from the previous step of RealignerTargetCreator. What am I doing wrong?

Post edited by Geraldine_VdAuwera on
• Posts: 7Member

Sorry. Above command has -R reference.fasta, not -T as shown.

The error message says it all: your interval list is malformed. Look closely at the Genome Loc string at the end of the error: chr1:58879-59946:95-193 and ask yourself, what is this supposed to represent, and what's wrong about it?

Geraldine Van der Auwera, PhD

• Posts: 7Member

I am sorry but I don't know what's wrong other than it's obviously not a valid genomic coordinate in the usual sense. Since this was output by RealignerTargetCreator I assumed it would be correctly interpreted by IndelRealigner. I have a contig in my reference that has coordinates chr1:58879-59946 but I do not know what the 95-193 refers to.

Alright, well the 95-193 part shouldn't be there at all, it's not part of the normal makeup of a genome loc, so it seems the interval list was not created correctly by the RTC. Please realize that a more accurate description of your problem would have been "I just ran RealignerTargetCreator and it produced an interval list that looks like it's incorrectly formatted". Technically this is a problem that should be posted under RealignerTargetCreator, not IndelRealigner...

Can you check (by using -XL to exclude the problem location) whether this only occurs a single time in your intervals file, or whether there are other instances?

Another helpful test would be to run your original RTC command again to see if it produces the same malformed locus a second time. This could have been due to a one-time hiccup, so let's check for that before going too deep.

Geraldine Van der Auwera, PhD

I agree with Geraldine that it looks like a one time blip with your file system.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 7Member
edited October 2012

The whole interval list file looks the same. Some contigs have multiple entries, e.g:

chr1:239794835-239795056:6-63
chr1:239794835-239795056:142
chr1:239794835-239795056:218-219


I did rerun the RTC command and the output was the same.

Post edited by Geraldine_VdAuwera on

This is weird -- can you please post your full command line for the RealignerTargetCreator step?

Geraldine Van der Auwera, PhD

• Posts: 7Member
edited October 2012
java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I input.bam -o forIndelRealigner.interval_list --known reference_snps.vcf --validation_strictness LENIENT

Post edited by Geraldine_VdAuwera on
• Posts: 7Member

I have attached the output of RealignerTargetCreator.

What happens if you run that command without the --validation_strictness flag?

Geraldine Van der Auwera, PhD

Also, can you tell me what reference you are using? Is it one of the standard ones from our bundle or is it something else? It looks like the locations are formatted in a way that's not compatible with our tools. If you upload your reference .idx file we may be able to verify if that's where the problem is.

Geraldine Van der Auwera, PhD

• Posts: 19Member
edited October 2012

Here is a problem related to realigning around know indel sites that I cannot solve:

After realigning around known indel sites (using the 1000G phase1 vcf file file from your bundle, as well as different copy of the same 1000G phase1 known indels vcf file), I notice some strange results when looking at plots of empirical and observed base-level phred scores pre- and post-indel-realignment. The attached PDF file shows side-by-side empirical v. reported phred score plots generated from both a raw bam file and a bam file after local indel realignment. These bam files are generated from the same sample, but you can see that after indel realignment there is a substantial deflation of empirical phred scores compared to pre-realignment results.

My question is: why do we see such results? The trend appears similar (pre- and post-realignment), but the empirical scores appear capped at ~25. I'm still attempting to better understand what might be happening here, but something about the way I am running indel realignment appears to also be impacting downstream base quality re-calibration results as well (similar, strange empirical vs. reported plot results post-recalibration). If I exclude the indel realignment step, the base quality re-calibration appears to work successfully (a much tighter adherence to the expected empirical v. reported line post-recalibration).

The code I used to carry out the two-stage indel realignment step is as follows:

/usr/bin/java -Xmx2g -jar /net/amd/hovelson/bin/GenomeAnalysisTK-1.4-13-gef50e77/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -known /net/fantasia/home/gjun/data/indel/1000G_phase1.indels.b37.vcf

/usr/bin/java -Xmx4g -jar /net/amd/hovelson/bin/GenomeAnalysisTK-1.4-13-gef50e77/GenomeAnalysisTK.jar -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

samtools index /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

I've spent a lot of time trying to troubleshoot this to no avail. I hope I'm just missing something simple, but am very eager to see whether you folks have any thoughts about what might be going wrong!

Best,
Dan

Post edited by hovelson on

Hi Dan,

That isn't the behavior I would have expected. A few thoughts:
1. What do the AnalyzeCovariates (since you are using an older version of the code) plots look like? Can you confirm that they show the same behavior?
2. What if you were to remove the -L argument in the IndelRealigner command? Does the plot for the realigned bam look better?

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member
edited October 2012

I assume by AnalyzeCovariates plots you mean plots after base quality recalibration (is this correct?). If I run recalibration after local indel realignment, I get the very odd plot you see on the lefthand-side of the attached PDF doc. If I run recalibration without any local indel realignment, I get the more sensible plot you see on the right side of the attached doc.

I will re-run local realignment without supplying the target region file and let you know what happens shortly.

Post edited by hovelson on

No, this isn't right. I don't want you to generate these plots with your own tools, but rather I want to see what they look like with our plotting tools. Perhaps you should upgrade to the latest release of the GATK and use BaseRecalibrator which is documented (and supported).

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member

I'll work to generate these plots with your tools. I am a little unclear, though: how will proceeding with base quality recalibration help us sort out the unexpected results from the local indel realignment?

Thanks,
Dan

• Posts: 19Member

For what it's worth, our tools do still show the same deflation in empirical phred score values for bam files created from local realignment when removing the -L target file option.

I just want to confirm that the problem is not with your plotting tools.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member

Ok, thanks Eric! I've updated to the latest release of GATK, and am currently re-running indel realignment and base quality recalibration on this sample. I'll generate the QQ plots using your tools, and let you know what I find.

You shouldn't need to rerun the re-alignment - just the recalibration.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member

So (using the latest GATK release) the target interval creation step (for local realignment) ran smoothly. However, when running IndelRealigner I receive a stack trace error that is as follows:

INFO  12:07:59,775 ArgumentTypeDescriptor - Dynamically determined type of /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed to be BED
INFO  12:07:59,811 HelpFormatter - ---------------------------------------------------------------------------------
INFO  12:07:59,811 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-13-g0f021e6, Compiled 2012/10/10 19:42:23
INFO  12:07:59,812 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  12:07:59,812 HelpFormatter - Program Args: -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -L /net/amd/hovelson/GSK/studydata/bed/SureSelect_All_Exon_50mb.hg19.sites.XY.unique_exons.bed -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam
INFO  12:07:59,813 HelpFormatter - Date/Time: 2012/10/11 12:07:59
INFO  12:07:59,813 HelpFormatter - ---------------------------------------------------------------------------------
INFO  12:07:59,813 HelpFormatter - ---------------------------------------------------------------------------------
INFO  12:07:59,892 GenomeAnalysisEngine - Strictness is SILENT
INFO  12:07:59,961 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:07:59,982 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO  12:08:01,566 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO  12:08:02,640 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: -95
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.1-13-g0f021e6):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: -95
##### ERROR ------------------------------------------------------------------------------------------


The code used to generate this is as follows (I tried with both a -L genomic interval file and without):

/usr/bin/java -Xmx4g -jar /net/amd/hovelson/bin/GenomeAnalysisTKLite-2.1-13-g0f021e6/GenomeAnalysisTKLite.jar -T IndelRealigner -R /net/amd/hovelson/umake/ref/hs37d5.fa.gz -targetIntervals /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.bam_indels.intervals -I /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.dedup.bam -o /net/amd/hovelson/GSK/studydata/NEAVAbams/bam/KP_083_TAGCTT_L008.indel_realign.bam

Any thoughts?

I noticed you are using a BED interval file. Have you checked the coordinates/ indexing? BED files are indexed differently, this may just be an "off by one" error.

Geraldine Van der Auwera, PhD

• Posts: 19Member

I had wondered the same, except I've tested the local realignment without using the BED interval file and still receive the same error. The BED file I have been using has start position as 0-based, end position as 1-based.

It looks like your BAM file is malformed. Can you please run it through Picard's ValidateSAMFile?

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member

ValidateSAMFile Results:

I have >50 million reads in the BAM file, and ~98,000 records are flagged with an error essentially the same as below:

ERROR: Record 1234, Read name <READ_NAME>, Mate negative strand flag does not match read negative strand flag of mate

There's also ~50 errors that include one of the two following errors:

ERROR: .....MAPQ should be 0 for unmapped read
or
ERROR: .....Mate unmapped flag does not match read unmapped flag of mate

Is this type of mismatched read-pair information a likely culprit for causing problems at the indel realignment stage?

• Posts: 19Member

And before I forget - THANK YOU very much for your help/suggestions in trying to sort all this out. I sincerely appreciate it!

• Posts: 19Member

In playing around a little bit with some of these command line arguments, I realized that when running BaseRecalibrator (for example) with an unzipped reference FASTA file, I was able to get it to run just fine. If I run with a gzipped reference file - I get the same sorts of stack error problems described above for the IndelRealigner.

Could a gzipped reference file be the problem here?

Yes, that looks like the culprit.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member
edited October 2012

Okay - it's still a bit strange, though, because I can run the first step of local indel realignment without issue. It only presents problems for the 2nd step of indel realignment (or BaseRecalibrator).

Post edited by hovelson on

No, actually we are moving away even allowing zipped reference files at all, because the code we use to read it doesn't work so properly...

--
Mark A. DePristo, Ph.D.
Co-Director, Medical and Population Genetics
Broad Institute of MIT and Harvard

• Posts: 19Member

Okay - thanks very much for the help/feedback with this issue. It's greatly appreciated!

• Posts: 6Member

The IndelRealigner works fine with one know indel target intervals. However, when I try to use two known indel target intervals with two separate -targetintervals arguments and -two separate -known arguments, I get the error of too many values for the -targetintervals argument. Should I run the IndelRealigner for each known target intervals separately or there is a way to combine them in a single run.Thanks,

You can give it a file containing a list of target intervals as explained here:

Geraldine Van der Auwera, PhD

Just combine (e.g. with 'cat') the files into a single file first.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 6Member

I tried to use the MergeIntervalLists walker before, but it seems this walker is not included in the public release of the GATK. If I use cat for combining the intervals, then I will have many overlapping intervals in the final list. Is that ok for indelrealigner walker? Do I need to combine the vcf files as well or I can use them with two separate -known arguments?Thanks,

Actually, let's step back for a second. Why do you have 2 input target interval lists? That's not right. You should only have one input list, otherwise you have done something wrong...

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 6Member

Good question. Thank you for noting that. I do not know why I made a separate interval list for each vcf file. I think that I should use the two vcf files together with RealignerTargetCreator walker and make a combined interval list for both vcf files, shouldn't I?

Yes!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 5Member
edited November 2012

I'm working on a pipeline for a very high coverage targeted resequencing project using GATK 2.2. We are interested only in variants within the targeted region.
I tested two pipelines one with -L targeted_regions.bed for IndelRealigner, RealignerTargetCreator, BaseRecalibrator, PrintReads and one without -L for those steps. Out of about 430 variants, there are 3 variants that are discordant between the two variant sets (all three with >30 QUAL and >100 DP).

Which set should I trust? is -L necessary for such targeted resequencing projects? or I should avoid it for certain commands.

Post edited by Geraldine_VdAuwera on

Hi there,

There is no strong argument for either side (and what you're seeing is a very minor difference) but I can tell you that we don't use intervals until the PrintReads step (after Base Recalibration).

Geraldine Van der Auwera, PhD

• Posts: 4Member

Hi,

I just recently started using GATK 2. I tried but I couldn't quite figure out where to find answers for my questions from the documentations, so I'll just aks straight up.

I'm working with 2 different Gasterosteus aculeatus lineages. Even though there is a genome, I haven't found a GVF file for the species. The sequences are from illumina. I've following throught the pipeline for local realignment with creating targets and then realigning followed with baserecalibration. I ran into few questions:

1) When I ran RealignerTargetCreator with a VCF of my own for these samples, I endedn up having in addition to my previously found indels (from samtools) alot of other targets as well. Some of them did seem like candidates even without an indel, but some of them were fine to my eye. What could be causing this? This is what I ran
java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R gasAcu_combinedbac_inv7.fa
-I readgroup_FF1_inv7c.sorted.bam -L Capture_Target_Regions.intervals -o intervals2_FF1_inv7c -known FF1_inv7c_var.vcf
-fixMisencodedQuals

2) When I ran IndelRealigner using the targets created in the previous, I couldn't any more add my vcf as the --knownAlleles. It ran fine without it though, but when I added it in as an argument it errored with this:

##### ERROR ------------------------------------------------------------------------------------------

The code I tried running was
java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T IndelRealigner -R gasAcu_combinedbac_inv7.fa
-I readgroup_FF1_inv7c.sorted.bam -targetIntervals intervals2_FF1_inv7c.intervals -L Capture_Target_Regions.intervals
-knownAlleles FF1_inv7c_var.vcf -o realigned_FF1_inv7c.bam -fixMisencodedQuals

3) I'm know at the stage of BaseRecalibrator, but my problem is that I'm not sure what to add as the -knownSites based on previous questions and the fact that I know there are samples for which I have the case of SNPs being Ref/Alt and Atl1/Alt2. I wasn't sure after reading the documentation, should I use -knownSites to mask potential SNPs from recalibration and if yes, then can it be my own VCF based on unrealigned and uncalibrated samtools results or should it be something else?

I appreciate any suggestions here.

1) Don't worry about it. There are various things the RTC looks for when determining what regions may need to be realigned. Even if a region is flagged that doesn't really need to be, it won't hurt or mess up anything.

2) The argument name is slightly different for IR than for RTC: --knownAlleles / -known. Always worth looking up the exact arg names in the tech docs if you're getting an error.

3) You should probably filter your knowns VCF to keep only the high quality calls, otherwise you risk over-masking. This is covered in the Best Practices documentation.

Geraldine Van der Auwera, PhD

• Posts: 4Member
edited January 2013

I'll check my argument names and filter my VCF before running the BaseRecalibrator.

Post edited by hmviit on
• Posts: 33Member

I had a conceptual question regarding "strand discordant loci." In the local realignment video tutorial, there was a slide on this concept whereby a unique albeit artefactual SNP cluster seems to be biased towards the ends of either forward or reverse strands due to the presence of an indel. It is not quite clear to me why exactly there is this strand bias. Let's say we have two identical reads (same sequence) which map to the edge of an indel, but one read is read from the forward strand (sequencing towards the indel let's say) and then another read is read from the reverse strand (sequencing away from the indel let's say). Are these two reads not equally likely to map to the same position on the indel? Does it have something to do with the fact that bases at the end of the reads tend to have lesser qualities (and thus a greater chance of mapping to an indelic region with artefactual mismatches?)

Regardless of this strand discordance, I understand that a high SNP density is also a huge flag for "suspicious" regions requiring realignment but I'd like to understand where this strand bias comes from.

Thanks,

MC

Hi there,

This is an interesting question, but it's more so a question on how aligners work rather than how to use the GATK, so I'm afraid we can't spend time discussing it. I think you would do better to start that discussion somewhere like SeqAnswers.

That said if someone in the community would like to discuss this with you they should feel free to do so here; we're quite happy to host the discussion even if we don't have time to participate.

Geraldine Van der Auwera, PhD

• Posts: 18Member

Hi there,
Is it possible to perform RealignerTargetCreator and IndelRealigner with multiple samples at once in the same BAM file or does it need to be done one by one? In case it is possible to perform these steps with multiple samples at once, is there an optimal number of samples?

S.

Hi @bioSG,

You can realign multiple samples at the same time, yes. They can be in the same file or in separate files that you pass in together, as you prefer.

There is no hard limit to the number of samples you can realign together, but it is computationally expensive, so the number you can realistically process depends on the capabilities of your computer/platform.

Geraldine Van der Auwera, PhD

• Posts: 4Member

Hi,

In case I run the realigner with multiple samples at once, will it try to find a consistent realignment taking all the samples at a certain site into account or will it treat the samples independently?

Yes, it will apply the same consensus to all the samples.

Geraldine Van der Auwera, PhD

• Posts: 18Member

Going back to my question on the realigner with multiple samples at once, is there the risk of not performing the realignment in regions that would have been realigned doing a one by one sample realignment? It looks like realigning with multiple samples at once could decrease the percentages of mismatches at a locus having high entropy, and thus, further tuning of the entropy threshold would be needed.

That's right, which is why if you want to be really thorough you can first realign per-sample, then do a second round of realignment on multiple samples. That way you're sure that individual samples are realigned internally even if the region doesn't get triggered when running multi-samples. But that takes a whole lot of time & compute, so it's not very popular.

Geraldine Van der Auwera, PhD

• Posts: 7Member

Is there anyway to reorder the contigs for the known indels the Broad provides in their hg19 bundle? The hg19 reference genome I have been using starts chr1, chr2, chr3...chr7, chrX, chr8. However, the known indel contigs (1000G and Mills set) start chrM, chr1, chr2...chr7, chr8, chr9. I am trying to run RealignerTargetCreator (and later IndelRealigner) and I get an error stating that GATK doesn't function with different contig orders, it is "unsafe."

I would much rather reorder the contigs for the known indel VCF once than having to reorder all my BAM files everytime. Or is there another known indel VCF with the chr1, chr2, chr3...chr7, chrX, chr8 order I can download?

Thanks!

I'm afraid our hg19 resource files all start with chrM too. Maybe someone else has a differently ordered vcf they can share with you; otherwise we have heard of a script tool called VCFsorter that will allow you to reorder the contigs in your VCF according to a dictionary, using Picard: https://code.google.com/p/vcfsorter/

Just keep in mind that this is not one of our tools so we can't guarantee results nor provide support for that particular tool. Good luck!

Geraldine Van der Auwera, PhD

• Posts: 7Member

As long as the -known [file] is in VCF format, will the RealignerTargetCreator function properly, regardless of chromosomal order? I want to make sure this will not hinder GATKs functionally.

No, the order of the contigs has to be the same for all the files involved, otherwise GATK will freak out. This is an important requirement of how the GATK engine retrieves data from the files.

Generally speaking, you should always use data resources that match the reference. From when you start designing your experiment, you should plan which reference you will use for alignment to go with which data resources you will use. Choosing afterwards leads to this kind of problem of having to reorder things around, it's messy. Sorry to be the bearer of bad news...

Geraldine Van der Auwera, PhD

• Posts: 5Member

Hi, I used the RealignerTargetCreator to create a list file (sample.intervals) with intervals to be used with the IndelRealigner. But when I call the command I get a error saying that my argument 'targetIntervals' has to many values. There is only one argument given to this parameter:

java -jar GenomeAnalysisTK.jar -T IndelRealigner -I ... -o ... -targetIntervals sample.intervals

Do you know some possible work around?

Thanks.

Hi @mjaeger,

Can you please post the full command that you ran?

Geraldine Van der Auwera, PhD

• Posts: 5Member

Yes of cause.

java -jar GenomeAnalysisTK.jar -T IndelRealigner -R UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
-I results/exom/HOCsampleA.bam -o results/exom/HOCsampleA_realigned.bam -targetIntervals sample.intervals

Here is the head of the intervals file:

chrM:302-303
chrM:516-517
chr1:132032
chr1:448309
chr1:523157-523158
chr1:533066
chr1:535082
...

The file contains about 62.000 positions/lines.

Thanks.
Marten

Hi Marten,

That's very odd. You could try running this command again with a copy of the targets file containing only a few lines, and see if that works properly. If it does, it might mean that something is wrong in the original targets file. If it does not, then the problem is elsewhere. Can you also make sure that you are running the latest version of the GATK?

Geraldine Van der Auwera, PhD

• Posts: 5Member

Hi Geraldine,

I tried to run the command with a copy with only the first 1.000 lines of the intervals file but got the same error message. So maybe I will rerun the analysis on an other machine.

Marten

can you copy & paste the GATK error message? There shouldn't be a limit on the number of intervals.

• Posts: 11Member
edited May 2013

Hi,
Indel realigner didn't give me any error messages whe I used it like this:
java -Xmx8g -jar ~/Java/GenomeAnalysisTKLite-2.2-16-g2cc9ef8/GenomeAnalysisTKLite.jar -T IndelRealigner -R /prj/gf-nugget/processed-data/data/Assemblies/beet/RefBeet-1.0.chromosomes/RefBeet-1.0.gendbe.chrmosomes.fa -I merged_Red_RG.bam -targetIntervals forIndelRealigner.red.intervals -o merged_Red_RG_realigned_CHR2.bam --maxReadsForConsensuses 200 --maxReadsForRealignment 50000 --maxReadsInMemory 300000 -L lcl\|Chr2:6000000-7500000

But for some indels, it doesn't quite seem to work, since none of the reads are realigned. The Problem might be, that it can not clearly be decided for some reads, if there is an indel or not, the mapping is equally fine in both cases. The indel was in the list created by RealignerTargetCreator.

Is there an option, so that an ambiguous read is rather realigned than not realigned (which would essentially mean to prefer a gap)?

I attached a picture of the region.

Best regards,

David

Post edited by DavidRies on
• Posts: 11Member

And the attachment ...

Hi @DavidRies,

One thing you can try is change the --consensusDeterminationModel / -model (see the documentation page for the different options). You should also check how much depth you have at that position -- if there's too much depth you may be maxing out the maxReads parameters.

Geraldine Van der Auwera, PhD

Hi David,

First off, you should re-run the IndelRealignment with the latest version of the GATK (we do not support older versions) and the default command-line arguments. See how that fares. If it still doesn't catch this indel then you should also check whether there's another indel nearby in the same target interval (only 1 indel is allowed to get realigned in an given interval).

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 11Member
edited June 2013

Hi,
first of all thanks for your help.
I did rerun the IndelRealignment with GenomeAnalysisTK-2.5 and the default arguments. I only have a coverage of about 16x in that region, and didn't get an error complaining about to many reads. The nearest next indel is about 1900 bases away. Is that to close? I couldn't find an option defining this parameter.
This might be a little bit off-topic, but the UnifiedGenotyper further downstream identified the indel correctly.
Still, I would like to see the realignment in my IGV.

Best regards,

David

Post edited by DavidRies on

1900 bases away should be far enough to not interfere (or Eric will correct me).

You can try passing the indel as a known indel when you do the realignment. I think this should force the realignment you want to see.

Geraldine Van der Auwera, PhD

I agree with Geraldine. If all you want is a nice IGV screenshot, you can also just pass in the small interval around the indel for the -targetIntervals argument.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 103Member
edited June 2013

Hi,

I am trying to run the IndelRealigner for my new dataset and revisit this page for update and realized that the command has slightly changed in particular, now it has options as:

--consensusDeterminationModel KNOWNS_ONLY
-LOD 0.4


Could you explain a bit about why we have these options now and what circumstance we need use it or not? Is it a default setting and we need use it most of the cases?

Thanks a lot!

Mike

Post edited by Geraldine_VdAuwera on

Hi @mike,

For an explanation of these two arguments, please see the tech doc page. I'm happy to clarify if there's anything there you don't understand, but please read that first.

Geraldine Van der Auwera, PhD

• Posts: 103Member
edited June 2013

Hi, Geraldine:

Thanks for the prompt response. In fact, that's exactly what I am trying to ask. The tech doc page you refer to did give some default value, e,g

--LODThresholdForCleaning / -LOD ( double with default value 5.0 )


but the example as in my first post is

-LOD 0.4


which is much smaller than the default, and the tech doc page mentioned:

Note that this number should be adjusted based on your particular data set. For low coverage and/or when looking for indels with low allele frequency, this number should be smaller.

The question is how smaller. The dataset I have may also need to look for indels with low allele frequence but my coverage is high (Illumina) and so if I used -LOD 0.4, what would be the pros and cons? Also I do not want this bother my SNP detection as well, any impact on that? I indeed did not use the option in my old dataset for this step, and apparently used the default value. I was concerned the change in the example code may mean some insight there or some cons not using it.

Also for option: --consensusDeterminationModel, it has three possible values as below and default is USE_READS not the one in your example (KNOWNS_ONLY)

3 possible value:

KNOWNS_ONLY
Uses only indels from a provided ROD of known indels.
USE_SW
Additionally uses 'Smith-Waterman' to generate alternate consenses.

My question for this is what is the pros and cons for using these different values particular for my dataset I just mentioned. For example, I used to not using this option and so I must have used the default (USE_READS), it sound more aggressive, is that a better approach disregard of the time cost or performance issue?

Mike

Post edited by Geraldine_VdAuwera on

I see. I'm not sure when we changed the example value, but you shouldn't take it as a sign of anything important. The values we give in examples like this are only for demonstration purposes. The default values that are pre-set are reasonable values and are what we normally use most for our projects, but if your project is different, you may need to adapt them. There are a few places where we actually recommend values (like in VQSR and filtering) but even there the problem is that there is no single good value for all projects, and there are so many types of projects that it would be too hard to enumerate all the possible configurations. So I really can't give you numbers.

What I can do is clarify what the tradeoffs are. The main tradeoff is usually between performance and accuracy.

Lowering the LOD means that the realignment "cleaner" subprogram will be triggered in more cases. That is good because it increases the chance that your data will be realigned correctly around your rare indels (and that will give better results for SNP detection). But it's also bad because if you have high depth, that will mean the program has to do a lot more work, and it will take more time.

Using the reads for generating consensus alignments is better than just known indels because the result will be more representative of your data. In your case, especially since you are looking at rare indels, the USE_READS option will likely give more accurate results. But again, the deep coverage means that it will be more work and take more time. Full SW realignment would take even more time and is usually not necessary -- unless you used a gapped aligner to map your reads, the degree of improvement tends to be minor compared to the performance burden.

Ultimately the best way to find the compromise that works for your data is to experiment with a range of settings on a test region, and see what the results look like. And of course you have to decide for yourself how much time you're willing to take to do this.

Geraldine Van der Auwera, PhD

• Posts: 103Member

Geraldine:

Thanks a lot for your advice. Appreciated very much! That's all I wish to know

Best

Mike

• Posts: 103Member

Hi, Geraldine:

Just a quick clarification question: you mentioned "Using the reads for generating consensus alignments is better than just known indels because ...", do you specifically mean USE_READS or USE_SW option? the document mentioned:

USE_SW
Additionally uses 'Smith-Waterman' to generate alternate consenses.

You seem mean USE_SW, right? What if USE_READS, what is the difference compared to USE_SW. We used BWA for the mapping, which one would be the better choice here?

Thanks and best

Mike

Hi Mike,

No, I do mean you should use USE_READS. The name is a little misleading; in this mode, the reads are not realigned de novo when considering consensus, but the tool will consider indels that are encoded in the CIGAR strings of reads when evaluate possible consensus. In USE_SW mode, the tool assumes that the original alignment provided by the mapper cannot be trusted, and the reads are actually realigned de novo before evaluating possible consensus.

Geraldine Van der Auwera, PhD

• Posts: 103Member

Hi, Geraldine:

Thanks a lot for the clarification, which is very helpful!

Best

Mike

• Posts: 2Member

Hi,
I have known indel sites and I am using GATK2.5 to realign my reads only around those sites. At some loci there are several positions where the indel could be reasonably aligned and different reads have the indel in different positions.

Is there a way to force a consensus such that all the reads with the indel have it aligned at the exact same position? I thought that using --consensusDeterminationModel KNOWNS_ONLY would do this but it doesn't.

Thanks,
Emily

Hi Emily,

The expected behavior of the KNOWNS_ONLY mode is indeed to force realignment to the given indel. Perhaps this indel is in a complex region? Do you see any realignment at all? If there is too much complexity in the region or excessive read depth, the realigner may give up on realigning the region in order to avoid crashing. If so there are some options you can tweak to override the default safeguards...

Geraldine Van der Auwera, PhD

• Posts: 2Member

Thank you for your quick response. I am not seeing any realignment. There are 300 reads at the site I am considering. Is this considered excessive depth? There are some repeats around the indel and therefore more than one way for reads with the indel to align. I am not sure if that is what is causing the problem.

Which parameters can I change to override the default safeguards you mention and force a realignment to the known sites I am providing?

300 reads shouldn't be excessive. But it's difficult to say without seeing the region why the realigner is quitting.

Geraldine Van der Auwera, PhD

• Posts: 4Member

How does the computational time of the indel realigner scale with the number of samples / reads? Is it a linear relationship?

• Posts: 23Member
edited August 2013

@ebanks said:
Hi there,
I'm pretty sure we already have a read filter that will work for you: try the NotPrimaryAlignmentFilter ('-rf NotPrimaryAlignment'). See the Read Filters section for more details. And if this doesn't work for you, it should be extremely easy for you to create a filter modeled after this one that will work. Good luck!

Can you paste the website address here? I can't find the reads filtered section that has details for solving the problem. Just an empty
NotPrimaryAlignmentFilter page
thanks

Post edited by Geraldine_VdAuwera on

Ah, that's the document you want but the content is not correct/missing. I will fix it. There is no other resource available in the meantime, sorry. What exactly are you having problems with? Perhaps I can give you the info you need.

Geraldine Van der Auwera, PhD

• Posts: 23Member

when I ran the IndelRealigner command, I caught up with the question --"MESSAGE: Error caching SAM record FCD054AACXX:4:2102:2977:12785#TTAGGCAT, which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present."
And I don't know how to fix it.
Thanks

Oh, I see. Did you try adding -rf NotPrimaryAlignment to your command as ebanks suggested?

Geraldine Van der Auwera, PhD

• Posts: 2Member

I found mapping quality of some reads reach 70 after local realignment, what does it mean? I remember the maximum MQ is 60 in BWA?

Best regards,

Lucius

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 2Member

@ebanks said:

Thank you. That means a read with Q30 will increase the Q to 40, if it was realigned. It's very helpful for me!

Best regards,

Lucius

• MilanPosts: 33Member
edited November 2013

I have a query that we do the local realignment around the indels with the known sites because it identifies most parsimonious alignment along all reads at a problematic locus. InDels in reads among the near ends can cause the mappers into mis-aligning with mismatches. These artifactual mismatches can harm base quality recalibration and variant detection. However if one is not interested in performing this step as one might not be interested in the InDel assessment downstream but if one wants a local realignment of the mapped reads around the exonic regions using the bed files containing the probes that was provided by the company to check the quality of reads that mapped only to the exonic regions. Can this be done? I am particularly interested in trying to see how many of my mapped reads which aligned to the whole hg19 genome is actually getting mapped on the exonic region, then we can assess the coverage of the reads in that way. I believe we can do this also by using the DepthofCoverage walker and use it on the local indel realigned file along with the target interval bed file which captures the probe for the exonic regions to find out the statistics of the mapping. But if we skip the local realignment around the indels and rather perform alignment around the exonic region with the bed files before variant calling can we do that? Is there any walker in the GATK 2.3 which can perform this task?

Thanks,

Vivek

Post edited by vivekdas_1987 on

To be clear, realignment around indels is not just useful if you are interested in calling indels; it will also improve the results of SNP calling. We do not recommend skipping this step at all.

As to the other part of your question, there is no tool in GATK to do local realignment in exon regions as such; but the HaplotypeCaller includes a de novo reassembly step that will improve the alignment in the intervals of interest before it proceeds to call variants.

Geraldine Van der Auwera, PhD

• MilanPosts: 33Member
edited November 2013

Thanks a lot for the quick response , that makes sense. I missed out on the part of local indel realignment improving the quality of the SNP. But yes I guess HaploypeCaller and Unified Genotyper both require the interval bed files which actually does a reassembly step to improve the alignment. I would say though I am using 6 samples I went ahead with UnifiedGenotyper, I would try with haplotype caller as well to see the changes in my results for calling the true variants. But I was more interested in getting some sort of coverage measures for my mapped reads around the exonic regions and if am not wrong it can be only done with the DepthofCoverage walker right if I use it with the bam files and the interval bed files as I was asking you last day when I was trying to gain some statistical measures to do a quality control check on my mapped data. Please let me know.

Regards,

Vivek

Post edited by vivekdas_1987 on

You can use DepthOfCoverage or DiagnoseTargets to evaluate coverage of your exonic regions; they have slightly different capabilities and usefulness so I encourage you to try both and see which one gives you the most useful information.

Geraldine Van der Auwera, PhD

• MilanPosts: 33Member

Thank you very much for the reply. I will perform it and keep it posted in the forum.

Regards,
Vivek

• MilanPosts: 33Member

I have run the DepthofCoverage and the DaignoseTarget walker for my tumor samples and its IPS lines. I am finding it difficult to settle with the Diagnose Targets output but the DepthofCoverage outputs are quite clear. I have some query I would like to get it clarified. I was looking at the file _cumulative_coverage_counts which helps to draw a histogram with cumulative frequency of all bases that have been mapped on the exomes. Here I see two columns at the beginning

                gte_0    gte_1


NSamples_1 80050421 64522700

Can you tell me what does gte stands for and if I want to understand what the highest number of reads that got mapped on the exonic region with my samples it should the 64 million read count in the second column right? As this number varies across all samples but the gte_0 is same across all the samples and the same 80 million reads is showing up in all samples. So if I think of the reads that mapped ultimately on the exome it should be second column right? Please suggest me.

Regards,

Vivek

Hi Vivek,

Please note that DepthOfCoverage looks at coverage per locus, not where reads are, so these counts refer to loci, not to reads. gte stands for "greater than or equal to". Each gte_ column represents the left endpoint of a bin in the cumulative coverage histogram. So in your example, 80050421 is the number of loci that are covered by 0 or more reads, 64522700 is the number of loci covered by 1 or more reads, and so on.

We generally are mostly interested in analyzing coverage per locus. However, if what you want is a count of how many reads fall in your exonic regions, the simplest way to get that is to use CountReads over your exome targets list.

Geraldine Van der Auwera, PhD

• MilanPosts: 33Member

Dear @Geraldine_VdAuwera
Thank you for the reply. I tried the walker CountReads for one of my sample and I see the output in the terminal as 0 reads were filtered out during traversal out of 460289 total (0.00%). This means with my input.bam file and the exome target list (which is the exome baits bed file) I see only 460289 reads getting mapped on the exonic regions. This would be very low as the number of reads for my mapped bam file on the reference genome is 79609248 ( ~80 millions). If am not wrong the number of reads which should map on the exonic region for any normal samples should be roughly around 60-70% of the total reads mapped on the reference genome. But this statistics cannot be correct right? Please guide me where am doing wrong. My intension is to find out of 80 million reads that got mapped onto the reference genome how many got mapped on the exonic regions. So for that I need the bam file, the reference genome and the bed file with the exome baits provided by the company. So you say that this CountReads is the only method to detect how many reads are actually mapping onto the exome? Please enlighten me.

Regards,
Vivek