The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# IndelRealigner Not attempting realignment in interval

FreiburgMember Posts: 9

Hi,

first of all, sorry, if I'm not correct in this forum, but I didn't find a place where it realy seemed correct.

On the last data I've got I ren into information like:
"INFO 22:14:12,352 IndelRealigner - Not attempting realignment in interval chrY:27782626-27782640 because there are too many reads."

I've read about the option: "--maxReadsForRealignment" and set it to 100000. I had a look at the bam file and saw, that at some points I've got more then 150000 reads. The average reading depth is about 50. I'm quite new to the filed of NGS and thus am not sure what that means.

Exactly how does the program handle this? Sincer there only small parts, which are actually to high covered. Are only those few bases not realigend? Or is everything in the step skipped? As an example:
chrY:27782626-27782640 is only an interval of 14bp. The whole y-chromsom is covered by only two steps shown in the log file. What is not aligned? One part of the y-chromosom or just those 14bp?

Besides this technical questions: is it save to just increase the "--maxReadsForRealignment" until all intervals are realigned, or is there normaly a general failure (eg. PCR) behind this, so that it seems more save to ignore those intervals?

If some other users have encountered the same issue it would be nice to hear how the have handled this.

Thanks alot

Tagged:

• FreiburgMember Posts: 9

Hi,

thanks alot for the answer. All intervals with this information actually were repeating units of either GT or AC. So I defnitly can safely ignore the message.

One more question concerning -dcov. Am I correct, that since it is a general option listet in commandline option it is usable for each step of the GATK pipeline? And since it actually removes all sequences in excess it is only needed once, since all downstream files will have only the reduced amount of sequences available.

And last: the documentation "CommandlineGATK" says, that the -dcov is biased. If an unbiased aproach is needed the option -dfrac should be used. Sadly there is almost nothing written to the -dfrac option. Am I correct, that the bias mentioned in -dcov is the fact that it cuts of exceeding sequences resulting, that only a small amount of sequences is affected, meaning that the bias is ment to be the different proportion of reading depth within the different reads? Underlying this I would interprete the -dfrac option (sadly not good documented) that I choose a persentage between 0% an 100% [0.0-1.0] and reduce the amount of ALL reads accordingly to this. Resulting in an unbiasd proportion in reads, but decreasing all the number of reads, even if they don't have high coverage.

Thanks again.

• Member Posts: 11

Hi Geraldine, I am wondering if you provide any tool to check out the regions with 'too many reads' for the sanity?

Hi @apallav2,

Have a look either at DepthOfCoverage or DiagnoseTargets (for exomes) in the QC section of the Technical Docs.

Geraldine Van der Auwera, PhD

• Member Posts: 11

Oops - sorry - I should have been more clear..

Any method you (specialists stated above ) would prefer to get an assessment of the 'too many reads' areas ( apart form DoC and DT as they just give me counts - which is what indicated by 'too many reads' message ) . I guess an automated way to look at these areas before ruling them out as bad reads.

Thanks!

@apallav2, our tools can only tell you whether intervals have too many reads or not. Something like DT can tell which intervals you should troubleshoot at the sequencing level, or ignore in your analysis. We don't have anything that will tell you why there are too many reads somewhere (e.g. presence of repeats), or how safe it is to still use them anyway, if that's what you're looking for.

Geraldine Van der Auwera, PhD

• Member Posts: 11

• SingaporeMember Posts: 19

@Hoppmann said:
Hi,

first of all, sorry, if I'm not correct in this forum, but I didn't find a place where it realy seemed correct.

On the last data I've got I ren into information like:
"INFO 22:14:12,352 IndelRealigner - Not attempting realignment in interval chrY:27782626-27782640 because there are too many reads."

I've read about the option: "--maxReadsForRealignment" and set it to 100000. I had a look at the bam file and saw, that at some points I've got more then 150000 reads. The average reading depth is about 50. I'm quite new to the filed of NGS and thus am not sure what that means.

Exactly how does the program handle this? Sincer there only small parts, which are actually to high covered. Are only those few bases not realigend? Or is everything in the step skipped? As an example:
chrY:27782626-27782640 is only an interval of 14bp. The whole y-chromsom is covered by only two steps shown in the log file. What is not aligned? One part of the y-chromosom or just those 14bp?

Besides this technical questions: is it save to just increase the "--maxReadsForRealignment" until all intervals are realigned, or is there normaly a general failure (eg. PCR) behind this, so that it seems more save to ignore those intervals?

If some other users have encountered the same issue it would be nice to hear how the have handled this.

Thanks alot

Hi Hoppmann, may I ask you how do you know how many reads over a certain point in you bam file?

• FreiburgMember Posts: 9

Hi liuxingliang,

I wrote the position mentioned by the message and then increased the number in "-maxReadsForRealignment" as long as there was this message. Once I've been aligning all reads I took the bam file loded in my genome browser (I use the one from GoldenHelix) and looked at the position.
When I click at the position the number of reads at that position is stated at the sidebar.

Hope this helps.

Greeting Anselm

• Member Posts: 27

I am getting similar errors, too. But I think it has to do with my mean mapped coverage is about 3000x for my targeted sequencing data. Does that mean it is better for me to set --maxReadsForRealignment if I don't care about runtimes?

Oof, that's a lot of depth. You can either allow it using the max reads setting, or downsample the data preemptively, depending on what you care about most (time or preserving depth).

Geraldine Van der Auwera, PhD

• Member Posts: 4

I am getting the same error message from indelrealigner, but my bam file is from simulation reads of known reference sequence, the only problem to get this message is my sequence coverage is very high (45k on average), Is it possible to do the indel realignment by adjust parameters? thank you.

@PG ID:GATK IndelRealigner CL:knownAlleles=[(RodBinding name=knownAlleles source=mit.vcf)] targetIntervals=indelrealign.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=KNOWNS_ONLY entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null

@sendru
Hi,

Wow. 45,000x is really really high. It is best to follow Geraldine's above response and either downsample or set the maxReadsForRealignment to a super high number. Downsampling will reduce runtime quite a bit.

-Sheila

• Member Posts: 4

@Sheila
Hi, It is not exactly the same scenario in a normal snp call. the reason for a such high sequence coverage is that we want to see minor alleles, which has many application such as cancer cell development, pool sequencing or mitochondrial heteroplasmy. Downsample does not look like an option for me, maybe I will try maxReadsForRealignment . Thank you for your advice.

• Member Posts: 21

@Geraldine_VdAuwera said:
Oof, that's a lot of depth. You can either allow it using the max reads setting, or downsample the data preemptively, depending on what you care about most (time or preserving depth).

Thank you for your detailed information. I used "--maxReadsForRealignment 200000" in IndelRealigner to solve the same problem in my case. The mean read depth is about 3000~40000 for my targeted sequencing data samples.

But I found it interesting that even though I did not set the -dcov parameter, the default setting for RealignerTargetCreator is 1000 because I see the output that
"INFO 17:15:47,314 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000", but the default setting for IndelRealigner is
"INFO 17:16:17,551 GenomeAnalysisEngine - Downsampling Settings: No downsampling". Any reason for that? Considering the high coverage in my target regions, using the default setting would ignore the majority of my reads with coverage of >= 1000 when determining intervals which are likely in need of realignment, right? so that I need to reset the downsampling parameter in RealignerTargetCreator as "-dcov NA" or "-dcov 200000 (a large number)" to not doing downsampling?? I want to realign all the regions needed while preserving depth.

• Member Posts: 21

I just read that I can use -dt NONE to disable the downsampler in RealignerTargetCreator.

The difference between the tools is because RTC only writes out intervals, so downsampling doesn't really remove any information from the analysis flow (assuming the downsampled set is representative enough for that step). In contrast, IR writes the bam that will be used in the next step, so if we downsample there, we only write out a subset of reads. That would be bad.

Geraldine Van der Auwera, PhD

• Member Posts: 21

Yes. I understand it now. I was worried that RTC would not write out intervals where huge amount of reads lie, so I disabled the downsampler in RealignerTargetCreator to keep all the intervals that need to be realigned. Thank you for clarification.