IndelRealigner Not attempting realignment in interval

Hoppmann Freiburg


first of all, sorry, if I'm not correct in this forum, but I didn't find a place where it realy seemed correct.

On the last data I've got I ren into information like: "INFO 22:14:12,352 IndelRealigner - Not attempting realignment in interval chrY:27782626-27782640 because there are too many reads."

I've read about the option: "--maxReadsForRealignment" and set it to 100000. I had a look at the bam file and saw, that at some points I've got more then 150000 reads. The average reading depth is about 50. I'm quite new to the filed of NGS and thus am not sure what that means.

Exactly how does the program handle this? Sincer there only small parts, which are actually to high covered. Are only those few bases not realigend? Or is everything in the step skipped? As an example: chrY:27782626-27782640 is only an interval of 14bp. The whole y-chromsom is covered by only two steps shown in the log file. What is not aligned? One part of the y-chromosom or just those 14bp?

Besides this technical questions: is it save to just increase the "--maxReadsForRealignment" until all intervals are realigned, or is there normaly a general failure (eg. PCR) behind this, so that it seems more save to ignore those intervals?

If some other users have encountered the same issue it would be nice to hear how the have handled this.

Thanks alot


  Hoppmann Freiburg


    thanks alot for the answer. All intervals with this information actually were repeating units of either GT or AC. So I defnitly can safely ignore the message.

    One more question concerning -dcov. Am I correct, that since it is a general option listet in commandline option it is usable for each step of the GATK pipeline? And since it actually removes all sequences in excess it is only needed once, since all downstream files will have only the reduced amount of sequences available.

    And last: the documentation "CommandlineGATK" says, that the -dcov is biased. If an unbiased aproach is needed the option -dfrac should be used. Sadly there is almost nothing written to the -dfrac option. Am I correct, that the bias mentioned in -dcov is the fact that it cuts of exceeding sequences resulting, that only a small amount of sequences is affected, meaning that the bias is ment to be the different proportion of reading depth within the different reads? Underlying this I would interprete the -dfrac option (sadly not good documented) that I choose a persentage between 0% an 100% [0.0-1.0] and reduce the amount of ALL reads accordingly to this. Resulting in an unbiasd proportion in reads, but decreasing all the number of reads, even if they don't have high coverage.

    Thanks again.

  apallav2

    Hi Geraldine, I am wondering if you provide any tool to check out the regions with 'too many reads' for the sanity?

  Geraldine_VdAuwera

    Hi @apallav2,

    Have a look either at DepthOfCoverage or DiagnoseTargets (for exomes) in the QC section of the Technical Docs.

    Geraldine Van der Auwera, PhD

  apallav2

    Oops - sorry - I should have been more clear..

    Any method you (specialists stated above ) would prefer to get an assessment of the 'too many reads' areas ( apart form DoC and DT as they just give me counts - which is what indicated by 'too many reads' message ) . I guess an automated way to look at these areas before ruling them out as bad reads.


  Geraldine_VdAuwera

    @apallav2, our tools can only tell you whether intervals have too many reads or not. Something like DT can tell which intervals you should troubleshoot at the sequencing level, or ignore in your analysis. We don't have anything that will tell you why there are too many reads somewhere (e.g. presence of repeats), or how safe it is to still use them anyway, if that's what you're looking for.

    Geraldine Van der Auwera, PhD

  apallav2
