IndelRealigner Not attempting realignment in interval

HoppmannHoppmann FreiburgPosts: 8Member

Hi,

first of all, sorry, if I'm not correct in this forum, but I didn't find a place where it realy seemed correct.

On the last data I've got I ren into information like:
"INFO 22:14:12,352 IndelRealigner - Not attempting realignment in interval chrY:27782626-27782640 because there are too many reads."

I've read about the option: "--maxReadsForRealignment" and set it to 100000. I had a look at the bam file and saw, that at some points I've got more then 150000 reads. The average reading depth is about 50. I'm quite new to the filed of NGS and thus am not sure what that means.

Exactly how does the program handle this? Sincer there only small parts, which are actually to high covered. Are only those few bases not realigend? Or is everything in the step skipped? As an example:
chrY:27782626-27782640 is only an interval of 14bp. The whole y-chromsom is covered by only two steps shown in the log file. What is not aligned? One part of the y-chromosom or just those 14bp?

Besides this technical questions: is it save to just increase the "--maxReadsForRealignment" until all intervals are realigned, or is there normaly a general failure (eg. PCR) behind this, so that it seems more save to ignore those intervals?

If some other users have encountered the same issue it would be nice to hear how the have handled this.

Thanks alot

Tagged:

Best Answers

Answers

  • HoppmannHoppmann FreiburgPosts: 8Member

    Hi,

    thanks alot for the answer. All intervals with this information actually were repeating units of either GT or AC. So I defnitly can safely ignore the message.

    One more question concerning -dcov. Am I correct, that since it is a general option listet in commandline option it is usable for each step of the GATK pipeline? And since it actually removes all sequences in excess it is only needed once, since all downstream files will have only the reduced amount of sequences available.

    And last: the documentation "CommandlineGATK" says, that the -dcov is biased. If an unbiased aproach is needed the option -dfrac should be used. Sadly there is almost nothing written to the -dfrac option. Am I correct, that the bias mentioned in -dcov is the fact that it cuts of exceeding sequences resulting, that only a small amount of sequences is affected, meaning that the bias is ment to be the different proportion of reading depth within the different reads? Underlying this I would interprete the -dfrac option (sadly not good documented) that I choose a persentage between 0% an 100% [0.0-1.0] and reduce the amount of ALL reads accordingly to this. Resulting in an unbiasd proportion in reads, but decreasing all the number of reads, even if they don't have high coverage.

    Thanks again.

  • apallav2apallav2 Posts: 11Member

    Hi Geraldine, I am wondering if you provide any tool to check out the regions with 'too many reads' for the sanity?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,137Administrator, GATK Dev admin

    Hi @apallav2,

    Have a look either at DepthOfCoverage or DiagnoseTargets (for exomes) in the QC section of the Technical Docs.

    Geraldine Van der Auwera, PhD

  • apallav2apallav2 Posts: 11Member

    Oops - sorry - I should have been more clear..

    Any method you (specialists stated above ) would prefer to get an assessment of the 'too many reads' areas ( apart form DoC and DT as they just give me counts - which is what indicated by 'too many reads' message ) . I guess an automated way to look at these areas before ruling them out as bad reads.

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,137Administrator, GATK Dev admin

    @apallav2, our tools can only tell you whether intervals have too many reads or not. Something like DT can tell which intervals you should troubleshoot at the sequencing level, or ignore in your analysis. We don't have anything that will tell you why there are too many reads somewhere (e.g. presence of repeats), or how safe it is to still use them anyway, if that's what you're looking for.

    Geraldine Van der Auwera, PhD

  • apallav2apallav2 Posts: 11Member

    Alright - Thanks for answering.

  • liuxingliangliuxingliang SingaporePosts: 13Member

    @Hoppmann said:
    Hi,

    first of all, sorry, if I'm not correct in this forum, but I didn't find a place where it realy seemed correct.

    On the last data I've got I ren into information like:
    "INFO 22:14:12,352 IndelRealigner - Not attempting realignment in interval chrY:27782626-27782640 because there are too many reads."

    I've read about the option: "--maxReadsForRealignment" and set it to 100000. I had a look at the bam file and saw, that at some points I've got more then 150000 reads. The average reading depth is about 50. I'm quite new to the filed of NGS and thus am not sure what that means.

    Exactly how does the program handle this? Sincer there only small parts, which are actually to high covered. Are only those few bases not realigend? Or is everything in the step skipped? As an example:
    chrY:27782626-27782640 is only an interval of 14bp. The whole y-chromsom is covered by only two steps shown in the log file. What is not aligned? One part of the y-chromosom or just those 14bp?

    Besides this technical questions: is it save to just increase the "--maxReadsForRealignment" until all intervals are realigned, or is there normaly a general failure (eg. PCR) behind this, so that it seems more save to ignore those intervals?

    If some other users have encountered the same issue it would be nice to hear how the have handled this.

    Thanks alot

    Hi Hoppmann, may I ask you how do you know how many reads over a certain point in you bam file?

  • HoppmannHoppmann FreiburgPosts: 8Member

    Hi liuxingliang,

    I wrote the position mentioned by the message and then increased the number in "-maxReadsForRealignment" as long as there was this message. Once I've been aligning all reads I took the bam file loded in my genome browser (I use the one from GoldenHelix) and looked at the position.
    When I click at the position the number of reads at that position is stated at the sidebar.

    Hope this helps.

    Greeting Anselm

  • ymcymc Posts: 27Member

    I am getting similar errors, too. But I think it has to do with my mean mapped coverage is about 3000x for my targeted sequencing data. Does that mean it is better for me to set --maxReadsForRealignment if I don't care about runtimes?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,137Administrator, GATK Dev admin

    Oof, that's a lot of depth. You can either allow it using the max reads setting, or downsample the data preemptively, depending on what you care about most (time or preserving depth).

    Geraldine Van der Auwera, PhD

  • sendrusendru Posts: 4Member

    Hi, @Geraldine_VdAuwera

    I am getting the same error message from indelrealigner, but my bam file is from simulation reads of known reference sequence, the only problem to get this message is my sequence coverage is very high (45k on average), Is it possible to do the indel realignment by adjust parameters? thank you.

    @PG ID:GATK IndelRealigner CL:knownAlleles=[(RodBinding name=knownAlleles source=mit.vcf)] targetIntervals=indelrealign.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=KNOWNS_ONLY entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null

  • SheilaSheila Broad InstitutePosts: 1,608Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @sendru
    Hi,

    Wow. 45,000x is really really high. It is best to follow Geraldine's above response and either downsample or set the maxReadsForRealignment to a super high number. Downsampling will reduce runtime quite a bit.

    -Sheila

  • sendrusendru Posts: 4Member

    @Sheila
    Hi, It is not exactly the same scenario in a normal snp call. the reason for a such high sequence coverage is that we want to see minor alleles, which has many application such as cancer cell development, pool sequencing or mitochondrial heteroplasmy. Downsample does not look like an option for me, maybe I will try maxReadsForRealignment . Thank you for your advice.

Sign In or Register to comment.