We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Using IntervalListTools to create scatter intervals for HaplotypeCaller

I was looking at the GATK4 $5 WDL file and see that it uses IntervalListTools to create the interval list for scattering over HaplotypeCaller. In the call to IntervalListTools, it sets the parameter BREAK_BANDS_AT_MULTIPLES_OF to a non-zero value (100,000 in the hg38 json file ). The ScatterIntervalsByNs call to generate the interval list which is used as input to this step is very careful to split at N's, but then in this call we may split in the middle of actual sequence. Won't we potentially be introducing edge effects by doing this? If we split every 100,000 bases, then won't we have issues calling an indel that spans one of these break points?

Best Answer


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @BTH

    I am not sure what you mean by:

    The ScatterIntervalsByNs call to generate the interval list which is used as input to this step is very careful to split at N's, but then in this call we may split in the middle of actual sequence.

    Can you please be more specific. Thank you.

  • BTHBTH Member

    Then during the ScatterIntervalsByNs might create intervals of:
    It never breaks except at an N. HapplotypeCaller can operate on each of this intervals independently and will get the same results as if we had not split by intervals at all.

    But then the pipeline runs IntervalListTools BREAK_BANDS_AT_MULTIPLES_OF. Let's say it sets it to 5. Then we would break the above intervals to:

    The challenge as I see it here is that let's say there is an indel that splits the first two of these intervals, i.e.:
    Reference: ACCCTCCTGG
    Sample: ACCC--CTGG

    Since HaplotypeCaller will be run independently on ACCCT and CCTGG, we'll get edge effects when compared to a run that saw ACCCTCCTGG as a single interval. I'm actually not sure how the results will vary, but it seems to me that the results at these edge regions will be different.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 2019

    Hi @BTH

    As mentioned in the tool docs for ScatterIntervalsByNs

    This tool is used for creating a broken-up interval list that can be used for scattering a variant-calling pipeline in a way that will not cause problems at the edges of the intervals. By using large enough N blocks (so that the tools will not be able to anchor on both sides) we can be assured that the results of scattering and gathering the variants with the resulting interval list will be the same as calling with one large region.

    So looks like the way the intervals are created, the tool avoids problems at the edges. I hope this doc helps.

  • BTHBTH Member

    Right, that's my point. There are two parts to creating the scatter intervals in the WDL definition:
    1. ScatterInteralsByNs - this step is very careful to avoid the edge problem as we both described.
    2. IntervalListTools using BREAK_BANDS_AT_MULTIPLES_OF - this step then further breaks the intervals from step 1 and is not at all careful about edges so it appears to me you would get edge effects every 100,000 bases on average (since this second step breaks all of the intervals every 100,000 bases without regard to Ns etc.)

  • BTHBTH Member

    Ah-ha - that's great to know. Thanks Bhanu!

  • TracyCTracyC SydneyMember
    Hi @bhanuGandham , I have a related question. If HaplotypeCaller intrinsically does padding on sequences, and there is an indel that sits on the edge of 2 intervals, does that mean that the indel be called in both VCF files for each interval?

    I am looking into tools that do the "gather" step. GATK 4's GatherVcfs seems purposefully built, however it explicitly states that "Input files must be supplied in genomic order and must not have events at overlapping positions."

    What is the best way to handle this scenario?

    Also, testing GatherVcfs, I noticed that the header in the final, merged VCF file subsets the VCF headers for the "##GATKCommandLine=" and only shows the command for the first interval. There is also no header to show "GatherVcfs" and so there is no record of which files have been merged. Is there a way to get this info in the header?

    Many thanks. :)
  • bshifawbshifaw Member, Broadie, Moderator admin
    edited August 2019

    1) I'll have to double check with the team about Haplotypecaller.
    2) Switch to MergeVcfs, it doesn't have the problems with events at overlapping positions.

  • bshifawbshifaw Member, Broadie, Moderator admin


    Dev team response :

    The padding in this case means that HaplotypeCaller will look at reads that extend beyond the boundaries of the requested interval when assembling and finding variants. This is to help resolve variants that might be near the boundary of intervals. However the tool retains the information of interval it's supposed to call on, so it will not make calls outside the requested interval. Each variant will only be called once.

  • manolismanolis Member ✭✭✭
    edited August 2019

    Hi @bshifaw ,

    looking your last post I do not understand if in general HC call variants in the padding regions (-ip option). That because you wrote "in this case means" and also, what is the meaning of "requested interval"? only the original interval or interval+padding? or is just a my confusion between "-ip" and "padding intrinsically"?

    Many thanks

  • manolismanolis Member ✭✭✭
    edited August 2019

    Sorry, I made some confusion reading "padding" in the previous posts, it's ok now!

    Post edited by manolis on
Sign In or Register to comment.