We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Collected FAQs about interval lists

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited October 2016 in Frequently Asked Questions

1. Can GATK tools be restricted to specific intervals instead of processing the entire reference?

Absolutely. Just use the -L argument to provide the list of intervals you wish to run on. Or you can use -XL to exclude intervals, e.g. to blacklist genome regions that are problematic.

2. What file formats does GATK support for interval lists?

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files.

A. Picard-style .interval_list

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style .list or .intervals

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension .bed

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.

3. Is there a required order of intervals?

Yes, thanks for asking. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is for efficiency reasons.

4. Can I provide multiple sets of intervals?

Sure, no problem -- just pass them in using separate -L arguments. You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by setting an interval_set rule.

5. How will GATK handle intervals that abut or overlap?

Very gracefully. By default the GATK engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by setting an interval_merging rule.

6. What's the best way to pad intervals?

You can use the -ip engine argument to add padding on the fly. No need to produce separate padded targets files. Sweet, right?

Note that if intervals that previously didn't abut or overlap before you added padding now do so, by default the GATK engine will merge them as described above. This behavior can be modified by setting an interval_merging rule.

Post edited by Geraldine_VdAuwera on


  • prepagamprepagam Member

    If I was only interested in calling variants in a set of neutral regions, I wonder if there are any negative implications to intersecting my bam with a bed file of these regions PRIOR to gatk. i.e. doing this rather than using the genomics intervals that GATK offers. For me this is preferable for various storage reasons, but perhaps this has some unknown side effect with GaTK.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No problem at all, you can use whatever intervals you want. This may influence the expected Ti/Tv ratio, so keep that in mind when you analyze your callset, but it shouldn't have any effect on the quality of results.

  • eflanneryeflannery San DiegoMember

    Hi Geraldine, It seems like there is a minimum size the interval in the interval list needs to be to get outputted in the Diagnose Targets walker. Do you know this minimum? Is it default or calculated each time? Is there a way to change it?



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @eflannery,

    I just looked at the code and didn't find any hardcoded limits. The only limitation that I'm aware of is that intervals must be non-null (ie not zero-length). Why do you think there's a limit?

  • eflanneryeflannery San DiegoMember

    When I run Diagnose Targets there are intervals that are not present in the output file that are present in the interval_list file. All of the intervals that are excluded, are very small, <500bp. I only assumed this is why they were not included. Shouldn't every interval in interval_list be included in the output of diagnose Targets?



  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Erika,

    Sorry for the late response. I was going through my old emails and found this! Are you still having an issue with this? Is it possible that the short intervals overlap some other longer intervals and are getting output as part of the longer intervals?


  • KatieKatie United StatesMember ✭✭

    Is there a way to define an interval list by position rather than interval? For example, if I am interested in using SelectVariants, can I query a VCF with a list containing only contig and SNP position? I've tried this but seems like I need to define regions rather than positions.
    Thank you!

  • KatieKatie United StatesMember ✭✭

    Sorry to bother, I found that vcftools will filter with a tab-delimited list of chromosome and position with the command:
    vcftools --vcf 'VCFfile' --positions 'positions_list'


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You can do this with SelectVariants, sure. You can pass in single positions using either the interval list format or a vcf of sites of interest.

  • QazSeDcQazSeDc Hong KongMember

    I've had a hard time running DepthOfCoverage with the correct format of interval file.
    I tried following the gatk instructions but still wouldn't work.
    Would anyone please give an example for each of the .list .intervals and .interval_list format?

  • QazSeDcQazSeDc Hong KongMember
    edited October 2016

    Hi @Geraldine_VdAuwera ,

    I have tried the [chr] [start] [stop] format with .list .intervals and .interval_list filename extension mentioned in https://software.broadinstitute.org/gatk/guide/article?id=1204 but it wouldn't work.
    I figured the [chr] [start] [stop] format only worked for .bed files and the only time when .list .intervals and .interval_list worked out was to use the [chr]:[start]-[stop] format.
    Am I missing something?

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @QazSeDc, I rewrote this article to be more clear about what is supported, what are the requirements and also some of the convenience options that are related to intervals. I hope this helps.

  • QazSeDcQazSeDc Hong KongMember

    Thank you @Geraldine_VdAuwera!
    This new guild line explains everything clearly!

  • biojiangkebiojiangke Member ✭✭


    I have a question about the behavior of the interval option in CombineGVCF: I understand it could take standard samtools/GATK format chr:start-end, and BED format, but it also could take the format of chr:pos, as I tried. I would think GATK processes one genomic position in this situation, but instead, I'm getting results up to 5bp from this specified position. Would anyone provide more information about this behavior?

    The application behind this is that sometimes we use this type of operation to fetch genotypes across samples with WGS data and compare with results from other genotyping platforms such as SNP chips and amplicons. In this case, the sites to be checked are discrete and scattered across the genome and I had to supply GATK with multiple intervals.

    P. S. also posted this in a different thread before finding this one.



  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Ke,

    I will answer there.


  • Jason_WuJason_Wu Member
    edited December 2018

    Dear GATK team,

    I met an error when using a bed file as INTERVAL input at "gatk GenotypeConcordance"(Picard).
    Well, it ended up by using "gatk BedToInterval" to get a new Interval file as the INTERVAL input instead.

    But here said, "GATK also accept bed file as interval input".
    So I was wondering if it meant that the GATK standard only covers the "original" GATK tools, but not the Picard tools which are also included in the GATK tool list now.


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited December 2018

    Hi @Jason_Wu,

    Picard definitely accepts Picard-style intervals lists and GATK accepts both Picard-style as well as BED intervals. Given what you report, I will put a request in for Picard tools called through GATK to also accept BED-format. Thanks for bringing this to our attention.

    P.S. Here is the GitHub issue ticket I placed for you: https://github.com/broadinstitute/gatk/issues/5472

Sign In or Register to comment.