Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Intervals and interval lists

GATK_TeamGATK_Team
edited December 2017 in Dictionary

Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:

  • You want to run a quick test on a subset of data (often used in troubleshooting)
  • You want to parallelize execution of an analysis across genomic regions
  • You need to exclude regions that have bad or uninformative data where a tool is getting stuck
  • The analysis you're running should only take data from those subsets due to how the underlying algorithm works

Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.


Interval-related arguments and syntax

Arguments for specifying and modifying intervals are provided by the engine and can be applied to most of not all tools. The main arguments you need to know about are the following:

  • -L / --intervals allows you to specify an interval or list of intervals to include.
  • -XL / --exclude-intervals allows you to specify an interval or list of intervals to exclude.
  • -ip / --interval-padding allows you to add padding (in bp) to the intervals you include.
  • -ixp / --interval-exclusion-padding allows you to add padding (in bp) to the intervals you exclude.

By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule in the Tool Docs).

The syntax for using -L is as follows; it applies equally to -XL:

  • -L chr20 for contig chr20.
  • -L chr20:1-100 for contig chr20, positions 1-100.
  • -L intervals.list (or intervals.interval_list, or intervals.bed) when specifying a text file containing intervals (see supported formats below).
  • -L variants.vcf when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.

If you want to provide several intervals or several interval lists, just pass them in using separate -L or -XL arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see --interval-set-rule in the Tool Docs).


Supported interval list formats

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.

A. Picard-style .interval_list

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style .list or .intervals

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension .bed

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.


Obtaining suitable interval lists

So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?

Targeted sequencing (exomes, gene panels etc.)

For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).

We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.

Whole genomes (WGS)

For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.

We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.

Comments

  • Is there a Google bucket where I can find an hg19 WGS interval list file for use with "help-gatk/Somatic-CNVs-GATK4"?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ehodis
    Hi,

    Unfortunately, we do not provide one for hg19.

    -Sheila

  • hexyhexy ChinaMember

    Hi @ehodis ,
    I want hg19 WGS interval list file also, have you found that?

  • A link to where these list can be found would be helpful! Or better yet, a tutorial on how to create an interval list from a reference genome.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @registered_user
    Hi,

    We don't provide hg19 intervals. I don't know about a tutorial on how to make your own interval list, but it may come in the future.

    -Sheila

  • sleeslee Member, Broadie, Dev ✭✭✭

    Hi @ehodis, @ehodis, and @registered_user,

    For running the Somatic-CNVs-GATK4 workflow on WGS data, you can start by simply passing a .list or .intervals file containing all of the autosomes (1, 2, 3, etc.) This will run the CNV workflow on the autosomes, covering them with bins of size bin_length. For reasons discussed in the workflow and tool documentation, extra care must be taken to run the workflow on the sex chromosomes.

    For running this workflow on WES data, we do not provide intervals---you should use whatever interval list is appropriate for the capture kit used to generate your data.

  • pkhadkapkhadka Member

    Hi, it says here that you make your WGS interval list available. Can you point to me where I might be able to find this? I am using hg18 data.

  • alongaloralongalor Member
    edited October 2018

    I am using hg19 data and would find a WGS interval list helpful too...

  • kclemkclem Member

    Bump this. Can you add the link to the interval lists for download? For hg38?

  • sen_guo1sen_guo1 Member
    above are the links that can get the interval_list file, while when use thefile < somatic-b37_whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.baits.interval_list > , GATK says
    Homo_sapiens_assembly19.targets.interval_list is malformed:Interval file could not be parsed in any supported format
    Homo_sapiens_assembly19.targets.interval_list has an invalid interval : 1:30366-30503 + target_1


    Any advice would be great appreciate!
  • sen_guo1sen_guo1 Member
     console.cloud.google.com/storage/browser/gatk-best-practices/somatic-b37/?project=broad-dsde-outreach 

    console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38?project=broad-dsde-outreach
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
  • pcuscopcusco Member
    Hi,

    I am getting a Java NullPointerException when I try to use .bed file with the --interval-list/-L option.

    This doesn't happen at chromosome boundaries or the like. And doesn't happen if I take only the first ~20k lines of the bed.
    Also, I don't get any error if I use a different, smaller bed file with the same chromosome order as the problematic file, so I ruled out this as the source of the error.

    Is there a maximum number of lines or bp? Anyone knows why is this happening?

    Thanks!
  • RFRichholtRFRichholt Phoenix, AZMember
    edited March 12

    Can you describe the "inclusiveness" of these intervals better. For example, to represent a single base interval at chr1 position 100 would it be chr1:100-100 or chr1:100-101?

  • RosmaninhoRosmaninho Member

    I suppose the recommended file is b37_wgs_consolidated_calling_intervals.list ?

    Does this exclude highly repetitive regions, it's not present on the resource bundle so it would be nice to have some more info about it...

  • RosmaninhoRosmaninho Member
    edited April 2

    While running HaplotypeCaller by defining a while loop pointing to the Broad file of interval lists I got the following error on -L.

    broad_interval_list=$interval_file/intervals_b37_wgs_consolidated_calling_intervals.list
    
    
    while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Text read from file: $line" | \
    ls -d $bam_files_bqsr/${names[${SLURM_ARRAY_TASK_ID}]}.bam | srun shifter --image=broadinstitute/gatk:latest gatk --java-options '-Xmx32G' HaplotypeCaller -I=$bam_files_bqsr/${names[${SLURM_ARRAY_TASK_ID}]}.bam \
    -O=$output/${names[${SLURM_ARRAY_TASK_ID}]}_${line}_50_wgs.g.vcf.gz \
    -ERC=GVCF -L=${line} -ip=50 -R=$genome/human_g1k_v37.fasta --native-pair-hmm-threads=$SLURM_CPUS_PER_TASK ; done < "$broad_interval_list"
    

    ``A USER ERROR has occurred: No value found for tagged argument: L=1:10001-29878082

    Also, is it okay to include a string 1:10001-29878082 in the name of the output file? I find having : in a file name not be a good practice.

  • RosmaninhoRosmaninho Member

    Well, just solved this minor niggle, no worries.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @Rosmaninho

    b37_wgs_consolidated_calling_intervals.list is a list of the intervals used for WGS short variant calling (as seen in the $5G pipeline), which are available as this "consolidated" list (ie in a single file) and as a scattered list of lists. And this is the b37 reference version.

  • RosmaninhoRosmaninho Member

    Thank you for the reply Bhanu!!

    Is there more information on the regions removed from this list? Repetitive regions, telomeres, etc?...
    I used it to treat my WGS data instead of using the full chromossomes and it worked quite well (created a list of lists from it). :)

  • javisjavis Member
    edited July 26

    could I use the interval wgs_calling_regions.hg38.interval_list to filter bad regions with API SelectVariants?
    gatk SelectVariants -L wgs_calling_regions.hg38.interval_list

    wgs_calling_regions.hg38.interval_list download from https://console.cloud.google.com/storage/browser/_details/gatk-test-data/intervals/wgs_calling_regions.hg38.interval_list

Sign In or Register to comment.