The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

Collected FAQs about interval lists

Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,739 admin
edited October 14 in FAQs

1. Can GATK tools be restricted to specific intervals instead of processing the entire reference?

Absolutely. Just use the -L argument to provide the list of intervals you wish to run on. Or you can use -XL to exclude intervals, e.g. to blacklist genome regions that are problematic.


2. What file formats does GATK support for interval lists?

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files.

A. Picard-style .interval_list

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style .list or .intervals

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension .bed

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.


3. Is there a required order of intervals?

Yes, thanks for asking. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is for efficiency reasons.


4. Can I provide multiple sets of intervals?

Sure, no problem -- just pass them in using separate -L arguments. You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by setting an interval_set rule.


5. How will GATK handle intervals that abut or overlap?

Very gracefully. By default the GATK engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by setting an interval_merging rule.


6. What's the best way to pad intervals?

You can use the -ip engine argument to add padding on the fly. No need to produce separate padded targets files. Sweet, right?

Note that if intervals that previously didn't abut or overlap before you added padding now do so, by default the GATK engine will merge them as described above. This behavior can be modified by setting an interval_merging rule.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

Comments

  • prepagamprepagam Member Posts: 57

    If I was only interested in calling variants in a set of neutral regions, I wonder if there are any negative implications to intersecting my bam with a bed file of these regions PRIOR to gatk. i.e. doing this rather than using the genomics intervals that GATK offers. For me this is preferable for various storage reasons, but perhaps this has some unknown side effect with GaTK.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,739 admin

    No problem at all, you can use whatever intervals you want. This may influence the expected Ti/Tv ratio, so keep that in mind when you analyze your callset, but it shouldn't have any effect on the quality of results.

    Geraldine Van der Auwera, PhD

  • eflanneryeflannery San DiegoMember Posts: 9

    Hi Geraldine, It seems like there is a minimum size the interval in the interval list needs to be to get outputted in the Diagnose Targets walker. Do you know this minimum? Is it default or calculated each time? Is there a way to change it?

    Thanks!

    Erika

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,739 admin

    Hi @eflannery,

    I just looked at the code and didn't find any hardcoded limits. The only limitation that I'm aware of is that intervals must be non-null (ie not zero-length). Why do you think there's a limit?

    Geraldine Van der Auwera, PhD

  • eflanneryeflannery San DiegoMember Posts: 9

    When I run Diagnose Targets there are intervals that are not present in the output file that are present in the interval_list file. All of the intervals that are excluded, are very small, <500bp. I only assumed this is why they were not included. Shouldn't every interval in interval_list be included in the output of diagnose Targets?

    Thanks!

    Erika

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,308 admin

    @eflannery
    Hi Erika,

    Sorry for the late response. I was going through my old emails and found this! Are you still having an issue with this? Is it possible that the short intervals overlap some other longer intervals and are getting output as part of the longer intervals?

    Thanks,
    Sheila

  • KatieKatie United StatesMember Posts: 28

    Is there a way to define an interval list by position rather than interval? For example, if I am interested in using SelectVariants, can I query a VCF with a list containing only contig and SNP position? I've tried this but seems like I need to define regions rather than positions.
    Thank you!

  • KatieKatie United StatesMember Posts: 28

    Sorry to bother, I found that vcftools will filter with a tab-delimited list of chromosome and position with the command:
    vcftools --vcf 'VCFfile' --positions 'positions_list'

    Cheers,

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,739 admin

    You can do this with SelectVariants, sure. You can pass in single positions using either the interval list format or a vcf of sites of interest.

    Geraldine Van der Auwera, PhD

  • QazSeDcQazSeDc Hong KongMember Posts: 3

    I've had a hard time running DepthOfCoverage with the correct format of interval file.
    I tried following the gatk instructions but still wouldn't work.
    Would anyone please give an example for each of the .list .intervals and .interval_list format?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,739 admin
  • QazSeDcQazSeDc Hong KongMember Posts: 3
    edited October 11

    Hi @Geraldine_VdAuwera ,

    I have tried the [chr] [start] [stop] format with .list .intervals and .interval_list filename extension mentioned in https://software.broadinstitute.org/gatk/guide/article?id=1204 but it wouldn't work.
    I figured the [chr] [start] [stop] format only worked for .bed files and the only time when .list .intervals and .interval_list worked out was to use the [chr]:[start]-[stop] format.
    Am I missing something?

    Issue · Github
    by Sheila

    Issue Number
    1343
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,739 admin

    Hi @QazSeDc, I rewrote this article to be more clear about what is supported, what are the requirements and also some of the convenience options that are related to intervals. I hope this helps.

    Geraldine Van der Auwera, PhD

  • QazSeDcQazSeDc Hong KongMember Posts: 3

    Thank you @Geraldine_VdAuwera!
    This new guild line explains everything clearly!

Sign In or Register to comment.