Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

When should I use -L to pass in a list of intervals?

SheilaSheila Broad InstitutePosts: 288Member, GATK Developer, Broadie, Moderator admin
edited June 13 in FAQs

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important note:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.


So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!

Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool -L? Why / why not
RealignerTargetCreator YES Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.
IndelRealigner NO IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.
BaseRecalibrator YES This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads NO Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller YES We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps NO No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • rxy712rxy712 Posts: 12Member

    This is very helpful! Thank you very much. As I am doing GATK for the whole exome sequencing data with the bam files downloaded from TCGA. I am not sure how to make the intervals file for the option -L. I saw some people use the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) to generate a BED file containing all the exons plus 10bp at each end. it is based on hg19 assembly. Would that be a good way to go? I do not see any exome interval file in the GATK ftp bundle, although it has been mentioned in the forum. I also saw that GATK accepts BED style interval lists. But the warning message is that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1. I do not know how does this fit into the story if I use UCSC table browser to generate a bed file. As I set 10bp at each end, there will be no problem? The output should still be 1 based because the annotation is based on the reference fasta file, human_g1k_v37.fasta, instead of the intervals, right?

    Any input would be very appreciated!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,891Administrator, GATK Developer admin

    BED files are always 0-based, that's how the format is specified. It isn't affected by what reference the file corresponds to.

    From http://bedtools.readthedocs.org/en/latest/content/general-usage.html:

    **start ** - The zero-based starting position of the feature in the chromosome.
    The first base in a chromosome is numbered 0.
    The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20, inclusive.

    If you use a program or website that generates a bed file natively that's fine, the intervals will be issued and interpreted correctly by GATK. The warnings are meant for users who generate interval files with their own scripts, to remind them to define intervals appropriately.

    But you should pay attention to what reference you are choosing. The UCSC file will be based on the hg19 build, but the human_g1k_v37.fasta file you are referring to is the b37 build. Have a look at our documentation on reference files if this is confusing to you. You can find both the hg19 and the b37 build files in our resource bundle.

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 12Member

    Thank you very much for your explanation! You are right about the reference issue. The problem that I ran into is that GATK does not accept the UCSC version of intervals (hg19 build) when I use the human_g1k_v37.fasta reference (b37 build). But I have to use the human_g1k_v37.fasta reference, because the bam files from TCGA originally used that version. I am wondering if there is any way to solve this problem. It is simple to change chr1 to 1 for the chromosomes in the interval file, but it would be problematic for others like chr6_apd_hap1, chr6_cox_hap2, chrUn_gl000211, chr1_gl000191_random, etc. Or is there any other website for generating exome.intervals based on b37 build?

    ERROR MESSAGE: File associated with name exome_intervals.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,891Administrator, GATK Developer admin

    I don't know the non-chromosomal contigs of hg19 well enough to really answer that. I think they're not represented by equivalents in b37 but I could be wrong. Do you actually need those targets in your analysis? What intervals did the TCGA study use?

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 12Member

    Thanks a lot for your information. I am also considering that maybe the non-chromosomal contigs are not as important, and I plan to remove them from the interval file to give it a try. I do not know what intervals the TCGA study use. Thank you!

Sign In or Register to comment.