When should I use -L to pass in a list of intervals?

SheilaSheila Broad InstitutePosts: 1,409Member, GATK Dev, Broadie, Moderator, DSDE Dev admin
edited August 2014 in FAQs

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 (for chromosome 20 in b37/b39 build)

-L chr20:1-100 (for chromosome 20 positions 1-100 in hg18/hg19 build)

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!

Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool -L? Why / why not
RealignerTargetCreator YES Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.
IndelRealigner NO IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.
BaseRecalibrator YES This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads NO Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller YES We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps NO No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

Post edited by Sheila on


  • rxy712rxy712 Posts: 12Member

    This is very helpful! Thank you very much. As I am doing GATK for the whole exome sequencing data with the bam files downloaded from TCGA. I am not sure how to make the intervals file for the option -L. I saw some people use the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) to generate a BED file containing all the exons plus 10bp at each end. it is based on hg19 assembly. Would that be a good way to go? I do not see any exome interval file in the GATK ftp bundle, although it has been mentioned in the forum. I also saw that GATK accepts BED style interval lists. But the warning message is that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1. I do not know how does this fit into the story if I use UCSC table browser to generate a bed file. As I set 10bp at each end, there will be no problem? The output should still be 1 based because the annotation is based on the reference fasta file, human_g1k_v37.fasta, instead of the intervals, right?

    Any input would be very appreciated!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,043Administrator, GATK Dev admin

    BED files are always 0-based, that's how the format is specified. It isn't affected by what reference the file corresponds to.

    From http://bedtools.readthedocs.org/en/latest/content/general-usage.html:

    **start ** - The zero-based starting position of the feature in the chromosome.
    The first base in a chromosome is numbered 0.
    The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20, inclusive.

    If you use a program or website that generates a bed file natively that's fine, the intervals will be issued and interpreted correctly by GATK. The warnings are meant for users who generate interval files with their own scripts, to remind them to define intervals appropriately.

    But you should pay attention to what reference you are choosing. The UCSC file will be based on the hg19 build, but the human_g1k_v37.fasta file you are referring to is the b37 build. Have a look at our documentation on reference files if this is confusing to you. You can find both the hg19 and the b37 build files in our resource bundle.

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 12Member

    Thank you very much for your explanation! You are right about the reference issue. The problem that I ran into is that GATK does not accept the UCSC version of intervals (hg19 build) when I use the human_g1k_v37.fasta reference (b37 build). But I have to use the human_g1k_v37.fasta reference, because the bam files from TCGA originally used that version. I am wondering if there is any way to solve this problem. It is simple to change chr1 to 1 for the chromosomes in the interval file, but it would be problematic for others like chr6_apd_hap1, chr6_cox_hap2, chrUn_gl000211, chr1_gl000191_random, etc. Or is there any other website for generating exome.intervals based on b37 build?

    ERROR MESSAGE: File associated with name exome_intervals.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,043Administrator, GATK Dev admin

    I don't know the non-chromosomal contigs of hg19 well enough to really answer that. I think they're not represented by equivalents in b37 but I could be wrong. Do you actually need those targets in your analysis? What intervals did the TCGA study use?

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 12Member

    Thanks a lot for your information. I am also considering that maybe the non-chromosomal contigs are not as important, and I plan to remove them from the interval file to give it a try. I do not know what intervals the TCGA study use. Thank you!

  • GiusyGiusy Posts: 9Member

    Hi there! I've a similar problem to the one rxy712 had, but I'm afraid I can't do without the -L argument. I'm performing an analysis on exome sequencing data so I need to restrict it to target regions. My problem is that I had to use the b37 build for the alignment, but the .bed file of target region (provided by Agilent company) is in the UCSC format so the program gives me an error of no overlapping contigs. Is there a way to rename my .bed file contigs in order to match the b37 reference? In case there isn't, if I just decide not tu use the -L argument, there will be consequences on my analysis? Thank you!

  • SheilaSheila Broad InstitutePosts: 1,409Member, GATK Dev, Broadie, Moderator, DSDE Dev admin



    It is a good idea to use intervals to speed up runtime, but you must use the correct intervals.

    If your dataset does not include mitochondrial contigs or unmapped contigs, then you can simply rename the contigs to match the b37 build. For example, you would change chr1 to 1 in the bed file.

    I hope this helps.


  • GiusyGiusy Posts: 9Member

    Hello, Sheila! It was great help thank you. Actually before your reply, I tried the sed command with very few hopes, but it seems it worked well. Now everything is running smoothly. I don't have in my data mitochondrial contigs, but I'm not sure what you mean for unmapped contigs. Is it a separate file that should come out from BWA? But anyway if I get it right, in the case I had unmapped contigs it would not be sufficient to use the sed command and I would still be stuck, am I right? Thanks again!

  • SheilaSheila Broad InstitutePosts: 1,409Member, GATK Dev, Broadie, Moderator, DSDE Dev admin



    Unmapped contigs are parts of the genome sequence for which people have not yet figured out where they belong on the chromosomes. They are are put in the reference build as "decoy" contigs that collect reads that do not map to anywhere else on the chromosomes, to prevent them from getting mapped with low confidence to the canonical chromosomes. There are versions of the reference build that include them and some that don't.

    It sounds like you do not have unmapped contigs. There is some evidence that having them helps a little, but not having them is not a big problem either.


  • GiusyGiusy Posts: 9Member

    Thank you very much for your explanation. Actually I'm using the hs37d5 reference genome, which should contain also the decoy contigs. How can I find out if I have unmapped contigs in my dataset? when I downloaded my intervals list for the SureSelect All Exons Kit v5 they didn't give any info about unmapped contigs, so I guess they shouldn't be present in my sequences, hence whether I have the decoy or not in the reference genome should not make a difference for me, isn't it?

  • KurtKurt Posts: 207Member ✭✭✭

    You are going to have non-specific hybridizations (i.e. not all of your sequences are going to map to within your targets/baited intervals, the percentage has to do with how well your capture efficiency is). the decoy mainly affects non-coding regions, but there are a few exceptions. CDC27 is one. I believe a few of the MUC genes are as well (some of the reads will end up mapping to the decoy sequence as opposed to be forced onto CDC27, MUC, etc). So even though they are not targeted by your capture product it's not a bad idea when aligning your sequencing data to a reference genome to use the hs37d5 reference which has the unlocalized contigs, decoy sequence as they act as a sink for non-specific reads and removes false positive variants in those genes.

  • liuxingliangliuxingliang SingaporePosts: 13Member

    Hi all, before I read this topic, I did use -L in my PrintReads step which made the rate of mapped reads in my recalibrated bam very low, I think this is so-called "lost data", after I remove -L, the result seems to become normal. Many thanks to all of you. But I am still not very clear why this could happen? Why the data will lose when I use -L in PrintReads.

    Thank you all~

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,043Administrator, GATK Dev admin

    @liuxingliang‌ Please see this document for an explanation of what the -L argument does.

    Geraldine Van der Auwera, PhD

  • chenyu600chenyu600 Posts: 22Member


    I have a few questions about the usage of -L. Currently, I use version3.2 on exomes.

    In order to speed up, I run HaplotypeCaller with the -GVCF option and GenotypeGVCFs on split chromosomes, and then merge them into sample.g.vcf and sample.vcf, respectively. I use -L chr (for example, chr1) when running HaplotypeCaller to get more variants, and then -L chr1.bed when running GenotypeGVCFs. The reason why I use -L chr1 is that I would like to get all variants though I know that sites called outside the target region might be with high FP. The results from this step may be used together with other data captured by different chip. And -L chr1.bed (which is exactly the target region) is to get more accurate genotype.

    My question is that would this combination of commands produce the same result with that produced by with -L chr1.bed for both HaplotypeCaller and GenotypeGVCFs? And it be the same with that produced by running on per-sample bam (rather than per-chromosome)?

    Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,043Administrator, GATK Dev admin

    Hi @chenyu600,

    The problem with using -L on GVCFs that were generated without an intervals file is that if you have an interval that starts in the middle of a reference block, the entire block will be skipped, which is going to cause missing records. It would be better to run GenotypeGVCFs on the entire GVCFs, then subset the sites of interest using SelectVariants. Including regions outside the capture targets will not affect the calls within the targets.

    Geraldine Van der Auwera, PhD

  • chenyu600chenyu600 Posts: 22Member

    Thanks for the quick response @Geraldine_VdAuwera !

    I am now clear about the -L issue. But I am still not sure if the performing of HaplotypeCaller and GenotypeGVCFs on different level (per-sample vs. per-chromosome) would cause any problem?

    Thanks again!

  • chenyu600chenyu600 Posts: 22Member

    Hi @Geraldine_VdAuwera,

    I have just discussed the usage of -L with my colleague, but we dissent with each other on some point, and we would like to make it more clear. Would you please give us more detail for several questions?

    1. What does "the reference block" refer to?

    2. Say we have three exome dataset. For exome1 and exome2, there is no variant from chr1:1-5000, and exome3 got a SNP (A->G, for example) on chr1:3000. If the capture chip we used did not cover chr1:1-5000, would this record (chr1 3000 . A G) missed from the final output of GenotypeGVCFs? Still using -L chr1 for HaplotypeCaller with -GVCF option and -L chr1.bed for GenotypeGVCFs.

    3. If the record (chr1 3000 . A G) is emitted in the final VCF generated by GenotypeGVCFs, what would the record be like for exome1 and exome2?

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,043Administrator, GATK Dev admin

    @chenyu600 ,

    Both HaplotypeCaller and GenotypeGVCFs can be run per-chromosome. It does not matter if you do it for just one or both.

    Regarding reference blocks, please read the FAQ article on GVCF files.

    In the GVCFs, you will get records representing every single site that is included in your intervals file. If there is no usable data at a site, the record will be a no-call. In the final joint-genotyped VCF, you will only get records at variant sites unless you enable the option to emit non variants.

    A site is considered variant if at least one sample is variant.

    Geraldine Van der Auwera, PhD

  • aruncoorgaruncoorg USAPosts: 2Member

    Hello, I have a question about the-L flag. Since you mentioned that the coordinates are zero based, for the first ten thousand bases I should be using something like -L chr1:0-10000, right? When I tried this, Haplotype caller gave me an error MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The start position 0 is less than 1. My understanding is that if I use -L chr1:1-10000, it will miss the first base. How should I specify in this case? Thanks for any help!

  • thibaultthibault Posts: 28GATK Dev mod

    Intervals in the standard GATK format are 1-based, so you are not missing the first base with -L chr1:1-10000.

    Joel Thibault ~ Software Engineer ~ GSA ~ Broad Institute

  • aruncoorgaruncoorg USAPosts: 2Member

    I am confused, why does the second comment say that it is zero based? I am using BEDtools to generate my intervals. By default, it generated -L chr1:0-10000.

  • thibaultthibault Posts: 28GATK Dev mod

    The GATK uses 1-based intervals internally.

    The comment above is referring to BED files. If you pass in a bed file (like -L intervals.bed) then it will work correctly, even though BED's format is 0-based.

    When passing in the intervals directly as text (what you are doing here, like -L 1:123) the GATK assumes you are using its standard 1-based format.

    Joel Thibault ~ Software Engineer ~ GSA ~ Broad Institute

Sign In or Register to comment.