The Frontline Support team will be offline December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks as we get to all of your questions. Happy Holidays!
When should I use -L to pass in a list of intervals?
The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.
In a nutshell, if you’re doing:
- Whole genome analysis: intervals are not required but they can help speed up analysis
- Whole exome analysis: you must provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you must provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet
Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.
Example Use of -L:
-L 20for chromosome 20 in b37/b39 build
-L chr20:1-100for chromosome 20 positions 1-100 in hg18/hg19 build
intervals.bed) where the value passed to the argument is a text file containing intervals
-L some_variant_calls.vcfwhere the value passed to the argument is a VCF file containing variant records; their genomic coordinates will be used as intervals.
Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.
- For example,
HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the
-L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use
-L chr1:1-100. This also works for our HLA contig, e.g.
- However, when passing in an entire contig, for contigs with colons in the name, you must add
:1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.
So here’s a little more detail for each experimental design type.
Whole genome analysis
It is not necessary to use an intervals list in whole genome analysis -- presumably you're interested in the whole genome!
However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. You can do this by providing a list of "good" intervals with
-L, or you could also provide a list of "bad" intervals with
-XL, which does the exact opposite of
-L: it excludes the provided intervals. We share the whole-genome interval lists (of good intervals) that we use in our production pipelines, in our resource bundle (see Download page).
Whole exome analysis
By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.
Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.
Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.
|Tool||-L?||Why / why not|
|BaseRecalibrator||YES||This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.|
|PrintReads||NO||Output is a bam file; using -L would lead to lost data.|
|UnifiedGenotyper/Haplotype Caller||YES||We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.|
|Next steps||NO||No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.|
Small targeted experiments
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.
Debugging / troubleshooting
You can use -L a lot while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).