Due to a minor hiccup in the release process for version 3.3, the tool documentation page is currently malfunctioning. We're working on getting that fixed asap.

When should I use -L to pass in a list of intervals?

SheilaSheila Broad InstitutePosts: 537Member, GATK Developer, Broadie, Moderator admin
edited August 11 in FAQs

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 (for chromosome 20 in b37/b39 build)

-L chr20:1-100 (for chromosome 20 positions 1-100 in hg18/hg19 build)


So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!

Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool -L? Why / why not
RealignerTargetCreator YES Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.
IndelRealigner NO IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.
BaseRecalibrator YES This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads NO Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller YES We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps NO No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

Post edited by Sheila on
Tagged:

Comments

  • rxy712rxy712 Posts: 12Member

    This is very helpful! Thank you very much. As I am doing GATK for the whole exome sequencing data with the bam files downloaded from TCGA. I am not sure how to make the intervals file for the option -L. I saw some people use the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) to generate a BED file containing all the exons plus 10bp at each end. it is based on hg19 assembly. Would that be a good way to go? I do not see any exome interval file in the GATK ftp bundle, although it has been mentioned in the forum. I also saw that GATK accepts BED style interval lists. But the warning message is that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1. I do not know how does this fit into the story if I use UCSC table browser to generate a bed file. As I set 10bp at each end, there will be no problem? The output should still be 1 based because the annotation is based on the reference fasta file, human_g1k_v37.fasta, instead of the intervals, right?

    Any input would be very appreciated!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,418Administrator, GATK Developer admin

    BED files are always 0-based, that's how the format is specified. It isn't affected by what reference the file corresponds to.

    From http://bedtools.readthedocs.org/en/latest/content/general-usage.html:

    **start ** - The zero-based starting position of the feature in the chromosome.
    The first base in a chromosome is numbered 0.
    The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20, inclusive.

    If you use a program or website that generates a bed file natively that's fine, the intervals will be issued and interpreted correctly by GATK. The warnings are meant for users who generate interval files with their own scripts, to remind them to define intervals appropriately.

    But you should pay attention to what reference you are choosing. The UCSC file will be based on the hg19 build, but the human_g1k_v37.fasta file you are referring to is the b37 build. Have a look at our documentation on reference files if this is confusing to you. You can find both the hg19 and the b37 build files in our resource bundle.

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 12Member

    Thank you very much for your explanation! You are right about the reference issue. The problem that I ran into is that GATK does not accept the UCSC version of intervals (hg19 build) when I use the human_g1k_v37.fasta reference (b37 build). But I have to use the human_g1k_v37.fasta reference, because the bam files from TCGA originally used that version. I am wondering if there is any way to solve this problem. It is simple to change chr1 to 1 for the chromosomes in the interval file, but it would be problematic for others like chr6_apd_hap1, chr6_cox_hap2, chrUn_gl000211, chr1_gl000191_random, etc. Or is there any other website for generating exome.intervals based on b37 build?

    ERROR MESSAGE: File associated with name exome_intervals.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,418Administrator, GATK Developer admin

    I don't know the non-chromosomal contigs of hg19 well enough to really answer that. I think they're not represented by equivalents in b37 but I could be wrong. Do you actually need those targets in your analysis? What intervals did the TCGA study use?

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 12Member

    Thanks a lot for your information. I am also considering that maybe the non-chromosomal contigs are not as important, and I plan to remove them from the interval file to give it a try. I do not know what intervals the TCGA study use. Thank you!

  • GiusyGiusy Posts: 9Member

    Hi there! I've a similar problem to the one rxy712 had, but I'm afraid I can't do without the -L argument. I'm performing an analysis on exome sequencing data so I need to restrict it to target regions. My problem is that I had to use the b37 build for the alignment, but the .bed file of target region (provided by Agilent company) is in the UCSC format so the program gives me an error of no overlapping contigs. Is there a way to rename my .bed file contigs in order to match the b37 reference? In case there isn't, if I just decide not tu use the -L argument, there will be consequences on my analysis? Thank you!

  • SheilaSheila Broad InstitutePosts: 537Member, GATK Developer, Broadie, Moderator admin

    @Giusy

    Hi,

    It is a good idea to use intervals to speed up runtime, but you must use the correct intervals.

    If your dataset does not include mitochondrial contigs or unmapped contigs, then you can simply rename the contigs to match the b37 build. For example, you would change chr1 to 1 in the bed file.

    I hope this helps.

    -Sheila

  • GiusyGiusy Posts: 9Member

    Hello, Sheila! It was great help thank you. Actually before your reply, I tried the sed command with very few hopes, but it seems it worked well. Now everything is running smoothly. I don't have in my data mitochondrial contigs, but I'm not sure what you mean for unmapped contigs. Is it a separate file that should come out from BWA? But anyway if I get it right, in the case I had unmapped contigs it would not be sufficient to use the sed command and I would still be stuck, am I right? Thanks again!

  • SheilaSheila Broad InstitutePosts: 537Member, GATK Developer, Broadie, Moderator admin

    @Giusy

    Hi,

    Unmapped contigs are parts of the genome sequence for which people have not yet figured out where they belong on the chromosomes. They are are put in the reference build as "decoy" contigs that collect reads that do not map to anywhere else on the chromosomes, to prevent them from getting mapped with low confidence to the canonical chromosomes. There are versions of the reference build that include them and some that don't.

    It sounds like you do not have unmapped contigs. There is some evidence that having them helps a little, but not having them is not a big problem either.

    -Sheila

  • GiusyGiusy Posts: 9Member

    Hello! Thank you very much for your explanation. Actually I'm using the hs37d5 reference genome, which should contain also the decoy contigs. How can I find out if I have unmapped contigs in my dataset? when I downloaded my intervals list for the SureSelect All Exons Kit v5 they didn't give any info about unmapped contigs, so I guess they shouldn't be present in my sequences, hence whether I have the decoy or not in the reference genome should not make a difference for me, isn't it?

  • KurtKurt Posts: 161Member ✭✭✭

    You are going to have non-specific hybridizations (i.e. not all of your sequences are going to map to within your targets/baited intervals, the percentage has to do with how well your capture efficiency is). the decoy mainly affects non-coding regions, but there are a few exceptions. CDC27 is one. I believe a few of the MUC genes are as well (some of the reads will end up mapping to the decoy sequence as opposed to be forced onto CDC27, MUC, etc). So even though they are not targeted by your capture product it's not a bad idea when aligning your sequencing data to a reference genome to use the hs37d5 reference which has the unlocalized contigs, decoy sequence as they act as a sink for non-specific reads and removes false positive variants in those genes.

  • liuxingliangliuxingliang SingaporePosts: 4Member

    Hi all, before I read this topic, I did use -L in my PrintReads step which made the rate of mapped reads in my recalibrated bam very low, I think this is so-called "lost data", after I remove -L, the result seems to become normal. Many thanks to all of you. But I am still not very clear why this could happen? Why the data will lose when I use -L in PrintReads.

    Thank you all~

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,418Administrator, GATK Developer admin

    @liuxingliang‌ Please see this document for an explanation of what the -L argument does.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.