The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

When should I use -L to pass in a list of intervals?

SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin
edited July 26 in FAQs

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 (for chromosome 20 in b37/b39 build)

-L chr20:1-100 (for chromosome 20 positions 1-100 in hg18/hg19 build)

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

- For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
- However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

-L HLA-A*01:01:01:01:1+

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!

Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool -L? Why / why not
RealignerTargetCreator YES Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.
IndelRealigner NO IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.
BaseRecalibrator YES This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads NO Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller YES We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps NO No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

Post edited by shlee on
Tagged:

Comments

  • rxy712rxy712 Posts: 18Member

    This is very helpful! Thank you very much. As I am doing GATK for the whole exome sequencing data with the bam files downloaded from TCGA. I am not sure how to make the intervals file for the option -L. I saw some people use the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) to generate a BED file containing all the exons plus 10bp at each end. it is based on hg19 assembly. Would that be a good way to go? I do not see any exome interval file in the GATK ftp bundle, although it has been mentioned in the forum. I also saw that GATK accepts BED style interval lists. But the warning message is that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1. I do not know how does this fit into the story if I use UCSC table browser to generate a bed file. As I set 10bp at each end, there will be no problem? The output should still be 1 based because the annotation is based on the reference fasta file, human_g1k_v37.fasta, instead of the intervals, right?

    Any input would be very appreciated!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    BED files are always 0-based, that's how the format is specified. It isn't affected by what reference the file corresponds to.

    From http://bedtools.readthedocs.org/en/latest/content/general-usage.html:

    **start ** - The zero-based starting position of the feature in the chromosome.
    The first base in a chromosome is numbered 0.
    The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20, inclusive.

    If you use a program or website that generates a bed file natively that's fine, the intervals will be issued and interpreted correctly by GATK. The warnings are meant for users who generate interval files with their own scripts, to remind them to define intervals appropriately.

    But you should pay attention to what reference you are choosing. The UCSC file will be based on the hg19 build, but the human_g1k_v37.fasta file you are referring to is the b37 build. Have a look at our documentation on reference files if this is confusing to you. You can find both the hg19 and the b37 build files in our resource bundle.

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 18Member

    Thank you very much for your explanation! You are right about the reference issue. The problem that I ran into is that GATK does not accept the UCSC version of intervals (hg19 build) when I use the human_g1k_v37.fasta reference (b37 build). But I have to use the human_g1k_v37.fasta reference, because the bam files from TCGA originally used that version. I am wondering if there is any way to solve this problem. It is simple to change chr1 to 1 for the chromosomes in the interval file, but it would be problematic for others like chr6_apd_hap1, chr6_cox_hap2, chrUn_gl000211, chr1_gl000191_random, etc. Or is there any other website for generating exome.intervals based on b37 build?

    ERROR MESSAGE: File associated with name exome_intervals.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    I don't know the non-chromosomal contigs of hg19 well enough to really answer that. I think they're not represented by equivalents in b37 but I could be wrong. Do you actually need those targets in your analysis? What intervals did the TCGA study use?

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 18Member

    Thanks a lot for your information. I am also considering that maybe the non-chromosomal contigs are not as important, and I plan to remove them from the interval file to give it a try. I do not know what intervals the TCGA study use. Thank you!

  • GiusyGiusy Posts: 9Member

    Hi there! I've a similar problem to the one rxy712 had, but I'm afraid I can't do without the -L argument. I'm performing an analysis on exome sequencing data so I need to restrict it to target regions. My problem is that I had to use the b37 build for the alignment, but the .bed file of target region (provided by Agilent company) is in the UCSC format so the program gives me an error of no overlapping contigs. Is there a way to rename my .bed file contigs in order to match the b37 reference? In case there isn't, if I just decide not tu use the -L argument, there will be consequences on my analysis? Thank you!

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @Giusy‌

    Hi,

    It is a good idea to use intervals to speed up runtime, but you must use the correct intervals.

    If your dataset does not include mitochondrial contigs or unmapped contigs, then you can simply rename the contigs to match the b37 build. For example, you would change chr1 to 1 in the bed file.

    I hope this helps.

    -Sheila

  • GiusyGiusy Posts: 9Member

    Hello, Sheila! It was great help thank you. Actually before your reply, I tried the sed command with very few hopes, but it seems it worked well. Now everything is running smoothly. I don't have in my data mitochondrial contigs, but I'm not sure what you mean for unmapped contigs. Is it a separate file that should come out from BWA? But anyway if I get it right, in the case I had unmapped contigs it would not be sufficient to use the sed command and I would still be stuck, am I right? Thanks again!

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @Giusy‌

    Hi,

    Unmapped contigs are parts of the genome sequence for which people have not yet figured out where they belong on the chromosomes. They are are put in the reference build as "decoy" contigs that collect reads that do not map to anywhere else on the chromosomes, to prevent them from getting mapped with low confidence to the canonical chromosomes. There are versions of the reference build that include them and some that don't.

    It sounds like you do not have unmapped contigs. There is some evidence that having them helps a little, but not having them is not a big problem either.

    -Sheila

  • GiusyGiusy Posts: 9Member

    Hello!
    Thank you very much for your explanation. Actually I'm using the hs37d5 reference genome, which should contain also the decoy contigs. How can I find out if I have unmapped contigs in my dataset? when I downloaded my intervals list for the SureSelect All Exons Kit v5 they didn't give any info about unmapped contigs, so I guess they shouldn't be present in my sequences, hence whether I have the decoy or not in the reference genome should not make a difference for me, isn't it?

  • KurtKurt Posts: 253Member ✭✭✭

    You are going to have non-specific hybridizations (i.e. not all of your sequences are going to map to within your targets/baited intervals, the percentage has to do with how well your capture efficiency is). the decoy mainly affects non-coding regions, but there are a few exceptions. CDC27 is one. I believe a few of the MUC genes are as well (some of the reads will end up mapping to the decoy sequence as opposed to be forced onto CDC27, MUC, etc). So even though they are not targeted by your capture product it's not a bad idea when aligning your sequencing data to a reference genome to use the hs37d5 reference which has the unlocalized contigs, decoy sequence as they act as a sink for non-specific reads and removes false positive variants in those genes.

  • liuxingliangliuxingliang SingaporePosts: 19Member

    Hi all, before I read this topic, I did use -L in my PrintReads step which made the rate of mapped reads in my recalibrated bam very low, I think this is so-called "lost data", after I remove -L, the result seems to become normal. Many thanks to all of you. But I am still not very clear why this could happen? Why the data will lose when I use -L in PrintReads.

    Thank you all~

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @liuxingliang‌ Please see this document for an explanation of what the -L argument does.

    Geraldine Van der Auwera, PhD

  • chenyu600chenyu600 Posts: 22Member

    Hi,

    I have a few questions about the usage of -L. Currently, I use version3.2 on exomes.

    In order to speed up, I run HaplotypeCaller with the -GVCF option and GenotypeGVCFs on split chromosomes, and then merge them into sample.g.vcf and sample.vcf, respectively. I use -L chr (for example, chr1) when running HaplotypeCaller to get more variants, and then -L chr1.bed when running GenotypeGVCFs. The reason why I use -L chr1 is that I would like to get all variants though I know that sites called outside the target region might be with high FP. The results from this step may be used together with other data captured by different chip. And -L chr1.bed (which is exactly the target region) is to get more accurate genotype.

    My question is that would this combination of commands produce the same result with that produced by with -L chr1.bed for both HaplotypeCaller and GenotypeGVCFs? And it be the same with that produced by running on per-sample bam (rather than per-chromosome)?

    Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi @chenyu600,

    The problem with using -L on GVCFs that were generated without an intervals file is that if you have an interval that starts in the middle of a reference block, the entire block will be skipped, which is going to cause missing records. It would be better to run GenotypeGVCFs on the entire GVCFs, then subset the sites of interest using SelectVariants. Including regions outside the capture targets will not affect the calls within the targets.

    Geraldine Van der Auwera, PhD

  • chenyu600chenyu600 Posts: 22Member

    Thanks for the quick response @Geraldine_VdAuwera !

    I am now clear about the -L issue. But I am still not sure if the performing of HaplotypeCaller and GenotypeGVCFs on different level (per-sample vs. per-chromosome) would cause any problem?

    Thanks again!

  • chenyu600chenyu600 Posts: 22Member

    Hi @Geraldine_VdAuwera,

    I have just discussed the usage of -L with my colleague, but we dissent with each other on some point, and we would like to make it more clear. Would you please give us more detail for several questions?

    1. What does "the reference block" refer to?

    2. Say we have three exome dataset. For exome1 and exome2, there is no variant from chr1:1-5000, and exome3 got a SNP (A->G, for example) on chr1:3000. If the capture chip we used did not cover chr1:1-5000, would this record (chr1 3000 . A G) missed from the final output of GenotypeGVCFs? Still using -L chr1 for HaplotypeCaller with -GVCF option and -L chr1.bed for GenotypeGVCFs.

    3. If the record (chr1 3000 . A G) is emitted in the final VCF generated by GenotypeGVCFs, what would the record be like for exome1 and exome2?

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @chenyu600 ,

    Both HaplotypeCaller and GenotypeGVCFs can be run per-chromosome. It does not matter if you do it for just one or both.

    Regarding reference blocks, please read the FAQ article on GVCF files.

    In the GVCFs, you will get records representing every single site that is included in your intervals file. If there is no usable data at a site, the record will be a no-call. In the final joint-genotyped VCF, you will only get records at variant sites unless you enable the option to emit non variants.

    A site is considered variant if at least one sample is variant.

    Geraldine Van der Auwera, PhD

  • aruncoorgaruncoorg USAPosts: 2Member

    Hello, I have a question about the-L flag. Since you mentioned that the coordinates are zero based, for the first ten thousand bases I should be using something like -L chr1:0-10000, right? When I tried this, Haplotype caller gave me an error MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The start position 0 is less than 1. My understanding is that if I use -L chr1:1-10000, it will miss the first base. How should I specify in this case? Thanks for any help!

  • thibaultthibault Broad InstitutePosts: 36Member, Broadie, Dev ✭✭

    Intervals in the standard GATK format are 1-based, so you are not missing the first base with -L chr1:1-10000.

    Joel Thibault ~ Software Engineer ~ GSA ~ Broad Institute

  • aruncoorgaruncoorg USAPosts: 2Member

    I am confused, why does the second comment say that it is zero based? I am using BEDtools to generate my intervals. By default, it generated -L chr1:0-10000.

  • thibaultthibault Broad InstitutePosts: 36Member, Broadie, Dev ✭✭

    The GATK uses 1-based intervals internally.

    The comment above is referring to BED files. If you pass in a bed file (like -L intervals.bed) then it will work correctly, even though BED's format is 0-based.

    When passing in the intervals directly as text (what you are doing here, like -L 1:123) the GATK assumes you are using its standard 1-based format.

    Joel Thibault ~ Software Engineer ~ GSA ~ Broad Institute

  • AJWillseyAJWillsey San Francisco, CAPosts: 4Member

    I understand that using -L with an intervals file will restrict the analysis to the provided intervals.

    However, does using -L result in different treatment for reads that are spanning the junctions of an intervals file? Are they excluded or kept?

    For example, if an interval spans chr1:150-300, and a read maps to chr1:100-250, would this read be 'counted' when determining variants at positions and/or read depth within the interval chr1:150-250?

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @AJWillsey
    Hi,

    No, the read would not be counted. However, we do recommend using interval padding to ensure all reads within and outside your -L intervals will be used. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--interval_padding

    -Sheila

  • AJWillseyAJWillsey San Francisco, CAPosts: 4Member

    Thanks @Sheila, this is really helpful.

    The reason that I ask is because I have created a consensus intervals file, which contains the intervals in common between 3 exome capture arrays (should be equivalent to using -L for each intervals file and using -isr INTERSECTION). I have done this so that I can accurately compare rates of variation across samples prepared with different arrays. My concern is that if I use interval padding, I will bias some capture arrays over others. I.e. a 'larger' capture array may have more reads spanning the intervals in the consensus file... At the same time, I don't want to lose reads unnecessarily as this may affect variant calling at exon boundaries. Any thoughts?

    Thanks again

  • aneekaneek Posts: 76Member

    Hi,
    I have a specific question about which bed files should I use for -L option. We are doing whole exome sequencing in Illumina platform and the exome capture is performed using Agilent SureSelectXT V5 capture kit by the service provider. From Agilent website I have downloaded S04380219_SureSelectV5+5UTR folder which contains 4 different bedfiles and two text files namely,

    1. S04380219_AllTracks.bed
    2. S04380219_Covered.bed
    3. S04380219_Padded.bed
    4. S04380219_Regions.bed
    5. S04380219_Targets.txt
    6. S04380219_Probes.txt

    Now I am confused here to use the appropriate bedfile as list intervals with -L option to pass. Can you suggest me which file should I use in my analysis?

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @AJWillsey
    Hi,

    I don't think that should be a concern. When you use the new joint calling pipeline, if any sites in a sample don't have coverage but some other samples have coverage and a variant call, the sample with no coverage will show a no-call so you know there was no information there. https://www.broadinstitute.org/gatk/guide/article?id=3893

    I hope this answers your question.

    -Sheila

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @aneek
    Hi,

    Unfortunately, I don't think I can answer that question. You probably have to ask Agilent to find out what the differences are.

    -Sheila

  • aneekaneek Posts: 76Member

    @Sheila

    Thanks for the quick response. I shall enquire to the Agilent people regarding this.

  • IvoIvo AmsterdamPosts: 1Member

    Hi,

    I am looking to find out which reads exactly are filtered out when i use -L. Is filtering based on whether the starting position of a read is outside the intervals in the interval_list? To use AJWillsey's example above: if an interval spans chr1:150-300, the read chr1:100-250 would be filtered out. Read chr1:299-449 would not be filtered out? what about read chr1:300-450?

    Is there any documentation (or answered forum question) on this topic that you can point me to?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @Ivo The rule is different for reads and variants. When working with reads, any reads that extend into the intervals specified with -L will be included, even if they start before the interval. When working with variants, only variants that have their start position within the interval will be included.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @AJWillsey
    Hi,

    I have to apologize. I mislead you earlier. It turns out as long as any part of a read is contained in the interval you specify, it will be used in the tools. I told you earlier that it would not be. Sorry.

    -Sheila

  • mglclinicalmglclinical USAPosts: 72Member

    Hi @Sheila ,

    I see the importance of padding for RealignmentTargetCreator and BaseRecalibrator, which could be useful when indels are at the bordering region of an exome-target-capture .

    You also mentioned that "if you do use it, you should do it consistently at all steps where you use -L ". If I use padding for UnifiedGenotyper or HaplotypeCaller, wouldn't I endup with insignificant extra mutation calls in my vcf which are not actually covered by my exam target-capture-region ?

    Thanks,
    mglclinical

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @mglclinical, this is true but you can always trim them out by using the -L intervals with e.g. SelectVariants. This is a minor inconvenience considering it improves results of upstream steps.

    Geraldine Van der Auwera, PhD

  • mglclinicalmglclinical USAPosts: 72Member

    thanks @Geraldine_VdAuwera , I will use SelectVariant with -L exam_capture_targets.bed file to filter variants in vcf

  • shangzhong0619shangzhong0619 La JollaPosts: 7Member

    Hi,
    The data I have is whole genome sequencing, but now I'm only interested in around 100 genes.
    I am applying GATK to an organism that don't have snp database like dbSNP for human. So I'm using the recommended pipeline: 1st round variant calling --> use hardfilter to manually get high confidence variant as gold standard --> base recalibration --> 2nd round variant calling. My question is:
    In the 1st round variant calling step, can I set -L? If -L is set, I will only get variants at those targeted genes' regions, in this case, will the recalibration of those reads covering the target regions be affected? Thanks

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @shangzhong0619
    Hi,

    If you have WGS data, it is best to not use -L. BaseRecalibrator depends on many reads, and it will not perform well if you don't have enough data. You can do the pipeline without using -L; once you have finished the pipeline, you can use SelectVariants to select the intervals you are interested in.

    -Sheila

  • shangzhong0619shangzhong0619 La JollaPosts: 7Member

    @Sheila
    Thanks for your reply. The problem is that the organism we are using have over 50,000 scaffolds, the Haplotype Caller is very slow. Based on your answer, if I extract the interested scaffolds as reference, I still would have the same problem, because only a small portion of reads would map, correct? I guess the way to improve the speed is to parallelize the scaffold analysis and then merge the results.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @shangzhong0619 There are some other (better) ways to proceed.

    For the pre-processing steps, you just can't get around base recalibration's need for big data, so you need to give it all the data at a time. Once you have finished pre-processing everything, then sure, you can run HaplotypeCaller with -L.

    If even the pre-processing steps are slow because of the number of scaffolds in your reference, then you may want to consider assembling some artificial scaffolds to reduce the number of separate contigs you are working with.

    Geraldine Van der Auwera, PhD

  • shangzhong0619shangzhong0619 La JollaPosts: 7Member
    edited February 19

    @Geraldine_VdAuwera
    Thanks for the suggestion. I wonder if I can just merge all the scaffolds into one long scaffold separated by N? I can make the length of N longer than the allowed gap length in bwa so it will not considered as inserts, and can also change the gff file. Is there any downside for doing this?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Yes that's essentially what I meant, although you may want to consider making a few separate contigs as opposed to a single one. IIRC the BAM index format imposes a limit of 512MB for a single contig sequence.

    Geraldine Van der Auwera, PhD

  • corlagoncorlagon germanyPosts: 9Member

    Hi,
    I'd like to know if you have any experience or recommendation on using -L together with variant calling on human RNA-Seq data. I know that there are probably still some unannotated genes or isoforms in the genome which I would discard with such a list, but I actually don't care about those atm. Besides that it seems reasonable to me to provide a "gene-interval-list".
    Any thoughts?
    Cheers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @corlagon We've never looked into that ourselves, but if you're only interested in getting variant calls from annotated gene intervals that should be completely fine, after the SplitNTrim step.

    Geraldine Van der Auwera, PhD

  • mscjuliamscjulia United StatesPosts: 15Member

    Thanks for the explanation. I'm just wondering about doing so in VQSR, because in the best practice it is stated that "...it requires a large callset (minimum 30 exomes, more than 1 whole genome if possible)." I also notice in some previous discussion, you did seem to be opposed to the idea of using a single chromosome. So, to my understanding, it sounds like if I use the -L option, but input the WGS dataset, it's okay; but if I only input oneChromosome.bam to VQSR, it is biased. Is that correct please? Thank you very much.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @mscjulia, it all depends what's in your interval list. It could be just one gene out of the entire genome. Or it could be everything in the genome except the really troublesome regions. Depending on that the conclusion is completely different. Basically you need to maximize the number of variants that will be available for building the VQSR model.

    Geraldine Van der Auwera, PhD

  • d0ct0rd0ct0r AarhusPosts: 6Member

    @Sheila @Geraldine_VdAuwera I have a question in relation to the -L option. Recently I was working with ~1500 exomes and I followed the best practices guidelines, used -L option to specify the target intervals of the exome capture kit and generated the g.VCF files. Then I split the g.VCF files chromosome wise using -L option and combined then finally in to batches of ~50 multisample g.VCF files. So I used chromosome numbers to split them and then finally used the capture intervals during the genotype GVCF step. My question is it ok to simply use chromosome numbers to split the g.VCF files or should I split the capture interval file in to chromosomes and use them to split the g.VCF files using SelectVariants walker ?

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @d0ct0r
    Hi,

    Have a look at this article. If you called variants on only the capture regions, only those regions will be in your GVCFs. If you want to save time, you can run per chromosome, but you don't need to specify the capture regions anymore.

    -Sheila

  • shleeshlee CambridgePosts: 208Member, Broadie, Moderator, Dev admin

    Hi @d0ct0r,

    Assuming you are using some version of GATK3:

    Because you generate g.VCFs using the capture intervals, only those genomic intervals should be present in your variants and reference blocks. It is safe for you to simply use -L chromosome numbers at the GenotypeGVCF step to generate your per chromosome multisample VCFs. This should take the same time to process as using your more defined interval list.

    If you are using GRCh38, and depending on the version of GATK3 you are using, I may have an additional comment. So let me know.

  • d0ct0rd0ct0r AarhusPosts: 6Member

    @Sheila Thanks for the message. Just to confirm, so you meant that it's not necessary to specify the regions (-L) in GenotypeGVCF step right ?
    @shlee Thanks for the message. You are right I am using GATK version 3.6 and for the reference, I am using GRCh37 for now. But I am interested to know the additional comment you have, if the reference genome is GRCh38.

  • shleeshlee CambridgePosts: 208Member, Broadie, Moderator, Dev admin

    @d0ct0r If it's been a few days since you've read the original document above, then you'll notice a new section I recently updated that starts with

    Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6.

    Please let me know if what is outlined is unclear.

  • d0ct0rd0ct0r AarhusPosts: 6Member

    @shlee Thank you for the info.

    -Veera

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @d0ct0r
    Hi Veera,

    Yes, that is correct.

    -Sheila

  • d0ct0rd0ct0r AarhusPosts: 6Member
    edited July 28

    @Sheila Thank you. Just for the record , here's what I did. After obtaining the individual g.VCF files for each of the 1500 exomes, I split the gvcfs into individual chromosomes (1.22,x,y,mt) and then combined them chromosome wise in batches of 50, then again combined all chromosomes into multisample gvcf files. Finally, genotyped them chromosome wise and finally combined the vcfs in to single file. this approach was really fast. Thanks again.

    -Veera

  • LisamLisam FRANCEPosts: 8Member

    Hi,

    I noticed something unusual with the -I argument of Haplotype Caller. When I look at a specific position ( here position chr1:32372389) I get an homozygous alternate genotype :

    chr1 32372389 . C A, 2201.77 . BaseQRankSum=-1.938;ClippingRankSum=0.000;DP=558;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQRankSum=0.000;RAW_MQ=1395000.00;ReadPosRankSum=3.662 GT:AD:DP:GQ:PL:SB 1/1:6,81,0:87:35:2230,35,0,2322,244,2531:5,1,20,61

    Yet, when I launch HaplotypeCaller to find it on an interval (+/- 150 bp around the position), I get an homozygous reference genotype :

    chr1 32372389 . C . . . GT:AD:DP:GQ:PL 0/0:38,287:325:0:0,0,0

    I used HaplotypeCaller with the following arguments :
    -mbq 20 \
    -stand_call_conf 20 \
    -stand_emit_conf 20 \
    --minDanglingBranchLength 1 \
    --minPruning 1 \
    -ERC BP_RESOLUTION \
    -variant_index_type LINEAR \
    -variant_index_parameter 128000 \
    -L chr1:32372389 (or chr1:32372239--32372539 for the interval)

    The same arguments were used to obtain the two vcf files (except of course for the -L argument ).

    I don't understand why I get two different GT by changing the interval. What's more, when I try the same method but with ERC GVCF and GenotypeGVCFs, the position isn't detected anymore in the vcf file of the region of +/- 150 bp around chr1:32372389 (it still is present in the vcf file produced by looking specifically at position chr1:32372389). I suppose GenotypeGVCFs delete position chr1:32372389 since its genotype is 0/0 when detected in an interval but I am not sure this explanation is correct and I would like your opinion.

    Thanks in advance!

    Lisa

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @Lisam
    Hi Lisa,

    The intervals affect the reads that are used in the calling. So, inputting different intervals can cause some different calls, however, those calls are usually edge cases.

    In your case however, HaplotypeCaller needs some padding around the site to be able to call variants properly. We usually set -ip 100. Remember, HaplotypeCaller does a reassembly step, so having more reads in the active region allows us to get a better reassembly.

    -Sheila

  • LisamLisam FRANCEPosts: 8Member

    @Sheila

    Hi Sheila!

    First, thanks for your answer and sorry for the delay! I launched Haplotype Caller with the -ip argument set to 200 but it didn't result in a genotype 1/1 for position chr1:32372389. I still have :

    chr1 32372389 . C . . . GT:AD:DP:GQ:PL 0/0:38,287:325:0:0,0,0

    I had a look at the baminput file and the position is in the center of an exon (chr1:32372389 corresponds to the green covered position).

    image

    Furthermore, when I look for specific positions I want to get (3595 positions in total) either by specifying the position or an interval around it, I get the following results : the wider the interval the less variations I get.

    You can find for instance the venn diagram between the positions I am looking for, the positions detected locally (the blue/grey circle at the center), the ones detected with an interval (here +/- 150 bp, corresponds to the "global" circle )

    image

    When I look at the same diagram for an interval of +/-10 bp I get

    image

    So for the +/- 10 bp interval I get 522 of the variants I am looking for, for the +/- 150 bp I get only 441 of them. Note that I deleted the indels before producing the venn diagrams.

    I didn't put all the diagrams but I did the same for +/-50 bp and +/- 1000 bp and always observe that the larger the interval, the less variants I detect. There is however no proportionality between the size of the intervals and the number of mutations lost I get (when I go from interval +/-50 to +/-1000 I lose 30 variants and when changing from interval +/-50 to +/-10 I get 56 more variants ).

    I hope my explanations were not too messy!

    Thanks in advance,

    Lisa

    venn_mutations_detectees_M977_minD_GVCF_global_150_local_no_indels_and_3595_mut.png
    3000 x 3000 - 99K
  • shleeshlee CambridgePosts: 208Member, Broadie, Moderator, Dev admin

    Hi @Lisam,

    chr1 32372389 . C . . . GT:AD:DP:GQ:PL 0/0:38,287:325:0:0,0,0

    Your read depths are incredibly deep at 325 for your one site. Are you looking at RNA-seq data? Also, I wanted to look up the sequence context of this exon. I just want to confirm, is this in GRCh37?

  • LisamLisam FRANCEPosts: 8Member

    Hi @shlee !

    I am indeed looking at RNA-seq data, and using GRCh37.

    I hope it can help!

    Lisa

  • shleeshlee CambridgePosts: 208Member, Broadie, Moderator, Dev admin

    Hi Lisa (@Lisam),

    I suspect your coverage is falling off towards the end of your padded region. Your site of interest, chr1:32372389, is in the 3' UTR of an RNA transcript. Not only that, it is immediately adjacent to a ~50 bp low-complexity sequence region that repeats the dinucleotide sequence AC as shown via IGV below:

    image

    The transcript is on the minus strand. So 5' to 3', your site of interest comes first, then the ~50 bp low-compexity region, and finally the annotated end of the transcript ~300 bps down at chr1:32372020.

    Even if your library originated from DNA, this region poses a challenge for calling variants because of the low complexity region. But your library is RNA and so any issues are exacerbated. For example,

    • The location of polyA addition isn't exact within cells, different tissues (e.g. cell line vs. primary tissue) and different cellular states (e.g. quiescent vs. dividing), and so some proportion of transcript ends could be well upstream (or even downstream) of the annotated end.

    • Degradation of transcripts from the 3' end is common. I'm not just talking about RNA degradation from sample preparation. I'm saying transcript turnover is actively going on in cells. What we capture via RNA-seq is some equilibrium snapshot. I'm certain somewhere out there the literature records the relative turnover rate for PTP4A2 for some context.

    Compounding these issues are

    • library preparation methods whose coverage falls off in terms of capturing the very ends of transcripts,
    • higher rates of PCR error for such regions no matter the complexity but especially for low-complexity regions and their surrounds,
    • and the ambiguity of mapping reads containing such low-complexity sequence. This is what the equivalent region of your site of interest looks like in GRCh38:
      image
      Notice that the dinucleotide repeat region is in lowercase. I believe this means that you will find many regions in the genome with this sequence pattern, which is not surprising.

    For these and other reasons, variant calling on RNA-Seq data comes with this message. My tentative suggestions are (i) that you limit variant calling to regions that exclude the transcript edges and only include those regions for which your sample exhibits even coverage per transcript; and (ii) when it comes to low-complexity regions, be wary that any variants found in such regions will always be suspect, even with DNA-seq. Finally, consider leveraging sequence data originating from a PCR-free DNA library if calling variants for such challenging regions is important to your research.

  • LisamLisam FRANCEPosts: 8Member

    Hi!

    Thanks for your answer @shlee! I will follow your recommandations :)

    Lisa

Sign In or Register to comment.