Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Variants Called by GATK are outside the target interval regions?

mikemike Posts: 103Member

Hi,

This question was brought by my colleague who pursued the GATK called variants downstream. She used a program called ANNOVAR to annotate the variants derived form GATK calls from our exome-seq data. After that she saw many of variants are annotated to be at 5' or 3' UTR regions, intronic regions, or even intergenic regions. However, when I called the variants with GATK, I did use -L option at the Unified genotyper step with the bed file directly download from Agilent (I used the enrichment kit from Agilent), which supposed to restrict the variants only to the exons or exon-approximate regions. UTRs or intronic regions may be understandable, but the variants from intergenic regions are kind of odd, is it? Anybody has similar observation or just something wrong with ANNOVAR or our local setting of ANNOVAR? I hope GATK did not introduce these odds but faithfully call variants only at the target interval regions as defined to do. Unless the Agilent's target enrichment regions are spread out into intergenic regions?

Any comments or insights are appreciated!

Thanks

Mike

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,227Administrator, GATK Developer admin

    Hi Mike,

    I would recommend you take a look at the Agilent intervals and compare them with the exonic intervals you're interested in. For various technical reasons, the target intervals used in exome capture kits don't necessarily start and stop exactly at the same positions as the exons. I think you'll find that the regions covered in the target intervals occasionally include even intergenic regions.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks for the input, which is very helpful.

    I understand it is possible that for technical reasons or design consideration, the kit might occasionally expand into the areas of intronic, 5' or 3' UTR,region. However, we got up to 5k variants (SNPs) in intergenic regions. Some of them may be caused by pseudo genes or redundant/duplicated elements near gene exons spread out in the genome. But the number seems to be too large to believe. I did talk to Agilent agent, who friendly told me that their design supposed to cover all the exons, and shall not expand to other areas such as intergenic regions that much. This alerted us that it might be caused by either bad annotation databases/files from ANNOVAR (or some processing issue in our internal pipeline to set up the ANNOVAR) or some issues from GATK SNP call using Unified Genotyper with the option -L. I did confirm with Agilent that the bed file I used is the right one covering all the probes of their enrichment kit we used.

    So I just wish to confirm in GATK, if I used -L with the bed file for Unified genotyper, the variants derived shall be only at the regions defined by the bed file, right?

    So I just want to make sure one fact: if use -L in Unified genotyper, the variants inside the resulting vcf file shall be only at the regions specified by the -L bed file (may be off by 1 because 0-based or 1-based difference in worst situation), right? I know in some of GATK steps, especially at VQSR for example, the good variants were just "flagged" with "PASS" and bad ones still left inside the vcf file. Not sure if it is the case for Unified genotype when I used -L option, the variants outside -L bed file defined regions would be still left into the resulting vcf file?

    Thanks again,

    Mike

  • ebanksebanks Posts: 683GATK Developer mod
    edited December 2012

    Yes, if you used -L (correctly) then the variants emitted can only come from regions defined by your bed file. Just to confirm, in your UG command line you specified just that one single -L? I.e. -L targets.bed and not -L targets.bed -L chr20

    Post edited by ebanks on

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mikemike Posts: 103Member

    Hi, Eric

    Thanks for the confirmation. Yes, I used the first and only single option: -L targets.bed.

    Thanks again

    Mike

  • vyellapavyellapa Posts: 29Member
    edited January 2013

    If I need to filter variants from a whole genome vcf to get variants only over a certain region (specified in a bed file), is there a way to do it. Again, I would like to filter the vcf and not go back to the variant calling step.

    Post edited by vyellapa on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,227Administrator, GATK Developer admin

    Yes, you just use the -L argument with the desired gene coordinates. If you run SelectVariants with -L <your gene coordinates>, you will get a new VCF file containing only the variants that are within the gene. If you use VariantFiltration, the variants within your gene will be annotated in the original VCF.

    Geraldine Van der Auwera, PhD

  • fjrossellofjrossello Posts: 12Member

    Hi Guys,

    _> @mike said:

    So I just want to make sure one fact: if use -L in Unified genotyper, the variants inside the resulting vcf file shall be only at the regions specified by the -L bed file (may be off by 1 because 0-based or 1-based difference in worst situation), right? I know in some of GATK steps, especially at VQSR for example, the good variants were just "flagged" with "PASS" and bad ones still left inside the vcf file. Not sure if it is the case for Unified genotype when I used -L option, the variants outside -L bed file defined regions would be still left into the resulting vcf file?_

    Related to what @mike stated in regards to the use of target lists, in BED format, provided by the kit vendors (see quotation above), Do We need to account for the 0-based nature of these BED files and modify it to 1-based or the GATK recognizes them as a 0-based format and automatically accounts for this? It is not completely clear for me when described here http://www.broadinstitute.org/gatk/guide/article?id=1204, when it states "Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.".

    Thanks in advance for your support.

    Kind regards,

    Fernando

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,227Administrator, GATK Developer admin

    Hi Fernando,

    If the BED file comes from the vendor and assuming the vendor did their job properly, you should be able to use the file as is. The warning is really more about BED files that you might make yourself, to remind you to keep the indexing mode in mind when designing intervals.

    Geraldine Van der Auwera, PhD

  • fjrossellofjrossello Posts: 12Member

    Hi Geraldine,

    Thanks for your prompt reply. I assume it was the case but I wanted to make sure I was on the right path.

    Thanks for your kind reply and support.

    Cheers,

    Fernando

Sign In or Register to comment.