Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Does providing an exome target interval list overlook many non-targeted, high-quality SNPs?

crojocrojo CaliforniaMember
edited May 2014 in Ask the GATK team

Hi GATK Team,

I recently came across a paper that states that exome sequencing can generate high-quality SNPs in non-targeted regions, and even in regions far from the targets: ncbi.nlm.nih.gov/pubmed/22607156. I just completed variant calling on some exome data and used the kit manufacturer's exome target list to restrict the variant calling during the process. However, I wonder if, in retrospect, this was the best thing to do.

I will likely go back and redo the variant calling without an exome target interval list to see how many other SNPs (if any) we get but I just wanted to post this reference here in case other GATK users (particularly those doing exome sequencing) find it interesting and perhaps ask the GATK team if they had any thoughts on not using exome interval lists during variant calling on exome data? Perhaps it's just a tradeoff between time and overall SNP count...

Anyway, thank you and keep up the good work.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @crojo,

    In a nutshell, we really don't recommend it.

    While it is true that you can get potentially usable variant calls from off-target sequence, this comes with some pretty big caveats. Most of which boil down to, how do these "freebie variants" fit into your experimental design? If you're doing exome analysis, presumably you're looking for variants in exons and applying correspondingly specific methods -- what do you do with off-target variants in between exons or in intergenic regions? Is it still the same analysis/experiment? I understand the temptation to squeeze every last iota of information out of your sequencing dollars, but generally speaking, I feel that modifying the scope of your experiment after you get the data, depending on what you got, is questionable practice.

    On top of that, there is little guarantee that you will get consistent coverage across different samples in the off-target regions, and so you may only be able to make calls for some samples and not for others. This limits your ability to get any actionable information out of them, and to compare your results to other datasets. So you get more SNPs -- but what do you do with them? Plus, those off-target variants will probably be enriched with false positives, and can pose additional challenges for filtering and QC'ing. Do you really want to deal with that, and are you sure what you get out of it is worth it?

    But maybe I'm being too rigid on this -- up to you to decide how you can best use your data of course.

    Caveat: the above rant does not apply to the flanking regions of capture intervals. It is well established (citation needed?) that you can get good calls in the regions immediately preceding and following the bait regions, and GATK provides an interval padding argument for the purpose of exploiting this. See the new FAQ article on using -L for some further details.

  • SumuduSumudu Sri LankaMember
    edited June 20

    Hi,

    I have a whole exome sequence sample and in order to analyze it, as per GATK best practices guidelines, I downloaded 30 BAMs from 1000 Genomes which were aligned to GRCh38 reference.

    I have few concerns that I want to clear off and greatly appreciate any assistance to move forward.

    1) When using an exome interval bed file, it has to be based on the same reference that is used to generate the aligned BAMs?

    If so, This could be the likely reason for genome location error I get in HC variant calling process.

    2) My whole exome sample is generated using a capture kit based on the GRCh37 reference which is different than the ones used by 1000 Genome samples. Capture kits used by the 1000 Genome samples differ among each of them, and all are based on the hg19 reference.

    In this case, Do I have to use the appropriate interval bed file for each single BAM file when calling variants using -ERC GVCF mode in HC? to generate .g.vcf files?

    3) My understanding is that hg19 and GRCh37 reference has same location coordinates and only the format representation differ. So that I can just convert E.g. chr1 to 1 accordingly etc.

    4) Since the exome interval bed files for 1000 genome samples based on hg19, In order to process them do I have to start from fastq files and aligned them to hg19/GRCh37 ? instead of using the BAMs aligned to GRCh38?

    Thank you very much in advance.
    Best Regards
    Sumudu

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Sumudu

    1) When using an exome interval bed file, it has to be based on the same reference that is used to generate the aligned BAMs?

    Yes.

    2) Do I have to use the appropriate interval bed file for each single BAM file when calling variants using -ERC GVCF mode in HC? to generate .g.vcf files?

    Yes.

    3) My understanding is that hg19 and GRCh37 reference has same location coordinates and only the format representation differ. So that I can just convert E.g. chr1 to 1 accordingly etc.

    I believe the genomic content for the two is identical, except for the mitochondrial contig.
    The contig names are also different. GRCh37 names them chr1, chr2,,chr3, etc, while hg19 just has 1, 2, 3.
    Thus you can use the same GTF file for both (excluding mitochondrial, of course) if you do a simple replace operation for the contig names.

    4) Since the exome interval bed files for 1000 genome samples based on hg19, In order to process them do I have to start from fastq files and aligned them to hg19/GRCh37 ? instead of using the BAMs aligned to GRCh38?

    Yes either that or you can use the LiftoverVcf tool.

  • SumuduSumudu Sri LankaMember

    Hi @bhanuGandham ,

    Thank you very much for the reply.

    Just one more thing, some of the 1000G BAMs I downloaded, that represent my population of interest, were sequenced at the Broad Institute with Agilent SureSelect_All_Exome_v2 capture kit which is not available at Agilent site now, when I searched. In that case is it acceptable to use a v4 or v5 version which is available at their site?

    My plan is to get an intersection bed file from different capture kits used by my samples and used that intersection bed as the interval file in BQSR and Variant calling steps. I hope this approach would be correct.

    Thank you once again.
    Best Regards
    Sumudu

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Sumudu

    I don't see why the intersection bed file won't work.

Sign In or Register to comment.