Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Which variant filering process is better for high-coverage data?

Hi all,

I'm currently analysing 4 high-coverage (>30X) mammalian non-human genomes.
I've implemented GATK's/Picard's dedup, realigner, fixmates, and called the variants using UnifiedGenotyper, followed by BaseRecalibrator.

Given there is no available database with variants called on other relevant samples, I'm wondering which variant filtering method is better to use:
1) Variant quality score recalibration
2) VariantFiltration
3) Both.

Thank you very much!

Sagi

Answers

  • Also: can I use a combined SNP+Indel vcf file as an input to the VariantFiltration tool, or do I need to use separate input vcf files for SNPs and indels?

    Thanks again!

    Sagi

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sagi,

    It's best to do Variant Recalibration if you can, but that does require having training sets for the model to use. If there are none for your species of interest you can make some yourself, but be aware it takes a certain amount of experimenting. If that is not an option for you, then use VariantFiltration.

    Technically, you can use a single input VCF if you're using the same parameters to filter both types, but you may want to separate them in order to treat them differently.

  • sagipolanisagipolani Member
    edited October 2012

    Hi Geraldine,

    Thanks for your reply.

    Since I don't have training sets, as far as I see it I have two options:

    1. Use my initial variants calls (post self-base-recalibration) as a training set with only those that pass a certain quality score cuttof.
    2. Use my initial variants calls (post self-base-recalibration) as a training set only after I use hard-filtering via VariantFiltration.
    3. Use an overlapping list of variants called in at least two different programs (e.g. GATK and Samtools) as a training set.

    For option (1) I'm afraid that there won't be any true non-biased recalibration, thus remaining with a lot of FP.
    For option (2) I'm afraid that I will get a lot of FN if the stringencies are too high. What is your recommendations in terms of what stringencies to use in this case?

    I would appreciate your response on this.

    Thanks!

    Sagi

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sagi,

    This is a fairly complicated topic. We're planning to write up some recommendations to do this, but unfortunately we haven't had the time yet, and I can't promise you it'll be done soon. I'm afraid for now you'll need to experiment on your own. Try generating callsets with the different approaches, then look at a set of variant calls in a viewer like IGV to see what they look like. You can use VariantEval to compare results and evaluate what's called vs. what's missed in the various sets.

    Good luck!

  • OK Geraldine,

    Thank you for all the effort!

    One last thing though - following the above, I'm attempting to use a list of "de-novo" variant calls for training in VQSR as previously recommended by your team for non-human data using the following arguments: --resource known=true,training=true,truth=true,prior=10.0.
    (http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration)

    Regarding this, I'm wondering why should I set the truth and known to "true" rather than false? Isn't that misleading?

    I'd appreciate your thoughts on this.

    Thank you very much!

    Sagi

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Well, @rpoplin may correct me on this since he's the expert on this tool, but as far as I understand it the bootstrapping approach requires you to trick the recalibrator into thinking you have a set of variants it can believe in. So it is indeed misleading, but on purpose. That's why you should use very high-confidence calls for that set -- you need to be as close to the truth as possible.

  • Hi Geraldine,

    Thanks again!

    Is there any chance I can get in touch with @rpoplin in order to address this issue in a more detailed manner?

    Thanks!

    Sagi

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We prefer not to engage in direct private conversations because it's better if the solutions and advice are available to everyone, but I'll ask him to drop by this thread and comment on your question.

  • Thanks Geraldine,

    Absolutely understandable.

    I'll wait for @rpoplin's response

    Thank you very much!

    Sagi

  • rpoplinrpoplin Member ✭✭✭

    Hi there,

    The idea behind this experiment is that if you are very conservative and only use the variants which you have the highest confidence in then those variants are most likely true and can be used to train the model to evaluate the variants which you have less confidence in.

    I hope that helps,

  • Hi rpoplin!

    Thank you very much for your comment.

    I completely understand the idea and thus I understand why you set the training=true, but I don't understand is why do you set the truth-true and the known=true in the following arguments:
    --resource known=true,training=true,truth=true,prior=10.0

    What's the logic?

    Thanks!

    Sagi

  • rpoplinrpoplin Member ✭✭✭

    Does this answer your question?

    Training sites: Input variants which are found to overlap with these training sites are used to build the Gaussian mixture model.

    Truth sites: When deciding where to set the cutoff in VQSLOD sensitivity to these truth sites is used. Typically one might want to say I dropped my threshold until I got back 99% of HapMap sites, for example.

    Known sites: The known / novel status of a variant isn't used by the algorithm itself and is only used for reporting / display purposes. The output metrics are stratified by known status in order to aid in comparisons with other call sets.

    Cheers,

  • Hi rpoplin.

    Thanks again for your comment.

    I understand the definitions, but here's my dilema:
    I've sequenced 4 non-human genomes. No trustworthy database is currently available for them (e.g. hapmap dbSNP etc.).

    Hence, there is no trustworthy VCF file that I can desigante for either Training, Truth and Known.

    After reading through a LOT of your documentations, I understand that you recommend choosing the "best" available dataset I can generate as an input for all three arguments (Training, Truth and Known), and in that case, I should set all three arguments to "true". Understood.

    In that case, the issue now is how to generate such a dataset for input as "true" for all these three arguments.

    So one obvious method is to use a dataset that includes variants (both SNPs and indels) called and intersected according to at least two different callers (I used GATK's UnifiedGenotyper and Samtools' mpilepup for calling BOTH SNPs and Indels simultaneously) as an input file for the VQSR. After filtering out all the VQSRTrancheBOTH99.90to100.00 and Qual<50, I was eventually left with 93.8% of the calls originally identified via GATK's UnifiedGenotyper. I.e. This method filtered out ~6.2% of the original calls.

    The second optional method is to use a hard filtering approach which can vary greatly according to the specific cutoffs used. There are numerous recommendations out there (including GATK's own recommendations available via "best practices": http://www.broadinstitute.org/gatk/guide/topic?name=best-practices; or another one available via SeqAnswers: http://seqanswers.com/wiki/How-to/exome_analysis#Marking_PCR_duplicates) but the problems associated with them are:
    1) They are mostly related to Exome data (whilst I'm using whole genome data).
    2) Hard filtering usually means a lot of FN...

    Now, in contrast to human data, there is commonly NO available database which is trustworthy nor curated well. Hence, a researcher in my shoes needs to be very cautious in regards to potential FN's, since every variant counts. Nevertheless, too many variants means a lot of FP. Basically, I'm very confused which is the method which will best suite my needs, given that I'm researching a Mendelian disorder.

    In light of all this, my questions are as follows:

    1) Is it reasonable to filter out the VQSRTrancheBOTH99.90to100.00 and Qual<50 annotations after implementing VQSR?
    2) Is it reasonable that after appliying the VQSR and filtering out the above I will be left with 93.8% of the original variants?
    3) For non-human WGS data, is it recommended to perform the VQSR for SNPs and Indels separately, or can I implement the VQSR for both SNPs and Indels combined in one vcf file?
    4) What are the best recommended arguments to use via VariantFiltration for WGS of non-human genomes?
    5) For non-human WGS data, is it recommended to perform the VariantFIltration for SNPs and Indels separately, or can I implement the VQSR for both SNPs and Indels combined in one vcf file?

    If you have any additional recommendations, I would be more grateful for them!

    Thank you very much!

    Sagi

  • rpoplinrpoplin Member ✭✭✭

    The beautiful thing about the VQSR is that it annotates every variant with a VQSLOD score. It sounds like you are worried about the tradeoff between FNs and FPs in your study. Simply thresholding on this score puts all of the control firmly in your hands. If you are worried that you've only filtered out 93.8% of the original calls then you can turn down that threshold and filter more strictly. Perhaps by stratifying your results by this VQSLOD score you can decide the best cutoff for your data. Maybe try making a plot with the VQSLOD filter threshold on the x-axis and some interesting property of your data on the y-axis.

    By using the 99.9% tranche as you have done you are telling the VQSR to choose the VQSLOD threshold such that you keep 99.9% of the variants in your truth file. Probably the right thing is to be more conservative there since it sounds like your confidence in the truth set is low.

    I am sympathetic to the position you are in but the truth of the matter is that no one in our group has used the GATK with non-human data so we have no recommendations beyond what is already in the documentation. Answering any of your five questions (or using the GATK at all with your data for that matter) would require that you do careful experimentation and examination of the results.

    Good luck!

Sign In or Register to comment.