Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Interpreting CNNScoreVariants Scores

Hi,

I'm playing around with the new CNNScoreVariants module, and was wondering if you had any guidelines for choosing a scoring cutoff, or in general, how to interpret the scores added to the output vcf.

Thanks,
Will

Tagged:

Issue · Github
by Sheila

Issue Number
3037
State
closed
Last Updated
Assignee
Array
Closed By
chandrans

Best Answer

  • samwellsamwell Cambridge, MA ✭✭
    Accepted Answer

    @willhooper
    Hi Will,
    One option is to use the the tranche filtering tool: FilterVariantTranches. It takes SNP and INDEL truth VCFs of common sites and a list of tranche sensitivities and filters the input VCF in a similar way to VQSR.

    In general positive scores indicate the model the believes the variant is real and negative scores indicate the model thinks the variant is an artifact.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @willhooper
    Hi Will,

    Let me ask the developer to get back to you.

    -Sheila

  • @Sheila

    Thanks Sheila, much appreciated.

    -Will

  • samwellsamwell Cambridge, MAMember, Broadie ✭✭
    Accepted Answer

    @willhooper
    Hi Will,
    One option is to use the the tranche filtering tool: FilterVariantTranches. It takes SNP and INDEL truth VCFs of common sites and a list of tranche sensitivities and filters the input VCF in a similar way to VQSR.

    In general positive scores indicate the model the believes the variant is real and negative scores indicate the model thinks the variant is an artifact.

  • manolismanolis Member ✭✭

    Hi, I'm using the FilterVariantTranches tool with the three SNP and INDEL truth VCFs, as reported in the wdl.

    My vcf has 31127 variants.
    8160 have a negative value.
    558 marked as CNN_2D_SNP_Tranche_99.90_100.00 (all with a negative CNN_2D value)
    56 marked as CNN_2D_INDEL_Tranche_99.50_100.00 (all with a negative CNN_2D value)

    Personally when I'm using the VQSR tool I usually check first the "PASS" variants and only in some case I check the several groups of "no PASS".

    Here how can I do the variants prioritization?

    First step, positive values? Second step, negative? Third step, "CNN_2D_SNP/INDELA_Tranche..."?

    @samwell in your previously comment (April 2018) you say "In general ..." but now is there any rule?

    Do you advise to use something like the VariantFiltration tool with the options: -G-filter 'GQ < 20.0' or GQ value is already included in the CNN pipeline?

    Many thanks!!!

  • yingchen69yingchen69 nanjingMember

    @samwell,

    I am a little confused. For variants labeled as CNN_2D_SNP_Tranche_99.90_100.00 or CNN_2D_INDEL_Tranche_99.50_100.00, they did not pass the filtering and should be filtered out from final results, right?

    Thanks,

    Ying

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @manolis and @yingchen69

    I have asked @samwell to look into your questions. We will get back to you soon.

  • samwellsamwell Cambridge, MAMember, Broadie ✭✭

    @manolis and @yingchen69 The FilterVariantTranches tool is designed to do variant prioritization for you. The arguments --snp-tranches and --indel-tranches each take a list of sensitivities (specified as a percent). The output VCF of FilterVariantTranches will label each variant with the tranche it belongs to or with a "." if it passes all the tranche filters. For example, if you specifiy --snp-tranches 99.0 99.5 your output VCF will have SNP variants with filter values of "." for the highest quality variants, CNN_2D_SNP_Tranche_99.00_99.50 for medium quality variants and CNN_2D_SNP_Tranche_99.50_100.00 for the lowest quality variants according to the 2D model. While I mentioned that negative values indicate the model "thinks" the variant is more likely to be an artifact, because of the way the model was trained we do not expect the probabilities it emits to be well calibrated. So it is better to use the FilterVariantTranches tool to group your variants rather than relying on the sign of the scores directly. We are working on a tutorial that will explain this step in detail and we will update this thread when it is available.

  • manolismanolis Member ✭✭

    Hi @samwell and @bhanuGandham,

    in this wdl pipeline in the line 173 (FilterVariantTranches command) you report an ${extra_args} option.

    What I have to add there? I cann't find in this file and in the others what I have exactly to add.

    Many thanks

  • manolismanolis Member ✭✭

    Hi again,

    can the CNN pipeline handle MNP variants?

    Many thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @manolis

    in this wdl pipeline in the line 173 (FilterVariantTranches command) you report an ${extra_args} option.

    These could be any extra arguments that the tool can take. Specifically I use that to pass the --invalidate-previous-filters argument if I am filtering a VCF that has already been filtered.

    can the CNN pipeline handle MNP variants?

    Not entirely. The CNN will score MNP variants at the site level. So we take the lowest score of all the variants at that site.

Sign In or Register to comment.