Interpreting results of VQSR on a non-human species

sp580sp580 GermanyMember

Hello and happy new Year!

I have done VQSR on a non-human data set. Data corresponds to WGS (~20x) on 60 mice sequenced individually. Total number of raw SNPs (according to GATK best practices): 16,012,193

As a truth/training resource, I used the sites that PASSed generic hard filters and that are found in any of the two mouse genotyping arrays: GIGA-Mouse Universal Genotyping Array; Mouse Diversity Array. Total number of SNPs in resource: 313,776.

As a training-only resource I used variants reported by Sanger's Mouse Genomes Project found accross 36 strains and that PASSed their filters. There were 9,130,946 sites found in my raw SNPs.

Known sites correspond to dbSNP150.

This is a snipet of the the command line:

--resource TRUTH,known=false,training=true,truth=true,prior=12.0:RawSNPs_in_GIGA_or_MDA_OnlyPASS.vcf \
--resource sanger,known=false,training=true,truth=false,prior=10.0:mgp.v5.merged.snps_all.dbSNP142_PASS_final.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:mus_musculus.vcf \

After running VQSR (90% truth sensitivity), there are 8,070,948 SNPs that PASSed (~50% of the raw SNPs). Of which 8,060,873 are bi-allelic.

The tranche plot shows a Ti/Tv ratio of 1.7 and 1/5 of false positives at tranche 90. Also, Ti/Tv has a wide range across tranches (1.7 to 1.078). Overall, I think the tranche plot is telling me there is room for improvement.

However, the tranche plot referes to novel variants (not found in any of the resources, incl dbSNP), and the Ti/Tv ratio for variants found in dbSNP as reported by Picard's CollectVariantCallingMetrics corresponds to 2.15, which is much satisfactory and represents >95% of all bi-allelic SNPs.

| TOTAL_SNPS| NUM_IN_DB_SNP| NOVEL_SNPS| PCT_DBSNP| DBSNP_TITV| NOVEL_TITV|
|----------:|-------------:|----------:|---------:|----------:|----------:|
|    8060873|       7674154|     386719|  0.952025|   2.146346|   1.700175|

I would like to clarify the following:

1) Was the contruction of the truth set reasonable (i.e. enough number of sites)?

2) If novel variants are more likely to be false positives, how are false and true positives defined in the traches plot, which is constructed from novel variants only?

3) Should I deal with the low Ti/Tv ratio at tranche 90, considering that novel SNPs correspond to <5% of all PASSing SNPs after VQSR?

Your feedback will be greatly appreciated!

Answers

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Hi @sp580 ,

    Thanks for the question, I'm referring to some of my teammates on this and will get back to you.

  • sp580sp580 GermanyMember

    Thanks @bshifaw
    can you please tell me if I could expect some feedback this week?

    Thanks again!

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Yes, a person from the dev team is looking into your three questions.

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    @sp580 , I spoke with a member of the dev team and here are their responses.
    1) Your resources are probably adequate. The default maximum number of training variants for VQSR is 250K, so variants in excess of that cap will be ignored. The number of truth sites may be low, which would result in somewhat inaccurate tranches.

    2) All the SNPs that are found are considered "positives" because they were found by earlier stages of analysis. "True" vs "False" positives is simply referring to whether they pass the VQSR filter in a given tranche. Since true/false positive is not determined by directly comparing to variants in the truth set, it's possible to visualize true/false positive restricted to only novel variants.

    3) Non-human organisms isn't our expertise but a quick skim of this literature suggests that a ti/tv ratio of 1.7 is probably reasonable for mouse:
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5110614/

  • sp580sp580 GermanyMember

    Thanks @bshifaw

    The default maximum number of training variants for VQSR is 250K, so variants in excess of that cap will be ignored.

    Two questions to that: (1) This is refering, in my case, to the Mouse Genomes Project VCF I used (9,130,946 SNPs), correct? (because the truth set also was used for training (313,776 SNPs)). And (2), is there a document describing how does the excess of variants is handled (i.e. random draws, the ones with better annotations or combinations thereof)?

    The number of truth sites may be low, which would result in somewhat inaccurate tranches.

    OK, so the training set I used had excess of variants (only 250K of them are used for VQSR) but the number of variants in my Truth resource may be low (313,776 SNPs). Did I get it right?. If so, what would be a better number (i.e. number of sites in resource "hapmap_3.3.b37.sites.vcf")?

    Regarding point (3), I was aware of that publication, sorry I forgot to mention it. The problem there is that it refers to the TiTv ratio for a subset of variants (strain specific). However, I made a post to inquire about that specific question and you can find it here in case any one else in the GATK community can use what I gathered or contribute to the thread.

  • bshifawbshifaw moonMember, Broadie, Moderator admin
    edited January 16

    Hi @sp580 ,

    The following is the dev team response:
    Apologize for misreading: the cap on training variants is 2.5 million, not 250K. What happens if you exceed the cap is mentioned in the code, the variants will be randomly selected from the available training data. This is probably superior to any deliberate selection because it will produce a representative mix of annotations.

    Broad's current best-practices pipeline for finding SNPs in humans is exemplified here.
    It uses hapmap_3.3.hg38.vcf.gz and 1000G_omni2.5.hg38.vcf.gz which together have 6.6 million truth sites. Unfortunately, there may not be any hard rule about how many sites are needed or how quality scales with number of sites. It's something that probably could be determined empirically: if adding or removing good SNP resources doesn't substantially alter your results, you probably have enough. However, practically people just use the highest quality resources available (and try to estimate the prior for SNP accuracy correctly). Due to the obvious priority on human health, there are a lot of comprehensive high-quality collections of SNPs.

    Again, Non-human organisms isn't our expertise, hopefully you'll receive a response from the posted question. We do have a community led forum for non-human organism experiments called Zoo & Garden if you are interested posting there as well.

    Edit: Changed FC workspace link to gatk-workflows/gatk4-germline-snps-indels git repo.

    Post edited by bshifaw on
  • sp580sp580 GermanyMember

    Hello @bshifaw
    thanks for the reply.

    Broad's current best-practices pipeline for finding SNPs in humans is https://portal.firecloud.org/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-hg38
    It uses hapmap_3.3.hg38.vcf.gz and 1000G_omni2.5.hg38.vcf.gz which together have 6.6 million truth sites.

    I couldn't find those details in the link. However, if you are refering to this , then only hapmap_3.3.hg38.vcf.gz is used as truth set and 1000G_omni2.5.hg38.vcf.gz is used as trainig set:

       --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \
       --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \
    
  • bshifawbshifaw moonMember, Broadie, Moderator admin

    That's correct. I edited the link in the previous post to point to a public git repo so you should be able to use the link now but it shows the same info you posted.

Sign In or Register to comment.