Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Interpreting VQSLOD and Tranche Quality in a Non-Human Model Organism

TristanTristan La Jolla, CAPosts: 12Member
edited December 2013 in Ask the GATK team

Hello there! Thanks as always for the lovely tools, I continue to live in them.

  • Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores. Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.

Methods Thus Far

We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).

As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.

My semi-educated guess..

The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?

pdf
pdf
ZF-vqsr-plots.pdf
4M
pdf
pdf
ZF-tranches.pdf
8K
pdf
pdf
ZF-vqslod-samples.pdf
33K
Post edited by Tristan on

Answers

  • pdexheimerpdexheimer Posts: 327Member, GSA Collaborator ✭✭✭

    The caveat here is that I've only ever run VQSR in human/mouse, so I suppose this is a semi-educated answer.

    The VQSR plots just look generally "funny" to me, but my hypothesis is that it's due to sequencing conditions more than anything. You've got a weird distribution of FS, with nearly all of the scores exactly 0 and then a small cloud from about 1 to 6. You've got a very strong mode of about 27 in your QD scores. For comparison, my (human exome) QDs are generally uniformish from about 2-35, and my FS are relatively all over the place (the majority between 0-10, but a significant number going up to 300 or so, especially at low QD). My theory for your FS scores is that you somehow only sequenced one strand, but I can't come up with even a guess for that QD spike.

    Beyond those funny distributions, it looks like your only strongly informative variable is QD. From these plots, my general sense is that your variants passed if they had an FS == 0 and QD > 25 (with a little wiggle in QD at low depths), and nothing else really contributed much information. This is definitely something you can control with your known set (and priors), but my feeling is that it's somehow tied to that QD==27 spike.

    The tranches plot is a mess, but that's not especially important since you don't have a target Ti/Tv. You could fix some of the graphical ickiness by ordering your tranches on the command line for VariantRecalibrator

    I agree with your interpretation of the VQSLOD vs nSamples plot, but I'm worried that all of the VQSLODs seem to be positive. My threshold for pass/fail varies from run to run, but it's generally between about -1 and 5 (at 99% sensitivity). I have thousands of negative scores, my lowest score is usually at least -500, sometimes lower. So if you're not getting any negatives, I would consider that odd as well.

    I'm not sure I gave you a really definitive answer here, but I guess I agree with your concern that all is not well. A couple of things to consider:

    1. Did you recalibrate base qualities with your "putative truth" set? I've never gone down that path and so don't know what effects it will have, but I know it's recommended
    2. I would never, ever run VQSR with only 6 humans. This is largely a matter of having enough variation in the cohort to get good overlap with the knowns, so you might be able to mitigate that problem with your self-generated knowns. On the other hand, you might not ever get good numbers with this small amount of variation, I just don't know
  • TristanTristan La Jolla, CAPosts: 12Member

    Thanks so much for the detailed response pdexheimer.

    I'm worried that all of the VQSLODs seem to be positive

    About 6% were less than 0

    Did you recalibrate base qualities with your "putative truth" set? I've never gone down that path and so don't know what effects it will have, but I know it's recommended

    Yes, perhaps that explains some of this?

    I would never, ever run VQSR with only 6 humans. This is largely a matter of having enough variation in the cohort to get good overlap with the knowns, so you might be able to mitigate that problem with your self-generated knowns. On the other hand, you might not ever get good numbers with this small amount of variation, I just don't know

    What would you do if you only had six humans?

    My gut reaction is to just take any variants with a VQSLOD over the median (2.53) for discerning the counts, and any of those that replicated across the samples for working with. Those super high scores still worry me a little though. Their scores are probably unrealistic, but my guess is that those are still variants that I want and taking the median lets me keep those and an appreciable number of the lower ones.

  • pdexheimerpdexheimer Posts: 327Member, GSA Collaborator ✭✭✭

    6% sounds about right for negative VQSLODs.

    If I only had 6 humans and couldn't bring in any compatible external samples, I'd probably gripe a lot before giving in and using a set of hard filters. I have run with low sample numbers before, but I won't push to less than about 12 (that was mouse exomes - very homogenous within a given strain, but I had several strains present and so still got reasonable overall variation).

    You might get lucky and find compatible public data that you can use to increase your numbers, but it would need to be at least similar sequencing conditions (I'd mix read lengths, but not Illumina and Solid) and similar library prep - in particular, you mention special digestion conditions.

    So given all of that, I'd probably spend some time with hard filters. Your VQSR plots will give you some guidance for selecting conditions and values (btw, I like MQ as well), but there's really nothing for it but trying different values and evaluating the results. It's not an easy process, not least because there's no real good evaluation process.

    Wait a second - you said three replicates of each. When I first read that, I thought you meant biological replicates (i.e., three different fish). Did you mean technical replicates (the same individual fish prepped and sequenced three times)? If that's the case, you only really have two samples and there's really nothing you can do with VQSR.

  • TristanTristan La Jolla, CAPosts: 12Member

    I'll call them pseudo-technical replicates, basically technical replicates of pooled samples in the sense that three sets of preps were done from a given sample, but that sample was an extraction from zebrafish eggs so one can assume that at least a couple individuals are in each. Further and further removed from the standard GATK usage but I'd argue that GATK still has the best tools for the task though I might need some more modification.

    Perhaps goofing with the ploidy settings would be appropriate?

  • pdexheimerpdexheimer Posts: 327Member, GSA Collaborator ✭✭✭

    Ploidy is exactly what I was going to suggest. The fun part will be determining the correct values - perhaps you could plot the AB values for each sample and look for modes?

  • TristanTristan La Jolla, CAPosts: 12Member

    Great, let's dig into that. What do you mean by AB values?

    This must be close to what they had to go through with sequencing tumor/normal pairs for cancer samples. I couldn't get mutect to work on these but maybe there is a way to simulate the process with the more standard tools...

    Is it just me or did my posted files disappear?

  • pdexheimerpdexheimer Posts: 327Member, GSA Collaborator ✭✭✭

    AB = Allele Balance. You can add it with VariantAnnotator or just calculate it from the AD values, depending on which is easier for you during plotting.

    Not sure I see the cancer parallel. You're talking about sequencing multiple individuals (Aside: Is it possible that you have unfertilized eggs = haploid individuals here?), whereas cancer is more concerned with heterogenous tissues. Maybe it's just a matter of degree, but my impression is that when the tissue samples get so heterogenous to look like three or four individuals, they're not real useful in cancer analysis. Dunno, I certainly could be wrong

    I still see the posted files

Sign In or Register to comment.