Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VQSR in non-human

Hi Team,

I have been running an exome/RNAseq analysis and wanted to clarify a few things.

1: I have 16 exomes, 8 from two conditions. I have individually HaplotypeCaller'd them (using -L with exome BED) so have a GVCF for each. I then 'joint genotyped' GVCFs for all, cond1 and cond2 samples. I believe this to be the new way of doing things, correct (?). I made a training set from my all.GVCF by running LD-pruning in Plink (indep-pairwise with r-squared 0.25). The idea was to remove variants in LD which is a known issue in cow which I work on. I know you do not support Plink, this is just for completeness. I take the resulting ~20,000 variants and use them as my training set for VQSR on cond1 and cond2.GVCFs, prior of Q12. I also use the available dbSNP flat file (~2.9m SNPs) as 'known', prior of Q6. Is this a correct approach? FWIW I use MDS-plots to determine if the LD-pruned SNPs can reassemble my pedigree and they do a great job of that. Anecdotal but I like it!

2: Following VQSR on my cond1, cond2.GVCF I got the attached tranche, r_script plots. I am happy enough, I know I don't have any FP as I have so many 'known' in my dbSNP file.The 99% gives ~6000 SNPs to investigate, and Ti/Tv seems high enough. R_script are a bit messy but cluster around 0 on MQRankSum /ReadPosRankSum which I have taken as a good sign from reading threads here.

3: When running ApplyRecalibrator I use TS 99; TS is a bit of an enigma in my set, seeing as I made my own training set I have no idea if they are real. Any ideas on how one might validate this without further genotype data (hard to get hold of...). My annotations have turned up some interesting stuff, so I am happy with the analysis. Would you say it is on the right track?

4: I mentioned RNAseq, so I have a same animals as exome is from. Can I use my training set on this data too? As it is time-point data with same animals, would you break them into 'replicates'? Is 16 samples too small of a set? I know there is not a lot of consensus but being able to compare to exome should help get 'better' SNPs in RNAseq, righhhttt??

Thanks for all your help and the continual updating of the package,

Bruce.

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bruce01,

    Wow, you're really using all the cutting edge stuff, that's cool.

    This all looks good to me. Out of curiosity, I'd be interested to hear more about how you use the MDS-plots -- basically as a QC / sanity check, then?

    There's no magic number for the tranche to take of course, since as you know it depends so much on how reliable your training set is. The best validation is to have orthogonal data e.g. from genotyping chip. Failing that, using metrics/expectations based on literature is helpful for well-studied organisms; for others not so much. I'm not sure where cow falls on that spectrum. Another thing people often do is to take variant calling results from several different callers and take the consensus, but that method will amplify any biases or flaws that are shared by the different callers. I think people should be more skeptical of the consensus method than they tend to be. I'm sorry this is probably not helpful, but you're already doing all the right things as far as I can tell.

    Yes, I think it would be reasonable to use your DNA-based training set on the RNAseq samples. I'm not sure how to deal with the time-point aspect as we have not done any such work ourselves yet... I guess it depends what information you are trying to get out of the data. Probably treating them as replicates is the right thing to do. I'll ask @ami to comment on this since he is our in-house RNAseq expert now. Certainly using the DNA data to improve calls in RNAseq makes a lot of sense. In upcoming versions we are planning to add functionality to process DNA and RNAseq from the same individuals at the same time.

  • bruce01bruce01 Member ✭✭

    Hi Geraldine,

    yeah, the MDS plots are basically a way to check that the SNPs in the LD-pruned set are representative of the 'population', or my tiny part of it. The LD-pruning was to try and reduce down to '10s of thousands' of SNPs which I had read here was the approximate amount to use in a 'de-novo' training set. To be honest I am not entirely sure of how --cluster calculates IBS but it manages to call all 16 samples into the correct condition, and because all animals share 8 sires (i.e. each sample is paired with another for sireage) the method does really well in picking the correct sisters. Cows can have huge LD blocks so I was immediately worried by LD so two birds and all that.

    There exist some genotype data (50K chip) for my animals but hard to get (commercially sensitive...). Consensus method doesn't make sense to me, as you say inherent bias/error may still get called. GATK is the obvious standard, and save for using a HapMap set for training I think the approach used here is OK. I will see if publishers believe so also!

    Thanks for your help,

    Bruce.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for the details, Bruce, that makes sense. Good luck with your paper! :)

  • amiami Member

    For the RNA data, I don't think it relevant to use the exome data as training, as you suggested. The VQSR is learning the error mode (and not the labels of what is true or false) so I expect that since RNA and DNA probably have different error modes, we should not train one based on the other.

    I have to say that it is not based on any experience or tests, but based on the theory behind the VQSR. If you do choose to ignore my suggestion (and it is ok if you do), please let us know what was your conclusions.

    Good luck

  • bruce01bruce01 Member ✭✭
    edited June 2014

    Hi @ami, thanks for your input. As to the RNAseq data, should I then make a separate training set for VQSR? I have essentially 6 replicates per animal (two tissues at three timepoints) so there is pretty good concordance when I do the LD pruning between these, ie they all plot out at very close to same on MDS.

    Secondly, the exome set has a bed file to guide HaplotypeCaller, RNAseq does not. Would you recommend using the same bed file to give intervals to call variants at? Seeing as exome is expected to be quite close to transcriptome (indeed my exome is built off the GTF that was used to guide RNAseq alignment) should I use -L in the RNAseq context? Without there are ~10x more variants called in exome... RNAseq I have not done -L yet (put on this morning) but assume it will be similar.

    As this is a 'discovery' analysis I would like to do what is recommended by the experts, and very little info is available otherwise, so your thoughts are much appreciated.

    Bruce.

  • bruce01bruce01 Member ✭✭

    Thanks @Geraldine_VdAuwera, I was thinking similarly myself about the training set. I find it better to remove spurious variants earlier (for a conservative approach, probably have to do it with relaxed parameters too), my MDS still look good from exome-derived LD-pruned SNPs in RNAseq so think I will take exome as 'appropriate training' set and go from there. Will be in touch if I have more questions=D

    Bruce.

Sign In or Register to comment.