Unified Genotyper Results - dnSNPs

I've spent some time looking at the various documentation and have a few lingering questions for understanding my data a little better. I was hoping you could help to clarify a bit and make sure I am making correct assumptions.

I have a set of 96 samples that we ran targeted sequencing on. I have followed the best practices for bam file processing, etc, and ran Unified Genotyper on all 96 bam files at once using -EMIT ALL SITES. I have snp chip data on all of my samples as well so I have some "truth" to compare to. It looks like around 25% of the SNPs I expected to get (are present in dbSNP and I have chip data for them) are not present in my vcf file. Where did they go - is it normal to not get all of the dbSNP sites back? Is this because the variant site did not have good enough quality of bases to make it a variant? Does Unified Genotyper not emit sites for which all samples are homozygous reference even though it knows it's there because of the dbSNP file (like CASAVA)? I looked at the coverage at these positions and in some cases it was very high (over 100X) so I'm not sure that they could all be bad. I am somewhat new to this sort of data analysis - what are the best tools to use to look at the quality of the bases at a single site to see if this is the culprit? Any other ideas as to what could be causing this?

What are the internal things in Unified Genotyper that make a site not be called?

Would I be able to use the GENOTYPE GIVEN ALLELES option to make Unified Genotyper call all dbSNPs regardless of base quality or Phred Score? If so, do I use the same reference db snp .vcf file that I use for the rest of the steps?

For the variants I do have: If I am interested in the coverage at a site, the AD is the unfiltered depth and the DP is the filtered depth (the sum of the allele depths is the total depth), correct?

FInally, I have not run VSQR on my output. We did targeted sequencing and I'm not sure if the number of bases we did is enough got VSQR to be helpful. What is the minimum number of bases that VSQR is recommended with? I have 96 samples and I know that is enough. We are mostly interested in dbSNP values anyhow. In your opinion, is it reasonable to stop after Unified Genotyper since we are only interested in dbSNP calls or should we do the VSQR step anyhow?

Thank you for all your help and insights. I really appreciate the feedback allowed for in this forum.

Lisa

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Lisa,

    For a general rundown of the reasons why the UG might not call a variant, see this article:
    http://www.broadinstitute.org/gatk/guide/article?id=1235

    A few things to keep in mind:
    - there's a lot of junk calls in dbsnp - very high coverage in some regions can be a bad thing because it might indicate unspecific mapping and therefore untrustworthy calls - because AD is unfiltered, sometimes the sum of ADs is bigger than DP (which is filtered) - sometimes it helps to eyeball the data, i.e. look at the pileup of reads at a site using a genome browser like IGV -- for example if you think you are missing some calls, look up the position in IGV and see whether your data supports a variant call or not.

    Yes, you can use GENOTYPE GIVEN ALLELES to get the UG to make a call for each site you put in (eg the list of sites on your chip).

    What you do with the variant calls depends entirely on your experimental design. Just keep in mind that the UG is very aggressive in making calls, and you should do some kind of evaluation to determine which calls are real vs. artefacts. That can be VQSR or hard filtering. But you can't just output calls and consider the results final.

    Good luck!

    Geraldine Van der Auwera, PhD

  • LisaLattaruloLisaLattarulo Posts: 9Member

    Thanks for the info. We are doing some evaluation of what comes out of UG (hard filtering).

    Will UG make a call at a variant site if all of the samples are homozygous reference?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    No, by definition if they are all hom-ref there is no variant to call. But you can get the UG to output hom-ref sites to the VCF by using EMIT ALL SITES.

    Geraldine Van der Auwera, PhD

  • LisaLattaruloLisaLattarulo Posts: 9Member

    I have a follow up question. I re-ran my analysis using the EMIT ALL SITES (what a text file that is - thank goodness it is targeted sequencing!!) and am basically getting out every location we sequenced. I had expected that in this file I would get out the dbSNP sites for which my 96 samples were all homozygous reference. It looks like I get the position out and the base call for all of my 96 samples (which matches the reference as expected with no alternate alleles), though i do not get an rsNumber at this position still - even though I went back and double checked that this position is annotated with an rs number in the dbSNP reference vcf. Is this the behavior that is typical? I had thought that UG will annotate all positions with the rsNumber if it was so labeled in the input files. Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Lisa,

    Yes, that's actually the expected behavior, because there's technically no dbsnp variant there. We could add the option to add the rsID but to be honest that's very low priority and we've got a lot going on right now. But if someone wants to implement this we'd be happy to look at a patch.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.