Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

QD Distribution for sequence capture data

Hi there,
I am working with sequence capture data from a non model organism (based on a de novo genome). Our goal is to get the site frequency spectrum for use in demographic inference, so the number of singleton mutations is important to us.

I am using the recommended hard filters, and am losing 50,000 variants due to the QD < 2.0 filter. I wanted to get your advice to see if that filter is appropriate for sequence capture data, as I know the normal DP filters are not appropriate with capture data. My QD distribution looks very different from the QD distribution shown here.

When I use a straight QUAL < 30 filter, I get many more singletons in my SFS, some of which are probably false positives, but I am not sure what proportion. (Figure shows use of QUAL filter in pink, and QD outlined in blue).

Do you have any recommendations for adjusting the QD filter for use with sequence capture data, or QD distrubtions that look like mine?

Thanks so much for your help!

~ Annabel

Other info:
GATK version 3.7
Best Practices (though without VQSR since I am working with a de novo genome from a non-model organism and don't have good set of trusted SNPs)
Mean coverage: 25-35x

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @annabelbeichman
    Hi Annabel,

    Unfortunately, we work mostly with human data, so we don't have much experience with data like yours.

    Are you saying you think the QD<2.0 filter is too stringent? How many variants do you have in your entire dataset and how many samples do you have?

    You may find the hard filtering tutorials in the GATK presentations section helpful as well.

    -Sheila

    P.S. Your QD graph does show that most of your variants are heterozygous (the large peak around 9-10). You don't seem to have many homozygous variants (should be another large peak around 20-30). Is that expected?

  • Hi Sheila,
    Thanks so much for your help!

    I am worried that the QD < 2.0 is too stringent for this data, as the filter is hitting right around the mode of my variants (the tip of that large peak). My distribution looks so different from the one shown in the tutorial (I have the one large peak around QD =2-3, while your plot has the heterozygote peak around 9-10 and the homozygote around 20-30), that I am not sure if this filter is making me lose real singletons.

    Overall, I end up with 23,000 SNPs passing filters from ~52Mb of callable sites from 22 individuals. Around 14,000 SNPs are lost due to using the QD < 2.0 filter as opposed to the QUAL < 30 filter. (Most of these SNPs are low-frequency, as you can see in the SFS I attached).

    Thanks!
    ~ Annabel

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited October 2017

    @annabelbeichman
    Hi Annabel,

    Wow, so you started with ~52,000,000 variant sites before filtering and ended with 23,000 variant sites after filtering? If so, the filters are too stringent.

    The peak around low QD (2-3) suggests your data quality is bad. Did you run BQSR? Have you looked at the base qualities/mapping qualities around the variant sites? Do the other annotation plots look like the ones from our tutorial?

    Have you tried less stringent filters? What is your end goal?

    For QC, we have some basic recommendations:
    1) collect QC data (coverage, contamination, chimerism, sequencing artifact metrics)
    2) look at the data (metrics, sequence data and variants)
    3) look at the code/pipeline.
    4) look at the data, again.

    I hope this helps.

    -Sheila

    P.S. I forgot to add, we don't recommend using QUAL for filtering because sites with more reads will get an inflated QUAL (due to more evidence for the variant). That is why we use the normalized QD.

Sign In or Register to comment.