Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Most Variants Called <2.0 Quality of Depth (QD) in VCF files

ajlivajliv LiverpoolMember

Hello all!

So I am forced to do hard-filtering on my VCF files. Looking at them before filtering, ~99% of my variants have a QD of <2.0. Looking at the distribution plots in ggplot, they do not follow the same distribution pattern as seen in http://gatkforums.broadinstitute.org/gatk/discussion/6925/understanding-and-adapting-the-generic-hard-filtering-recommendations. I have 24 samples and they are not at all similar in their distribution.
The other FS and MQ are all within the recommendations. I wasn't sure where the values for ReadPosRankSum and MQRankSum were as so couldn't plot those out.

I have used a ploidy of 20, and I'm looking at a population of bacteria. Does anyone know why the QD is so low?

I'm going to reduce the recommended filtering QD cut off to 0.8, therefore, sampling the top ~10% of variants by QD. Does that seem sensible?

Thanks,

A

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    If you're working with bacteria, it's not surprising that the different ploidy and very different population genetics would produce a different QD profile compared to human data.

    That being said your value of QD seems worryingly low to me, so I would recommend checking the QC metrics for your sequence data, but it might just be a side effect of high ploidy that I'm not familiar with.
  • apfuentesapfuentes Member

    Hi @Geraldine_VdAuwera
    I have a similar situation. I am studying 16 populations of a non-model organism and used ploidy 10 for variant calling (actual ploidy is 100). This data corresponds to a Poolseq experiment where 50 ind. per population were pooled and sequenced to 50x coverage.

    This is the QD plot for my data (16 pops):

    In the documentation that explains how the QUAL score is calculated (here) it is mentioned that the calculation of the probability that an allele has a different ancestor assumes ploidy=2.

    Would this explain such low values of QD for data with ploidy higher than 2?
    If so, is there any way to adjust this calculation to higher ploidies?

    I don't want to discard true variants becuase QD has an unusual behavior with high ploidies...

    Thanks for any help!

    Issue · Github
    by Sheila

    Issue Number
    2602
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • apfuentesapfuentes Member

    Hi,
    I forgot to mention that my dataset is mainly composed by high quality reads (Qscores >20), as shown in the plots below (obtained with FastQC):

    and the other variant annotations are within the recommended values (for example, high mappability scores, no evidence of strand bias, etc.):

    Thanks for any advice.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @apfuentes
    Hi,

    Let me check with the team and get back to you.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited October 2017

    @apfuentes
    Hi,

    I have heard back from the developer :smile: He says:

    The first thing to try is re-running with -newQual. The old qual may not handle more than two alleles or any ploidy greater than 2 very well. The new qual handles any ploidy naturally. It is possible that the QD will still be low, but at least we will have ruled out funky behavior of the qual.

    -Sheila

  • apfuentesapfuentes Member

    @Sheila
    Thanks very much for asking my question to the developers.

    I indeed used the newQUAL for all my runs, in HaplotypeCaller and GenotypeGVCF...is there something else that could explain such low value when the seq data and mapping seem to be good?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @apfuentes
    Hi,

    Okay, I think we can rule out QUAL calculation issues.

    Have you done any QC on your data? Some things you can try are:

    1) collect QC data (coverage, contamination, chimerism, sequencing artifact metrics)
    2) look at the data (metrics, sequence data and variants)
    3) look at the code/pipeline

    -Sheila

  • apfuentesapfuentes Member

    @Sheila
    Yes, I have done QC of the raw sequence data and of the read mapping. I have FastQC summary statistics on raw and clean raw sequence data (I trimmed low quality bases and Illumina adapters with Trimmomatic), and I have also read mapping summary statistics obtained with Qualimap. Would you prefer I show you this data here or send it to you through other means?

  • apfuentesapfuentes Member

    Hi @Sheila,
    Please find attached the summary statistics of sequence quality (FastQC_plots) and mapping quality (Qualimap) for 16 pools. In case the interactive HTML report generated by MultiQC does not work, I sent also the static version of these plots.

    Just to clarify, one pool corresponds to whole genome resequencing data of a mixture of equal amounts of DNA of 50 individuals from a population, sequenced to a high depth (20-50x in my case). Therefore, there is one BAM file for each pool. In theory ploidy per pool would be 100, but due to the current memory limitations of GATK 3.* to handle high ploidies, I decided to use ploidy 10 (to increase sensitivity for low frequency variants and alternative alleles in a pool).

    Regarding the points suggested:

    1) collect QC data (coverage, contamination, chimerism, sequencing artifact metrics):
    Average depth of coverage among pools varied between two sequencing batches. For example, the pools in the first batch (6) had an average depth of 15-20x, whereas the pools in the second batch (10) had a coverage of 44-50x. PCR duplicate level also varied, in the first batch was about 50% and in the second batch it was 20%. To avoid PCR duplicates were considered for SNP calling, I marked duplicates in the BAM file with Picard tools before SNP calling with the GATK.

    2) look at the data (metrics, sequence data and variants):
    After SNP calling with GATK 3.8 (nightly build Sept 12 2017), I extracted variant annotations from the VCF with raw variants and created the density plots shown in this forum on Oct 17.

    3) look at the code/pipeline:
    Attached there is a diagram of the steps I followed for SNP calling with the GATK.

    Here some examples of the scripts I used (I scattered by scaffold).

    For the obtention of individual gVCFs per pool with the HaplotypeCaller:

    # GATK 3.8 (nightly build Sep 12 2017 with fix for hanging CentOS and logger)
    
    java -Djava.io.tmpdir=/ltmp/ -XX:ParallelGCThreads=1 -Dsamjdk.use_async_io=true -Dsamjdk.buffer_size=4194304 -Xmx8g -jar /path/GATK/2017_09_12/GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R /path/genome.fasta \
    -I /path/BAM/pop1.PoolSeq.sorted.MarkDup.RG.bam \
    -L Super_Scaffold0 \
    -l DEBUG \
    -ERC GVCF \
    -ploidy 10 \
    --max_genotype_count 286 \
    -newQual \
    -mbq 20 \
    -minPruning 5 \
    --read_filter OverclippedRead \
    -o /path/gVCFs/pop1.Super_Scaffold0.raw.g.vcf
    

    And to perform joint SNP calling in the 16 gVCFS with GenotypeGVCF, I used:

    java -Djava.io.tmpdir=/ltmp/ -XX:ParallelGCThreads=1 -Dsamjdk.use_async_io=true -Dsamjdk.buffer_size=419304 -Xmx32g -jar /path/GATK/2017_09_12/GenomeAnalysisTK.jar \
    -T GenotypeGVCFs \
    -R /path/genome.fasta \
    -V /path/GATK/gVCFs/pop1.raw.g.vcf \
    -V /path/GATK/gVCFs/pop2.raw.g.vcf \
    -V /path/GATK/gVCFs/pop3.raw.g.vcf \
    -V /path/GATK/gVCFs/pop4.raw.g.vcf \
    -V /path/GATK/gVCFs/pop5.raw.g.vcf \
    -V /path/GATK/gVCFs/pop6.raw.g.vcf \
    -V /path/GATK/gVCFs/pop7.raw.g.vcf \
    -V /path/GATK/gVCFs/pop8.raw.g.vcf \
    -V /path/GATK/gVCFs/pop9.raw.g.vcf \
    -V /path/GATK/gVCFs/pop10.raw.g.vcf \
    -V /path/GATK/gVCFs/pop11.raw.g.vcf \
    -V /path/GATK/gVCFs/pop12.raw.g.vcf \
    -V /path/GATK/gVCFs/pop13.raw.g.vcf \
    -V /path/GATK/gVCFs/pop14.raw.g.vcf \
    -V /path/GATK/gVCFs/pop15.raw.g.vcf \
    -V /path/GATK/gVCFs/pop16.raw.g.vcf \
    -l DEBUG \
    -L Super_Scaffold0 \
    -newQual \
    -o /path/GATK/VCFs/16.pops.raw.SNPs-indels.vcf
    

    Thanks for all your help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @apfuentes
    Hi,

    Okay. Thanks for sharing. I will get back to you soon.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @apfuentes Can you also post the QUAL and DP plots? And do you have any truth data that you could use to gauge the QD distribution in trusted sites?

  • apfuentesapfuentes Member

    Hi, for sure. Plots requested below.

    Density plot for DP (bottom image: zoom in):

    Density plot for QUAL (bottom image: zoom in):

    I have a set of ~100 SNPs that were validated using Sequenom SNP genotyping platform. Would that number of SNPs be enough to explore QD distribution in trusted sites?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @apfuentes
    Hi,

    Okay, great. So, I think we have the answer in those plots of why the QD is low :smile: Notice, most of your QUALs range from 0-1000 and your Depths range from 0-1250. Dividing the QUALs by the DEPTHs would indeed result in less than 2 average QD.

    I suspect this is a "quirk" of having high depth. Our tools are tested on ~30X coverage data. Let me confirm with the team and get back to you with any other information.

    -Sheila

  • Thanks so much Sheila, that makes sense! Depth of coverage per sample is ~50x and I have 16 samples, thus DP of the INFO field should peak around 800-1000.

    These density plots correspond to raw variants. I have a question, should SNPs falling within repetitive regions be removed before or after obtaining the density plots and applying hard filters with the GATK?

    To exclude SNPs in repetitive regions,
    (1) should I follow the steps described here?
    (2) Or, should I remove them in a less strict way by filtering out SNPs with DP (FORMAT field) above mean coverage across samples + 2-3 SD?

    Thanks for your help.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @apfuentes
    Hi,

    I think it would be good to do a comparison of both ways, as we don't have any specific recommendations for dealing with repeat regions. If you do find some interesting, please let us know here :smile:

    -Sheila

    P.S. For the actual procedure, it is fine to use option 1 and the suggestions from the other user.

  • @Sheila
    Hi,
    Thanks for your response!

    For sure, I will keep you posted.

    Just wondering, have you heard from the team any additional information on the low QD?
    Considering that the low QDs for my dataset most likely are the result of the high total coverage across samples and the high QUAL values (not because of low quality of the variants), then, would it be OK not applying this hard filter on the raw variant data? I would filter instead for MQ, Fisher strand, etc...

    Thanks :)

    Issue · Github
    by shlee

    Issue Number
    2657
    State
    closed
    Last Updated
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @apfuentes It sounds like the annotation is not informative in your case, so you would indeed be better off not using it. None of the annotations is required for filtering as such; the key guideline is really just to use whatever makes sense given the shape of your data.

  • @Geraldine_VdAuwera @Sheila
    Thanks for the clarification and for all your help! Much appreciated :)

Sign In or Register to comment.