Discrepancies between HC and UG during Known Variant Bootstrapping

Hello,

I have asked a question about this project before, but the problem in this post is less easy to define than the problem in my first question.

I am using GATK (GenomeAnalysisTK-3.3-0) to compare levels of heterozygosity between three separate diploid lizards.

I have around 18x coverage per animal, however that data is combined from three separate sequencing runs.

One of the three samples is suspected to be highly homozygous, while the other two should have much higher levels of heterozygosity.

The DNA from the homozygous animal was used to generate the genome assembly being used as the reference genome for all three animals.

To assess levels of heterozygosity, I had planned to use the bootstrapping method with BQSR and HaplotypeCaller to train for a set of known variants. Next perform the final round of recalibration with the set of known variants and variant calling in GVCF mode for each sample. Then I would be able to perform joint genotyping and only look at sites with sufficient data to make calls across all three samples (i.e. not "./." for all three samples). This way there would be no biased towards SNP detection for the animal used to make the reference.

I ran into my first difficulty when training for known variants with HaplotypeCaller, and I mention it here incase it may provide insight to a more experienced GATK user. After my second round of BSQR I was no longer detecting SNPs for either of my heterozygous animals, and very few for the homozygous animal. I have attached my training commands to this post.

If I examine the number total number of passing SNPs per each round:
all_SNPs_0.recode.vcf
6694578
all_SNPs_1.recode.vcf
10660
all_SNPs_2.recode.vcf
107

Then look at the number of SNPs for each sample:
Atig001.filtSNP.vcf <- heterozygous
4130988
Atig001.bsqr_0.filtSNP.vcf
0
Atig001.bsqr_1.filtSNP.vcf
0

Atig003.filtSNP.vcf <- heterozygous
4959403
Atig003.bsqr_0.filtSNP.vcf
3635
Atig003.bsqr_1.filtSNP.vcf
0

A_tigris8450.filtSNP.vcf <- homozygous animal, used for reference genome
272215
A_tigris8450.bsqr_0.filtSNP.vcf
8822
A_tigris8450.bsqr_1.filtSNP.vcf
125

I examined the recalibrated bam files and found that the average phred score had dropped below 25 for the two heterozygous animals across all sites. Instead of further trouble shooting the HaplotypeCaller I instead decided to try training with UnifiedGenotyper. I will attach the BSQR plot for Atig001.

Unsure of what to do next I decided to try and train for a set of known variants with UnifiedGenotyper. This seemed to work, I continued to detect SNPs after each round of recalibration. However I misunderstood the documentation, and expected the number of SNPs to converge, so I ended up running more recalibration steps than needed (my last question)...

Here are some of the numbers for UnifiedGenotyper training:

Atig001.filtSNP.vcf
5546061
Atig001.bsqr_0.filtSNP.vcf
950172
Atig001.bsqr_1.filtSNP.vcf
523390
Atig001.bsqr_2.filtSNP.vcf
574732
Atig001.bsqr_3.filtSNP.vcf
588325
Atig001.bsqr_4.filtSNP.vcf
590560
Atig001.bsqr_5.filtSNP.vcf
590773

Atig003.filtSNP.vcf
5757825
Atig003.bsqr_0.filtSNP.vcf
1489475
Atig003.bsqr_1.filtSNP.vcf
555349
Atig003.bsqr_2.filtSNP.vcf
600669
Atig003.bsqr_3.filtSNP.vcf
613971
Atig003.bsqr_4.filtSNP.vcf
616467
Atig003.bsqr_5.filtSNP.vcf
615607

A_tigris8450.filtSNP.vcf
355944
A_tigris8450.bsqr_0.filtSNP.vcf
161685
A_tigris8450.bsqr_1.filtSNP.vcf
92969
A_tigris8450.bsqr_2.filtSNP.vcf
100055
A_tigris8450.bsqr_3.filtSNP.vcf
105300
A_tigris8450.bsqr_4.filtSNP.vcf
110698
A_tigris8450.bsqr_5.filtSNP.vcf
104465

Anyone have any ideas as to why HaplotypeCaller is showing such different results? I have proceeded to final variant calling using my set of known variants from UG training, but I actually have another question to post on that as well.

Best Answers

Answers

  • DuncanTormeyDuncanTormey Kansas City Member

    Hi Geraldine,

    Thank you very much for the response. I'll stick to the problems we are having with HaplotypeCaller for now.

    Can you summarize the numbers of SNPs you get for each round before and after filtering?

    Total SNPs between all three animals

    Round       SNPs Before Filtering       SNPs Remaining After Filtering    
    0 6694578 6763241
    1 11712 10660
    2 125 107
    3 0 0

    Also, did you look at the distribution of values of the annotations at all?

    I am not sure what you mean by this? This is a de novo genome assembly, and I have not looked into any SNP annotation tools yet. I have a GFF3 file produced by the MAKER pipeline. I will look into what can be done with what I have.

    Can you provide a summary of the commands you ran?

    Sorry about that! I attached a more readable version of the commands. The basic outline of the training is:

    Animal 8450 Round 0 Variant Calling and applying filtering
    Animal 003 Round 0 Variant Calling and applying filtering
    Animal 001 Round 0 Variant Calling and applying filter

    Round 0 Combine Variants across animals
    Round 0 Filtering out variants that did not pass filters

    Animal 8450 Base Recalibration using Round 0 filtered Combined Variants
    Animal 003 Base Recalibration using Round 0 filtered Combined Variants
    Animal 001 Base Recalibration using Round 0 filtered Combined Variants

    Animal 8450 Round 1 Variant Calling and applying filtering
    Animal 003 Round 1 Variant Calling and applying filtering
    Animal 001 Round 1 Variant Calling and applying filtering

    Round 1 Combine Variants across animals
    Round 1 Filtering out variants that did not pass filters

    Animal 8450 Base Recalibration using Round 1 filtered Combined Variants
    Animal 003 Base Recalibration using Round 1 filtered Combined Variants
    Animal 001 Base Recalibration using Round 1 filtered Combined Variants

    Another aspect we noticed was the rapid phred score drop off in the recalibrated bam files. The phred scores in the preprocessed bam files all indicate that our reads are high quality (See attached QC plot Atig001_preprocessed_bam_per_base_quality.png).

    However by the second round of recalibration, the per position distribution of quality drops below thirty for every position. (see attached QC plot Atig001_recalibrated_round_1_per_base_quality.png).

    I also attached the bam file from first round of recalibration. (Atig001_recalibrated_round_0_per_base_quality.png)

    This could explain why we are no longer getting variants called after the second round of recalibration, especially since we were setting -mbq to 30.

    I am currently re-running the whole pipeline with fewer scaffolds (I had around 3000 scaffolds in the raw assembly). I also trimmed the reads using trimmomatic prior to starting the pipeline. I am hopping that eliminating some of the low quality trailing bases, which show up in several of raw fastq files prior to merging, might reduce the quality drop off we are seeing during recalibration.

    If we still see the rapid drop off after a single round of recalibration, we are going to try and reduce the quality requirements for base calling (-mbq) from 30 to 20.

    I am very concerned about the low quality we are seeing after recalibration. In your opinion, should this be taken as a red flag that something is wrong with our input data? Or is this an indication that we need to fine tune more parameters to work with our data?

  • DuncanTormeyDuncanTormey Kansas City Member

    Illumina hiseq in rapid run mode: 150bp paired end reads.

    The sequence data come from three separate flowcells, the libraries for each animal were pooled and multiplexed across two lanes each time.
    *total of 18 pairs of fastqs files (3 animals, 2 lanes per flowcell, 3 flowcells)

    The total reads from each flowcell add up to a total estimated coverage of around 18x for each animal.
    *This was later confirmed in the alignments.

    There were problems with the first two flowcells due to over clustering, which led to less than the desired coverage. However, according to our sequencing team the sequencing data was still useable.

    Issue · Github
    by Sheila

    Issue Number
    288
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • DuncanTormeyDuncanTormey Kansas City Member

    Also, the libraries were remade with the same stock DNA for all three animals for use in the third flowcell.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    This is definitely an odd case. Are you seeing any differences per-read group at all?

    I'd be interested to see if re-running with fewer scaffolds helps. Not sure how I feel about the trimming -- normally we don't recommend trimming on quality, preferring instead to let the tools weigh the evidence accordingly. Not sure whether it might hurt or help in this case. (Trimming adaptors is of course necessary.)

    My earlier question about the distribution of annotations was related to the context annotations that get added by HaplotypeCaller during the calling stages of the bootstrapping (not any functional annotation). Evaluating how those annotations (depth of coverage etc) are distributed across all called SNPs can help determine whether the hard-filtering parameters are appropriate or not.

  • msrmsr Stowers Institute for Medical ResearchMember

    Hi Geraldine,

    I am also working on this project with Duncan. In the attached figures, I plotted the distribution of annotations for QUAL, BaseQRanksSum, DP, MQ, and QD from the initial run of variant calling with HC. The distributions are similar for all three animals except in the QD plots. Do any of these look problematic to you?

    Thank you,
    Morgan

    Issue · Github
    by Sheila

    Issue Number
    321
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there, sorry for the late response (again) -- we've been pretty busy with a local workshop.

    It's hard to evaluate the DP plots since the scale is not showing much resolution on the low end. But for sure the QualByDepth (QD) distribution looks really bad for the 8450 animal.

  • msrmsr Stowers Institute for Medical ResearchMember

    Hi Geraldine,

    I attached plots with each sample on its own scale.

    To summarize, 8450 animal has a lot less variants called initially because this animal was used to make our reference genome. However after one round of recalibration, no variants are called in animal 001. After the second round of recalibration, no variants are called in animal 003 either and we are left with a few variants (< 200) only in animal 8450.

    Thank you for all your help!

    Best,
    Morgan

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ok, the QD plot makes more sense now -- normally you expect a bimodal QD distribution due to hom-vars and hets clustering differently, since on average you have about twice the evidence for hom-vars than hets. Since the homozygous animal was used to make the reference, it makes sense that pretty much all its variant calls should be hets.

    The DP is still hard to interpret because of the x axis scale not showing much resolution.

    After further thought I do think the -mbq requirement you set may be contributing to the problem. Did you try using the default as you mentioned you were thinking of doing?

  • DuncanTormeyDuncanTormey Kansas City Member

    Hi Geraldine,

    I am currently running the data set through a third round of BSQR recalibration with HaplotypeCaller, using the default -mbq. It has finished for A_tig8450 and it looks like the before and after plots have converged. So far we have retained high quality SNPs through each round of recalibration. About 5 million SNPs for each of non-reference animals, and 3 hundred thousand for the reference animal.

    As soon as the third round of recalibration finishes, I will use the resulting set of known variants to recalibrate the original processed bam files (deduped, merged, deduped, realigned). Then, I will perform final variant calling with HaplotypeCaller in GVCF mode and proceed to JointGenotyping.

    Would you suggest keeping the mbq at default for the final variant calling? Or bumping it back up to 30?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, that's a good sign.

    Just keep the default value for mbq -- there is no reason to constrain it to be so high. The caller is aware of base qualities and integrates them into the model that weighs the evidence. Bases with lower quals will accordingly "count" less in the calculation. This is the same reason why quality trimming is unnecessary. Modern tools take the quality into account so there is no need for hard thresholds.

  • DuncanTormeyDuncanTormey Kansas City Member

    Ah I see, well hopefully this works. I think we got the idea to use mbq 30 from this post.

    Thank you so much for all of the help, it's really awesome that you guys have this forum set up!

    Duncan

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, I might not have looked closely enough at the value of the argument when I answered that one. Sorry it led to some confusion and time sink for you.

    Glad you find the forum useful; it's gratifying for us to see people use our tools, so we want to do all we can to help them be successful :)

Sign In or Register to comment.