If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Discrepancies between HC and UG during Known Variant Bootstrapping
I have asked a question about this project before, but the problem in this post is less easy to define than the problem in my first question.
I am using GATK (GenomeAnalysisTK-3.3-0) to compare levels of heterozygosity between three separate diploid lizards.
I have around 18x coverage per animal, however that data is combined from three separate sequencing runs.
One of the three samples is suspected to be highly homozygous, while the other two should have much higher levels of heterozygosity.
The DNA from the homozygous animal was used to generate the genome assembly being used as the reference genome for all three animals.
To assess levels of heterozygosity, I had planned to use the bootstrapping method with BQSR and HaplotypeCaller to train for a set of known variants. Next perform the final round of recalibration with the set of known variants and variant calling in GVCF mode for each sample. Then I would be able to perform joint genotyping and only look at sites with sufficient data to make calls across all three samples (i.e. not "./." for all three samples). This way there would be no biased towards SNP detection for the animal used to make the reference.
I ran into my first difficulty when training for known variants with HaplotypeCaller, and I mention it here incase it may provide insight to a more experienced GATK user. After my second round of BSQR I was no longer detecting SNPs for either of my heterozygous animals, and very few for the homozygous animal. I have attached my training commands to this post.
If I examine the number total number of passing SNPs per each round:
Then look at the number of SNPs for each sample:
Atig001.filtSNP.vcf <- heterozygous
Atig003.filtSNP.vcf <- heterozygous
A_tigris8450.filtSNP.vcf <- homozygous animal, used for reference genome
I examined the recalibrated bam files and found that the average phred score had dropped below 25 for the two heterozygous animals across all sites. Instead of further trouble shooting the HaplotypeCaller I instead decided to try training with UnifiedGenotyper. I will attach the BSQR plot for Atig001.
Unsure of what to do next I decided to try and train for a set of known variants with UnifiedGenotyper. This seemed to work, I continued to detect SNPs after each round of recalibration. However I misunderstood the documentation, and expected the number of SNPs to converge, so I ended up running more recalibration steps than needed (my last question)...
Here are some of the numbers for UnifiedGenotyper training:
Anyone have any ideas as to why HaplotypeCaller is showing such different results? I have proceeded to final variant calling using my set of known variants from UG training, but I actually have another question to post on that as well.