If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Variant discovery in aneuploid organisms
I am working with a set of samples at the moment which have the following problems:
1) The organism is aneuploid
2) Ploidy for individual chromosomes varies between samples (we have a script that estimates this from the coverage so we can use this information to improve variant calling)
3) Coverage is thin (sequencing was done on a student's budget)
I am using the UnifiedGenotyper for variant discovery as it can account for aneuploidy.
I initially tried calling variants for each sample, grouping chromosomes by ploidy (i.e, for sample 1 I call all the diploid chromosomes, then all the triploid chromosomes etc). I also tried doing multi-sample variant calling across all the samples, setting the ploidy to 2 (The modal chromosome number is 2). Comparison of these two analyses shows that each is missing good SNPs that are called by the other. Multi-sample analyses is missing or mis-calling SNPs on aneuploid chromosomes, and individual analysis is missing SNPs in individual samples due to thin coverage (or more accurately, they are being called, but failing QD or quality filters due to thin coverage - these SNPs pass these filters in the multi-sample analysis).
I am thinking of trying to combine these approaches. The idea is to analyse each chromosome separately. For each chromosome, I would do a multi-sample call on all the samples which are diploid and a multi-sample call on all the samples which are triploid etc etc. I am hoping that this will give me the best of the two approaches above.
So my questions are:
1) Does this strategy makes sense? Is there any reason not to do it this way?
2) How should I proceed downstream? I know I will have to hard-filter regardless. Can I merge all my vcf files before filtering and annotation, or is there a reason to keep them separate?
Any input from the GATK team or the wider community is appreciated!