Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Variant discovery in aneuploid organisms

Hi all,

I am working with a set of samples at the moment which have the following problems:

1) The organism is aneuploid
2) Ploidy for individual chromosomes varies between samples (we have a script that estimates this from the coverage so we can use this information to improve variant calling)
3) Coverage is thin (sequencing was done on a student's budget)

I am using the UnifiedGenotyper for variant discovery as it can account for aneuploidy.

I initially tried calling variants for each sample, grouping chromosomes by ploidy (i.e, for sample 1 I call all the diploid chromosomes, then all the triploid chromosomes etc). I also tried doing multi-sample variant calling across all the samples, setting the ploidy to 2 (The modal chromosome number is 2). Comparison of these two analyses shows that each is missing good SNPs that are called by the other. Multi-sample analyses is missing or mis-calling SNPs on aneuploid chromosomes, and individual analysis is missing SNPs in individual samples due to thin coverage (or more accurately, they are being called, but failing QD or quality filters due to thin coverage - these SNPs pass these filters in the multi-sample analysis).

I am thinking of trying to combine these approaches. The idea is to analyse each chromosome separately. For each chromosome, I would do a multi-sample call on all the samples which are diploid and a multi-sample call on all the samples which are triploid etc etc. I am hoping that this will give me the best of the two approaches above.

So my questions are:

1) Does this strategy makes sense? Is there any reason not to do it this way?

2) How should I proceed downstream? I know I will have to hard-filter regardless. Can I merge all my vcf files before filtering and annotation, or is there a reason to keep them separate?

Any input from the GATK team or the wider community is appreciated!

Thanks

Kathryn

Best Answer

Answers

  • kathryncrouchkathryncrouch GlasgowMember
    edited July 2014

    Hi Geraldine,

    Thanks for your response. I had a play with this yesterday and the initial results look good.

    After playing around with different ways of combining VCFs, I ended up running the UnifiedGenotyper using a read group blacklist and -onlyEmitSamples flags. So, for example, for chromosome 1 diploid samples I created a read group blacklist of all the samples that weren't diploid and only emitted the samples that were diploid. I wrote a script that will group the samples and generate the command lines automatically.

    This approach has allowed me to do a true merge without worrying about priority as records are only emitted for samples at the appropriate ploidy.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sounds like you've got a nice setup there! Good luck and let us know how it goes -- I'm sure others would be interested in the workflow.

Sign In or Register to comment.