If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Using GATK on Arabidopsis data after EMS mutagenesis

olgaolga TechnionMember

I have been using GATK for a while now, but until now I've been analyzing Human samples and the current analysis is of Arabidopsis Thaliana data. Since I do not have databases of known indels and SNPs for this algorithm, I am following the suggested workflow without known sites.
I only have 2 samples in the analysis, a W.T parental strain and a sample which consists of a pool of 50 plants that underwent EMS mutagenesis. This treatment causes a large number of mutations, when each of the 50 plants in the pool can present different variations and the goal in the experiment is to find the one strong common homozygous mutation to the mutated plants, which is not present in the parental strain.
Since the data is a bit different than any other data that I had worked with, I would like to know if the standard workflow (running indel realignment without known sites, running HC, filtering the high confidence SNPs and than BQSR and HC again) is also recommended in this case and if so, should I apply any different cutoffs to obtain the high confidence SNPs set? Should I use the variants found in both samples to create the high confidence SNPs file? (since the mutagenesis sample will consist of a lot of mutation with a wide range of frequencies, that will not be present in the parental strain at all)

Thank you very much,
Olga karinsky



  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Olga Karinsky,

    You have two different routes you can go. You can call variants on both samples together and take the most confident variants to use in BQSR and Haplotype Caller until convergence. Or, you can call variants on the two samples separately and run BQSR and Haplotype Caller on each sample separately until convergence.

    We really don't know which way is the best way because we are not sure how many confident variant sites you will find if you call both samples together. If there are too many sites that are called confidently in the mutated sample, those sites will be masked incorrectly in the normal sample during BQSR. The best thing to do is try both methods and see which gives you better results.

    Good luck!


Sign In or Register to comment.