Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Results GATK v2.7 vs. v3.7

JulsJuls Member ✭✭

Dear GATK team,

I have started to re-analyse some samples, which I had analysed a long time ago with v2.7 - mainly because a new reference genome has become available for this organism. The old reference genome was pretty bad quality with lots of assembly mistakes, lots of scaffolds and with a significant proportion of Ns in the genome.
Now I observed two things:
*) the number of SNPs increased significantly
*) the percentage of overlap (mutual SNPs) between two samples increased significantly
My questions now relate to the 'why'
A part of the increased number of SNPs will of course come from additional sequence information in the new genome (instead of the Ns).
but it does not explain the high increase I observed or the increased percentage in the mutual SNPs.
1) So can a bad reference lead to less SNPs being called?
2) Does the new GATK version call more SNPs and/or is it able to call SNPs more reliable on low coverage data? the number of SNPs called appears to be more constant now across the samples. With the previous version I observed quite a high dependency between average coverage and number of SNPs.

Thank you very much for your help!
Best,
Julia

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Juls
    Hi Julia,

    I don't think we can help entirely with this. There have been many uprgades since version 2.7. It is possible the reference had an effect. Bad mapping can lead to many missed calls. It is also possible the newer version does much better when calling low coverage data. You can have a look at the release notes and version highlights to see if any of the changes may have had an impact on your results.

    Keep in mind we recommend everyone use the latest version, so I would stick to the results from 3.7.

    -Sheila

Sign In or Register to comment.