If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
known_sites for v37 genome for Base Recalibration (Gatk_4.0.4)
Hi, gatk team, I see in the latest WDL scripts of broad pipeline (posted on github) that the databases used as known_sites for base recal process are 'dbSNP_138' and 'Mills_and_1000g_gold_standard_indels' for v37 human genome. I have few questions with respect to the usage of known sites:
a) I would like to upgrade to the latest dbSNP_151 available on dbSNP website and use All.vcf available therein. Do you foresee any issues using that dbSNP version? May I ask why is broad institute not using the latest dbSNP version in their scripts? Is there any special reason to it other than maintaining stability?
b) Is it OK to use a huge database such as gnomad as known_sites? Any negative impact you think it may bring to the calculations underlying base recalibration algorithm. (Note: I have knowledge regarding how you calculate base quality error rates and adjust them after masking of known sites (from population databases) in a bam. And hence am afraid that masking large number of sites may affect the sensitivity and accuracy of base recaliration process. I am aware of the role that the 4 co-variates play but still worried that masking more and more sites would result into reduced-error-rate as there are now less unmasked sites that are hypothesized as 'error mismatches' ultimately resulting into lesser amount of recalibration happening at ALL sites). On the other side of coin, masking too many sites may not be helpful as it is possible that the base quality of those masked sites may not get fixed significantly and a site with a low machine-annotated base quality would still remain of the low quality and hence not called appropriately by the variant calling step.
c) How about using 1000g_phase3 vcfs as known_sites too in addition to dbSNP_151 and the old Mills_and_1000G_gold_standard_indels.
d) Over the period of last 8 years, sequencing techniques have improved and hence the base qualities. Does Base Recalibration still makes sense as a necessary step in exome or whole genome analysis? Does it have a significant effect on the variant qualities? I still have to look at the change logs for Base Recalibration tool for past years. Could you highlight any significant change made to this tool since its birth?
Note: Please view these questions from the standpoint of a clinical lab.