known_sites for v37 genome for Base Recalibration (Gatk_4.0.4)

Hi, gatk team, I see in the latest WDL scripts of broad pipeline (posted on github) that the databases used as known_sites for base recal process are 'dbSNP_138' and 'Mills_and_1000g_gold_standard_indels' for v37 human genome. I have few questions with respect to the usage of known sites:

a) I would like to upgrade to the latest dbSNP_151 available on dbSNP website and use All.vcf available therein. Do you foresee any issues using that dbSNP version? May I ask why is broad institute not using the latest dbSNP version in their scripts? Is there any special reason to it other than maintaining stability?

b) Is it OK to use a huge database such as gnomad as known_sites? Any negative impact you think it may bring to the calculations underlying base recalibration algorithm. (Note: I have knowledge regarding how you calculate base quality error rates and adjust them after masking of known sites (from population databases) in a bam. And hence am afraid that masking large number of sites may affect the sensitivity and accuracy of base recaliration process. I am aware of the role that the 4 co-variates play but still worried that masking more and more sites would result into reduced-error-rate as there are now less unmasked sites that are hypothesized as 'error mismatches' ultimately resulting into lesser amount of recalibration happening at ALL sites). On the other side of coin, masking too many sites may not be helpful as it is possible that the base quality of those masked sites may not get fixed significantly and a site with a low machine-annotated base quality would still remain of the low quality and hence not called appropriately by the variant calling step.

c) How about using 1000g_phase3 vcfs as known_sites too in addition to dbSNP_151 and the old Mills_and_1000G_gold_standard_indels.

d) Over the period of last 8 years, sequencing techniques have improved and hence the base qualities. Does Base Recalibration still makes sense as a necessary step in exome or whole genome analysis? Does it have a significant effect on the variant qualities? I still have to look at the change logs for Base Recalibration tool for past years. Could you highlight any significant change made to this tool since its birth?

Note: Please view these questions from the standpoint of a clinical lab.



  Sheila Broad Institute

    Hi S,

    a) I suspect the team simply has not had the time to change to the latest version. There should be no issue with using the latest version.

    b) We still recommend the known sites files here, but the team has simply not had the time to test out and validate using gnomad data. I suspect that will be a future plan. In your case, it is better to overmask a bit than undermask, so you should be fine using gnomad. Let us know how it goes.

    c) That should be fine, but it may be overkill to input so many known sites files.

    d) As Geraldine always says, "BQSR is like fire insurance". If something went wrong during the sequencing, BQSR could be very helpful.


