Known indel/SNP databases for Indel-based realignment
Dear GATK team,
Would you please clarify that, based on your experience or the logic used in the realignment algorithm, which option between using dbSNP, 1K gold standard (mills...), or "no known dbase" might result in a more accurate set of indels in the Indel-based realignment stage (speed and efficiency is not my concern).
Based on the documentation I found on your site, the "known" variants are used to identify "intervals" of interest to then perform re-alignment around indels. So, it makes sense to me to use as many number of indels as possible (even if they are unreliable and garbage such as many of those found in dbSNP) in addition to those more accurate calls found in 1K gold-standard datasets for choosing the intervals. After all, that increases he number of indel regions to be investigated and therefore potentially increase the accuracy. Depending on your algorithm logic, also, it seems that providing no known dbase would increase the chance of investigating more candidates of mis-alignment and therefore improving the accuracy.
But if your logic uses the "known" indel sets to just "not" perform the realignment and ignore those candidates around known sites, it makes sense to use the more accurate set such as 1K gold standard.
Please let me know what you suggest.