Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Quality Score Re-calibration without dbSNP

edge_dinersedge_diners Member
edited October 2012 in Ask the GATK team

Hi,

As I know GATK SNP calling involve the following steps:
1. Base-calling and image analysis
2. Alignment
3. SNP-calling

Last step of Alignment is Quality score re-calibration which include Count covariates and table re-calibration.
Can I know how to Quality Score Re-calibration if I don't have dbSNP for my query sequence?
My sequence is newly assembly transcriptome.
Thus I don't have any suitable dbSNP for it.

Or I can just stop at "Local realignment around indels" and continue into SNP-calling for newly sequenced transcriptome?
Thanks and looking forward to hear from you.

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    If you want to try recalibrating your data despite not having a dbsnp database, this is what you need to read:

    I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

    The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

    However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

    If that's too complicated and you decide to skip it, then you don't need the PrintReads step at all. Just use your realigned/fixed BAM file as input to the genotyper.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
  • Hi Geraldine,

    Can I run my output bam file from Picard's FixMateInformation.jar by using the following command:
    java -jar GenomeAnalysisTK.jar \
    -T PrintReads \
    -R reference.fasta \
    -I input.marked.realigned.fixed.bam \
    -o output.bam

    I ignore the "recalibration_report.grp" as I don't have any dbsnp and reference.
    Thus I unable to generate recalibration_report.grp :(

    Kindly correct me if I was wrong.
    Thanks for any advice.

  • Hi Geraldine,

    Can I run my output bam file from Picard's FixMateInformation.jar by using the following command: java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.marked.realigned.fixed.bam \ -o output.bam

    I ignore the "recalibration_report.grp" as I don't have any dbsnp and reference. Thus I unable to generate recalibration_report.grp :(

    Kindly correct me if I was wrong. Thanks for any advice.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    If you want to try recalibrating your data despite not having a dbsnp database, this is what you need to read:

    I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

    The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

    However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

    If that's too complicated and you decide to skip it, then you don't need the PrintReads step at all. Just use your realigned/fixed BAM file as input to the genotyper.

  • bright_youngbright_young Shenzhen, ChinaMember

    @Geraldine_VdAuwera
    I have finished the GATK re-alignment by using the snp file generated by GATK as a dbsnp reference, but the low quality snp site in these files is not removed, but just marked as "PASS" or "filter" or "LowQual;filter".
    I wonder if the low quality SNPs will be used as a dbsnp reference.
    Thank you!!!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I think they're ignored, but even if they are used it is harmless and will not hurt your analysis.

Sign In or Register to comment.