Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

(BP2.3) Variant Quality Score Recalibration

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin
edited February 4 in Methods and Workflows

This article is part of the Best Practices workflow documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

The GATK callers (HaplotypeCaller and UnifiedGenotyper) are by design very lenient in calling variants in order to achieve a high degree of sensitivity. This is a good thing because it minimizes the chance of missing real variants, but it does mean that we need to refine the call set to reduce the amount of false positives, which can be quite large. The best way to perform this refinement is to use variant quality score recalibration (VQSR). In the first step of this two-step process, the program uses machine learning methods to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, or on targeted sequencing data. If for either of these reasons you find that you cannot perform variant recalibration on your data, we recommend you use hard-filtering instead. See the methods articles and FAQs for more details on how to do this.

image
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • him72him72 Posts: 3Member

    Hi, I have WGS data and I am having issues with few of the chromosomes X,Y,M and some random chromosomes always fails to go through vqsr step. I am using gatk 2.5.2 and also tried 3.1.1 and when I put in the commend it just quits. How do I know if this is due to memory issue? I just get my prompt back. Here is my commend: java -Xmx20g -Xms20g -jar /srv/gs1/software/gatk/gatk-3.1.1/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /srv/gs1/projects/scg/Resources/GATK/hg19/ucsc.hg19.fasta \ -input /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.gatk.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /srv/gs1/projects/scg/Resources/GATK/hg19/hapmap_3.3.hg19.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 /srv/gs1/projects/scg/Resources/GATK/hg19/1000G_omni2.5.hg19.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /srv/gs1/projects/scg/Resources/GATK/hg19/1000G_phase1.snps.high_confidence.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /srv/gs1/projects/scg/Resources/GATK/hg19/dbsnp_137.hg19.vcf \ -an DP \ -an QD \ -an FS \ -an MQRankSum \ -an ReadPosRankSum \ -mode SNP \ --percentBadVariants 0.05 \ --maxGaussians 4 \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.tmp.snp.vcf \ -tranchesFile /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.tranches.gatk.snp.recal.csv \ -rscriptFile /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.gatk.recal.snp.R

    Also How much memory do I need to give this job?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    This sounds like a platform issue. Are you using a personal computer or a server?

    Geraldine Van der Auwera, PhD

  • him72him72 Posts: 3Member

    I am using a server that's running sun grid engine

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    If you're just getting your prompt back you should ask your IT folks for help. Sounds like there's something problematic that's not GATK-specific.

    Geraldine Van der Auwera, PhD

  • him72him72 Posts: 3Member

    I am not so sure, because when I don't separate the data by chromosome it runs fine it just takes too long. Is there a minimum number of variants for the input?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    There is no hard minimum. Anyway if you have two few variants you get an error message. You wouldn't just get your prompt back.

    Geraldine Van der Auwera, PhD

  • jhparkjhpark Posts: 2Member

    I have one question. How about gVCF? Is it also same?

Sign In or Register to comment.