(BP2.3) Variant Quality Score Recalibration

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

This article is part of the Best Practices workflow documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

The GATK callers (HaplotypeCaller and UnifiedGenotyper) are by design very lenient in calling variants in order to achieve a high degree of sensitivity. This is a good thing because it minimizes the chance of missing real variants, but it does mean that we need to refine the call set to reduce the amount of false positives, which can be quite large. The best way to perform this refinement is to use variant quality score recalibration (VQSR). In the first step of this two-step process, the program uses machine learning methods to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, or on targeted sequencing data. If for either of these reasons you find that you cannot perform variant recalibration on your data, we recommend you use hard-filtering instead. See the methods articles and FAQs for more details on how to do this.

image
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • him72him72 Posts: 3Member

    Hi,
    I have WGS data and I am having issues with few of the chromosomes X,Y,M and some random chromosomes always fails to go through vqsr step.
    I am using gatk 2.5.2 and also tried 3.1.1 and when I put in the commend it just quits.
    How do I know if this is due to memory issue? I just get my prompt back.
    Here is my commend:
    java -Xmx20g -Xms20g -jar /srv/gs1/software/gatk/gatk-3.1.1/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R /srv/gs1/projects/scg/Resources/GATK/hg19/ucsc.hg19.fasta \
    -input /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.gatk.vcf \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /srv/gs1/projects/scg/Resources/GATK/hg19/hapmap_3.3.hg19.vcf \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 /srv/gs1/projects/scg/Resources/GATK/hg19/1000G_omni2.5.hg19.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 /srv/gs1/projects/scg/Resources/GATK/hg19/1000G_phase1.snps.high_confidence.hg19.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /srv/gs1/projects/scg/Resources/GATK/hg19/dbsnp_137.hg19.vcf \
    -an DP \
    -an QD \
    -an FS \
    -an MQRankSum \
    -an ReadPosRankSum \
    -mode SNP \
    --percentBadVariants 0.05 \
    --maxGaussians 4 \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -recalFile /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.tmp.snp.vcf \
    -tranchesFile /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.tranches.gatk.snp.recal.csv \
    -rscriptFile /srv/gsfs0/projects//WGS/D01/OUTPUT/chrM.gatk.recal.snp.R

    Also How much memory do I need to give this job?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    This sounds like a platform issue. Are you using a personal computer or a server?

    Geraldine Van der Auwera, PhD

  • him72him72 Posts: 3Member

    I am using a server that's running sun grid engine

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    If you're just getting your prompt back you should ask your IT folks for help. Sounds like there's something problematic that's not GATK-specific.

    Geraldine Van der Auwera, PhD

  • him72him72 Posts: 3Member

    I am not so sure, because when I don't separate the data by chromosome it runs fine it just takes too long.
    Is there a minimum number of variants for the input?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    There is no hard minimum. Anyway if you have two few variants you get an error message. You wouldn't just get your prompt back.

    Geraldine Van der Auwera, PhD

  • jhparkjhpark Posts: 3Member

    I have one question. How about gVCF? Is it also same?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    @jhpark‌ Can you please clarify your question? I'm not sure I understand what you are asking.

    Geraldine Van der Auwera, PhD

  • jhparkjhpark Posts: 3Member

    I'm sorry for my late reply. My question is that the concept and command line are the same as that of general vcf.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    @jhpark, the basic format specification is the same for regular VCF and for GVCF, but GVCF has some extra information which is described here: http://www.broadinstitute.org/gatk/guide/article?id=4017

    To generate a GVCF, you need to use the HaplotypeCaller with appropriate arguments (see -ERC GVCF).

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.