Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

VQSR

meharmehar Posts: 58Member

Hi,

I am working on dog genome and trying to use VQSR on my data.

Here is the command i have used:

java -Xmx4G -jar GenomeAnalysisTK.jar -R genome.fa -T VariantRecalibrator -input GATK-snp.vcf -resource:dbsnp,known=false,training=true,truth=true,prior=6.0 canFam3_SNP.vcf -mode SNP -recalFile output.recal -tranchesFile output.tranches -rscriptFile output.plots.R -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an Inbreed

  1. I have only dbSNP file as training set and i have set the options, known=true,training=false,truth=false,prior=6.0 in the command line as per the documentation. But that doesn't work and instead suggested to use known=false,training=true,truth=true,prior=6.0. What is the prior =6.0 here? is there any threshold for prior?

2.The above command produces empty tranches and recal file.

3.Even though the files are empty i have proceeded to ApplyRecalibration with the below command:

java -Xmx4G -jar GenomeAnalysisTK.jar -R genome.fa -T ApplyRecalibration -input GATK-snp.vcf --ts_filter_level 99.0 -tranchesFile output.tranches -recalFile output.recal -mode SNP -o recalibrated.filtered.vcf.

It gives the error:

ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:

ERROR Name FeatureType Documentation
ERROR BCF2 VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_bcf2_BCF2Codec.html
ERROR VCF VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_vcf_VCFCodec.html
ERROR VCF3 VariantContext http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_vcf_VCF3Codec.html
ERROR

Any help to fix these?

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    Hi there,

    1. You have to specify at least one training set containing truth variants for VQSR to work. The prior is the prior likelihood that you assign to variants in the truth set. It represents the probability that a variant in that set is indeed true and not an artifact. The value depends mainly on how confident you are about the quality of the call set. See more discussion on this here.

    2. What was the console output? Did you get any warnings or error message?

    3. If the files are empty there is no point in running the next step, it will not work.

    Geraldine Van der Auwera, PhD

  • meharmehar Posts: 58Member

    Thanks. there seems to be error with -an Inbreed annotation. I have removed this and it works now. I have added the option -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 to the VariantRecalibrator along with the above command, followed by ApplyRecalibration. Now i have the recalibrated scores. Could you let me know how to interpret VQSLOD scores and the PASS or fail filter?

    Does it mean the higher the score, the variant is more reliable? or the other way?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    That is addressed in the documentation for the VQSR method. We are happy to answer detailed questions, but please read the method documentation before asking general questions.

    Geraldine Van der Auwera, PhD

  • meharmehar Posts: 58Member

    A most frequent question about the filtering parameters is, what are the ideal thresholds for filtering such as QUAL (quality of the SNP), Mapping quality(MQ) and the most frequent answer is, it depends on the dataset :)

    QUAL and MQ are the phred-scaled probability scores for the variant. Can we use QUAL > 40 and MQ>40 to get a good set of filtered variants irrespective of the dataset?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    Unfortunately there is no absolute rule that will yield a good set of filtered variants irrespective of the dataset. Part of the problem is how do you qualify a good set? Is it a very sensitive set, or very specific set? If you use very high quality filters, you will probably get a very specific set, but you will miss out variants that are real despite having low scores. If you lower the filter thresholds to retrieve those variants, you also let in false positives.

    That is the point of VQSR, to be able to identify patterns of covariation that are more informative than simply filtering on quality scores, and to fine-tune the filtering to achieve your desired compromise between sensitivity and specificity. But it is not perfect, and it is not possible to use with every dataset. In any case, you need to experiment with the settings to find what works for you.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.