Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

VQSR plots

h_asifh_asif Posts: 41Member
edited February 26 in Ask the GATK team

hi i run VQSR on the vcf file generated by unified genotyper and filtered PASS 63412 out of 86840 (files with snps and indels). as i run unified genotyper with -glm BOTH command. i have two questions

1) the number of pass snps are different when i counted them in two ways(first with original output of UG and other by separating snps and indel into two separate files using awk script

grep -v "#" sample1_recalibrated_snps_PASS.vcf | grep -c "PASS"
63412
grep -v "#" sample1_merged_recalibrated_snps_raw_indels.vcf| grep -c "LowQual“
18725

Statistics for separate snp file. here i use awk script to separate snps and indels (using awk script)

Rest is fine only problem is that pass snps no differ think why

grep -v  "^#" sample1_snp.vcf| grep -c "PASS
63402
grep -v  "^#" sample1_snp.vcf| grep -c "LowQual“
18725

2) i run VQSR on snps generated by unified genotyper i need to ask query about VQSR tranche plot for Snps. in my case tranche is not showing any false positive call see plot attached what do i interpret that there is no FP which seems surprising

when i tried to run VQSR on INDELS (in the same file) it doesnt work as i had 884 indels which i read from VQSR documentation and questions asked by ppl is small.

Post edited by Geraldine_VdAuwera on
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,286Administrator, GATK Developer admin

    Hi there,

    1) Maybe there is something wrong with your awk script. An easy way to separate SNPs and Indels is to use GATK's SelectVariants tool using the -selectType argument. Have a look at the documentation here:
    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html

    2) Can you please post your VQSR command lines for filtering SNPs? Your Ti/Tv ratio seems very high too. Also, you are correct that VQSR won't work on only 800 indels.

    Geraldine Van der Auwera, PhD

  • h_asifh_asif Posts: 41Member
    edited February 26

    thank you so much. i will try select variants

    my command line was

    java -Xmx3g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \
     -R h_hg19.fa \ -input sample1_merged_raw_snps.vcf \
     -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf.vcf \
     -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf \
     -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137_hg19.vcf \
     -an QD \
     -an HaplotypeScore \
     -an MQRankSum \
     -an ReadPosRankSum \
     -an FS \
     --maxGaussians 4 \
     -numBad 1000 \
     -mode SNP \
     -recalFile sample1_merged_variants.snps.recal \
     -tranchesFile sample1_merged_variants.snps.tranches \
    
    Post edited by h_asif on
  • h_asifh_asif Posts: 41Member
    edited February 26

    secondstep : Apply the desired level of recalibration to the SNPs in the call set

    java -Xmx3g -jar GenomeAnalysisTK.jar \
     -T ApplyRecalibration \
     -R human_hg19.fa \
     -input sample1_merged_raw_snps.vcf \
     -mode SNP \
     --ts_filter_level 99.0 \
     -recalFile sample1_merged_variants.snps.recal \
     -tranchesFile sample1_merged_variants.snps.tranches \ 
     -o sample1_recalibrated_snps.vcf
    
    Post edited by h_asif on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,286Administrator, GATK Developer admin

    This all looks normal, I still don't know why you're getting those surprising results. Can you also tell me what version of GATK you are using, and what kind of dataset you are working on? (e.g. whole genome or exome, number of samples, average coverage depth, etc)

    Geraldine Van der Auwera, PhD

  • h_asifh_asif Posts: 41Member

    GATK version GenomeAnalysisTK-2.7-2 and i am working on whole exome data from one sample treated with UV

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,286Administrator, GATK Developer admin

    Ah, version 2.7 had some model stability problems so maybe it's not doing the right thing. Please try running this again with version 2.8. Note that on a single exome we can't guarantee that VQSR will work properly. Typically we recommend running on 30 exomes.

    Geraldine Van der Auwera, PhD

  • h_asifh_asif Posts: 41Member
    edited February 26

    do i need to use same command line for 2.8 version

    yeah but i read on your forum people using this with one exome and even in a training session they run VQSR for small data here only chr13 http://www.cbs.dtu.dk/courses/27626/phdcourse/cancer_exercise.php

    so do you suggest i should trust output if i get recalibrated results

    second thing imp is i run GATK pipeline without using bedfile for the exomes? what do you suggest at what step shall i run that with bedfile thank you for your co-operation on every step

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,286Administrator, GATK Developer admin

    Yes, same command but without numbad; in 2.8 the program will determine that for you.

    i read on your forum people using this with one exome and even in a training session they run VQSR for small data here only chr13

    People do all sorts of things that we don't recommend doing, even though we tell them not to. We are not the NGS police... (too bad!)

    It is possible to run VQSR on single whole genomes with deep coverage, that is true, but for a single exome there is typically not enough data. It is possible to get the program to run successfully, but that does not guarantee the results are reliable. So, if you recalibrate your data successfully, that is fine, but you must be critical about your results. It is a good idea then to also do hard-filtering and compare results e.g. using VariantEval.

    Yes you should definitely use your exome intervals bedfile. You should use it for base recalibration (BQSR) and it is also better to use it for variant calling since it will shorten the runtime. If you used it for calling, you don't need to use it for VQSR; but if you didn't use it for calling then you should use it in both VQSR commands.

    Geraldine Van der Auwera, PhD

  • h_asifh_asif Posts: 41Member

    i am sorry i never mean that. hope you will understand thank you for your co-operation always I will try hard-filtering and will compare results and will try to run VQSR with bed file

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,286Administrator, GATK Developer admin

    Hey, no problem. Good luck with your work.

    Geraldine Van der Auwera, PhD

  • h_asifh_asif Posts: 41Member
    edited February 27

    i need to share a thought regarding VQSR from 2.7 GATK version and GATK 2.8 version i have tried to run this on my exome sample run in one lane and got these plots where ti/tv is in range of 1.5-1.8 (do u think it is fine as it is exome data ) plot attached: onelane_GATK2.7.tranches , onelane_GATK2.8_02.tranches

    then i tried to run VQSR on a merged bam file. we run our same sample in six lanes, i aligned the output in Lifescope. the six output bam files are then merged using picard (MergeSamFiles.jar). on these merged bam file i run GATK pipeline including VQSR and the plot attached showing ti/tv too high plot attached: Merged_GATK2.7.tranches , Merged_GATK2.8.tranches I have written my commandline above except for GATK2.8 i didnt use numBad as suggested i am going to compare results with GATK hardfilter also Thank you so much for your co-operation right from the day when i downloaded GATK tool kit uptill now

    Post edited by h_asif on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,286Administrator, GATK Developer admin

    Hi there, if you are working with exome data you must restrict the analysis to the exome capture targets using the -L argument to pass in the target list.

    Geraldine Van der Auwera, PhD

  • h_asifh_asif Posts: 41Member
    edited February 27

    you mean -L and the name of bed file ( as suggested here [-L targets.interval_list] if i use bedfile in BQSR then i will add that with -knownSites nameofbedfile

    Post edited by h_asif on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,286Administrator, GATK Developer admin

    you mean -L and the name of bed file ( as suggested here [-L targets.interval_list]

    Yes that's correct

    if i use bedfile in BQSR then i will add that with -knownSites nameofbedfile

    No that is wrong, just pass it in with -L the same as above.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.