The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

# VQSR plots

Posts: 70Member
edited February 2014

hi
i run VQSR on the vcf file generated by unified genotyper and filtered PASS 63412 out of 86840 (files with snps and indels). as i run unified genotyper with -glm BOTH command. i have two questions

1) the number of pass snps are different when i counted them in two ways(first with original output of UG and other by separating snps and indel into two separate files using awk script

grep -v "#" sample1_recalibrated_snps_PASS.vcf | grep -c "PASS"
63412
grep -v "#" sample1_merged_recalibrated_snps_raw_indels.vcf| grep -c "LowQual“
18725


Statistics for separate snp file. here i use awk script to separate snps and indels (using awk script)

Rest is fine only problem is that pass snps no differ think why

grep -v  "^#" sample1_snp.vcf| grep -c "PASS
63402
grep -v  "^#" sample1_snp.vcf| grep -c "LowQual“
18725


2) i run VQSR on snps generated by unified genotyper i need to ask query about VQSR tranche plot for Snps.
in my case tranche is not showing any false positive call see plot attached what do i interpret that there is no FP which seems surprising

when i tried to run VQSR on INDELS (in the same file) it doesnt work as i had 884 indels which i read from VQSR documentation and questions asked by ppl is small.

Post edited by Geraldine_VdAuwera on
Tagged:

Hi there,

1) Maybe there is something wrong with your awk script. An easy way to separate SNPs and Indels is to use GATK's SelectVariants tool using the -selectType argument. Have a look at the documentation here:

2) Can you please post your VQSR command lines for filtering SNPs? Your Ti/Tv ratio seems very high too. Also, you are correct that VQSR won't work on only 800 indels.

Geraldine Van der Auwera, PhD

• Posts: 70Member
edited February 2014

thank you so much. i will try select variants

my command line was

java -Xmx3g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \
-R h_hg19.fa \ -input sample1_merged_raw_snps.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137_hg19.vcf \
-an QD \
-an HaplotypeScore \
-an MQRankSum \
-an FS \
--maxGaussians 4 \
-mode SNP \
-recalFile sample1_merged_variants.snps.recal \
-tranchesFile sample1_merged_variants.snps.tranches \

Post edited by h_asif on
• Posts: 70Member
edited February 2014

secondstep : Apply the desired level of recalibration to the SNPs in the call set

java -Xmx3g -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R human_hg19.fa \
-input sample1_merged_raw_snps.vcf \
-mode SNP \
--ts_filter_level 99.0 \
-recalFile sample1_merged_variants.snps.recal \
-tranchesFile sample1_merged_variants.snps.tranches \
-o sample1_recalibrated_snps.vcf

Post edited by h_asif on

This all looks normal, I still don't know why you're getting those surprising results. Can you also tell me what version of GATK you are using, and what kind of dataset you are working on? (e.g. whole genome or exome, number of samples, average coverage depth, etc)

Geraldine Van der Auwera, PhD

• Posts: 70Member

GATK version GenomeAnalysisTK-2.7-2 and i am working on whole exome data from one sample treated with UV

Ah, version 2.7 had some model stability problems so maybe it's not doing the right thing. Please try running this again with version 2.8. Note that on a single exome we can't guarantee that VQSR will work properly. Typically we recommend running on 30 exomes.

Geraldine Van der Auwera, PhD

• Posts: 70Member
edited February 2014

do i need to use same command line for 2.8 version

yeah but i read on your forum people using this with one exome and even in a training session they run VQSR for small data here only chr13
http://www.cbs.dtu.dk/courses/27626/phdcourse/cancer_exercise.php

so do you suggest i should trust output if i get recalibrated results

second thing imp is i run GATK pipeline without using bedfile for the exomes? what do you suggest at what step shall i run that with bedfile
thank you for your co-operation on every step

Post edited by Geraldine_VdAuwera on

Yes, same command but without numbad; in 2.8 the program will determine that for you.

i read on your forum people using this with one exome and even in a training session they run VQSR for small data here only chr13

People do all sorts of things that we don't recommend doing, even though we tell them not to. We are not the NGS police... (too bad!)

It is possible to run VQSR on single whole genomes with deep coverage, that is true, but for a single exome there is typically not enough data. It is possible to get the program to run successfully, but that does not guarantee the results are reliable. So, if you recalibrate your data successfully, that is fine, but you must be critical about your results. It is a good idea then to also do hard-filtering and compare results e.g. using VariantEval.

Yes you should definitely use your exome intervals bedfile. You should use it for base recalibration (BQSR) and it is also better to use it for variant calling since it will shorten the runtime. If you used it for calling, you don't need to use it for VQSR; but if you didn't use it for calling then you should use it in both VQSR commands.

Geraldine Van der Auwera, PhD

• Posts: 70Member

i am sorry i never mean that. hope you will understand
thank you for your co-operation always
I will try hard-filtering and will compare results and will try to run VQSR with bed file

Hey, no problem. Good luck with your work.

Geraldine Van der Auwera, PhD

• Posts: 70Member
edited February 2014

i need to share a thought regarding VQSR from 2.7 GATK version and GATK 2.8 version
i have tried to run this on my exome sample run in one lane and got these plots where ti/tv is in range of 1.5-1.8 (do u think it is fine as it is exome data ) plot attached: onelane_GATK2.7.tranches , onelane_GATK2.8_02.tranches

then i tried to run VQSR on a merged bam file. we run our same sample in six lanes, i aligned the output in Lifescope. the six output bam files are then merged using picard (MergeSamFiles.jar). on these merged bam file i run GATK pipeline including VQSR and the plot attached showing ti/tv too high plot attached: Merged_GATK2.7.tranches , Merged_GATK2.8.tranches
I have written my commandline above except for GATK2.8 i didnt use numBad as suggested
i am going to compare results with GATK hardfilter also
Thank you so much for your co-operation right from the day when i downloaded GATK tool kit uptill now

Hi there, if you are working with exome data you must restrict the analysis to the exome capture targets using the -L argument to pass in the target list.

Geraldine Van der Auwera, PhD

• Posts: 70Member
edited February 2014

you mean -L and the name of bed file ( as suggested here [-L targets.interval_list]
if i use bedfile in BQSR then i will add that with
-knownSites nameofbedfile

Post edited by h_asif on

you mean -L and the name of bed file ( as suggested here [-L targets.interval_list]

Yes that's correct

if i use bedfile in BQSR then i will add that with -knownSites nameofbedfile

No that is wrong, just pass it in with -L the same as above.

Geraldine Van der Auwera, PhD