Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Variant Quality Score Recalibration (VQSR)

2»

Comments

  • vsvintivsvinti Member ✭✭

    Hi @Sheila

    I only have ~2k variants that I am pretty confident are real (based on genotyping of same samples on arrays), and a similar number that I am pretty confident are false (from trios). So I would use them alongside the human resources recommended in your best practices.

    At the moment, I am using these variants to obtain custom cut offs for annotations that seem to characterize true variants (ie QD, etc), and use these thresholds to 'hard filter' my calls. So I could:

    • use the true / false sets to add to the VQSR training models or
    • use the vcfs after this stringent hard filtering to train the VQSR model

    I was learning towards the first point, but the link you refer me to seems to indicate that my true / false sets may be too small to make a difference. Regardless, I can try to see what I get with either.

    So back to my question: if I want to use these custom sets in addition to recommended, how would that change my commands below:

        -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
        -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
        -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
        -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    

    Would I just add another line:

    -resource:custom_true,known=false,training=true,truth=true,prior=? custom_true.vcf 
    

    ? And can I use my false set anywhere?

    Thanks
    Vicky

  • SheilaSheila Broad InstituteMember, Broadie admin

    @vsvinti
    Hi Vicky,

    Yes, you can add in the custom resource file like you have above. For the prior, you can try playing around with different values, but because you are very sure about those variants, you should be safe starting with a pretty high value.

    I don't think you can use your negative/false set anywhere in VQSR.

    Good luck! Let us know how it goes.

    -Sheila

  • MagdaMagda Member

    Hi @Geraldine_VdAuwera
    As recommended I started everything from scratch using the genome ref. provided in the bundle (as well as the resource files), and after doing the whole pipeline again, the pattern I have is more or less the same.

    To remind you: I've done Expanded Exome (illumina) (64Mb) in 72 samples. The pipeline I've done is here.

    1. sickle-master/sickle pe -f Samples/${ind4}${adaptors}_L004_R1.fastq.gz -r Samples/${ind4}${adaptors}_L004_R2.fastq.gz -t sanger -o Samples/Sickle_30/${ind4}_filtered_L004_R1.fastq -p Samples/Sickle_30/${ind4}_filtered_L004_R2.fastq -s Samples/Sickle_30/${ind4}_filtered_L004_singles.fastq -q 30 > Samples/Sickle_30/${ind4}_L004_sickle_stats
    2. bwa/bwa mem -M -t 16 -R "@RG\tID:${ind4}.Run2_L4\tLB:Lib2\tPL:illumina\tPU:Run2_L4\tSM:${ind4}" human_g1k_v37_decoy.fasta.gz Samples/2.Sickle_30/${ind4}_filtered_L004_R1.fastq.gz Samples/2.Sickle_30/${ind4}_filtered_L004_R2.fastq.gz > Samples/3.Alignment_bwa_V2/${ind4}_L004.sam
    3. samtools-1.2/samtools view -b -S -o Samples/4.Samtools_bam_V2/${ind3}_L003.bam Samples/3.Alignment_bwa_V2/${ind3}_L003.sam.gz
    4. samtools-1.2/samtools sort Samples/4.Samtools_bam_V2/${ind4}_L004.bam Samples/5.Bam_sorted_V2/${ind4}_L004.sorted
    5. samtools-1.2/samtools index Samples/5.Bam_sorted_V2/${ind4}_L004.sorted.bam
    6. samtools-1.2/samtools merge Samples/6.Bam_Merged_Samples_V2/${ind}_Merged.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L001.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L002.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L003.sorted.bam Samples/5.Bam_sorted_V2/${ind}_L004.sorted.bam
    7. java -Xmx4g -jar picard.jar MarkDuplicates I=../Samples/6.Bam_Merged_Samples_V2/${ind}_Merged.sorted.bam O=../Samples/7.MarkDup_Picard_V2/${ind}_Mkdup.sorted.bam METRICS_FILE=../Samples/7.MarkDup_Picard_V2/${ind}_Mkdup_metrics.txt VALIDATION_STRINGENCY=LENIENT; done;
    8. sed 's/^chr//g' nexterarapidcapture_expandedexome_targetedregions.bed > nexterarapidcapture_expandedexome_targetedregions_nochr.bed
      java -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37_decoy.fasta -I Samples/7.MarkDup_Picard_V2/${ind}_Mkdup.sorted.bam --known 1000G_phase1.indels.b37.vcf --known Mills_and_1000G_gold_standard.indels.b37.vcf -L nexterarapidcapture_expandedexome_targetedregions_nochr.bed -ip 100 -o Samples/8.GATK_ReaTarInt_V2/${ind}_RealignerTarget.intervals
    9. java -Xmx4g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37_decoy.fasta -I Samples/7.MarkDup_Picard_V2/${ind}_Mkdup.sorted.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals Samples/8.GATK_ReaTarInt_V2/${ind}_RealignerTarget.intervals -o Samples/9.GATK_IndelRealigner_V2/${ind}_Mkdup_Realigned.sorted.bam

    10.1 java -Xmx6g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37_decoy.fasta -I Samples/9.GATK_IndelRealigner_V2/${ind}_Mkdup_Realigned.sorted.bam -knownSites 1000G_phase1.snps.high_confidence.b37.vcf -knownSites 1000G_phase1.indels.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -L nexterarapidcapture_expandedexome_targetedregions_nochr.bed -ip 100 -o Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recalibrated.table

    10.3 java -Xmx4g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37_decoy.fasta -I Samples/9.GATK_IndelRealigner_V2/${ind}_Mkdup_Realigned.sorted.bam -BQSR Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recalibrated.table -o Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recal.bam

    11.1 java -Xmx6g -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37_decoy.fasta -I Samples/10.GATK_BaseRecalibrator_V2/${ind}_Recal.bam -o Samples/11.GATK_HaploCaller_V2/${ind}_raw_g.vcf --dbsnp dbsnp_138.b37.vcf --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -L nexterarapidcapture_expandedexome_targetedregions_nochr.bed -ip 100 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000

    11.2 java -Xmx6g -jar ../../GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ../../human_g1k_v37_decoy.fasta -V H01_raw.g.vcf -V H08_raw.g.vcf -V H10_raw.g.vcf -V H29_raw.g.vcf -V H91_raw.g.vcf -V N424_raw.g.vcf -V N872_raw.g.vcf ...... (in total 72 inds) -D ../../dbsnp_138.b37.vcf -o Output_Joint_D.vcf

    12.1.1 java -Xmx6g -jar ../../GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T VariantRecalibrator -R ../../human_g1k_v37_decoy.fasta -input ../11.GATK_HaploCaller_V2/Output_Joint_D.vcf -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -nt 4 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ../../hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 ../../1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 ../../1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../../dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 -rscriptFile recalibrate_SNP_plots.R

    12.1.2 java -Xmx3g -jar ../../GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T ApplyRecalibration -R ../../human_g1k_v37_decoy.fasta -input ../11.GATK_HaploCaller_V2/Output_Joint_D.vcf -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o recalibrated_snps_raw_indels.vcf --ts_filter_level 99.5 -mode SNP

    And I still have the same type of plot, that is, higher TiTv fro the highest tranches, which should be the other way round. What I don't understand is why it says that the TiTv at 99.5 (the tranche I finally applied) is 1.584, and when I calculate it with bcftools I get Ts/tv = 2.2 for SNPs?

    In any case, I'm starting doing some filtering of the data, that is, filtering for a minimal VQSLOD score, sites with a max missing data, etc....

    We have also applied the same pipeline (with the newest GATK version) for another Expanded Exome dataset of another population, and we also have the same pattern in the tranches plot.

    Well, I don't know what can produce this. In any case, I'll just combine the VQSR + filtering until I get a High quality SNP dataset.

    Thank you.

    Magda

    Issue · Github
    by Sheila

    Issue Number
    1711
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Magda,

    The TiTv reported in the BQSR graph is the novel TiTv, ie the TiTv of SNPs in your callset that have not been previously reported (= not seen in the resource files you're using). If we looked at TiTv across everything, the number would be heavily biased by all the known SNPs that are probably real. So we look at only the novel ones to evaluate whether their TiTv number looks good overall.

    As we've discussed elsewhere, our expectations for TiTv are based on what parts of the genome are being examined. Since we haven't looked at the Expanded Exome specific regions ourselves, we have no basis for predicting what their value should be -- but to be frank I would be surprised if it was that different from the rest of the exome.

    I would recommend you evaluate your callset with other methods to determine whether you have a problem with excessive false positive rate, or something else. See the documentation on hard filters for some tips and tricks relating to looking at the distribution of annotation values.

  • bassubassu UAEMember

    Hi,

    I have 10+ RNASeq and Whole genome data (human) with me. For RNAseq I followed the GATK best practice method and for WGS I followed the usual bowtie2 approach. In both the analysis I used BQSR and the vcf used was “All_20160527.vcf” from ncbi. I had done filtering on the VCF (RNAseq) as below

    gatk -Xmx30048M -T HaplotypeCaller -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -I recal.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o bqsr.vcf

    gatk -Xmx30048M -T VariantFiltration -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -V bqsr.vcf -window 35 -cluster 3 -filterName FS -filter "FS'>'30.0" -filterName QD -filter "QD'<'2.0" -o filtered_bqsr.vcf

    My questions are as follows
    1) Does it make sense to do VQSR on the RNA Seq data set? If so is there any filters I need to lookinto
    2) Do I need to do VQSR on the merged VCF (I have one with bcftools) or individually
    3) In the “-resource” should I need to use the same “All_20160527.vcf” too or use the hapmap data or both
    4) Some of the samples here are related (siblings,cousin) so does I need to do any extra filtration ?
    5) Does Question 4 has any effect with "-an InbreedingCoeff" ? As I might use this data of Association mapping(plink)

  • mscjuliamscjulia United StatesMember

    Hello,

    I have run VQSR on a WGS study with 35 samples. Now I need to compare among some particular samples out of this 35 pool. For example, I would like to compare sample1 to sample2. I understand that I can use SelectVariants to easily extract these 2 samples to a different vcf, but my question is what should I do to the VQSR Tranches? The "PASS" is for all of the 35 samples, correct?

    If so, is there other filters I should use for such comparison? Would DP combined with GQ be a good choice? Thanks a lot

  • mscjuliamscjulia United StatesMember

    Just to add to the above, I worry that when I compare sample1 and sample 2, if I only look at "PASS" snps, I will miss some snps GT that are in good quality for sample1 and sample2, but not sample9, and thus marked as "VQSRTranchesSNP99.9to100".

  • SheilaSheila Broad InstituteMember, Broadie admin

    @bassu
    Hi,

    I think this thread and this thread will help.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mscjulia
    Hi,

    VQSR takes into account the site-level annotations which are calculated from all samples. It is unlikely that a site will fail because one sample has "bad" annotation values.

    I am not sure what your end goal is, but you can do a comparison of PASS sites and all sites in SelectVariants. SelectVariants by default keeps all sites, but if you would like to compare only PASS sites, you can use --excludeFiltered.

    -Sheila

  • mscjuliamscjulia United StatesMember

    @Sheila

    Thank you for your reply. You mentioned that if one sample has bad annotation, all the others are good, the site won't fall; but how about one sample has good annotation, all the others are bad?

    The "PASS" is for each site, but not for each sample at the site - is this understanding correct please? (In another word, if I want to look at one sample for the site, the "PASS", which represents 35 samples at this site, would not mean much.)

    Thank you so much!

  • SteveLSteveL BarcelonaMember ✭✭

    Dear @mscjulia ,

    If I understand VQSR correctly, what it does is tell you how good the variant calls in your dataset are at a particular POSITION, but not specific calls within particular samples at that position..

    It does this by:

    1) comparing positions in your dataset to "known" polymorphic positions - for all these known positions, it looks at the characteristics of the specific calls in your dataset for these positions. which should be your "good" positions.

    2) comparing the calls in the rest of your positions to these "good" positions

    3) if the call characteristics in 2 and 1 look more or less the same, the position will likely be given a "PASS"

    4) as the call characteristics differ more and more between 1 and 2 above, then those in 2 will be put into different "tranches", representing reduced confidence that we can make reliable calls at that position.

    Thus the "PASS" is position information - it means that generally that position is reliably callable in your dataset - this does not mean that every call in that position will be correct - this is why you have call specific DP and GQ etc - if one sample was particularly badly covered relative to the others in a particular position, then the positions is still likely to be a PASS, but the specific all, say a heterozygote with an AD of 1,4, should be taken with a big pinch of salt.. Likewise, in a position that is not "PASS", this does not mean every call will be unreliable, though sometimes it might - e.g. ALL your samples are heterozygote at a particular position - this likely indicates an underlying mappability issue.

    In your example of one good, and all the others bad, then it is likely that the position will not be a PASS. As ever here there is a trade off between sensitivity and specificity and there is no perfect answer.

    In your example of a PASS site with 35 samples, this indicates that most of the calls are likely to be reliable, but again, if one sample only has a DP of 5 and all the others had a DP ~30, then I would wonder what was going on there.

    Hope this helps a little.

  • mscjuliamscjulia United StatesMember

    @SteveL Thank you so much for the detailed explanation. It makes perfect sense - thank you!

  • zhaoqiang90zhaoqiang90 Member

    Hi,@Geraldine_VdAuwera
    If I use GATK VQSR to get a good result, is that mean that I must use Broad Bundle's data in order to train my variants. I have 12 exome samples, and I wonder that should I use hard-filtering to get the final result or to conduct VQSR to my vcf file(which as your recommendation I should add another 18 exomes samples to get a reliable result). Which is better as you suggestion?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @zhaoqiang90,

    You can use any known sites resource that contains confident sites of variation for your organism. According to Article#1259, for exome capture data, you will want a minimum of 30 samples. The document also suggests two alternative approaches for those with fewer samples: (i) add more samples or (ii) perform VQSR with --maxGaussians 4.

  • zhaoqiang90zhaoqiang90 Member
    edited July 2017

    Hi, @shlee
    If I pad my orginal 12 exome data set with 1000genomes up to 30 samples, should the paded exome data must match up my samples at pathology or disease?I have 12 scurvy patients exome data, but I added another 18 normal exome data. Will the irrelevant data set bring some error( for instance false positive rate) to my original data variants? If I want know some specific snps or indels among the small set of my data set, should I conduct joint genotyping with these 12 sample together or seperate?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2017

    Hi @zhaoqiang90,

    It's news to me there is a genetic component to scurvy.

    For the purposes of VQSR, it is okay that the additional exome data not match the disease of your cohort. However, there are other considerations in padding and Geraldine discusses them in part, e.g. in these two threads:

    Also, if it were me, I'd try multiple different approaches to filtering to see what gave the best results--hard-filtering, VQSR with just my 18 samples and VQSR with padding. Do you have orthogonal validation/truth data for your samples, e.g. SNP array data? Comparing against these could help your analysis.

    Relatedly, we have a hands-on tutorial that we give at workshops that outlines some considerations in hard-filtering and shows two approaches to assessing filtering results against a truthset. It is the Variant Filtering Tutorial and you can find a link to the worksheet here. I developed and wrote the bulk of the content of this version of the tutorial, so let me know if anything is unclear.

  • robertbrobertb torontoMember

    Hi,

    I ran a VQSR, Genotype Refinement, variant filtration, etc., work flow for my wgs data but found that I had too many false positives. My VQSR command and variant filtration commands are below. For those sites that actually passed filtration I also only took genotypes with GQ > 35 and DP > 10. But that still was not enough. I was wondering if I had done something wrong in the VQSR step (below). I made sure to run VQSR on all the chromosomes simultaneously. I think the problem is still at the site level in terms of QUAL and QD but I didn't think that I needed to hard filter on those if I was using VQSR. I mean if VQSR says there's a variant there then that should be good enough right? So maybe I should have made the VQSR options more stringent? I'm not sure. If someone could recommend some good filtering options to compliment VQSR I'd appreciate it. Or if there's some problems with my VQSR comman I'd like to hear about it. Thanks. _Robert

    VQSR

    /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx8g -jar /work/rb14sp/GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R ucsc.hg19.fasta \
    -input GATKVQSR.vcf \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /hapmap_3.3.hg19.sites.vcf \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \
    -an DP \
    -an QD \
    -an MQ \
    -an FS \
    -an SOR \
    -an MQRankSum \
    -an ReadPosRankSum \
    -an InbreedingCoeff \
    -mode SNP \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -recalFile GATK_VQSR_recalibrate_SNP.recal \
    -tranchesFile GATK_VQSR_recalibrate_SNP.tranches \
    -rscriptFile GATK_VQSR_recalibrate_SNP_plots.R &&
    \

    Apply the desired level of recalibration to the SNPs in the call set

    /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx8g -jar /work/rb14sp/GenomeAnalysisTK.jar \
    -T ApplyRecalibration \
    -R ucsc.hg19.fasta \
    -input GATK_VQSR.vcf \
    -mode SNP \
    --ts_filter_level 99.5 \
    -recalFile GATK_VQSR_recalibrate_SNP.recal \
    -tranchesFile GATK_VQSR_recalibrate_SNP.tranches \
    -o GATK_VQSR_recalibrated_snps_raw_indels.vcf &&
    \

    Build the Indel recalibration model

    /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx8g -jar /work/rb14sp/GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R ucsc.hg19.fasta \
    -input GATK_VQSR_recalibrated_snps_raw_indels.vcf \
    -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \
    -an QD \
    -an DP \
    -an FS \
    -an SOR \
    -an MQRankSum \
    -an ReadPosRankSum \
    -an InbreedingCoeff \
    -mode INDEL \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    --maxGaussians 4 \
    -recalFile GATK_VQSR_recalibrate_INDEL.recal \
    -tranchesFile GATK_VQSR_recalibrate_INDEL.tranches \
    -rscriptFile GATK_VQSR_recalibrate_INDEL_plots.R &&
    \

    Apply the desired level of recalibration to the Indels in the call set

    /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx8g -jar /work/rb14sp/GenomeAnalysisTK.jar \
    -T ApplyRecalibration \
    -R ucsc.hg19.fasta \
    -input GATK_VQSR_recalibrated_snps_raw_indels.vcf \
    -mode INDEL \
    --ts_filter_level 99.0 \
    -recalFile GATK_VQSR_recalibrate_INDEL.recal \
    -tranchesFile GATK_VQSR_recalibrate_INDEL.tranches \
    -o GATK__VQSR_recal_var_test.vcf; fi

    FILTER GENOTYPES

    !/bin/bash

    /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx8g -jar /work/rb14sp/GenomeAnalysisTK.jar \
    -T VariantFiltration \
    -R ucsc.hg19.fasta \
    -V GATK_recal_variants_geno.vcf \
    -G_filter "GQ < 30.0" \
    -G_filterName lowGQ \
    -G_filter "DP < 10" \
    -G_filterName lowDP \
    -o GATK_ALL_FILT.vcf \

  • SheilaSheila Broad InstituteMember, Broadie admin

    @robertb
    Hi,

    How do you know there are a lot of false positives left after VQSR?

    You should not have to do any extra filtering after VQSR. You can try setting a more stringent filter level if you want to remove more potential false positives. Have you tried different filter levels? Do you see less false positives with a more stringent filter level?

    -Sheila

  • KatieKatie United StatesMember ✭✭

    Hi - Thank you for a fantastic resource! Earlier in the thread, it's mentioned that the resource SNP sets are only used for SNP locations along the genome and annotations are not used. If that's the case, then would it be possible to use a BED file or list of SNP positions as an input? I'm working with a non-model organism and am trying to develop an approach to use VQSR based on known SNP differences between 2 published genomes.

    Thanks for your help!
    All the best,

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Katie
    Hi,

    I don't think you can do that. The tool requires a VCF or BCF as input for resource files. However, if you do some googling, you can find some ways to convert a bed file to VCF which may work :smile:

    -Sheila

  • KatieKatie United StatesMember ✭✭

    Got it - thanks for the help! It looks like there is a bedr R package that could help.

  • KatieKatie United StatesMember ✭✭

    Hi,
    I'm running VQSR and get about 10% of SNP calls with a Nan VQSLOD score. This makes it difficult to use the score as a threshold for retaining or discarding variants. Do all of the variant annotations have to have numeric scores for VQSLOD to be calculated (i.e. some of the MQRankSum and ReadPosRankSum scores are Nan)? If so, how do you recommend dealing with Nan values in some of the annotations of interest? Thank you!

    My call is below:
    gatk --java-options '-Xmx10g' VariantRecalibrator \
    -R ${REF_DIR}${ref} \
    --variant ${VARS_DIR}${vcf} \
    -resource truePos,known=false,training=true,truth=true,prior=15.0:/ifs/labs/andrews/walter/varcal/results/sims/old/nucmer.vcf \
    -an DP \
    -an QD \
    -an MQRankSum \
    -an ReadPosRankSum \
    -an FS \
    -an SOR \
    -an MQ \
    -mode SNP \
    -tranche 100.0 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 \
    --output ${vcf%*}".recal" \
    --tranches-file ${vcf%
    }".tranches" \
    --rscript-file ${vcf%_
    }".plots.R" \
    --max-gaussians 4 \
    --mq-cap-for-logit-jitter-transform $mqcap

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Getting nans for VQSLOD is not expected. You should have a proper numeric values for all sites that have been put through recalibration. Did you run on both SNPs and indels in series?

  • KatieKatie United StatesMember ✭✭

    Okay great, I didn't think so. I am only running for SNPs as I have already used gatk SelectVariants to select for only SNPs.

    I have SNPeff annotations on my VCF and wondering if that poses a problem? Or if the problem is due to a small SNP set (~1000 SNPs along a 4.4 Mb genome).

    I am trying to develop a pipeline to use VQSR for non-model organisms, so for my true positive VCF, I am using a VCF file of true positive variants developed when comparing 2 ref. genomes (i.e. it doesn't include annotations, only SNP location information). I validated the true positives VCF with gatk ValidateVariants.

    Any advice would be great, thank you!

  • KatieKatie United StatesMember ✭✭
    edited January 2018

    I am attaching my vcf file (bwa_samtools_all.vcf.gz) as well as my truth sites file (nucmer.vcf.gz) in case they will be helpful for diagnosing why VQSR is calculating NaNs for the VQSLOD score. Thank you so much for your help on this!

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Katie
    Hi,

    I think the issue may be related to a small dataset. We usually recommend running VQSR on at least 30 exomes or 1 whole genome. You may consider hard filtering. We have some helpful documents in the tutorials and methods and algorithms sections.

    Keep in mind we have a new tool to replace VQSR coming out soon which may not limit your work so much :smile:

    -Sheila

  • KatieKatie United StatesMember ✭✭

    @Sheila,
    Thank you for letting me know. Looking forward to the new tool. I experimented a bit and can make this work using the max-gaussians argument. All the best,

  • KatieKatie United StatesMember ✭✭

    Sorry - I take that back, the problem was not to do with too many gaussian mixture models. Instead I'd been using a mq-cap-for-logit-jitter-transform which somehow was introducing issues.
    Just wanted to update in case anyone else is using this for non-model organisms. Cheers!

  • timhtimh brissy, AUMember

    Hi. Is there a way of telling how many Gaussians have been identified by VQSR?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited March 2018

    @timh In recent versions there is an option to output the details of the model that was produced (--output-model), which I believe should include that information.

  • timhtimh brissy, AUMember
  • ArynioArynio Member

    Hi,

    Dear GATK team. I'm just learning about VQSR and its advantages over hard filters. I'm currently working with ~8x whole-genome data from a non-model species for which we have a reference genome, and I was wondering how to perform VQSR. Would it make sense to use called SNPs from a reduced dataset at ~30x as the highly validated "true" input? Or would that carry way too much noise?

    Thanks in advance for your advice, and also for all your great documentation that has already helped me many times before.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Arynio
    Hi,

    I think this thread will help.

    -Sheila

  • ArynioArynio Member

    @Sheila

    Thank you very much!

  • Hi,

    I am wondering what are commands in GATK4 to apply recalibration results to sequencing data? The line "-BQSR" in GATK3 was not recognized in GATK4.

    Also, I am having a really hard time trying to find a webpage with step-by-step command lines for GATK4 (germline SNPs / Indels with exome-seq data). Any recommendations will be much appreciated.

    Thanks!

    Best,
    Siyu

  • SheilaSheila Broad InstituteMember, Broadie admin

    @liusiyu93
    Hi Siyu,

    Have a look at this thread and the Best Practices.

    You may also find the WDLs provided here helpful.

    -Sheila

  • liuqiliuqi 104 Building, 1# Courtyard, Beichen West Road, Chaoyang District, Beijing, ChinaMember

    Hello,

    Dear GATK team, I use VQSR to filter variants, and when I see the recalibrated_snps.tranches.pdf, it is so strange. see the attachment.

    could you tell me why this occur?

    Thank you very much!

    Best,
    Qi Liu

  • SheilaSheila Broad InstituteMember, Broadie admin

    @liuqi
    Hi Qi,

    Can you tell us which version of GATK you used and what kind of data you are working with?

    Thanks,
    Sheila

  • liuqiliuqi 104 Building, 1# Courtyard, Beichen West Road, Chaoyang District, Beijing, ChinaMember

    @Sheila
    Hello, Sheila,

    Thank you for your reply. Sorry for the late reply.

    I use GATK 3.5, my data is a human population data which is a >30X NGS data .

    because I use GATK 3.5 to call SNP in my first data, so I use the same version to call SNP in my second data.

    my first data's picture look like normally, but the second is so strange.

    Do I need use new version to do VQSR or calling?

    Thank you very much!

    Best,
    Qi Liu

  • Hi,
    I am doing VQSR on a non-model organism (SNPs called by GATK and verified by samtools mpileup or freebayes). I have 70+ samples sequenced to 5X-10X (WGS). I created a training set by using concordant sites between mpileup, freebayes and haplotypecaller with stringent filtering parameters. For VQSR, I used different max-gaussian settings (4,6 and the default 8). The plots generated look very different from each other and I was wondering if there is any resource that would help guide to choose this setting from the plots. All the plots look very different from each other. My G8 plots show 2 clusters, with high lod scores, mostly separated out by depth. Tranches plot also looks a bit odd. G4 and G6 give a single cluster. I am aiming for high specificity and towards using G4 (just because this was suggested here) - tranche 90 for further analyses. Do you think it would still be better to use G6 or G8? Thanks a lot for your insights!
    Best,
    Melis

    Issue · Github
    by Sheila

    Issue Number
    3143
    State
    open
    Last Updated
  • SheilaSheila Broad InstituteMember, Broadie admin

    @liuqi
    Hi Qi,

    Yes, it may be best to try using GATK3 latest version. You can just test the VQSR step with it.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie admin

    @melisakman
    Hi Melis,

    I will need to take some time and consult my team before getting back to you. If I don't respond by the end of the week, please post again.

    -Sheila

  • liuqiliuqi 104 Building, 1# Courtyard, Beichen West Road, Chaoyang District, Beijing, ChinaMember

    @Sheila
    Hello Sheila,

    I have tryed to use GATK-3.8.1 to test VQSR step, but get the same result. It is also so strange.

    how can I resolve this problem?
    I use the same argument as my first data.

    Best,
    Qi Liu

    @Sheila said:
    @liuqi
    Hi Qi,

    Yes, it may be best to try using GATK3 latest version. You can just test the VQSR step with it.

    -Sheila

  • MehulSMehulS Member

    Hi, I plan to generate subsetted VCFs containing variants only on about 90 genes of my interest since I need to reduce time so I'm using Haplotype Caller on an interval using a .list file.

    What parameters are recommended for using VQSR on interval-delimited data ? I can include more genes and expand the intervals but want to avoid calling on the whole genome because HC takes 6 hours.

    I have ~100 low coverage Illumina paired-end WGS samples (5-10X), GATK 4.0.12

Sign In or Register to comment.