Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VariantRecalibratior questions; no plot shown in .R.PDF

1) When I used UnifiedGenotyper and VariantRecalibrator to analyze“NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam” downloaded from bundle, I can get right format VCF (98447 SNP) and “.R.PDF” that appeared to show right plots.

2) However, using the same approach to analyze my own 8 recalibrated bam files (target re-sequencing project, I can only obtain right format VCF (2654 SNPs), but with the error message followed by VariantRecalibrator:

ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)

3) In order to test whether I can use recalibration step, first I run UnifiedGenotyper by combining all my recalibrated Bam files and NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam file together and got right format VCF (101540 SNP) ). Then, I run VariantRecalibrator: this time, there is no error message, but R.PDF file didn't show plots.

Could you explain what might happen?

Thanks.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    The error you got using your own bam files means that your dataset is too small to use the VariantRecalibrator. You will need to use hard filtering. We give some recommendations about what values to use in the Best Practices document.

    For your other problem (no plot when you run VariantRecalibrator on the combined datasets) I cannot say for sure, but maybe you made a mistake in your command line. Are the other expected files generated normally?

  • xiaojunzhaoxiaojunzhao Member

    Yes, all other files seem right. I even can use “ApplyRecalibration” with newly generated .recal and .tranche files to generate new VCF, which still shows the same number of SNP (101540 SNP). Is there other way to determine whether VariantRecalibrator was run correctly? Please see attachment of .R.PDF (one shows right plots while another shows no plot). Thanks.

  • xiaojunzhaoxiaojunzhao Member
    edited December 2012

    This is part of recal file, which didn't show plots and had wrong value of VQSLOD. Why this happen when I combine my own bam files and NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam.

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO
    1   9097241 .   N   <VQSR>  .   .   END=9097241;VQSLOD=-32419729168085060000.0000;culprit=HaplotypeScore
    1   9097251 .   N   <VQSR>  .   .   END=9097251;VQSLOD=-31284081795478163000.0000;culprit=HaplotypeScore
    20  61098   .   N   <VQSR>  .   .   END=61098;VQSLOD=-39972755389554426000.0000;culprit=FS
    20  61795   .   N   <VQSR>  .   .   END=61795;VQSLOD=-38562259676594690000.0000;culprit=QD
    20  63244   .   N   <VQSR>  .   .   END=63244;VQSLOD=-40207657655898956000.0000;culprit=MQ
    20  63799   .   N   <VQSR>  .   .   END=63799;VQSLOD=-41081738415351390000.0000;culprit=QD
    

    Here is part of recal file when I only used "NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam"

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO
    20  61098   .   N   <VQSR>  .   .   END=61098;VQSLOD=10.2111;culprit=FS
    20  61795   .   N   <VQSR>  .   .   END=61795;VQSLOD=11.7290;culprit=QD
    20  63244   .   N   <VQSR>  .   .   END=63244;VQSLOD=11.6669;culprit=FS
    20  63799   .   N   <VQSR>  .   .   END=63799;VQSLOD=12.1791;culprit=QD
    20  65900   .   N   <VQSR>  .   .   END=65900;VQSLOD=9.7638;culprit=QD
    
    Post edited by Geraldine_VdAuwera on
  • rpoplinrpoplin ✭✭✭ Member ✭✭✭

    Hi there,

    Can you please post the full log output from the VariantRecalibrator command to help us understand what is going on?

    Thanks!

  • xiaojunzhaoxiaojunzhao Member

    Hi,

    Please see 3 attached log files:

    1) MyBamOnly.txt --- input VCF file generated from 8 my own bam files (recalibration didn't work and generated error message)
    2) NA12878BamFromBundle.txt --- input VCF file generated from "NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam" (recalibration works)
    3) MyBamandNA12878BamFromBundle.txt --- input VCF file generated from 8 bam and NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam files (recalibration run, but generated wrong recal file and no plot in R.PDF)

    Thanks.

  • rpoplinrpoplin ✭✭✭ Member ✭✭✭

    Hi there,

    The problem is just that you have too few variants to do reasonable error modeling. Our best practices recommendation is to combine your samples with 30 EXOME samples from the 1000 Genomes Project (for example). Combing your targeted sequencing data with a single whole genome variant callset is definitely not recommended by us. The data type and distribution of annotation statistics is too divergent.

    Have a look at the relevant sections of our best practice recommendations here:

    1. Notes about small whole exome projects or small target experiments

    In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples. Also, for experiments that employ targeted resequencing of a small region (for example, a few hundred genes), VQSR may not be empowered regardless of the number of samples in the experiment. For users with experiments containing fewer exome samples or with a small target region there are several options to explore (listed in priority order of what we think will give the best results):

    Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline).

    Use the VQSR with the smaller SNP callset but experiment with the argument settings. For example, try adding --maxGaussians 4 --percentBad 0.05 to your command line. Note that this is very dependent on your dataset, and you may need to try some very different settings. It may even not work at all. Unfortunately we cannot give you any specific advice, so please do not post questions on the forum asking for help finding the right parameters.

    Use hard filters (detailed below).

    Best of luck,

Sign In or Register to comment.