Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Is it mandatory to have annotation 'QualByDepth' in annotated raw indel file for VariantAnnotator

bishwobishwo Posts: 16Member

I had annotated raw indel file (given by UnifiedGenotyper), 1000G_omni2.5.b37.sites.vcf and hapmap_3.3.b37.sites.vcf with all possible annotations including QD (QualByDepth) using VariantAnnotator. However, i got an error when i tried to run VariantRecalibrator. It was complaing that QD has not been found in training variant. Is QD important annotation for indel filtering. Can it be ignored ?

P.S. - i did not use sample bam file while annotating training data set.

.
.
.
INFO  15:11:55,999 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK.vcf
INFO  15:12:21,650 TraversalEngine -  chr1:128346793        1.98e+07   30.0 s        1.5 s      4.1%        12.1 m    11.6 m
INFO  15:12:51,650 TraversalEngine -  chr9:130658800        5.26e+07   60.0 s        1.1 s     53.9%       111.2 s    51.2 s
INFO  15:13:13,618 VariantDataManager - QD:      mean = NaN      standard deviation = NaN
INFO  15:13:16,417 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator
##### ERROR ------------------------------------------------------------------------------------------

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    A good rule of thumb is that if the program refuses to run without a certain input, then yes, that input is important... ;)

    According to the error message, the file that does not have the annotations is this one: NCBI_dbsnp_for_GATK.vcf. You don't mention it in the files you annotated. Have you tried annotating it?

    Geraldine Van der Auwera, PhD

  • bishwobishwo Posts: 16Member

    NCBI_dbsnp_for_GATK.vcf is dbsnp file which has not been used as training set (training=false). Is it important to annotate non-training data set as well ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    All the variants that are going to be used in the model must have the necessary annotations. Please see the documentation for full details on how this tool works and what inputs must be given.

    Geraldine Van der Auwera, PhD

  • bishwobishwo Posts: 16Member
    edited November 2012

    I annotated NCBI_dbsnp_for_GATK.vcf , but i still got an error.

    . . . INFO 13:15:42,304 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf INFO 13:15:42,599 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 13:15:42,600 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 13:16:07,855 TraversalEngine - chr1:85835656 1.50e+07 30.0 s 2.0 s 2.8% 18.0 m 17.5 m INFO 13:16:37,855 TraversalEngine - chr3:128202943 3.21e+07 60.0 s 1.9 s 20.0% 5.0 m 4.0 m INFO 13:17:07,856 TraversalEngine - chr8:12438529 4.77e+07 90.0 s 1.9 s 45.4% 3.3 m 108.3 s INFO 13:17:37,857 TraversalEngine - chr14:37574960 6.33e+07 2.0 m 1.9 s 72.3% 2.8 m 46.0 s INFO 13:18:05,298 VariantDataManager - QD: mean = NaN standard deviation = NaN INFO 13:18:08,441 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.1-13-g1706365):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator
    Post edited by bishwo on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Can you post your command line?

    Geraldine Van der Auwera, PhD

  • bishwobishwo Posts: 16Member

    Command Line

    $ OUTPUT="Laine_pool2_9_AGTTCC_L005"
    $ WHOLEGENOME="/group/htb/projects/ReferenceData/illumina/hg19/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
    $ DBSNP="NCBI_dbsnp_for_GATK-annotated.vcf"
    $ 
    $ TRAININGSET1="hapmap_3.3.b37.sites.sorted.annotated.vcf"
    $ TRAININGSET2="1000G_omni2.5.b37.sites.sorted.annotated.vcf"
    $ THREADS=16
    $ java -Xmx4g -jar GenomeAnalysisTK.jar \
    >    -T VariantRecalibrator \
    >    -nt $THREADS \
    >    -R $WHOLEGENOME \
    >    -input $OUTPUT-raw-indel-annotated.vcf \
    >    -recalFile $OUTPUT-indel.recal \
    >    -tranchesFile $OUTPUT-indel.tranches \
    >    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $TRAININGSET1 \
    >    -resource:omni,known=false,training=true,truth=false,prior=12.0 $TRAININGSET2 \
    >    -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 $DBSNP \
    >    --maxGaussians 4 \
    >    -an QD \
    >    -an HaplotypeScore \
    >    -an MQRankSum \
    >    -an ReadPosRankSum \
    >    -an FS \
    >    -an MQ \
    >    -mode INDEL \
    >    -rscriptFile  VariantRecalibrator-indel.r
    

    Output

    INFO  10:03:25,978 HelpFormatter - --------------------------------------------------------------------------------- 
    INFO  10:03:25,981 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-13-g1706365, Compiled 2012/10/12 19:21:06 
    INFO  10:03:25,981 HelpFormatter - Copyright (c) 2010 The Broad Institute 
    INFO  10:03:25,982 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  10:03:25,982 HelpFormatter - Program Args: -T VariantRecalibrator -nt 16 -R /group/htb/projects/ReferenceData/illumina/hg19/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -input Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf -recalFile Laine_pool2_9_AGTTCC_L005-indel.recal -tranchesFile Laine_pool2_9_AGTTCC_L005-indel.tranches -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.sorted.annotated.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.sorted.annotated.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 NCBI_dbsnp_for_GATK-annotated.vcf --maxGaussians 4 -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode INDEL -rscriptFile VariantRecalibrator-indel.r 
    INFO  10:03:25,983 HelpFormatter - Date/Time: 2012/11/07 10:03:25 
    INFO  10:03:25,983 HelpFormatter - --------------------------------------------------------------------------------- 
    INFO  10:03:25,983 HelpFormatter - --------------------------------------------------------------------------------- 
    INFO  10:03:26,010 ArgumentTypeDescriptor - Dynamically determined type of Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf to be VCF 
    INFO  10:03:26,016 ArgumentTypeDescriptor - Dynamically determined type of hapmap_3.3.b37.sites.sorted.annotated.vcf to be VCF 
    INFO  10:03:26,021 ArgumentTypeDescriptor - Dynamically determined type of 1000G_omni2.5.b37.sites.sorted.annotated.vcf to be VCF 
    INFO  10:03:26,025 ArgumentTypeDescriptor - Dynamically determined type of NCBI_dbsnp_for_GATK-annotated.vcf to be VCF 
    INFO  10:03:26,045 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  10:03:26,133 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,178 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:26,234 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:26,299 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:26,528 MicroScheduler - Running the GATK in parallel mode with 16 concurrent threads 
    INFO  10:03:26,645 TrainingSet - Found hapmap track:    Known = false   Training = true     Truth = true    Prior = Q15.0 
    INFO  10:03:26,646 TrainingSet - Found omni track:  Known = false   Training = true     Truth = false   Prior = Q12.0 
    INFO  10:03:26,646 TrainingSet - Found dbsnp track:     Known = true    Training = false    Truth = false   Prior = Q6.0 
    INFO  10:03:26,742 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,750 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,758 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,766 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,773 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,781 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,788 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,804 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,812 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,836 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,845 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,851 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,859 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,866 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,873 RMDTrackBuilder - Loading Tribble index from disk for file Laine_pool2_9_AGTTCC_L005-raw-indel-annotated.vcf 
    INFO  10:03:26,880 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:26,903 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:26,931 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:26,958 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:26,986 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,009 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,036 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,060 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,083 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,109 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,136 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,164 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,190 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,217 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,243 RMDTrackBuilder - Loading Tribble index from disk for file hapmap_3.3.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,270 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,366 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,413 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,461 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,544 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,591 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,707 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,882 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:27,922 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:28,009 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:28,051 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:28,095 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:28,269 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:28,332 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:28,377 RMDTrackBuilder - Loading Tribble index from disk for file 1000G_omni2.5.b37.sites.sorted.annotated.vcf 
    INFO  10:03:28,423 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:28,623 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:28,677 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
    INFO  10:03:28,678 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  10:03:29,069 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:29,426 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:29,657 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:30,861 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:31,246 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:31,435 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:31,729 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:32,351 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:32,956 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:33,137 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:33,313 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:33,581 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:35,830 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK-annotated.vcf 
    INFO  10:03:56,732 TraversalEngine -   chr1:70443113        1.25e+07   30.0 s        2.4 s      2.3%        22.0 m    21.5 m 
    INFO  10:04:26,733 TraversalEngine -   chr3:30914684        3.04e+07   60.0 s        2.0 s     16.9%         5.9 m     4.9 m 
    INFO  10:04:56,734 TraversalEngine -   chr7:35746800        4.52e+07   90.0 s        2.0 s     41.0%         3.7 m     2.2 m 
    INFO  10:05:26,735 TraversalEngine -  chr13:21363534        6.09e+07    2.0 m        2.0 s     68.0%         2.9 m    56.4 s 
    INFO  10:05:56,738 TraversalEngine -   chrX:97689745        7.67e+07    2.5 m        2.0 s     96.2%         2.6 m     5.9 s 
    INFO  10:05:58,828 VariantDataManager - QD:      mean = NaN  standard deviation = NaN 
    INFO  10:06:02,996 GATKRunReport - Uploaded run statistics report to AWS S3 
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365): 
    ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ##### ERROR Please do not post this error to the GATK forum
    ##### ERROR
    ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator
    ##### ERROR ------------------------------------------------------------------------------------------
    
  • ebanksebanks Posts: 682GATK Developer mod

    Oh, I see what's going on: you are not following our best practices recommendations. Please go back and read them, especially as it concerns the statistical filtering of indels.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • bishwobishwo Posts: 16Member
    edited November 2012

    According to best practices recommendations for statistical filtering, i need to select one of the following option because i have small whole exome sample. 1. adding more sample 2. Running VSQR with the arguments --maxGaussians 4 --percentBad 0.12 3. using hard filters

    I selected the second option. I added --maxGaussians 4 and --percentBad 0.12 .

    I need to use training set Mills_and_1000G_gold_standard.indels.b37.sites.vcf

    I am still doing in a wrong way? Please correct me if i understood wrong.

    Post edited by bishwo on
  • ebanksebanks Posts: 682GATK Developer mod
  • bishwobishwo Posts: 16Member

    Thanks !

    I did according to the given link. However i got yet another problem. I have only one exome sample to call indel and filter them. Am i getting this error because of small sample data ?

    ##### ERROR MESSAGE: Bad input: Error during negative model training. Minimum number of variants to use in training is larger than the whole call set. One can attempt to lower the --minNumBadVariants arugment but this is unsafe.

    I tried to decrease the value of --minNumBadVariants argument. When i run VariantRecalibrator with --minNumBadVariants 382 i got following error.

    INFO  10:21:34,421 VariantDataManager - QD:      mean = 29.86    standard deviation = 12.37 
    INFO  10:21:34,421 VariantDataManager - FS:      mean = -0.01    standard deviation = 0.11 
    INFO  10:21:34,422 VariantDataManager - HaplotypeScore:      mean = 0.00     standard deviation = 0.11 
    INFO  10:21:34,422 VariantDataManager - ReadPosRankSum:      mean = -0.04    standard deviation = 1.40 
    INFO  10:21:34,423 VariantDataManager - Training with 202 variants after standard deviation thresholding. 
    WARN  10:21:34,423 VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable. 
    INFO  10:21:34,428 GaussianMixtureModel - Initializing model with 30 k-means iterations... 
    INFO  10:21:34,536 VariantRecalibratorEngine - Finished iteration 0. 
    INFO  10:21:34,561 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 0.28747 
    INFO  10:21:34,576 VariantRecalibratorEngine - Convergence after 9 iterations! 
    INFO  10:21:34,579 VariantDataManager - Found 0 variants overlapping bad sites training tracks. 
    WARN  10:21:34,580 VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable. 
    INFO  10:21:36,597 GATKRunReport - Uploaded run statistics report to AWS S3 
    Exception in thread "main" java.lang.NullPointerException
        at org.broadinstitute.sting.gatk.CommandLineGATK.checkForMaskedUserErrors(CommandLineGATK.java:138)
        at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:110)
    
  • ebanksebanks Posts: 682GATK Developer mod

    If you are running with only one sample then you should not be using VQSR. You need to use hard filters.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • bishwobishwo Posts: 16Member

    So, i need to use VariantFiltration. I did not understand what mask file is? Is it mandatory to provide mask file ?

    I had used VQSR for SNP call for single sample. I did not get any error running VQSR for SNP. Shouldn't i use VQSR also for SNP call ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    @bishwo, your previous runs were probably okay because you had enough SNPs, but you have fewer indels, so it doesn't work.

    The mask file is not required for using VariantFiltration. To understand how to use this tool, please read the following documentation carefully:

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_filters_VariantFiltration.html

    http://www.broadinstitute.org/gatk/guide/article?id=51

    Geraldine Van der Auwera, PhD

  • bishwobishwo Posts: 16Member

    Thanks for the information.

    Now, i have used hard filters for indels. Even after using hard filters i got 0% of indels filtered. For snp i have used VQSR. I got only 10% of snps filtered. It is usual to get almost the same number of snps/indels even after filtering step ? In my opinion it is not worth doing filtering.

Sign In or Register to comment.