Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Disagreement between HaplotypeCaller, VariantAnnotator, and ValidateVariants over a dbSNP annotation

grosscogrossco Posts: 4Member
edited December 2012 in Ask the GATK team

I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.

The HaplotypeCaller command line is:

#Fasta from the gz in the resource bundle

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \
 -I chrom_bams/286T.chr3.bam \
 -o hapc_vcfs/286T.chr3.raw.vcf 

The VariantAnnotator command line is:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
     --dbsnp $dbsnp  --alwaysAppendDbsnpId \
    -A BaseQualityRankSumTest -A DepthOfCoverage \
    -A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
    -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
    -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
    -A TandemRepeatAnnotator \
    --variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
    --out varanno_vcfs/286T.chr3.va.vcf

This all works nicely, but I go back and use ValidateVariants just to be sure:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
   --dbsnp ${dbsnp} \
   --variant:vcf varanno_vcfs/286T.chr3.va.vcf \
    1> report/ValidateVariants/286T.chr3.va.valid.out \
    2> report/ValidateVariants/286T.chr3.va.valid.err &

An issue arises with a rsID that is flagged as not being present in dbSNP.

...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP

I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.

The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AAAGAGAAGAGAAGAG        A       1865.98 .

There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AA      A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG     .  PASS

My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."

The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.

The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.

My questions are:

  1. Is this really a problem that arises from slightly different reference assemblies?
  2. Is the hg19-1.5 reference fasta different from any other hg19 reference fasta?
  3. Is there at tool that I have missed that would have prevented this error and allowed the pipeline to continue without error?"
  4. Will this strict validation failure cause problems for the VariantRecalibrator?

As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.

-- Colin

Post edited by grossco on

Best Answer


  • JramblaJrambla Posts: 4Member
    edited January 2013

    For a small target, deep coverage, 454 sequenced experiment. I'm finding a disagreement too between the HaplotypeScore, FS, BaseQRankSum, MQRankSum values set by HaplotypeCaller and the ones set by UnifiedGenotyper or VariantAnnotator. Applying VariantAnnotator after HaplotypeCaller ammends the values to the UnifiedGenotyper ones.

    Downsampling is hopefully disabled using the parameter --downsampling_type NONE.

    Is this also a bug that will be corrected in the 2.4 release or is it another kind of issue?

    Thanks in advance


    Post edited by Jrambla on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    Hi Jordi, what you're seeing is not a bug. The different values are expected because UG and HC use different models of the variation. You see VariantAnnotator "amending" the HC values to match the UG ones because VA uses the UG model.

    Geraldine Van der Auwera, PhD

  • JramblaJrambla Posts: 4Member

    Hmmm... I know that both use different models, but.... I'm working with a resequencing from 1000genomes samples, I'm getting HaplotypeScores on the hundreds for HaplotypeCaller and on the units/tenths for UnifiedGenotyper, indeed for the known SNPs, which makes me "nervous" about the issue. Furthermore, because I'm using the best practice hard filtering recommendations, the forner SNPs get filtered out and the later are kept. Because I'm a rookie... do you think such a big difference makes sense because of the models? Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    The difference is fine, but you make a good point about the filtering recommendations --those are formulated for UG results and shouldn't be applied as such for HC results. We will add HC-specific recommendations to the Best Practices documentation.

    Geraldine Van der Auwera, PhD

  • JramblaJrambla Posts: 4Member
    edited March 2013

    I've checked the Best Practices documentation and find nothing new about that. We are revising our results and having the HC recommendation could be very helpful. Do you know when will the recommendations be updated?. Thanks in advance!


    Post edited by Jrambla on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    Hi Jordi,

    I can't give you an estimate right now as we are currently working on improving the calculation model. Once that is done we'll need to run tests on a lot of data to determine what are the best generic settings. In the meantime you should experiment with the parameters to find the settings that best suit your data. Good luck!

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.