The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# Disagreement between HaplotypeCaller, VariantAnnotator, and ValidateVariants over a dbSNP annotation

Posts: 4Member
edited December 2012

I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.

The HaplotypeCaller command line is:

gatk="/usr/local/gatk/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar"
#Fasta from the gz in the resource bundle
indx="/home/ref/ucsc.hg19.fasta"
dbsnp="/fdb/GATK_resource_bundle/hg19-1.5/dbsnp_135.hg19.vcf"

java -Xms1g -Xmx2g -jar $gatk -R${indx} -T HaplotypeCaller \
-I chrom_bams/286T.chr3.bam \
-o hapc_vcfs/286T.chr3.raw.vcf


The VariantAnnotator command line is:

java -Xms1g -Xmx2g -jar $gatk -R${indx} -T VariantAnnotator \
--dbsnp $dbsnp --alwaysAppendDbsnpId \ -A BaseQualityRankSumTest -A DepthOfCoverage \ -A FisherStrand -A HaplotypeScore -A InbreedingCoeff \ -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \ -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \ -A TandemRepeatAnnotator \ --variant:vcf hapc_vcfs/286T.chr3.raw.vcf \ --out varanno_vcfs/286T.chr3.va.vcf  This all works nicely, but I go back and use ValidateVariants just to be sure: java -Xms1g -Xmx2g -jar$gatk -R ${indx} -T ValidateVariants \ --dbsnp${dbsnp} \
--variant:vcf varanno_vcfs/286T.chr3.va.vcf \
1> report/ValidateVariants/286T.chr3.va.valid.out \
2> report/ValidateVariants/286T.chr3.va.valid.err &


An issue arises with a rsID that is flagged as not being present in dbSNP.

...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP


I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.

The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AAAGAGAAGAGAAGAG        A       1865.98 .


There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat \$dbsnp | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AA      A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG     .  PASS


My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."

The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.

The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.

My questions are:

1. Is this really a problem that arises from slightly different reference assemblies?
2. Is the hg19-1.5 reference fasta different from any other hg19 reference fasta?
3. Is there at tool that I have missed that would have prevented this error and allowed the pipeline to continue without error?"
4. Will this strict validation failure cause problems for the VariantRecalibrator?

As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.

-- Colin

Post edited by grossco on
Tagged:

• Posts: 683 mod

Hi Colin,

Thank you very much for reporting this - it is in fact a bug. I am going to create a fix for this but it unfortunately will not be available until our 2.4 release (as the 2.3 release has already been set up for today).

• Posts: 4Member
edited January 2013

For a small target, deep coverage, 454 sequenced experiment. I'm finding a disagreement too between the HaplotypeScore, FS, BaseQRankSum, MQRankSum values set by HaplotypeCaller and the ones set by UnifiedGenotyper or VariantAnnotator. Applying VariantAnnotator after HaplotypeCaller ammends the values to the UnifiedGenotyper ones.

Downsampling is hopefully disabled using the parameter --downsampling_type NONE.

Is this also a bug that will be corrected in the 2.4 release or is it another kind of issue?

Jordi

Post edited by Jrambla on

Hi Jordi, what you're seeing is not a bug. The different values are expected because UG and HC use different models of the variation. You see VariantAnnotator "amending" the HC values to match the UG ones because VA uses the UG model.

Geraldine Van der Auwera, PhD

• Posts: 4Member

Hmmm... I know that both use different models, but.... I'm working with a resequencing from 1000genomes samples, I'm getting HaplotypeScores on the hundreds for HaplotypeCaller and on the units/tenths for UnifiedGenotyper, indeed for the known SNPs, which makes me "nervous" about the issue. Furthermore, because I'm using the best practice hard filtering recommendations, the forner SNPs get filtered out and the later are kept. Because I'm a rookie... do you think such a big difference makes sense because of the models? Thanks in advance.

The difference is fine, but you make a good point about the filtering recommendations --those are formulated for UG results and shouldn't be applied as such for HC results. We will add HC-specific recommendations to the Best Practices documentation.

Geraldine Van der Auwera, PhD

• Posts: 4Member
edited March 2013

I've checked the Best Practices documentation and find nothing new about that. We are revising our results and having the HC recommendation could be very helpful. Do you know when will the recommendations be updated?. Thanks in advance!

Jordi

Post edited by Jrambla on