It looks like you're new here. If you want to get involved, click one of these buttons!
grossco
Posts: 4Member ✭
I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.
The HaplotypeCaller command line is:
gatk="/usr/local/gatk/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar" #Fasta from the gz in the resource bundle indx="/home/ref/ucsc.hg19.fasta" dbsnp="/fdb/GATK_resource_bundle/hg19-1.5/dbsnp_135.hg19.vcf" java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \ -I chrom_bams/286T.chr3.bam \ -o hapc_vcfs/286T.chr3.raw.vcf
The VariantAnnotator command line is:
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
--dbsnp $dbsnp --alwaysAppendDbsnpId \
-A BaseQualityRankSumTest -A DepthOfCoverage \
-A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
-A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
-A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
-A TandemRepeatAnnotator \
--variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
--out varanno_vcfs/286T.chr3.va.vcf
This all works nicely, but I go back and use ValidateVariants just to be sure:
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
--dbsnp ${dbsnp} \
--variant:vcf varanno_vcfs/286T.chr3.va.vcf \
1> report/ValidateVariants/286T.chr3.va.valid.out \
2> report/ValidateVariants/286T.chr3.va.valid.err &
An issue arises with a rsID that is flagged as not being present in dbSNP.
...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP
I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.
The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374
chr3 123022685 rs67850374;rs72184829 AAAGAGAAGAGAAGAG A 1865.98 .
There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374
chr3 123022685 rs67850374;rs72184829 AA A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG . PASS
My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."
The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.
The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.
My questions are:
As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.
-- Colin
ebanks
Posts: 481 mod
Answers
For a small target, deep coverage, 454 sequenced experiment. I'm finding a disagreement too between the HaplotypeScore, FS, BaseQRankSum, MQRankSum values set by HaplotypeCaller and the ones set by UnifiedGenotyper or VariantAnnotator. Applying VariantAnnotator after HaplotypeCaller ammends the values to the UnifiedGenotyper ones.
Downsampling is hopefully disabled using the parameter --downsampling_type NONE.
Is this also a bug that will be corrected in the 2.4 release or is it another kind of issue?
Thanks in advance
Jordi
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Jordi, what you're seeing is not a bug. The different values are expected because UG and HC use different models of the variation. You see VariantAnnotator "amending" the HC values to match the UG ones because VA uses the UG model.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •Hmmm... I know that both use different models, but.... I'm working with a resequencing from 1000genomes samples, I'm getting HaplotypeScores on the hundreds for HaplotypeCaller and on the units/tenths for UnifiedGenotyper, indeed for the known SNPs, which makes me "nervous" about the issue. Furthermore, because I'm using the best practice hard filtering recommendations, the forner SNPs get filtered out and the later are kept. Because I'm a rookie... do you think such a big difference makes sense because of the models? Thanks in advance.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The difference is fine, but you make a good point about the filtering recommendations --those are formulated for UG results and shouldn't be applied as such for HC results. We will add HC-specific recommendations to the Best Practices documentation.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I've checked the Best Practices documentation and find nothing new about that. We are revising our results and having the HC recommendation could be very helpful. Do you know when will the recommendations be updated?. Thanks in advance!
Jordi
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Jordi,
I can't give you an estimate right now as we are currently working on improving the calculation model. Once that is done we'll need to run tests on a lot of data to determine what are the best generic settings. In the meantime you should experiment with the parameters to find the settings that best suit your data. Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •