Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
HaplotypeCaller: False negatives near intron boundaries
I am running HaplotypeCaller on RNA-seq, following the best practices instructions at this page:
I am running the pipeline on NA12878, so that I can compare its output to a set of "gold-standard" calls for this genome. The pipeline finished sucessfully, and the output looks pretty good. However I have noticed that there are a number of false positives that have high read coverage that are located a short distance from exon/intron boundaries. Initially, I thought that there was an issue about the polymorphism being located too close to the intron. However, when I run the same pipeline on another sample with this SNP, HaplotypeCaller does successfully report the variant.
I have attached a picture to illustrate the case I am describing. There are 3 vcf files loaded. The top is the set of "gold-standard" calls for NA12878. Below that "HaplotypeCaller NA12878" is the calls made from the RNA-seq sample. The sample below that "HaplotypeCaller sample2" is the output of running HaplotypeCaller on another RNA-seq sample that happens to have this SNP. Below is the RNA-seq for NA12878.
In the RNA-seq sample for which HaplotypeCaller does not report this variant, the coverage at this locus is 674 reads (598 support the alternate allele).
Looking at the primary data in IGV, this variant is very apparent in the mapped reads. Furthermore, of the events that are false-negatives with respect to the set of NA12878 "gold-standard" calls, I have hundreds of examples of high-coverage SNPs located < 50bp from an exon/intron boundary. Is there any parameter I can set so that variants near introns are reported? Is HaplotypeCaller omitting variants located near introns by design?
The command I am using to run haplotypecaller is:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I in.fa -BQSR bqsr.table -nct 4 -o out.vcf -ARO activeRegions.tsv --dontUseSoftClippedBases
I have noticed that if I run MuTect2 in tumor-only mode, it does report these variants for the NA12878.
Please let me know if you have any suggestions.