HaplotypeCaller: False negatives near intron boundaries

sshenkersshenker Cambridge, MAMember

Hi,

I am running HaplotypeCaller on RNA-seq, following the best practices instructions at this page:

https://software.broadinstitute.org/gatk/guide/article?id=3891

I am running the pipeline on NA12878, so that I can compare its output to a set of "gold-standard" calls for this genome. The pipeline finished sucessfully, and the output looks pretty good. However I have noticed that there are a number of false positives that have high read coverage that are located a short distance from exon/intron boundaries. Initially, I thought that there was an issue about the polymorphism being located too close to the intron. However, when I run the same pipeline on another sample with this SNP, HaplotypeCaller does successfully report the variant.

I have attached a picture to illustrate the case I am describing. There are 3 vcf files loaded. The top is the set of "gold-standard" calls for NA12878. Below that "HaplotypeCaller NA12878" is the calls made from the RNA-seq sample. The sample below that "HaplotypeCaller sample2" is the output of running HaplotypeCaller on another RNA-seq sample that happens to have this SNP. Below is the RNA-seq for NA12878.

In the RNA-seq sample for which HaplotypeCaller does not report this variant, the coverage at this locus is 674 reads (598 support the alternate allele).

Looking at the primary data in IGV, this variant is very apparent in the mapped reads. Furthermore, of the events that are false-negatives with respect to the set of NA12878 "gold-standard" calls, I have hundreds of examples of high-coverage SNPs located < 50bp from an exon/intron boundary. Is there any parameter I can set so that variants near introns are reported? Is HaplotypeCaller omitting variants located near introns by design?

The command I am using to run haplotypecaller is:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I in.fa -BQSR bqsr.table -nct 4 -o out.vcf -ARO activeRegions.tsv --dontUseSoftClippedBases

I have noticed that if I run MuTect2 in tumor-only mode, it does report these variants for the NA12878.

Please let me know if you have any suggestions.

Issue · Github
by Sheila

Issue Number
1644
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Answers

  • sshenkersshenker Cambridge, MAMember

    I found a possible workaround, and I wanted to check if anyone had any input about its correctness. I thought the problem might have something to do about the variants at the end of an aligned segment. Instead of representing the read as being spliced with an N-cigar-element, I tried transforming the spliced reads so that all nucleotides after the splice junction are replaced with N, and match directly to the genome. I have attached a picture to illustrate what I mean.

    Here the tracks are mostly as in the above post, from top to bottom:
    1. gold-standard calls for NA12878
    2. HaplotypeCaller output, run according to GATK best practices
    3. HaplotypeCaller output with N-padded reads instead of split'n'trim
    4. NA12878 RNA-seq
    5. N-padded RNA-seq

    Comparing track 4. vs 5. you can see how reads that map across this splice junction are replaced with a read that maps continuously to the genome, padded with N's where any artificial alignment was created by the transformation.

    I am curious if you can see any caveats to this approach, whether this is a valid approach, or this violates assumptions made by HaplotypeCaller. I will try replacing the split-n-trim step with this transformation and perform a systematic comparison myself to see how it affects HaplotypeCaller output vs the NA12878 calls, but I would appreciate any input you have.

  • sshenkersshenker Cambridge, MAMember

    curiously, if I set minPruning higher than the default, these false negatives are recovered. Does this mean the HaplotypeCaller is excluding these regions because there are too many variants observed in the reads?

    I tried setting the maxReadsInRegionPerSample lower, but this did not help. It seems setting the minPruning too high will lower the sensitivity. Does it make sense to run HaplotypeCaller twice for high/low -minPruning values, and merging the output? Would it help to filter highly mismatched reads before running HC?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @sshenker, sorry for the late response. HaplotypeCaller does not omit variants located near introns by design; based on your observation of the effect of minPruning, it might be due to some difficulties in the assembly step. Have you tried emitting the bamout file for this region? As described in this doc, you can emit a file that shows how HaplotypeCaller has realigned reads. This can help diagnose puzzling calls.

Sign In or Register to comment.