HaplotypeCaller not catching variants

Hello,
I am currently having an issue where HaplotypeCaller is not catching at least one specific variant. I have searched forums and have used the -bamout option as suggested by some. While the bam file output from this appears to be much different than the bam file generated from my BQSR step, it still shows the variant 10% of 220 reads with an average quality of 30.1; the BQSR bam shows 57% of 310 reads with average quality of 30.8.

I have run UnifiedGenotyper on this same interval and that picked up the variant. I am very confused as to why this variant is not picked up.

I do not understand why one is able to pick it up and the other is not. It is important that this variant be found because this is a control sample, and is showing as a somatic mutation when compared to its affected sample when in fact it is a germline mutation.

Does anyone have any suggestions?

Answers

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    There are a few possibilities here and unfortunately without looking at the data is basically impossible to reach to a conclusion. You could start with sharing here some IGV plots including the original alignment .bam and the realigned output of HaplotypeCaller (-bamout argument) on that region as you mention.

    From what you say, 57% reads that supported the variant became only 10% in HaplotypeCaller (HC); that is rather a big change in allele balance. As we expect to see around 50% of reads supporting each allele in a heterozygous call, it might be the case that the home-ref genotype becomes as likely (the remaining 10% are then considered errors) and in that case HC will not output the site.

    A trick to produce an output for such a site is to artificially increase the ploidy, say to 8 (-ploidy 8). I would expect that then HC would call GT as 0/0/0/0/0/0/0/1 or 0...0/1/1. Is not a fix for the issue at hand but just another way to diagnose it.

    Another thing that might we worth trying is to turn off downsampling. In HC there are two kinds of downsampling: the general downsampling available in all tools that consume reads (disable it with -dt NONE) and also an imposed limit of reads within an active region. The latter is 10000 by default and is controlled by the argument (-maxReadsInRegionPerSample).

  • trevorconleytrevorconley San DiegoMember

    I have attached a screenshot of the region in question.

    I have also tried your ways of diagnosing and have not been able to get the variant to be called.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @trevorconley
    Hi,

    Have a look at this document. There are a few more tips there.

    Can you please post a larger region in the IGV Screenshot? Please include about 100 bases before and after the site of interest.

    -Sheila

  • valentinvalentin Cambridge, MAMember, Dev ✭✭
    edited November 2016

    @trevorconley, I guess that that is not IGV, what software are you using for plotting alignments?

  • trevorconleytrevorconley San DiegoMember

    @Sheila said:
    @trevorconley
    Hi,

    Have a look at this document. There are a few more tips there.

    Can you please post a larger region in the IGV Screenshot? Please include about 100 bases before and after the site of interest.

    -Sheila

    Hello,
    I have attached a larger range surrounding the variant in question (the yellow one in the middle). The control image is of the original bam and bamout files, which is not calling a variant at the site. The other image is of my case, which is calling a variant at the site. They are both run through exactly the same pipeline from start to finish.

  • trevorconleytrevorconley San DiegoMember

    @valentin said:
    @trevorconley, I guess that that is not IGV, what software are you using for plotting alignments?

    I use GenomeBrowse

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    Thanks for the new plots... it seems that many reads have been filtered out and that can happens in a few ways....

    Could you give us an estimate of how many of those reads have mapping quality 20 or lower? Also whether their mates (the other read in the pair) maps on the same chromosome or somewhere else?

    That regions seems quite dense in terms on variants including also a indel. I think that internally we may discard reads that are "too divergent" from the reference and that could also be the cause as well.

  • trevorconleytrevorconley San DiegoMember

    I switched to IGV for ease of getting qualities of an individual base. The lowest quality for that position was 20, and it appeared once; everything else was in the high 20s or 30s. Every single read shows that its mate is mapped to the same chromosome

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    I think that it is quite likely due to the last reason, but I would need the data to verify that. Perhaps you can post an extract. @Sheila, what is the protocol here?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @valentin @trevorconley
    Hi.

    You can submit a bug report (if you can share the files!) following these instructions.

    Thanks,
    Sheila

  • trevorconleytrevorconley San DiegoMember

    Since I have two samples from the same subject, is it acceptable to put both of those into the report? Since one of them called the variant, and the other did not, I find that it would be useful to have that information. It is just baffling how they both produce roughly the same values for the variant, but only my case sample seems to detect it when calling HC

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    The more info the better @trevorconley as long as the instructions to reproduce the problem are clear.

  • trevorconleytrevorconley San DiegoMember
    edited November 2016

    Hello,
    I have uploaded PersImmune_GATK.zip into the ftp directory. Inside there are two folders, one for my case and one for the control. The interval BAM file for each one is located in there, as well as a log of the HaplotypeCaller run. The command that I used is in the main directory; the only information that varies between the two runs is anything that has to do with the samples.

    Forgot to put the position most in question; Chromosome 6:57398201. There are also a few others that each calls that the other does not, but currently those are not as big of a deal.

    Really I am just confused why one of them calls the variant and the other does not

    Issue · Github
    by Sheila

    Issue Number
    1433
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @trevorconley
    Hi,

    Thanks. I will take a look soon.

    -Sheila

    Issue · Github
    by Sheila

    Issue Number
    1530
    State
    open
    Last Updated
  • trevorconleytrevorconley San DiegoMember

    Hello @Sheila,

    Do you have any updates on the progress of this?

    Thanks,
    Trevor

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @trevorconley
    Hi Trevor,

    Sorry, I have not gotten to this yet. I should get to it by the beginning of next week though. Please post again if I do not post by Tuesday.

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @trevorconley
    Hi Trevor,

    I just put in a bug report. You can keep track of it here.

    -Sheila

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    I have taken a look a the submitted data for the bug report and It seem that @trevorconley has bumped in to a weakness of the algorithm that we use to select the best haplotypes in complex regions where the local assembly graph has far too many possible paths (haplotypes) to analyze them all.

    Thanks very much for that @trevorconley. Unfortunately a fix for that would not be a quick one.

    Workarounds:

    In this particular instance you should be able to recover that variant if either:

    1. you increase the number of haplotypes to keep for analysis too e.g. 512 (using -maxNumHaplotypesInPopulation 512). The default is 128; I haven't tried 256 or any other number in between that also could be sufficient.

    2. you increase the kmer size to e.g. 31 (argument -kmerSize 31).

    Even if those changes would rescue this variant there are consequences:

    Number 1 would result in a considerable increase in CPU time in complex regions...

    and Number 2 may result in some lack of sensitivity across the genome. You can use more than one kmer size (-kmerSize 10 -kmerSize 25 -kmerSize 31) paying an additional CPU cost to prevent that everywhere and not just complex regions.

    Notice that 10 and 25 are the default kmer sizes so you are just adding 31 to the set.

    V.

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    Notice that 31 is kind of a magic number ... it might work with some other kmer size .. I personally prefer to stick to prime numbers (despite the default 10 and 25).

  • trevorconleytrevorconley San DiegoMember

    Wow. Thank you for the update @valentin. I will play around a little bit with those settings to see what works best in my case and update on any results I find.

  • trevorconleytrevorconley San DiegoMember

    @valentin I have found that setting the maxNumHaplotypesInPopulation to 256 does indeed call the variant that was not previously being called. I will leave this as is right now as I await the results of the issued bug report.

Sign In or Register to comment.