Haplotype caller not picking up variants for HiSeq Runs

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nans_bn
    Hi,

    Can you check if this still occurs in GATK4 latest beta?
    Also, can you post the bamout files for the region?

    Thanks,
    Sheila

  • robertbrobertb torontoMember
    edited March 21

    Hi,

    I'm having a similar issue when doing concordance on a platinum genome sample sequenced on a hiseq and nextseq. The FN rate for Indels is significantly higher for the nextseq sample (8119) than the hiseq sample (8116). When I looked at what truth set indels were called for the hiseq sample and not the nextseq sample (the vcfs were generated on the same pipeline) it looks like they should have been called by haplotypeCaller in the nextseq samples ( I looked at the reads in IGV).

    It's not a filtering issue. These are unfiltered VCFs straight from haplotypeCaller and Genotype GVCF.

    Anyways, kind of odd. I guess a lot of what I'm seeing in the BAMs for the nextseq sample in these cases are "uninformative" reads? In the screen shot below 8119 (top) is the nextseq sample and 8116 (bottom) is the hiseq sample. Those this indel was called in the bottom and not the top.

    Screenshot_from_2018_03_21_11_30_20

    Post edited by robertb on
  • robertbrobertb torontoMember

    Here the line from the 8116 VCF:

    1. 1 1247578 . T TGG 141.87 . AC=2;AF=1.00;AN=2;DP=16;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.37;SOR=3.611 GT:AD:DP:GQ:PL 1/1:0,5:5:15:179,15,0

    So it was called hom_var. Odd that 8119 wasn't even called. And the AD and DP = 5. So there were a lot of "uninformative" reads here as well it's just that all the informative ones happened to carry the variant? I'd really like some kind of more plausible explanation for this. thanks - Robert

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @robertb,

    The FN rate for Indels is significantly higher for the nextseq sample (8119) than the hiseq sample (8116).

    This is a <1% difference.

    We will need the exact command you ran. If you are using an intervals list with the -L parameter, I suggest you omit this or use a minimum of a full contig length per interval.

    If you are comparing data from two different platforms using the same library sample, then read length is not a factor here. To correctly interpret what is going on with calling, you need to (i) view the bamout generated by HaplotypeCaller and (ii) color your reads according to instructions in section 6 of this article:

    Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
    Drag the alignments panel to center the red variant.
    Right-click on the alignments track and
    Group by sample
    Sort by base
    Color by tag: HC.
    

    Uninformative reads will be in gray.

  • robertbrobertb torontoMember
    edited March 22

    Hi,

    The Concordance was done on the same sample from the same library but sequenced on different Illumina platforms. As we use exome sequencing I used the intervals -L option when running concordance, yes, because I'm only interested in target regions. The pipeline and concordance were all run using GATK 3.6.0.
    Thanks for the info about bamout I wasn't aware of it. I'll have to get back you about what I find. Thanks for your Help - Robert

    Specifically, we get the VCFs from GenotypeGVCFs and do the following:

    A) Filter variants

    java -jar -Xmx4G $GATK \
    -R gatk_bundle/3.6.0/human_g1k_v37_decoy.fasta \
    --variant /nextseq/${1}.${2}.recal.filtered.snp.vcf \
    -o ${1}.${2}.snp.gatkF_all.vcf \
    --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' \
    --filterName 'INDELFILTERS' \
    -T VariantFiltration \

    B) Run concordance

    java -jar -Xmx4G $GATK \
    -R gatk_bundle/3.6.0/human_g1k_v37_decoy.fasta \
    --comp:VCF 202214_S1.genome.PASS.header.indels.vcf \
    --eval:VCF ${1}.${2}.PL.filt.target.indels.vcf \
    -o ${1}.${2}.PL.filt.target.indels.eval.txt \
    -T GenotypeConcordance \
    -L /baits/SS_clinical_research_exomes/S06588914.MT/S06588914_Covered.sort.merged.bed \

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @robertb,
    I suspect you will find that the difference in indel calls relates to the low complexity context, i.e. the string of Gs, you are showing for the insertion region. The bamout should help you see what is going on but the VCF records for the regions should also be telling.

  • robertbrobertb torontoMember

    I posted the VCF record for the truth set indel called in hiseq and not nextseq (above). Unless I'm mistaken, the VCF record states that indel was called 1/1 based on AD = 0,5 and DP = 5 so no reads supported ref allele and five reads supported the indel. Basically, all of the reads I see in the bam that appeared to support the ref allele were not used by HC.

    Similarly for the nextseq sample it looks like if any of the reads were used by HC they were all for the ref allele ( because there's no indel there at all) and no VCF record. It's not a filtering issue it's just not there. So for whatever reason HC just did not use those reads supporting the indel.

    I'm curious about the FN rate between nextseq and hiseq. Are you saying the difference should only be <1%? Does that apply to WES as well? Because if that's the case then something is wrong because I'm getting a muich greater than 1% difference between platforms.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @robertb, I was pointing out that calling a <1% change "significantly higher" as you did in your original post something that could use further clarification.

    Both platforms, hiseq and nextseq, like many others, sequence by synthesis (SBS). The following video illustrates SBS's bridge amplification, a step where we know that variability can be introduced, especially for regions of low complexity.

    Such PCR amplification can introduce variability and these may manifest in small differences, especially for regions of low complexity, where polymerase slippage is common both in vivo and in vitro. This is a known limitation of SBS that we chalk up to as sequencing artifact.

    The software's logic is invariant. That you observe differences between the sequencing runs on two different platforms or even two different runs on the same platform in such a low complexity region is not surprising. This is why the field (i) has high confidence calling regions, (ii) calls on cohorts, (iii) uses sophisticated filtering such as VQSR. So one approach used in the field in calculating concordance is to use a high-confidence intervals list, e.g. that provided by GIAB.

    We ask that you post the bamout results if you'd like us to consider why the insertions present in both raw BAMs result in a call only in one.

  • robertbrobertb torontoMember
    edited March 28

    Hi,

    Thanks for your help. I did as you instructed and am looking through things.
    We did a machine comparability a couple of years ago and there was a nextseq run (3231) with a FN rate for indels around 10% or something like that. But the pipeline has changed since then. So I reran the 3231 fastq through the current pipeline (8251) and compared the results. The 8251 FN rate for indels was around 15%.

    I then took all the indel calls made in 3231 and not in 8251 that were on target and passing quality filters and used those intervals to get the bamout for 8251. I then loaded the bamout for 8251 with the original bam for 8251 to see what the difference was at these sites. I mean, there were no calls at these sites so I was expecting to see either shaded reads in the bamout or no reads at all, or at least none that supported the variant.

    I've noticed that a lot of the truth set indels called in 3231 and not in 8251 did not have any reads in these regions in the bamout, despite the fact that there were reads at these sites in the original BAM and some of the reads carried the variant that was called in 3231 (see example below). Presumably GATK saw no evidence of variation at these sites????

    I'm also noticing that these sites are characterised by low complexity sequenced.

    Anyways, I'm not done looking, but because these VCF came from the same fastq data the difference in indel FN must somehow be due to the pipeline. It's not a filtering issue as I've said because our quality filters have a minimal impact of FN rate and are primarily removing false positives.

    I'm wondering whether changes in GATK over the years could explain these differences? I'll have to track it down but clearly the old version of the pipeline made more calls and picked more of these truth set calls up compared to the new pipeline and it has something to do with the steps from BWA to variant calling.

    In this example, there's an insertion in the original bam (3231) but nothing was done in the bamout for 8251. Many of the sites I've looked are like this: there's nothing in the bamout.

    Screenshot_from_2018_03_28_17_21_12

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited March 30

    @robertb
    Hi,

    I suspect this has to do with the homoplymer T run in the region. Can you try the suggestions in section 6 here?

    -Sheila

    EDIT: I didn't see it in your other posts, but are you sure these missed calls are true positives? How did you verify them?

  • robertbrobertb torontoMember

    An update on this.
    I found that the higher FN rate of indels with nextseq was due to haplotype caller and the min-base_quality score parameter. Default is 10 but we had it set at 20. That makes a big difference for nextseq data compared to hiseq data, presumably because the base quality score is on average lower for nextseq data ( we also used recalibrated bams and I need to look at the impact of that as well). By selecting the default min_base_quality _score (10) the FN rate for indels decreased significantly, but the false positive rate for snps and indels also increased.

    Does GATK have a good tool ( like DepthofCoverage) that can take a bam as input and calculate the average base quality score across all the reads? That would be useful. I just want to first confirm that the base quality scores for nextseq is lower than for hiseq. I mean I think it should be because you've got two colours instead of four).

    thanks - Robert

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @robertb
    Hi Robert,

    I think QualityScoreDistribution will do what you want. There are also some tools in the tool docs under "Diagnostics and Quality Control" that may be helpful.

    -Sheila

Sign In or Register to comment.