On Monday and Tuesday, November 12-13, the communications team will be out of the office for a U.S. federal holiday and a team event. We will be back in action on November 14th and apologize for any inconvenience this may cause. Thank you for using the forum.

Haplotype caller not picking up variants for HiSeq Runs

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @nans_bn
    Hi,

    Can you check if this still occurs in GATK4 latest beta?
    Also, can you post the bamout files for the region?

    Thanks,
    Sheila

  • robertbrobertb torontoMember
    edited March 21

    Hi,

    I'm having a similar issue when doing concordance on a platinum genome sample sequenced on a hiseq and nextseq. The FN rate for Indels is significantly higher for the nextseq sample (8119) than the hiseq sample (8116). When I looked at what truth set indels were called for the hiseq sample and not the nextseq sample (the vcfs were generated on the same pipeline) it looks like they should have been called by haplotypeCaller in the nextseq samples ( I looked at the reads in IGV).

    It's not a filtering issue. These are unfiltered VCFs straight from haplotypeCaller and Genotype GVCF.

    Anyways, kind of odd. I guess a lot of what I'm seeing in the BAMs for the nextseq sample in these cases are "uninformative" reads? In the screen shot below 8119 (top) is the nextseq sample and 8116 (bottom) is the hiseq sample. Those this indel was called in the bottom and not the top.

    Screenshot_from_2018_03_21_11_30_20

    Post edited by robertb on
  • robertbrobertb torontoMember

    Here the line from the 8116 VCF:

    1. 1 1247578 . T TGG 141.87 . AC=2;AF=1.00;AN=2;DP=16;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.37;SOR=3.611 GT:AD:DP:GQ:PL 1/1:0,5:5:15:179,15,0

    So it was called hom_var. Odd that 8119 wasn't even called. And the AD and DP = 5. So there were a lot of "uninformative" reads here as well it's just that all the informative ones happened to carry the variant? I'd really like some kind of more plausible explanation for this. thanks - Robert

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Hi @robertb,

    The FN rate for Indels is significantly higher for the nextseq sample (8119) than the hiseq sample (8116).

    This is a <1% difference.

    We will need the exact command you ran. If you are using an intervals list with the -L parameter, I suggest you omit this or use a minimum of a full contig length per interval.

    If you are comparing data from two different platforms using the same library sample, then read length is not a factor here. To correctly interpret what is going on with calling, you need to (i) view the bamout generated by HaplotypeCaller and (ii) color your reads according to instructions in section 6 of this article:

    Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
    Drag the alignments panel to center the red variant.
    Right-click on the alignments track and
    Group by sample
    Sort by base
    Color by tag: HC.
    

    Uninformative reads will be in gray.

  • robertbrobertb torontoMember
    edited March 22

    Hi,

    The Concordance was done on the same sample from the same library but sequenced on different Illumina platforms. As we use exome sequencing I used the intervals -L option when running concordance, yes, because I'm only interested in target regions. The pipeline and concordance were all run using GATK 3.6.0.
    Thanks for the info about bamout I wasn't aware of it. I'll have to get back you about what I find. Thanks for your Help - Robert

    Specifically, we get the VCFs from GenotypeGVCFs and do the following:

    A) Filter variants

    java -jar -Xmx4G $GATK \
    -R gatk_bundle/3.6.0/human_g1k_v37_decoy.fasta \
    --variant /nextseq/${1}.${2}.recal.filtered.snp.vcf \
    -o ${1}.${2}.snp.gatkF_all.vcf \
    --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' \
    --filterName 'INDELFILTERS' \
    -T VariantFiltration \

    B) Run concordance

    java -jar -Xmx4G $GATK \
    -R gatk_bundle/3.6.0/human_g1k_v37_decoy.fasta \
    --comp:VCF 202214_S1.genome.PASS.header.indels.vcf \
    --eval:VCF ${1}.${2}.PL.filt.target.indels.vcf \
    -o ${1}.${2}.PL.filt.target.indels.eval.txt \
    -T GenotypeConcordance \
    -L /baits/SS_clinical_research_exomes/S06588914.MT/S06588914_Covered.sort.merged.bed \

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Hi @robertb,
    I suspect you will find that the difference in indel calls relates to the low complexity context, i.e. the string of Gs, you are showing for the insertion region. The bamout should help you see what is going on but the VCF records for the regions should also be telling.

  • robertbrobertb torontoMember

    I posted the VCF record for the truth set indel called in hiseq and not nextseq (above). Unless I'm mistaken, the VCF record states that indel was called 1/1 based on AD = 0,5 and DP = 5 so no reads supported ref allele and five reads supported the indel. Basically, all of the reads I see in the bam that appeared to support the ref allele were not used by HC.

    Similarly for the nextseq sample it looks like if any of the reads were used by HC they were all for the ref allele ( because there's no indel there at all) and no VCF record. It's not a filtering issue it's just not there. So for whatever reason HC just did not use those reads supporting the indel.

    I'm curious about the FN rate between nextseq and hiseq. Are you saying the difference should only be <1%? Does that apply to WES as well? Because if that's the case then something is wrong because I'm getting a muich greater than 1% difference between platforms.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    @robertb, I was pointing out that calling a <1% change "significantly higher" as you did in your original post something that could use further clarification.

    Both platforms, hiseq and nextseq, like many others, sequence by synthesis (SBS). The following video illustrates SBS's bridge amplification, a step where we know that variability can be introduced, especially for regions of low complexity.

    Such PCR amplification can introduce variability and these may manifest in small differences, especially for regions of low complexity, where polymerase slippage is common both in vivo and in vitro. This is a known limitation of SBS that we chalk up to as sequencing artifact.

    The software's logic is invariant. That you observe differences between the sequencing runs on two different platforms or even two different runs on the same platform in such a low complexity region is not surprising. This is why the field (i) has high confidence calling regions, (ii) calls on cohorts, (iii) uses sophisticated filtering such as VQSR. So one approach used in the field in calculating concordance is to use a high-confidence intervals list, e.g. that provided by GIAB.

    We ask that you post the bamout results if you'd like us to consider why the insertions present in both raw BAMs result in a call only in one.

  • robertbrobertb torontoMember
    edited March 28

    Hi,

    Thanks for your help. I did as you instructed and am looking through things.
    We did a machine comparability a couple of years ago and there was a nextseq run (3231) with a FN rate for indels around 10% or something like that. But the pipeline has changed since then. So I reran the 3231 fastq through the current pipeline (8251) and compared the results. The 8251 FN rate for indels was around 15%.

    I then took all the indel calls made in 3231 and not in 8251 that were on target and passing quality filters and used those intervals to get the bamout for 8251. I then loaded the bamout for 8251 with the original bam for 8251 to see what the difference was at these sites. I mean, there were no calls at these sites so I was expecting to see either shaded reads in the bamout or no reads at all, or at least none that supported the variant.

    I've noticed that a lot of the truth set indels called in 3231 and not in 8251 did not have any reads in these regions in the bamout, despite the fact that there were reads at these sites in the original BAM and some of the reads carried the variant that was called in 3231 (see example below). Presumably GATK saw no evidence of variation at these sites????

    I'm also noticing that these sites are characterised by low complexity sequenced.

    Anyways, I'm not done looking, but because these VCF came from the same fastq data the difference in indel FN must somehow be due to the pipeline. It's not a filtering issue as I've said because our quality filters have a minimal impact of FN rate and are primarily removing false positives.

    I'm wondering whether changes in GATK over the years could explain these differences? I'll have to track it down but clearly the old version of the pipeline made more calls and picked more of these truth set calls up compared to the new pipeline and it has something to do with the steps from BWA to variant calling.

    In this example, there's an insertion in the original bam (3231) but nothing was done in the bamout for 8251. Many of the sites I've looked are like this: there's nothing in the bamout.

    Screenshot_from_2018_03_28_17_21_12

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited March 30

    @robertb
    Hi,

    I suspect this has to do with the homoplymer T run in the region. Can you try the suggestions in section 6 here?

    -Sheila

    EDIT: I didn't see it in your other posts, but are you sure these missed calls are true positives? How did you verify them?

  • robertbrobertb torontoMember

    An update on this.
    I found that the higher FN rate of indels with nextseq was due to haplotype caller and the min-base_quality score parameter. Default is 10 but we had it set at 20. That makes a big difference for nextseq data compared to hiseq data, presumably because the base quality score is on average lower for nextseq data ( we also used recalibrated bams and I need to look at the impact of that as well). By selecting the default min_base_quality _score (10) the FN rate for indels decreased significantly, but the false positive rate for snps and indels also increased.

    Does GATK have a good tool ( like DepthofCoverage) that can take a bam as input and calculate the average base quality score across all the reads? That would be useful. I just want to first confirm that the base quality scores for nextseq is lower than for hiseq. I mean I think it should be because you've got two colours instead of four).

    thanks - Robert

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @robertb
    Hi Robert,

    I think QualityScoreDistribution will do what you want. There are also some tools in the tool docs under "Diagnostics and Quality Control" that may be helpful.

    -Sheila

  • robertbrobertb torontoMember

    Yes the base quality score distribution was different for nextseq and hiseq machines. There was a notable hump in the 10-20 score range for nextseq and not hiseq. Dropping the base quality thresehold from 20 to 10 in the nextseq data rescued virtually all of the truth set variants that where called in the hiseq data when the base quality threshold was set at 10. So problem solved: our pipeline was optimized for hiseq2500 machines.

    Moving on, I'm now working with hiseq10X data and using a bcbio pipeline with gatk3.8 (we're moving to gatk4 as well). I'm finding that the number of indels being called is fewer than what is expected. Could be these are being filtered after VQSR. But the problem could also be haplotypeCaller. I though that if I posed my haplotypeCaller parameters someone may be able to recommend some changes. The
    standard_min_confidence_threshold_for_calling=10.0 is a default I guess. I'm trying to lower it to 0 but am having trouble doing this within the confines of bcbio. I think they may be hard coding this one somewhere. There's also the no low complexity region bed file that is being used to avoid calling in low complexity regions.

    It's possible I suppose that if my expectation of how many indels should be called were based on analyses that do not exclude low complexity regions that using this bed file in the interval option could significantly lower the number of indels called, right? But I don't expect many or any GIAB truth set calls to be in these regions so when I do concordance I shouldn't get a lot of false negatives, right?

    thanks - Robert

    here's the parameters:

    GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.8-1-0-gf15c1c3ef,Date="Mon Jun 11 22:16:01 EDT 2018",Epoch=1528769761665,

    CommandLineOptions="analysis_type=HaplotypeCaller
    input_file=[/hpf/largeprojects/pray/rbaldwin/WGS/REP/L004_REP/work/align/202214_L4/202214_L4-sort-recal.bam]
    showFullBamList=false
    read_buffer_size=null
    read_filter=[BadCigar, NotPrimaryAlignment]
    disable_read_filter=[]
    intervals=[/hpf/largeprojects/pray/rbaldwin/WGS/REP/L004_REP/work/gatk-haplotype/1/wgs_test-1_0_16882676-regions-nolcr.bed]
    excludeIntervals=null
    interval_set_rule=INTERSECTION
    interval_merging=ALL
    interval_padding=0
    reference_sequence=/hpf/tools/centos6/bcbio/20180111/genomes/Hsapiens/GRCh37/seq/GRCh37.fa
    nonDeterministicRandomSeed=false
    disableDithering=false
    maxRuntime=-1
    maxRuntimeUnits=MINUTES
    downsampling_type=BY_SAMPLE
    downsample_to_fraction=null
    downsample_to_coverage=500
    baq=OFF
    baqGapOpenPenalty=40.0
    refactor_NDN_cigar_string=false
    fix_misencoded_quality_scores=false
    allow_potentially_misencoded_quality_scores=false
    useOriginalQualities=false
    defaultBaseQualities=-1
    performanceLog=null
    BQSR=null
    quantize_quals=0
    static_quantized_quals=null
    round_down_quantized=false
    disable_indel_quals=false
    emit_original_quals=false
    preserve_qscores_less_than=6
    globalQScorePrior=-1.0
    secondsBetweenProgressUpdates=10
    validation_strictness=SILENT
    remove_program_records=false
    keep_program_records=false
    sample_rename_mapping_file=null
    unsafe=LENIENT_VCF_PROCESSING
    use_jdk_deflater=false
    use_jdk_inflater=false
    disable_auto_index_creation_and_locking_when_reading_rods=false
    no_cmdline_in_header=false
    sites_only=false
    never_trim_vcf_format_field=false
    bcf=false
    bam_compression=null
    simplifyBAM=false
    disable_bam_indexing=false
    generate_md5=false
    num_threads=1
    num_cpu_threads_per_data_thread=1
    num_io_threads=0
    monitorThreadEfficiency=false
    num_bam_file_handles=null
    read_group_black_list=null
    pedigree=[]
    pedigreeString=[]
    pedigreeValidationType=STRICT
    allow_intervals_with_unindexed_bam=false
    generateShadowBCF=false
    variant_index_type=DYNAMIC_SEEK
    variant_index_parameter=-1
    reference_window_stop=0
    phone_home= gatk_key=null
    tag=NA
    logging_level=INFO
    log_to_file=null

    help=false
    version=false
    out=/hpf/largeprojects/pray/rbaldwin/WGS/REP/L004_REP/work/bcbiotx/tmp2I74wS/wgs_test-1_0_16882676.vcf.gz
    likelihoodCalculationEngine=PairHMM
    heterogeneousKmerSizeResolution=COMBO_MIN
    dbsnp=(RodBinding name= source=UNBOUND)
    dontTrimActiveRegions=false
    maxDiscARExtension=25
    maxGGAARExtension=300
    paddingAroundIndels=150
    paddingAroundSNPs=20
    comp=[]
    annotation=[MappingQualityRankSumTest, MappingQualityZero, QualByDepth, ReadPosRankSumTest, RMSMappingQuality, BaseQualityRankSumTest, FisherStrand, GCContent, HaplotypeScore, HomopolymerRun, DepthPerAlleleBySample, Coverage, ClippingRankSumTest, DepthPerSampleHC]
    excludeAnnotation=[]
    group=[StandardAnnotation, StandardHCAnnotation]
    debug=false
    useFilteredReadsForAnnotations=false
    emitRefConfidence=NONE
    bamOutput=null
    bamWriterType=CALLED_HAPLOTYPES
    emitDroppedReads=false
    disableOptimizations=false
    annotateNDA=false
    useNewAFCalculator=false
    heterozygosity=0.001
    indel_heterozygosity=1.25E-4
    heterozygosity_stdev=0.01
    standard_min_confidence_threshold_for_calling=10.0
    standard_min_confidence_threshold_for_emitting=30.0
    max_alternate_alleles=6
    max_genotype_count=1024
    max_num_PL_vastandard_min_confidence_threshold_for_calling=10.0lues=100
    input_prior=[]
    sample_ploidy=2
    genotyping_mode=DISCOVERY
    alleles=(RodBinding name= source=UNBOUND)
    contamination_fraction_to_filter=0.0
    contamination_fraction_per_sample_file=null
    p_nonref_model=null
    exactcallslog=null
    output_mode=EMIT_VARIANTS_ONLY
    allSitePLs=false
    gcpHMM=10
    pair_hmm_implementation=VECTOR_LOGLESS_CACHING

    phredScaledGlobalReadMismappingRate=45
    noFpga=false
    nativePairHmmThreads=1
    useDoublePrecision=false
    sample_name=null
    kmerSize=[10, 25]
    dontIncreaseKmerSizesForCycles=false
    allowNonUniqueKmersInRef=false
    numPruningSamples=1
    recoverDanglingHeads=false
    doNotRecoverDanglingBranches=false
    minDanglingBranchLength=4
    consensus=false
    maxNumHaplotypesInPopulation=128
    errorCorrectKmers=false
    minPruning=2
    debugGraphTransformations=false
    allowCyclesInKmerGraphToGeneratePaths=false
    graphOutput=null
    kmerLengthForReadErrorCorrection=25
    minObservationsForKmerToBeSolid=20
    GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99]
    indelSizeToEliminateInRefModel=10
    min_base_quality_score=10
    includeUmappedReads=false
    useAllelesTrigger=false
    doNotRunPhysicalPhasing=true
    keepRG=null
    justDetermineActiveRegions=false
    dontGenotype=false
    dontUseSoftClippedBases=false
    captureAssemblyFailureBAM=false
    errorCorrectReads=false
    pcr_indel_model=CONSERVATIVE
    maxReadsInRegionPerSample=10000
    minReadsPerAlignmentStart=10
    mergeVariantsViaLD=false
    activityProfileOut=null
    activeRegionOut=null
    activeRegionIn=null
    activeRegionExtension=null
    forceActive=false
    activeRegionMaxSize=null
    bandPassSigma=null
    maxReadsInMemoryPerSample=30000
    maxTotalReadsInMemory=10000000
    maxProbPropagationDistance=50
    activeProbabilityThreshold=0.002
    min_mapping_quality_score=20
    filter_reads_with_N_cigar=false
    filter_mismatching_base_and_quals=false
    filter_bases_not_stored=false">

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @robertb
    Hi Robert,

    It's possible I suppose that if my expectation of how many indels should be called were based on analyses that do not exclude low complexity regions that using this bed file in the interval option could significantly lower the number of indels called, right?

    Yes, that is correct.

    But I don't expect many or any GIAB truth set calls to be in these regions so when I do concordance I shouldn't get a lot of false negatives, right?

    Yes, I agree, but can you tell us which version of GIAB NA12878 callset you are using. It seems the very latest version has a more restricted calling region.

    I'm finding that the number of indels being called is fewer than what is expected.

    What exactly do you mean by this? Do you have numbers? Can you post some example records in GIAB that are not called by HaplotypeCaller?

    Thanks,
    Sheil

Sign In or Register to comment.