Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Wrong Ref call being reported in CombineGVCFs

Hi,

I've got an issue when I run CombineGVCFs to merge a bunch of gvcf samples. I'm getting incorrect REF calls at certain point.
The GATK version is 3.5.

ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 1:183085898, C* vs. T*

tabix extra-gvcf.gvcf.gz 1:183085898-183085898
1 183085898 . T . . . GT:DP:GQ:MIN_DP:PL

tabix sorted.vcf.gz 1:183085898-183085898
1 183085898 . C *, . . DP=3072 GT:AD:DP:GQ:MIN_DP:PL:SB

I also ran a test using an interval file (one bed line of 10 basepairs 1:183085890-183085900) and this returned the correct REF (T) at position 183085898 - alt was without a star. This is perplexing!

I've now started running ValidateVariants on the input to see if it's corrupt in some way and it reports the error:

ERROR MESSAGE: File sorted.vcf.gz fails strict validation: one or more of the ALT allele(s) for the record at position 1:2069681 are not observed at all in the sample genotypes

Pulling this position from the merged file, I see lots of samples with no observations (./.) - don't know if this is an issue:

1 2069681 rs3753242 C T, . . BaseQRankSum=1.69;ClippingRankSum=-2.570e-01;DP=962;MQ=60.00;MQ0=0;MQRankSum=-9.900e-02;ReadPosRankSum=0.167 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:16,15,0:31:99:.:399,0,391,447,436,883:7,9,11,4 ./. ./. ./.:.:31:84:30:0,81,861,81,861,861 ./. ./. ./. ./.:.:41:93:41:0,94,1156,94,1156,1156 ./. ./. ./. ./. ./. ./.:0,51,0:51:99:.:1678,153,0,1678,153,1678:0,0,30,21 ./. ./.:0,35,0:35:99:.:1194,105,0,1194,105,1194:0,0,17,18 ./. ./.:.:36:99:36:0,99,1485,99,1485,1485 ./.:18,20,0:38:99:.:474,0,339,526,399,925:9,9,11,9 ./. ./. ./. ./.:.:36:67:36:0,68,870,68,870,870 ./. ./. ./. ./.:16,13,0:29:99:.:357,0,411,405,450,855:9,7,0,13 ./. ./. ./. ./.:0,44,0:44:99:.:1390,131,0,1390,131,1390:0,0,25,19 ./. ./.:19,15,0:34:99:.:363,0,470,421,514,935:10,9,8,7 ./. ./. ./. ./. ./. ./. ./.:20,12,0:32:99:.:305,0,501,365,537,902:8,12,4,8 ./. ./. ./. ./. ./. ./. ./.:.:29:50:29:0,51,743,51,743,743 ./.:23,15,0:38:99:.:386,0,547,454,592,1047:11,12,7,8 ./.:15,16,0:31:99:.:393,0,360,438,407,845:8,7,7,9 ./.:.:33:98:33:0,95,972,95,972,972 ./. ./. ./. ./. ./.:.:33:88:33:0,88,972,88,972,972 ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.:24,15,0:39:99:.:368,0,608,440,653,1093:13,11,8,7 ./. ./. ./. ./. ./.:.:31:67:31:0,67,944,67,944,944 ./. ./. ./. ./.:.:35:86:35:0,87,1033,87,1033,1033 ./. ./.:11,23,0:34:99:.:647,0,247,680,316,997:6,5,13,10 ./.:17,15,0:32:99:.:418,0,434,469,479,949:9,8,9,6 ./. ./.:13,13,0:26:99:.:368,0,295,406,334,740:7,6,5,8 ./. ./.:.:35:87:35:0,87,945,87,945,945 ./. ./.:.:24:45:24:0,45,643,45,643,643 ./. ./.:.:38:95:38:0,95,997,95,997,997 ./.:.:34:90:34:0,90,923,90,923,923 ./. ./. ./. ./. ./.:13,17,0:30:99:.:491,0,313,530,364,894:4,9,10,7

Has anyone had this issue before or could maybe tell me if my commands below are wrong.
Thanks in advance for your time,

Sean.

java -jar /usr/local/bin/GATK.jar -T CombineGVCFs -V HMHMCCCXX_s1_1_GSLv3-7_01_SL146599.realigned.recalibrated.g.vcf.gz -V HMHMCCCXX_s2_1_GSLv3-7_02_SL146600.realigned.recalibrated.g.vcf.gz -V HMHMCCCX
X_s3_1_GSLv3-7_03_SL146601.realigned.recalibrated.g.vcf.gz -V HMHMCCCXX_s4_1_GSLv3-7_04_SL146602.realigned.recalibrated.g.vcf.gz ..... -R hs37d5.fa -o merged.vcf -L intervals.bed

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @foxyjohn
    Hi Sean,

    Did you use the same reference when you created the GVCFs? Can you please post the command you ran to get the GVCFs?

    Thanks,
    Sheila

  • foxyjohnfoxyjohn Member

    Hi Sheila,

    Thanks for taking the time to respond. They were generated with a reference sequence from a third party. I'm gonna ask them to send it along and check for differences. Appreciate your input.

    Sean.

  • foxyjohnfoxyjohn Member

    Hi Sheila,

    I've managed to get the reference sequence used to generate the gVCFs (genome.fa which it turns out is actually hs37d5.fa) and extracted multiple test fasta regions around sites that were failing. All the sites are identical in this sequence as in our own local one (hg19.fa).
    I am very confident the reference is not the issue.

    I'm pasting the HaplotypeCaller command used to generate gVCFs below:

    In total 270 samples were analysed with this command. Later another 94 were analysed with exactly the same command using the exact same GATK version. When we run CombineGVCF on these 2 batches we get the inconsistent reference error mentioned previously.

    GATKCommandLine=<ID=HaplotypeCaller,Version=3.3-0-g37228af,Date="Tue Jan 12 05:59:19 UTC2016", Epoch=1452578359170, CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/home/dnanexus/in/sorted_bam/file.realigned.recalibrated.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=[1] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=genome.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencod

    ed_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=4 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allo
    w_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false dbsnp=(RodBinding name=dbsnp source=dbsnp_138.b37.vcf.gz) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC, StrandBiasBySample] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator, ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGlobalReadMismappingRate=45 maxNumHaplotypesInPopulation=128 mergeVariantsViaLD=false doNotRunPhysicalPhasing=false pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrectKmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAs
    semblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=1000 minReadsPerAlignmentStart=5 activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You're using an older version from last year that had a bug in that tool. See https://www.broadinstitute.org/gatk/blog?id=5596.

  • foxyjohnfoxyjohn Member

    Thanks for the link Geraldine. So is it the gvcf's generated by HaplotypeCaller that all need to be regenerated? I'll try CombineGVCFs using the latest GATK version 3.6 first perhaps.

    Sean.

  • foxyjohnfoxyjohn Member

    All HaplotypeCaller analysis was done using Version=3.3-0-g37228af. CombineGVCFs was done using v3.5-0-g36282e4.

  • foxyjohnfoxyjohn Member

    Geraldine,

    We ran CombineGCVFs (version 3.5) previously on a first batch of samples also generated using HC version 3.3 with no problems. Why would this be an issue now, considering the new batch of samples(94) were generated using the same HC version. The 2 batches are 270 & 94 samples big, use the exact same reference, same GATK version (3.3) for HC and same version for CombineGVCF(3.5) and were combined into those batches perfectly fine. Only now, on merging the 2 batches, do we experience the inconsistent reference issue. It's perplexing.

    Sean.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmph, and there goes my neat, convenient explanation.

    My advice would be to go back and redo the combining from the original GVCFs with the latest version of GATK, after checking that all the original GVCFs have the same REF allele at the offending position.

    The presence of a star (in your original post) makes me think there is a spanning deletion in the area that may be contributing to the problem. The early handling of spanning deletions (circa 3.4) was a bit buggy, and that has been smoothed out in later versions. So it would be good to know if 3.6 replicates the issue or not.

  • foxyjohnfoxyjohn Member

    Hi Geraldine,

    Yes one of the samples in the the new merged batch has a deletion as you point out. In fact we also notice that the position that the mismatch is reported is an offset - when we grep around this location we notice the right call is a 1 basepair offset.

    Unfortunately, I have tried version 3.6 to do the CombineGVCF on the 2 batches but to no avail, the same message is reported.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sorry, I mean you need to try regenerating the batches themselves with 3.6. Based on your original post it looks like the underlying problem (introduction of the wrong REF allele) happened when you made the batches.

  • foxyjohnfoxyjohn Member

    No worries.

    Both batches were made with 3.5. Only the HC calls were made with 3.3 - do you mean these HC calls need to be all redone? (I want to avoid this for now, though I know it'll have to be done in the long run)
    I am thinking of testing another strategy: tabix extracting the regions individually from each gVCF, then merging all the extracted regions to make the batch instead of one giant merge of the full gVCFs with -L intervals.file

    Sean.

Sign In or Register to comment.