Why does GenotypeGVCF add duplicated SNPs past the end of the GVCF region?

I ran HaplotypeCaller per sample on the entire genome, then merged multiple samples together into the same GVCF using CombineGVCFs. The CombineGVCFs step was run with -L to split up the GVCFs into 200kbp chunks to speed up processing. Each chunk was passed to GenotypeGVCFs with -L set to only genotype at desired variant sites from a VCF.

In at least one of the resulting VCF files from GenotypeGVCF, there are additional SNPs appended to the end of the chunk.

This is the tail of the CombineGVCF chunk we created with -L Pf3D7_04_v3:1-200000:

Pf3D7_04_v3 199997 . T *, . . DP=11046 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:.:0:0:0:0,0,0,0,0,0 ...
Pf3D7_04_v3 199998 . T *, . . DP=11046 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:.:0:0:0:0,0,0,0,0,0 ...
Pf3D7_04_v3 199999 . A *, . . DP=11046 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:.:0:0:0:0,0,0,0,0,0 ...
Pf3D7_04_v3 200000 . T *, . . DP=11046 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:.:0:0:0:0,0,0,0,0,0 ...

However, this is the tail of the GenotypeGVCF for the same chunk created with a -L :

Pf3D7_04_v3 199880 . C . 36.75 . AN=178;DP=9932;FractionInformativeReads=1.00;GC=28.57;InbreedingCoeff=-0.0039;NCC=2;VariantType=SYMBOLIC ...
Pf3D7_04_v3 200001 . A . 3501.96 . AN=182;DP=10066;FractionInformativeReads=0.00;GC=9.52;NCC=0;VariantType=NO_VARIATION ...
Pf3D7_04_v3 200002 . A . 3501.96 . AN=182;DP=10066;FractionInformativeReads=0.00;GC=9.52;NCC=0;VariantType=NO_VARIATION ...
Pf3D7_04_v3 200004 . A . 3501.96 . AN=182;DP=10066;FractionInformativeReads=0.00;GC=9.52;NCC=0;VariantType=NO_VARIATION ...

Note that the GenotypeGVCF result reports positions 200001, 200002, 200004 even though they are not included in the CombineGVCF GVCF.

The VCF list of desired SNP sites includes the following sites (among others):
Pf3D7_04_v3 199880 . C T 4297.63 PASS - -
Pf3D7_04_v3 200001 . A T 5638.72 PASS - -
Pf3D7_04_v3 200002 . A T 2871.53 PASS - -
Pf3D7_04_v3 200004 . A C 32763.20 PASS - -
Pf3D7_04_v3 200635 . T C 41.78 PASS - -

Why is GenotypeGVCF appending positions 200001, 200002, 200004? Is this a deletion starting on or before 200000 and spanning positions 200001, 200002, 200004?

Issue · Github
by shlee

Issue Number
2664
State
open
Last Updated

Answers

  • tnsangertnsanger Member

    I am using GATK v3.7

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited November 2017

    Hi @tnsanger,

    The way you approach parallelization using -L is something we warn against. Please parallelize in chucks that do not break up contiguous non-N regions of a contig. For example, instead of 200kbp chuncks, process chr1 separately from chr2 separately from chr3 + chr4, etc.

    What you describe could be a bug. If you are interested in pursuing the source of this as a potential bug, we'll need more information. For example, per sample level GVCF, do all the INFO field's END annotations give you 200000 or do some of them go past this locus? This can happen in HaplotypeCaller because the tool will expand the reassembly region as it deems fit, so that variants under consideration are enclosed by regions that merge back into the reference.

    We will also need the exact commands you ran for each step. Thanks.

    P.S. Check out Picard's ScatterIntervalsByNs to create such as intervals list split by Ns.

  • tnsangertnsanger Member

    HaplotypeCaller (executed per sample):

    java -Xmx5625m -Xms5625m -server -XX:+UseSerialGC -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Pfalciparum.genome.fasta -I sample1.bam -o sample1.hc.gvcf.gz --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --max_alternate_alleles 6

    java -Xmx5625m -Xms5625m -server -XX:+UseSerialGC -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Pfalciparum.genome.fasta -I sample2.bam -o sample2.hc.gvcf.gz --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --max_alternate_alleles 6

    . . .

    CombineGVCFs: Merge samples, and chunk by 200kbp

    java -Xmx5625m -Xms5625m -server -XX:+UseSerialGC -jar GenomeAnalysisTK.jar -T CombineGVCFs -R Pfalciparum.genome.fasta -V newline_separated_per_sample_haplotypecaller_gvcf_output_file_paths.list -o merged_samples.chr4.1_200000.gvcf.gz -L Pf3D7_04_v3:1-200000

    GenotypeGVCFs: executed per 200kbp chunk across all samples

    java -Xmx11025m -Xms11025m -server -XX:+UseSerialGC -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R Pfalciparum.genome.fasta --variant merged_samples.chr4.1_200000.gvcf.gz -o chr4.1_200000.genotyped.gvcf.gz --max_alternate_alleles 6 --annotation AS_BaseQualityRankSumTest --annotation AS_FisherStrand --annotation AS_InbreedingCoeff --annotation AS_InsertSizeRankSum --annotation AS_MQMateRankSumTest --annotation AS_MappingQualityRankSumTest --annotation AS_QualByDepth --annotation AS_ReadPosRankSumTest --annotation AS_StrandOddsRatio --annotation ExcessHet --annotation FisherStrand --annotation FractionInformativeReads --annotation GCContent --annotation GenotypeSummaries --annotation HomopolymerRun --annotation QualByDepth --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation VariantType -allSites -L desired_variant_sites.vcf.gz

  • tnsangertnsanger Member

    There are indeed many HaplotypeCaller per-sample GVCF output files that have INFO column with END field past position 200000:

    e.g.

    sample1.hc.gvcf.gz:
    Pf3D7_04_v3 198718 . T . . END=200781 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

    sample2.hc.gvcf.gz:
    Pf3D7_04_v3 199995 . A . . END=200022 GT:DP:GQ:MIN_DP:PL 0/0:26:70:24:0,70,782

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @tnsanger,

    Ok, thanks for clarifying your pipeline details with commands. Here is your pipeline:

    [1] HaplotypeCaller in ERC GVCF mode over entirety of sample without intervals
    [2] CombineGVCFs per 200kbp chunks
    [3] GenotypeGVCFs per same 200kbp chunks.

    The 200kbp chunks are arbitrary, and do not correspond to contig ends nor regions of Ns, yes?

    In our production, we consistently use the same split-by-Ns intervals at step [1], [2] and [3]. The split-by-N intervals also splits between contigs. Again, I recommend the same to you. You can simply apply split-by-N intervals to your CombineGVCFs and GenotypeGVCFs steps. No need to run HaplotypeCaller again. I think this will remove the odd calls you are seeing now, i.e. the "appending positions 200001, 200002, 200004" calls. Can you let us know if this is the case by running [2] and [3] over this region using the contig surrounding it as the interval? If this solves the issue and you are satisfied, then great. If not, we'll need to get a bug report from you following instructions in https://software.broadinstitute.org/gatk/guide/article?id=1894, so we can examine in detail what is going on.

    We can only be certain GVCF blocks are separated per contig and by regions with consecutive Ns across which reads would not align. Combining and genotyping mixed-edge blocks, or in a manner that splits a block, well, these can give rise to odd calls. I'm sorry this isn't clear in our documentation. I'll see what I can do to clarify this.

Sign In or Register to comment.