Missing positions after GenotypeGVCFs

Hi,

Sorry if this is a re-post but I have read through the last post on this issue: http://gatkforums.broadinstitute.org/gatk/discussion/4343/missing-positions-in-the-gvcf-file

I'm having the same issue except with GATK 3.6.0.

I ran haplotype caller on ~2,500 samples using the following cmd:
module load gatk/3.6.0 && module load java/1.8.0_91 && java -jar -Djava.io.tmpdir=/hpf/largeprojects/pray/llau/tmp/ -Xmx24G $GATK -T HaplotypeCaller -R /hpf/largeprojects/pray/llau/internal_databases/gatk_bundle/2.8_b37/human_g1k_v37_decoy.fasta -I .recalibrated.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -rf BadCigar --dontUseSoftClippedBases --min_base_quality_score 20 --emitRefConfidence GVCF -o .raw_variants.gvcf

I then ran genotypeGVCFs using the following cmd:
module load gatk/3.6.0 && module load java/1.8.0_91 && java -jar -Djava.io.tmpdir=/hpf/largeprojects/pray/llau/tmp/ -Xmx24G $GATK -T GenotypeGVCFs --disable_auto_index_creation_and_locking_when_reading_rods --max_alternate_alleles 500 -R /hpf/largeprojects/pray/llau/internal_databases/gatk_bundle/2.8_b37/human_g1k_v37_decoy.fasta -L 20:62025520-63025520 -allSites --variant .g.vcf --variant .g.vcf..... --variant .vcf -o .vcf

If you view the attached screenshot of the .vcf file you can see I'm missing a couple of random positions (chr20: 62025529). I thought -allSites should report all the position? Am I screwing something up?

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @llau
    Hi,

    Can you post some example GVCF records at some of the sites that are missing? Can you confirm this occurs with the latest version?

    Thanks,
    Sheila

  • llaullau Member

    Hi Sheila,

    Thanks so much for replying! Attached is one of the gvcf records with a NON_REF call at one of the missing location 62,025,529 in the NON_REF block of chr20:62,025,454 - 62,025,654 - it is part of a block. Sorry I thought if I used haplotype caller with the above commands that it should output all locations/blocks? I'm going to to try running this command with GATK3.7 and will let you know if I'm still having this issue. Thanks again!

  • llaullau Member

    Hi Shelia, I re-ran the same command using GATK3.7.0 unfortunately, I'm still having the same issues - I gained the position 62,025,529 I was missing in GATK3.6.0 but have lost position 62,024,439.

  • llaullau Member

    I'm currently trying to combineGVCFs first then do genotypeGVCFs

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Your expectation is correct in that the program should write out a record for every position when run in -allSites mode.

    Is this error deterministic or does the missing position vary between runs, for the same version?
  • llaullau Member

    Hi Geraldine,

    Sorry for the delay - it seems the missing positions are deterministic ex. I'm always missing the first two "9th" position. chr20:62025529 and chr20:62025539. However, when I used combineGVCFs I get all positions but once I use genotypeGVCFs with -allSites mode I lose these positions.

    Issue · Github
    by Sheila

    Issue Number
    1659
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • llaullau Member

    Sorry, just to give you context. My primary goal is to calculate internal allele frequencies from my ~2,500 samples. I can try re-calling haplotypeCaller with --emitRefConfidence BP_RESOLUTION, and then genotypeGVCFs -allSites do you think this would work?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    This is quite strange -- are you using a pipelining system to parallelize execution over intervals, by any chance? If so, how are you concatenating scattered segments? This looks like it could be caused by a problem at the gather stage. I have not seen anything like that within HaplotypeCaller/GenotypeGVCFs that I can remember.
  • llaullau Member

    Yes - I'm trying to call the entire genome in 1million base pair chunks. I'm trying to calculate internal allele frequencies using our samples on the entire genome. I use haplotypeCaller to make the gVCF files for 2,500 samples then I use genotypeGVCF to only merge a million base pair interval at a time.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    How are you doing the chunking? You might be getting some edges effects due to your scatter-gather implementation.
  • llaullau Member

    I'm using -L in the genotypeGVCFs to handle the chunking. For example -L 1:0-1000000, -L 1:1000000-2000000. Thanks so much for answering all my questions! I'm curious as to how Broad calculated internal allele frequencies for the ExAC data? Did you use g.vcf (similar to what I'm doing?) or (I'm guess the easier way...) database?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Ah, do you use interval padding at all? We use -ip 100 to remove edge effects.

    I believe ExAC was done the old fashioned way through hierarchical merging with CombineGVCFs, because the database wasn't ready at the time.

    I'll ask if we can perhaps get you some help setting up the database system.
  • llaullau Member

    Thanks Geraldine! I'll give -ip 100 a try. I'm very curious as to what/how they stored all the info -> I'm assuming they are storing homozygous reference calls also with the annotation? We are struggle with how to store all our information and we're only WES. I'm hoping to figure out something that scales well to WGS but am starting to believe it's a unicorn...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @llau
    Hi,

    I think you will find this thread helpful.

    -Sheila

Sign In or Register to comment.