To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

missing SNPs after gvcf combine and slow combination step

Hi, I'm using GATK ver 3.4 for SNP calling and I have some question about it. My data set has 500 samples, and I used genome data as reference for bowtie/GATK

1) I called SNP by sample (gvcf) with haplotype and then combined gvcf, however, the combination takes a long time, the GATK wants to recreate gvcf.idx files (4 of my gatk mission stuck at this step), one gatk combination finished after about 20 days calculation. I also try to use '-nct' to improve this, but it still stuck at preparing idx files.

2) For that finished gatk combination data set, I also used Unifiedgenotype with Gr.sorted.bam as input to call SNPs. The result is output with Gr.sorted.bam has 5 times more SNPs number than gvcf combination, and most missing SNPs could be found in individual gvcf files but missing in final result.

Could you help me with these? Thank you!

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Mr_Polar
    Hi,

    Can you tell us the exact commands you ran for HaplotypeCaller, CombineGVCFs and GenotypeGVCFs? How many GVCFs are you combining at a time with CombineGVCFs?

    Also, how are you comparing the VCF from HaplotypeCaller/GenotypeGVCFs and UnifiedGenotyper?

    Thanks,
    Sheila

  • Mr_PolarMr_Polar USMember
    edited March 2016

    @Sheila said:
    @Mr_Polar
    Hi,

    Can you tell us the exact commands you ran for HaplotypeCaller, CombineGVCFs and GenotypeGVCFs? How many GVCFs are you combining at a time with CombineGVCFs?

    Also, how are you comparing the VCF from HaplotypeCaller/GenotypeGVCFs and UnifiedGenotyper?

    Thanks,
    Sheila

    Hi Sheila,

    Thank you for replying.

    Here is my data set information:

    Data set 1 has 500 samples, which are aligned to genome data (around 1.5 gb)
    Data set 2 has 150 samples, which are aligned to a much smaller reference

    The commands I used for individual SNP calling is:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I sample1.Gr.sotred.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 --minPruning 1 -emitRefConfidence GVCF -o sample1.g.vcf

    The gvcf file size for two data sets are:
    Data set1's gvcf files are around 700 MB
    Data set2's gvcf files are around 150 MB

    For data set 1, I tried to use CombineGVCFs to merge gvcf before joint, however, it crashed because lack of memory (I assigned 100 GB memory for it). So for both data sets, I skipped the merge gvcf step and directly went to joint GVCF with GenotypeGVCFs, commands are:

    GenomeAnalysisTK.jar -nct 6 -T CombineGVCFs -R reference.fa --variant sample1.g.vcf --variant sample sample2.g.vcf -o output.vcf

    These two data sets finished. However, I had another 3 similar data sets with same commands stuck while joint gvcfs with GenotypeGVCFs, two of them were trying to rebuild g.vcf.idx for some samples and it took almost one day for one sample, one was totally freeze while loading data.

    For comparing the VCF file from HaplotypeCaller/GenotypeGVCFs and UnifiedGenotyper:

    I used this command for UnifiedGentyper:

    GenomeAnalysisTK.jar -T UnifiedGenotyper -R ref.fa -I sample1.Gr.sorted.bam -I sample2.Gr.sorted.bam -dcov 1000 -glm BOTH -o output.vcf.

    For data set 1, I got about 5 times more SNPs with UnifiedGenotyper than HaplotypeCaller/GenotypeGVCFs (unfinished yet, I only compared the SNPs from one finished chromosome). For data set 2, I also have at least 40% more SNPs from UnifiedGenotyper.

    Then I compared the SNPs from both methods, and I found HaplotypeCaller/GenotypeGVCFs miss lots of tags which are existed in UnifiedGenotyper, but it also have some tags (1/3 of missing tags number) which are not in UnifiedGenotyper results.

    Then I searched the gvcf files for missing SNPs, and I found almost every missing SNPs I tested could be found in individual gvcf files, but when I jointed them, they were missing from final results.

    Because we'll do large data set in the future, and HaplotypeCaller does give us great advantage when dealing with large number of samples, UnifiedGenotyper usually need months to finish the similar job. we hope you could help us to solve our problem about joint gvcf, thank you!

    Bests,
    Peng

    Post edited by Mr_Polar on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Mr_Polar
    Hi Peng,

    You mentioned you were not able to run CombineGVCFs, but you were able to run GenotypeGVCFs for your first two datasets. Did you try to run CombineGVCFs on all the GVCFs at once? Can you try running CombineGVCFs on 10-20 GVCFs at a time? Then, you can run GenotypeGVCFs on those combined GVCFs.

    As for the differences between UnifiedGenotyper and HaplotypeCaller, it seems like the quality of the variants that are missing is bad. Can you confirm that the variants missing in the VCF from HaplotypeCaller have low quality scores in the GVCF? Also, if you can post a few examples, that would be very helpful.

    Thanks,
    Sheila

  • Mr_PolarMr_Polar USMember

    @Sheila said:
    @Mr_Polar
    Hi Peng,

    You mentioned you were not able to run CombineGVCFs, but you were able to run GenotypeGVCFs for your first two datasets. Did you try to run CombineGVCFs on all the GVCFs at once? Can you try running CombineGVCFs on 10-20 GVCFs at a time? Then, you can run GenotypeGVCFs on those combined GVCFs.

    As for the differences between UnifiedGenotyper and HaplotypeCaller, it seems like the quality of the variants that are missing is bad. Can you confirm that the variants missing in the VCF from HaplotypeCaller have low quality scores in the GVCF? Also, if you can post a few examples, that would be very helpful.

    Thanks,
    Sheila

    Hi Sheila,

    Thank you for suggestion, I'll try to combine several GVCF and run GenotypeGVCF again.

    For your second question, we did check the quality scores, there was no significant different from absent/miss SNPs, we're doing an comprehensive comparison of different GATK methods, and we'll submit result to board ASAP.

    Bests,
    Peng

  • Mr_PolarMr_Polar USMember

    We use GATK to call SNPs in genotyping-by-sequencing data. We originally used Unified Genotyper, but then switched to Haplotype Caller because the former was no longer supported. With Haplotype Caller, we call SNPs individually in each sample, and then do joint genotyping. The problem is that we get many more SNPs using Unified Genotyper than using Haplotype Caller. As an example, after a series of filtering steps, we obtained 5534 SNPs using Unified Genotyper. Using Haplotype Caller, followed by the same filtering steps, we obtained 4593 SNPs. 4079 SNPs were common between the two SNP calling methods. We have seen this lower efficiency for SNP calling in multiple analyses, to the effect that we now have switched back completely to using Unified Genotyper. Do you have any idea what is causing the lower efficiency SNP calling with Haplotype Caller?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Mr_Polar, I'm afraid it's impossible to answer your question without more information. Can you tell us whether (and how) you validated the SNPs and have evidence that they should be called? If so, are there similarities in the variant context properties of the the SNPs that are missing? Also, what kind of data are you working with (WGS, Exome, gene panel, ...) and how much coverage do you have on average? What kind of filtering did you apply? Finally, have you tried lowering the emit and call confidence thresholds?

Sign In or Register to comment.