Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

Larger Sample Sizes are Reducing SNPs dramatically

AceAce United StatesMember
I'm running a pipeline with a little over 200 samples. I've tried this in both the "regular" and GVCF modes of haplotype caller, and have gotten ~1200 and ~500 SNPS, respectively. Biologically, I don't believe this can be possible given the diversity of the dataset. Additionally, when the sample size is smaller I get many more than that. By removing 65 of the samples I had over 12,000 SNPs and in a subset I used to pilot the pipeline with just 35 samples I was getting hundreds of thousands of SNPs. Removing low coverage, or sequences with highly-variable coverage did not help. Is this a normal outcome of HaplotypeCaller, and how can I rescue lost SNPs?

Answers

  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi

    Can you please post the version of HaplotypeCaller you are using and the exact commands.

  • AceAce United StatesMember
    Sure!

    I am using GATK version 4.0.10

    I initially tried with this command (this is also how I did my initial 35 samples which had no errors):

    gatk HaplotypeCaller -R ${ref}.fasta -O ${group}.vcf.gz -I BamFiles.list -ploidy 1

    Then I tried the GVCF method using
    #Step 1: Getting individual gVCFs
    for sample in $(cat samples.list); do
    gatk HaplotypeCaller -I $(cat BamFiles.list | grep $sample) -R ${ref}.fasta -O ${sample}.g.vcf.gz -ploidy 1 -ERC GVCF
    done

    #Step 2: Combining individual gVCFs
    #note, after this step everything still seems "normal". It's only at step 3 the variants seem to get cut down drastically
    gatk CombineGVCFs -R ${ref}.fasta $(for sample in $(cat samples.list); do echo "--variant" ${sample}.g.vcf.gz ) -O GVCF.g.vcf

    #Step 3: Genotyping the combined gVCF
    gatk GenotypeGVCFs -R ${ref}.fasta --variant GVCF.g.vcf -O GVCF.vcf

    I'm currently working on a pipeline with the gendb method instead of the combine variants method, but since this is a haploid organism and as far as I know it is not recommended for non-diploid organisms I'm a little wary about that. Thank you for your help!
  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 25

    Hi @Ace

    Sorry I got your question wrong. I am looking into this and I will get back to you shortly. It will be helpful to know which haploid organism sample you are analyzing.

  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 25

    @Ace

    Take a look at the GQ values for these "regular" vcf, the additional variant you are seeing could be edge cases. Also do variant filtration on the vcfs generated by these two modes and see if the difference is still that big. Let me know what you find.

  • AceAce United StatesMember
    The issue isn't two different methods giving me different results. That I can chalk up to filtering. The issue is that a different number of samples is giving me wildly different results, wherein the one with MORE samples is actually giving me FEWER variants than any combination of a smaller number of samples. In theory this should not be the case unless adding samples reduces precision.
    The haploid genome is around 25mbp.
  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited July 1

    Hi @Ace

    I checked with the dev team, and you see this discrepancy because when you run multiple input vcfs in Haplotypecaller it performs joint calling on all the inputs. On the other hand when you run Haplotypecaller on individual samples in gvcf mode it is only calling variants in that one sample. They will be equivalent after you run CombineGVCFs on the gvcfs.

  • AceAce United StatesMember
    edited July 9
    Yes. I get similar outputs when I run HC > CombineGVCFs > GenotypeGVCFs as when I run just haplotype caller. That's not the issue.

    The issue is that I get unexpected outputs when I run my full set of over 200 samples vs around 30 samples in a subset. Meaning, in theory the 200 should have around as many or more variants as the subset, but it has SIGNIFICANTLY less.

    As an update, it appears that it's only capturing part of the first chromosome, and I believe it's a memory issue. However, I am dedicating 150g of ram with my Xmx at 120g and "just giving more memory" isn't an option. So. I think it's just too many high coverage samples for GATK genotyping. I am currently trying to run as 4 separate batches to later put under combinevariants. While I know this isn't ideal, I am running out of options with how much memory is being required at the GenotypeGVCF step with many samples. Is variant re-calibration an option to improve accuracy in this case?
  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited July 11

    Hi @Ace

    How much physical memory does your machine have?
    Take a look at this thread to understand memory requirements for GATK tools: https://gatkforums.broadinstitute.org/gatk/discussion/8755/filtersamreads-java-lang-outofmemoryerror-java-heap-space

Sign In or Register to comment.