Does GenotypeGVCFs call SNPs fixed in only 1 sample?

Apologies if this was addressed elsewhere, but I have looked carefully through the documentation. Simply put: using the joint analysis pipeline of GATK's HaplotypeCaller (-ERC GVCF) -> CombineGVCFs-> GenotypeGVCFs, are SNPs called when they are fixed in only 1 sample? For example, imagine 3 samples total, where the correct genotypes should be: A/A, A/A, T/T. Will this variant be called correctly, given sufficient coverage/quality? This is very important to what I am trying to accomplish, as I only have 3 samples, and would like to analyze SNPs that are fixed between samples. It's unclear to me that this pipeline will work, since HaplotypeCaller will not see any find these variants within samples individually. But maybe GenotypeGVCFs finds them? This is not clearly documented.

Best Answer

Answers

  • An afterthought: would these SNPs be discovered by HaplotypeCaller based on discordance between each sample and the reference fasta? Again, sorry to be dense here.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @maierpa
    Hi,

    would these SNPs be discovered by HaplotypeCaller based on discordance between each sample and the reference fasta?

    HaplotypeCaller will output variants at sites that are different from the reference FASTA file. It will not use the other samples and compare their variants. You can determine differences in variants between samples using other tools (like SelectVariants).

    In your case, are you sure you are only going to have 3 samples? We recommend using the GVCF workflow because it is scalable to many samples and allows for additional samples at later dates.

    The GVCF workflow and running HaplotypeCaller on all samples together should produce the same results. There were some concerns of singletons being dropped, but this is fixed by using --use-new-qual-calculator. Also, are your samples related? If so, the Genotype Refinement workflow may interest you.

    -Sheila

  • maierpamaierpa Member
    edited April 2018

    Thanks Sheila. Yes, there are 3 essentially complete de novo transcriptomes, so there will only be 3 samples total. To be clear, will the workflow above correctly find my example variant? i.e...

    A (fasta reference)
    A/A (sample 1)
    A/A (sample 2)
    T/T (sample 3)

    Using the GVCF workflow of...
    HaplotypeCaller (-ERC GVCF) on each sample -->
    CombineGVCFs -->
    GenotypeGVCFs -->
    VariantFiltration
    ...will the 3-sample VCF file contain this variant, and these genotypes? (Since T is a variant compared to the fasta in Sample 3?) Or do I need to explore some other tool, such as SelectVariants? Do I need to rerun the pipeline using --use-new-qual-calculator on HaplotypeCaller, or should this work without adding that flag? (Running HaplotypeCaller on 3 large BAM files took about 3 days, and CombineGVCFs has been running for 1 day without sign of finishing. I want to make sure I'm not wasting any time!)

    These are not related samples, they are sequenced from disparate regions of the species distribution to find range-wide SNPs.

  • @Sheila,

    Couple more points of clarification. You said GenotypeGVCFs can be run without CombineGVCFs, but this only seems to be true for v3.8, not v4.0+, right? The overview states it takes a single input, so it might be easier to use this earlier version for the GenotypeGVCFs, to skip the slow CombineGVCFs step?

    Regarding intervals (-L) to speed up compute time... I'd like to use a vcf file of biallelic SNPs to speed up computation, with interval padding (-ip) of 100, since the reads are 100bp. From other posts, I've gleaned that it's only necessary to specify -L and -ip once, i.e. during the HaplotypeCaller step. But I'm wondering if I can further speed things up during GenotypeGVCFs by invoking only 1 bp SNP positions as intervals (i.e. without any padding). Those positions are all I'm interested in. My thinking was that GenotypeGVCFs does not seem to use the reads, hence the padding might not make any difference for this step, but I'm worried this will not work since v3.8 mentions using a 20 bp sliding window.

  • @Sheila, can you please clarify these other 2 questions? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @maierpa
    Hi,

    Yes, I was thinking you were using GATK3. It is true you need to run CombineGVCFs or GenomicsDBImport before GenotypeGVCFs in GATK4. Sorry for the confusion.

    The overview states it takes a single input, so it might be easier to use this earlier version for the GenotypeGVCFs, to skip the slow CombineGVCFs step?

    The problem with this is that you generated the GVCFs with GATK4. It is not a good idea to mix and match versions.

    But I'm wondering if I can further speed things up during GenotypeGVCFs by invoking only 1 bp SNP positions as intervals (i.e. without any padding).

    I think the recommendation was that you don't have to use -L after HaplotypeCaller. But, I think it would be a good idea to try using -L in GenotypeGVCFs, but with the same padding and intervals as HaplotypeCaller. I feel like I have seen some speedup (at least on small intervals) with -L in GenotypeGVCFs. Also, the team is aware that GenotypeGVCFs is slow, and they have said that using -newQual is going to help with that. However, I am unsure if you can use newQual if you did not use it in HaplotypeCaller. It seems like it is okay, but I have asked in this issue. If I don't update you with an answer on that, you can check the issue to see if someone from the team has responded there.

    -Sheila
    P.S. After writing all of that, I realized you probably were hoping to go with using GenotypeGVCFs from GATK3. I don't think it is a good idea. You will have to solve the issue with CombineGVCFs/GenomicsDBImport before proceeding to GenotypeGVCFs. Hopefully the other thread helped. If not, it may just be worth it to run HaplotypeCaller with GATK3 and proceed from there.

  • Update: for those in the future seeking an answer on this topic, I solved the problem by scripting a solution. Scripts can be made available upon request. Here's the protocol that worked for me:

    (Keep in mind, my goal was calling only biallelic SNPs for a de novo transcriptome, and genotyping the same 3 individuals used to construct that transcriptome.)

    1) Run the entire GATK 4.0.1.2 pipeline on pooled samples, with best practices for RNAseq. Since HaplotypeCaller is relatively fast, and the joint genotyping tools are relatively slow, identifying biallelic SNPs first allows much faster computation in the steps below. Some important addenda to the RNAseq protocol: (1) Use -new-qual, it's faster (and doesn't lose singletons in the steps below); (2) choose the correct -ploidy, in this case 6, for 3 diploid individuals. After using VariantFiltration as outlined, I used SelectVariants to select only biallelic SNPs with --select-type-to-include SNP and --restrict-alleles-to BIALLELIC.

    2) Run the same pipeline for each sample separately, up until and including the HaplotypeCaller step, except this time choose the GVCF output option. Now is the fun part.

    3) Use the break_blocks tool from gvcftools to convert from GVCF to a VCF with all positions. (This probably could have been achieved using -ERC BP_RESOLUTION in the previous step, but I hadn't yet crafted this solution at that point :smile: ). For gvcftools, you can create a bedfile for your fasta easily enough using awk, e.g. awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' $fai_file > fasta.bed; You can can also crunch down file size at this point by removing the ##contig headers, e.g.
    pattern="^##contig";
    sed "/$pattern/d" file.vcf > smaller.file.vcf;

    4) Then run a script that temporarily changes all chromosomes to a single one, and all positions to sequential. This is best done by creating an artificial reference (list of chromosomes, positions, REF/ALT SNP values) and an artificial fasta file from the REF values. Make sure you are very careful here, if your code is not perfect, you may accidentally rename chromosomes/positions differently for different individuals. I implemented this in R using the vcfR and plyr packages. You'll also want to make indices for your shiny new fasta file using samtools faidx, and a dictionary using picardtools CreateSequenceDictionary.

    5) Now, add in your fake ##contig entries in each VCF using GATK's UpdateVCFSequenceDictionary, or else GenomicsDBImport will yell at you.

    6) You can finally read in the VCF files (remember, these are just positions at globally identified SNPs) using GenomicsDBImport. Given the small file sizes, it will run very fast now. I had about 500,000 SNPs, and it ran in about 90 minutes.

    7) Last major step, run GenotypeGVCFs on the database. Don't forget -new-qual, and for bi-allelic SNPs you can specify --max-alternate-alleles 2 to speed things up. This took less than 3 hours for me. Not bad compared to the estimates of 10-60 years I was getting with the conventional protocol. This has the advantage that I won't be dead when it finishes.

    8) Afterwards, you'll need to run a script similar to the one in (4), that reverses the chromosomes/positions back to normal. Again, be very careful, and test your code thoroughly. A couple of variants may have differing REF/ALT SNP calls compared to the pooled run (or even be indels). So it's a good idea to run SelectVariants again, and then use R to check that all SNPs match up correctly. When in doubt, remove possible erroneous SNPs.

    That's it! I know GATK is working on making GenomicsDBImport useable on multiple intervals, and that may or may not make it fast enough for this type of data (with hundreds of thousands of contigs). I'm posting this in the hopes that it saves someone the massive headache I got from working through this problem.

    Cheers!
    Paul

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @maierpa
    Hi,

    Thanks for sharing/posting your solution here :smiley:

    -Sheila

Sign In or Register to comment.