We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Seems like CombineGVCFs is freezing

Hello,

I am using latest version of gatk (gatk-4.0.3.0), it seems like CombineGVCFs is required before GenotypeGVCFs for multiple samples according to the online Tool Documentation, so I am running CombineGVCFs after SNP calling. My problem is CombineGVCFs is running very slowly, and looks like it has been freezing after read input files (I only have 6 samples) as below:

11:33:49.467 INFO CombineGVCFs - Initializing engine
11:33:56.507 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/151_RWG1_assembly.g.vcf
11:34:02.884 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/179_RWG1_assembly.g.vcf
11:34:08.362 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/338_RWG1_assembly.g.vcf
11:34:13.497 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/374_RWG1_assembly.g.vcf
11:34:22.429 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/449_RWG1_assembly.g.vcf
11:34:27.592 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/RWG1_control_RWG1_assembly.g.vcf

Is there anyone could help or explain this issue? Thank you very much!

Best,
Yuanwen

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited April 2018

    Hi @yuanwen, we actually recommend using another tool, called GenomicsDBImport.

    CombineGVCFs is provided as a backup solution for people who cannot use GenomicsDBImport (which has some limitations, for example it can only run on diploid data at the moment). We know it is quite inefficient and typically requires a lot of memory. I encourage you to try GenomicsDBImport first before spending any more effort trying to get CombineGVCFs to work for you.

    See this doc article for more info: https://software.broadinstitute.org/gatk/documentation/article?id=11813

  • maierpamaierpa Member
    edited April 2018

    Hi @Geraldine_VdAuwera ,

    I encountered the same problem today, documented here. Can you answer these 2 questions:

    (1) If you sidestep the slow CombineGVCFs tool by inputting multiple samples directly into GenotypeGVCFs v3.8, the ETA is very slow (years). Would GenotypeGVCFs finish in a reasonable time if I used GenomicsDBImport first?

    (2) For data with ~250,000 contigs, what is the recommended pipeline for splitting up by interval, and then getting a single output? But for now you need to run on each interval separately. We recommend scripting this of course. i.e., how?

    Thanks!

  • yuanwenyuanwen Member

    @Geraldine_VdAuwera , thank you very much for the prompt replay! I will try GenomicsDBImport as you suggest, but I have a question same as @maierpa , about what kind of interval should I use. My reference is a de novo transcriptome assembly, which is larger than 400,000 contigs. Could you please give me some suggestions?

    I appreciate the help!

    Best,
    Yuanwen

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes I would expect GenotypeGVCFs to have much more reasonable runtime if you run it on a GenomicsDB workspace.

    If you go to the link I referenced above you’ll find more info and a link to the script we use in production. It's probably more complex than you need, so you'll need to adapt it.

    To be frank we never work with genomes with so many contigs, so we don't have a ready-made solution for that. Typically for unfinished genomes we recommend making an artificially scaffolded reference to reduce the number of contigs. For de novo transcriptome I'm not sure there's any good, principled way to do this. We usually prefer to align RNAseq data to the common reference genome rather than work with de novo transcriptomes, which create a lot of complexity downstream when it comes to comparing things. But to be fair we haven't done much with RNAseq, so take that with a grain of salt.

  • Thanks @Geraldine_VdAuwera ,

    I encourage you to try GenomicsDBImport first before spending any more effort trying to get CombineGVCFs to work for you. See this doc article for more info: https://software.broadinstitute.org/gatk/documentation/article?id=11813

    If you go to the link I referenced above you’ll find more info and a link to the script we use in production.

    I don't see any link to a script in the above link. Please clarify?

    Unfortunately, even using GenomicsDBImport, it takes 2 hours for 1 contig. Scaling up to ~250,000 contigs, that's an estimated 60 years to completion, not even including the GenotypeGVCFs part (which would likely more than double the time). Parallelizing will not make a huge difference here.

    I'm getting the sense from your forums and answers that GATK is not capable of genotyping my dataset, which is only moderately-sized (3 individuals, 250,000 contigs). Completion times for CombineGVCFs, GenomicsDBImport, and GenotypeGVCFs (at least paired with CombineGVCFs) are a matter of years! This is obviously not tenable. @Geraldine_VdAuwera , @Sheila , @shlee , do you have any other suggestions?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Having bazillions of contigs is not something useful. There are ways to consolidate contigs to reduce the numbers. There must be a topic for that here on forums.

  • yuanwenyuanwen Member

    Thank you all for your suggestions. Since right now we only need allele depth information for each variant site, I probably will go ahead run -ERC BP_RESOLUTION mode, and extract allele depth for each possible variant.

    Best,
    Yuanwen

  • Thank you all for the input.

    For anyone else down the road with a similar situation (few samples, RNAseq, goal of genotyping), I found these threads particularly helpful:
    https://www.biostars.org/p/284097/
    https://www.biostars.org/p/10926/

    GATK is a great program in general, but the above suggests it currently lacks functionality for multi-sample genotyping. I opted to use GATK's best practices for RNAseq, including HaplotypeCaller (with --useNewAFCalculator, to make sure singletons were included), and then bcftools to merge VCF files. If the goal is to find well-supported SNPs (especially singletons), joint calling has an increased false negative rate, and so the whole CombineGVCFs/GenomicsDBImport/GenotypeGVCFs debacle can probably be safely sidestepped for <10 samples. Of course, this makes quality filters all the more important.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    GATK is all about multi-sample genotyping. I have 1200 WES samples that I joint genotyped and I can catch disease causing singleton variants as efficiently as I could with single genotyping (with less false positive rate.). Single sample variant calling will generate ./. results for those samples that had no call at that position in multisample vcfs. You will not be able to tell if that is because due to a deletion or a structural variation or its simply that region might have hom ref only for that sample if genotyped per sample.

  • Homozygous reference sites and sites with missing data can be distinguished by outputting all sites (i.e. BP_RESOLUTION or GVCF mode). Homozygous ref sites with sufficient confidence will then be 0/0, not ./. Also, low-coverage variants can be partially rescued by running a pooled sample through HaplotypeCaller (e.g. S1, S2, S3, pooled_S1_S2_S3). Again, the reason for pursuing single-sample genotyping is because the current pipeline for multi-sample genotyping takes weeks, months, or years to finish, not because it isn't preferable in many cases. There are numerous threads describing problems finishing in a reasonable amount of time. It's unclear that scaffolding contigs would solve this efficiency problem, and as was stated above, there's no good, principled way to do it anyway. This is the best solution that I'm aware has been presented.

  • Update: for those in the future seeking an answer on this topic, I solved the problem by scripting a solution. Scripts can be made available upon request. Here's the protocol that worked for me:

    (Keep in mind, my goal was calling only biallelic SNPs for a de novo transcriptome, and genotyping the same 3 individuals used to construct that transcriptome.)

    1) Run the entire GATK 4.0.1.2 pipeline on pooled samples, with best practices for RNAseq. Since HaplotypeCaller is relatively fast, and the joint genotyping tools are relatively slow, identifying biallelic SNPs first allows much faster computation in the steps below. Some important addenda to the RNAseq protocol: (1) Use -new-qual, it's faster (and doesn't lose singletons in the steps below); (2) choose the correct -ploidy, in this case 6, for 3 diploid individuals. After using VariantFiltration as outlined, I used SelectVariants to select only biallelic SNPs with --select-type-to-include SNP and --restrict-alleles-to BIALLELIC.

    2) Run the same pipeline for each sample separately, up until and including the HaplotypeCaller step, except this time choose the GVCF output option. Now is the fun part.

    3) Use the break_blocks tool from gvcftools to convert from GVCF to a VCF with all positions. (This probably could have been achieved using -ERC BP_RESOLUTION in the previous step, but I hadn't yet crafted this solution at that point :smile: ). For gvcftools, you can create a bedfile for your fasta easily enough using awk, e.g. awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' $fai_file > fasta.bed; You can can also crunch down file size at this point by removing the ##contig headers, e.g.
    pattern="^##contig";
    sed "/$pattern/d" file.vcf > smaller.file.vcf;

    4) Then run a script that temporarily changes all chromosomes to a single one, and all positions to sequential. This is best done by creating an artificial reference (list of chromosomes, positions, REF/ALT SNP values) and an artificial fasta file from the REF values. Make sure you are very careful here, if your code is not perfect, you may accidentally rename chromosomes/positions differently for different individuals. I implemented this in R using the vcfR and plyr packages. You'll also want to make indices for your shiny new fasta file using samtools faidx, and a dictionary using picardtools CreateSequenceDictionary.

    5) Now, add in your fake ##contig entries in each VCF using GATK's UpdateVCFSequenceDictionary, or else GenomicsDBImport will yell at you.

    6) You can finally read in the VCF files (remember, these are just positions at globally identified SNPs) using GenomicsDBImport. Given the small file sizes, it will run very fast now. I had about 500,000 SNPs, and it ran in about 90 minutes.

    7) Last major step, run GenotypeGVCFs on the database. Don't forget -new-qual, and for bi-allelic SNPs you can specify --max-alternate-alleles 2 to speed things up. This took less than 3 hours for me. Not bad compared to the estimates of 10-60 years I was getting with the conventional protocol. This has the advantage that I won't be dead when it finishes.

    8) Afterwards, you'll need to run a script similar to the one in (4), that reverses the chromosomes/positions back to normal. Again, be very careful, and test your code thoroughly. A couple of variants may have differing REF/ALT SNP calls compared to the pooled run (or even be indels). So it's a good idea to run SelectVariants again, and then use R to check that all SNPs match up correctly. When in doubt, remove possible erroneous SNPs.

    That's it! I know GATK is working on making GenomicsDBImport useable on multiple intervals, and that may or may not make it fast enough for this type of data (with hundreds of thousands of contigs). I'm posting this in the hopes that it saves someone the massive headache I got from working through this problem.

    Cheers!
    Paul

  • prasundutta87prasundutta87 EdinburghMember
    edited May 2018

    Thanks a lot Paul. I think its a good idea. We did have a discussion on doing something like this as we were also in a similar soup in our lab, but since we had also used -ERC GVCF and not -ERC BP_RESOLUTION, gvcftools seems like a boon..I was wondering if the scripts can be made available for this protocol that I can adapt to suit my requirement..

  • maierpamaierpa Member

    Hi @prasundutta87,

    Sure. I've uploaded the scripts (and a couple necessary sample input files) to github.
    https://github.com/paulmaier/GATK-Joint-Genotyping-Pipeline

    You'll need to change around the path and sample names, etc., but this should contain all the info needed to repeat what I did. Good luck!

    Paul

  • prasundutta87prasundutta87 EdinburghMember

    Thanks a lot @maierpa ..will write to you in github if I have any query..or you prefer here? May be it will be helpful to other users as well?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @maierpa
    Hi Paul,

    This is so nice of you to post a solution here. Hopefully it will help other users in the future, and not just @prasundutta87 :smile:

    -Sheila

  • maierpamaierpa Member

    No problem! :smile: Yeah, feel free to post questions/comments here, and/or on github.

  • prasundutta87prasundutta87 EdinburghMember
    edited May 2018

    Hello @maierpa..So, I tried using your way, but wrote my own scripts reason being my reference genome is 2.83 Gb big and an unblocked gvcf (not gunzipped yet, can check the tail of the file in unzipped file) is 272 GB big . So, R is out of picture.

    I used a small script to unblock, change the chromosome name into single chromosome and make positions sequentially through this:

    break_blocks --region-file fasta.bed --ref <original_reference_genome> <blocked_gvcf.g.vcf | awk 'BEGIN{FS=OFS="\t";counter=1;}$0 ~ /#/ {print $0};$0 !~ /#/ {$1="A";$2=counter++;print;}' | bgzip > unblocked_single_scaffold.g.vcf.gz

    My reference genome has 2836166969 bases, so technically my unblocked gvcf should also have that many sites. But, it has 2833405762 sites. Am I missing something here? @Sheila may also help here.

    Secondly, I cannot tabix index my unblocked gvcf file. According to this post regarding htslib (https://github.com/samtools/htslib/issues/348), TBI index formats do not work beyond 2^29 (about 536 million). The CSI index format on the other hand does support coordinates up to 2^31. But that also fails because I have more bases than 2^31. One solution is to split the unblocked gvcf into two artificial scaffolds rather than one. May be that will help, but I still need to try that.

    P.S. I am dealing with WGS germline variant calling

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @prasundutta87
    Hi,

    My reference genome has 2836166969 bases, so technically my unblocked gvcf should also have that many sites.

    Can you clarify what you mean by "unblocked GVCF"? How did you generate it? The exact command would help.

    Thanks,
    Sheila

  • prasundutta87prasundutta87 EdinburghMember

    Hi @Sheila,

    Unblocked gvcf is the one that is generated using -ERC BP RESOLUTION..it has all the sites..including homozygous reference..using 'break blocks' program of gvcftools, one can break the gvcf blocks created using -ERC GVCF parameter..the exact and I used is mentioned in my above post..

  • prasundutta87prasundutta87 EdinburghMember
    edited May 2018

    Hi @Sheila,

    Unblocked gvcf is the one that is generated using -ERC BP RESOLUTION..it has all the sites..including homozygous reference..using 'break blocks' program of gvcftools, one can break the gvcf blocks created using -ERC GVCF parameter..the exact command I used is mentioned in my above post..

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @prasundutta87
    Hi,

    I see. How did you find out there are more sites in your VCF compared to the reference? Can you run a small test on a few bases using -L and see if that gives different numbers?

    Thanks,
    Sheila

  • prasundutta87prasundutta87 EdinburghMember

    Hi @Sheila,

    I checked it using bcftools view -H <g.vcf>|wc -l..

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @prasundutta87
    Hi,

    It looks like you are checking the number of lines? It is possible some sites take up more than one line.

    -Sheila

  • prasundutta87prasundutta87 EdinburghMember

    Yes..using bcftools view and with -H parameter (remove header), I am just counting the variant sites present in my unblocked g.vcf file...And the problem is that my unblocked g.vcf has less number of sites than my reference genome. It should be exactly the same right?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Does your reference genome contain N masked sites?

    Do your reads cover the whole reference genome or are there blank spots with no reads mapped?

    Do you have INDELS?

  • prasundutta87prasundutta87 EdinburghMember
    edited May 2018

    The reference genome I am using is a draft assembly and has a lot of gaps (Ns). It is also soft masked..but I don't think that soft masking will cause any issue..I am not sure about the blank spots though..

    On extracting all the sites and making it into a sequence, there were Ns in it as well..

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Can you check the number of Ns and compare to your other numbers. That may explain why some of your numbers don't match.

  • prasundutta87prasundutta87 EdinburghMember

    Hi..sorry for replying late..The no. of Ns in my reference genome and no. of Ns present in the gvcf file are exactly the same..

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭
    edited June 2018

    @prasundutta87
    Hi,

    Can you check the indel records and post some examples here? To confirm, you did not use -L to generate the GVCF?

    Thanks,
    Sheila

  • prasundutta87prasundutta87 EdinburghMember

    Sure. Some indel examples:

    NW_005435090.1 242 . T TCAGTTA, 0.75 . BaseQRankSum=1.282;ClippingRankSum=0;DP=11;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0;RAW_MQ=37136;ReadPosRankSum=0.524 GT:AD:DP:GQ:PL:SB 0/1:4,1,0:5:30:30,0,165,42,168,210:2,2,0,1
    NW_005435090.1 401 . A AT, 38.73 . BaseQRankSum=-0.842;ClippingRankSum=0;DP=8;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0;RAW_MQ=25984;ReadPosRankSum=-0.253 GT:AD:DP:GQ:PL:SB 0/1:2,3,0:5:46:76,0,46,82,55,138:0,2,1,2
    NW_005435091.1 303 . CAAAA C, 0 . DP=7;MLEAC=0,0;MLEAF=nan,nan;RAW_MQ=21229 GT:PGT:PID ./.:0|1:303_CAAAA_C
    NW_005435092.1 227 . TACCTGA T, 54.52 . DP=4;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=7391 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:90,6,0,90,6,90:0,0,1,1
    NW_005435093.1 151 . A AGCTTCTAAAGAGTGCAT, 54.52 . DP=8;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=19740 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:90,6,0,90,6,90:0,0,2,0
    NW_005435093.1 237 . G GT, 20.56 . DP=9;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=23340 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:56,6,0,56,6,56:0,0,2,0
    NW_005435093.1 566 . T TA, 57.73 . BaseQRankSum=-0.16;ClippingRankSum=0;DP=10;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0.589;RAW_MQ=35881;ReadPosRankSum=0.366 GT:AD:DP:GQ:PL:SB 0/1:5,4,0:9:95:95,0,126,110,138,248:2,3,0,4

    I did not use -L to generate gvcf.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @prasundutta87
    Hi,

    Okay, so can you check the position after NW_005435091.1 303? Is it 304? Can you post the records following that? I am not certain if deleted sites are included in GVCF output for a single sample.

    -Sheila

Sign In or Register to comment.