Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CombineGVCFs takes forever if there are no calls in one g.vcf

dbeckerdbecker MunichMember ✭✭✭

Hi,

we have one sample which produced only ~2000 mapped reads and therefore got no calls at all from HaplotypeCaller. Since we are doing the whole pipeline at once, we merged all per sample g.vcf of that run into one to do GenotypeGVCFs. In most runs this taks a few hours. In this case it took 2.5 weeks. As I removed the g.vcf of this sample it was done in 6 hours.

Why does CombineGVCFs takes so much longer if there is one file without calls? I attached a g.vcf of chromosme 2 as a txt file (since g.vcf was not possibe).

Looking forward to your ideas.

Best,
Daniel

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @dbecker,

    Which run took 2.5 weeks--GenotypeGVCFs or CombineGVCFs?

    Looking through the records, I see your depths are extremely low. Oy, I would not include such a sample in a cohort analysis. Interestingly, your extremely-low-coverage VCF contains one variant call for chr2. I should hope your sequencing center performed QC on the data and subsequently either re-sequenced this sample to acceptable coverage depths or let you know that your sample was degraded and informed you that any data from it would be unreliable.

  • dbeckerdbecker MunichMember ✭✭✭

    Hi,

    CombineGVCFs was the tool that took weeks.

    Of course we do QC and re-sequence such samples. The Sample got no peak on TapeStation for this run and was not added to the pool. But it somehow made its way onto the samplesheet. It got a few reads based on sequencing errors in the indexes of other samples and since we are working on a complete automation of our pipeline, analysis with GATK was started. In the end of our pipeline is a Database and researchers can review all QC, coverage and variant calls. A failed sample that accidentally appeared on the samplesheet should get in there and be marked as failed.

    I'm aware that this sample failed, but I don't understand, why CombineGVCFs takes so much more time to combine the samples of a run, if there are no variants in one sample.

    Best,
    Daniel

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @dbecker,

    Yes, I agree it's strange that CombineGVCFs should take so much time to simply combine the GVCFs. One guess is that it has to do with having to handle a large number of low-quality gq-bands that this GVCF presents. You could test this by banding the GQ bands using the --gvcf-gq-bands parameter so that instead of each integer value being a separate band, you have bands of ten integers each, e.g. 0-10, 11-20, etc. You can read more about this parameter here.

  • dbeckerdbecker MunichMember ✭✭✭

    Hi

    Could you explain a little more why these gq-bands seem to be the problem? I don't get it yet. This was 1 out of 25 samples in this run and all of them had non-banded gq. 24 samples were combined quite fast and only adding the one without calls made it slow. So I don't really understand why the gq values shoud be the problem. Furthermore, since this sample has no calls, there are continuous regions of >18,000,000bp with the same gq. This sample has very little variation and therefore shouldn't make the process slower.

    Best regards,
    Daniel

  • dbeckerdbecker MunichMember ✭✭✭

    Hi,

    we tried the GenomicsDB and it took more time than combinegvcf. I don't think we will switch until we move from local servers to the cloud. In a perfect world there will be the possibility to always add new samples to the one existing GenomicsDB by that time. As long as we are working on local servers we will use combinegvcf and just add some automatic checks for the mapping qc before we try to call.

    Thanks for your help!

    Best,
    Daniel

  • Hi,

    I am working on a classic variant calling on a non model organism.

    I have 250 samples and followed the GATK best practice.

    I have produced 250 g.vcf with HaplotypeCaller, now the next step is to combine those g.vcf and produce a .vcf (with GenotypeGVCFs) either:

    A) Solution A: using GenomicsDBImport

    B) Solution B: using CombineGVCFs

    But those methods are super slow.

    I am wondering if it is possible to produce one vcf per g.vcf with GenotypeGVCFs (quite fast) and then combine the 250 vcf with an another program ?

    Does it produce the same result ? Thanks.

    aws training in chennai

    advanced aws training in chennai

  • dbeckerdbecker MunichMember ✭✭✭

    I don't think so. A g.vcf contains quality metrics for all regions, even those without calls (see the file added to my first post). VCF only contain calls and their qc. Therefore you can't combine the qualities for every position in every sample and produce wrong/incomplete information.

    Best,
    Daniel

Sign In or Register to comment.