The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
Using CombineVariants on gvcf files.

Currently I am following GATK best practice for using HC 3.0+, however I'm splitting my calls to chromosomal regions (-L). Next are the following step I perform working up to GenotypeGVCF and my question.
1 - I use CatVariants (following HC) to merge all 25 chromosome gvcf files into a single gvcf file per individual.
2 - I use CombineGVCF to merge 2 .. n number of individuals together. This is done because some analysis have 300+ individuals.
3- I then use CombineGVCF again to merge all the file from step 2 into one large gvcf file for one large joint GenotypeGVCF step.
4 - GenotypeGVCF is done again based on chromosomal regions (-L), which is followed by a additional CatVariants before VQSR.
The question I have this this:
Given the size of the analysis I have noticed that my CombineGVCF done in step 3 can take anywhere from 4-8 hours. I was wondering if I could change this step to use CombineVariants and have the result be the same (unlost data). The main reason for this would be because GATK currently allow CombineVariants to use the -nt option.
Thanks for you time and work.
--Shawn
Best Answers
-
Geraldine_VdAuwera Cambridge, MA admin
Hi Shawn,
Step 3 is actually not necessary unless you have 200+ combined GVCFs resulting from step 2. You can just pass a list of GVCFs to GenotypeGVCFs and it will do the big joint genotyping step as if they were all in one file.
If you did need to do a step 3, you would not be able to use CombineVariants because it is not capable of combining GVCF-specific information appropriately. It might work (ie it would run) but the results would be invalid.
-
Geraldine_VdAuwera Cambridge, MA admin
Hi @srynearson1,
This might be due to a technicality of how the indices are produced. Try adding these parameters to your CatVariants command:
--variant_index_type LINEAR --variant_index_parameter 128000
Answers
Hi Shawn,
Step 3 is actually not necessary unless you have 200+ combined GVCFs resulting from step 2. You can just pass a list of GVCFs to GenotypeGVCFs and it will do the big joint genotyping step as if they were all in one file.
If you did need to do a step 3, you would not be able to use CombineVariants because it is not capable of combining GVCF-specific information appropriately. It might work (ie it would run) but the results would be invalid.
Strangely, I took your advice and ran the following steps:
1 - I use CatVariants (following HC) to merge all 25 chromosome gvcf files into a single gvcf file per individual.
2 - You can just pass a list of GVCFs to GenotypeGVCFs and it will do the big joint genotyping step as if they were all in one file
Result
I used the 17 Platinum genomes in a split run across each chromosome (-L). Each run would increase time needed to complete (up to about 50+h) before it eventually died with a call to increase Xmx argument (was running with -Xmx100g).
Any thoughts would be appreciated.
I thought I would add a log file using the above parameters.
Hi @srynearson1,
This might be due to a technicality of how the indices are produced. Try adding these parameters to your CatVariants command:
--variant_index_type LINEAR --variant_index_parameter 128000
That was the trick. Reduced the whole run time, start to finish to 2 hours.
Thanks a bunch.
--Shawn
Yay
I believe that this discussion answers my question. For some samples I have a collection of gvcfs for each chromosome while for others I just have a single gvcf. For joint genotyping I was going to pass all these gvcfs to GenotypeGVCfs together as one long list ( more than 200 gvcfs but only about 50 samples overall ). I haven't read anything that would make me think that this would be a problem. There's the fact that I've got over 200 gvcf's but most of these files are small. Am I still going to see the run time issues?
Thanks!
@robertb The bottleneck for this operation is that the more files you have that need to be open at the same time, the more memory you'll need. You might get away with it, depending on what kind of infrastructure you're working on, but I would still recommend combining GVCFs per-sample before running GenotypeGVCFs. You can do this fairly quickly with CatVariants. If you go with that, be sure to use the parameters I indicated above.
Yes, Thanks. As with more recent versions of TK I specified .g.vcf for output so there's no need for the variant_index settings
In the course of doing CatVariants for the chromosomes for each sample I noticed that for 3 samples some chromosomes were not represented meaning there were no gvcfs for these chromosomes. Seems strange to me. Is this normal? Could it suggest a problem with the BAM files given to HaplotypeCaller?
Thanks
To be clear you need to give the indexing parameters to CatVariants otherwise it won't know to use the GVCF settings. That tool doesn't have the extension recognition logic (an unfortunate oversight).
Regarding your second question, can you clarify what is missing? From what step?
That's curious because I had initially followed your instructions and got a warning from the program stating that I didn't need to use them when specifying g.vcf for -out.
When I ran HC I used the -L option for each chromosome ( chr1-chr22 + chrother ). When I used CatVariants to merge the chromosome gvcfs for each sample I noticed that for three of my samples anywhere from 3-5 of the chromosomes were missing ( there were no gvcf for these chromosomes ). For example for sample A I was missing chromosomes 2, 4, 6, 9, and 17. I'd assumed that all chromosomes would be present following HC.
Here's the relevant section of one of my log files in case your interested. Note the lats line.
INFO 23:34:36,524 HelpFormatter - -------------------------------------------------------
INFO 23:34:36,532 HelpFormatter - Program Name: org.broadinstitute.gatk.tools.CatVariants
INFO 23:34:36,538 HelpFormatter - Program Args: -R /work/lianglab/bin/GATK/resources/hg19/ucsc.hg19.fasta --variant_index_type LINEAR --variant_index_parameter 128000 -V /scratch/rb14sp/Control/VCF/01C08034.chr1.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr2.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr3.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr4.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr5.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr6.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr7.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr8.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr9.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr10.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr11.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr12.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr13.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr14.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr15.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr16.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr17.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr18.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr19.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr20.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr21.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr22.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chrother.g.vcf -out /scratch/rb14sp/Control/VCF/01C08034.g.vcf
INFO 23:34:36,568 HelpFormatter - Executing as [email protected] on Linux 2.6.32-358.14.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00.
INFO 23:34:36,569 HelpFormatter - Date/Time: 2016/02/13 23:34:36
INFO 23:34:36,569 HelpFormatter - -------------------------------------------------------
INFO 23:34:36,569 HelpFormatter - -------------------------------------------------------
WARN 23:34:46,099 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values for --variant_index_type and --variant_index_parameter
9575 [main] WARN org.broadinstitute.gatk.engine.GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values for --variant_index_type and --variant_index_parameter
.
Ah perhaps the logic was put in and I didn't realize -- sounds like that's working ok then.
Do you have logs for the runs that were supposed to produce the missing files? What kind of pipelining system do you use?
Unfortunately I inherited these bams from someone so I'm not clear on the details and I'm a little suspicious about what they did exactly. What I do know is the following:
1) mapped the reads with bwa 0.7.9 mem ( human high seq reads 30X coverage )
2) Some combination of sort, mark dups, mergebams, relaign and bqsr using GATK version 3.3 or 3.4
I inherited the recalibrated bams although no record of bqsr appears in the headers. They say that they ran it and I believe them but the absence of a record concerns me because I didn't get the same result with my bam files.
One thing that I did do however was use GATK 3.5 for the HaplotypeCaller step. I believe that they were using 3.3 or 3.4 up to that point.
Thanks for all your help, it's really appreciated especially at this hour!
Here's my HC script:
if [ ! ${1} ]; then echo "$0 basenameOfBQSRbam"; exit; fi
echo ${1} && \
\
HaplotypeCaller
for i in {1..22}; do echo -ne "$i "; \
sqsub -r 36h -o /work/rb14sp/CONlog/${1}.chr${i}.HC.log -f serial --mpp 8G \
java -Xmx6g -jar /work/rb14sp/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R /work/lianglab/bin/GATK/resources/hg19/ucsc.hg19.fasta \
--dbsnp /work/lianglab/bin/GATK/resources/hg19/dbsnp_138.hg19.vcf \
-L chr${i} \
-I /scratch/rb14sp/Control/BAM/${1}.bam \
-o /scratch/rb14sp/Control/VCF/${1}.chr${i}.g.vcf \
--emitRefConfidence GVCF \
2>/work/rb14sp/CONlog/$1.vcf.log; done && \
\
echo -ne "chrother " && \
sqsub -r 20h -o /work/rb14sp/CONlog/${1}.chr${i}.HC.log -f serial --mpp 8G \
java -Xmx6g -jar /work/rb14sp/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R /work/lianglab/bin/GATK/resources/hg19/ucsc.hg19.fasta \
--dbsnp /work/lianglab/bin/GATK/resources/hg19/dbsnp_138.hg19.vcf \
-XL chr1 -XL chr2 -XL chr3 -XL chr4 -XL chr5 -XL chr6 -XL chr7 -XL chr8 -XL chr9 -XL chr10 -XL chr11 -XL chr12 \
-XL chr13 -XL chr14 -XL chr15 -XL chr16 -XL chr17 -XL chr18 -XL chr19 -XL chr20 -XL chr21 -XL chr22 \
-I /scratch/rb14sp/Control/BAM/${1}.bam \
-o /scratch/rb14sp/Control/VCF/${1}.chrother.g.vcf \
--emitRefConfidence GVCF; exit
Hi @robertb, if you have files missing when running through a script like this you need to look at the jobscheduler logs to see what happened. Probably some of the submissions failed.