Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How do I specify a list of samples for GenotypeGVCFs?

This is the recommended code for GenotypeGVCFs
java -jar GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R reference.fasta \ --variant sample1.g.vcf \ --variant sample2.g.vcf \ -o output.vcf
Is there some way to specify input g.vcfs from a variable or a text file with sample names?
echo "$files" s1.g.vcf s2.g.vcf
or
cat files.txt s1.g.vcf s2.g.vcf
I tried --variant $files
and --variant <(echo $files)
, but that doesn't work.
Tagged:
Best Answer
-
This approach worked for me:
# get all files ending with g.vcf and add --variant before it samples=$(find . | sed 's/.\///' | grep -E 'g.vcf$' | sed 's/^/--variant /')
Then
java -jar GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R reference.fasta \ -o output.vcf \ $(echo $samples)
Answers
For bash scripts, I use a loop to build up a single variable that has all my samples. For example:
Then just put $tmp in when I call GATK.
This approach worked for me:
Then
Put all the file names in a single file named files.list or whatever. Give that file as --variant parameter and you are set. You don't need to fiddle with loops and other things.
@SkyWarrior
Is it possible that your suggestion is not available in GATK4?
If I use -V sample.list which contains
the GenotypeGVCFs returns the following error
Cannot read test/samples_gvcf.list because no suitable codecs found
I also tried this format for the sample list (tab separated)
and the same error occurs
You cannot use that in GATK4.0. You need to combine your gvcfs into GenomicsDB or into a combined GVCF via CombineGVCFs. You cannot genotype them on the fly anymore in GATK4.0. And this is for the common good (It is faster!).
A little benchmark result may push more people to GATK4.0
1200 60X-100X WES GenotypeGVCFs GATK3.8-1 -nt 4 > 70 hours (The only parallelization option on the local machine 4 threads per data thread. Try 8 or more to see your system die in vain! if running on a local machine)
1200 60X-100X WES GenomicsDBImport + GenotypeGVCFs GATK4.0.3.0 ~40 hours. (GenomicsDBImport parallelized 4 times with 6 contigs each thread and GenotypeGVCFs parallellized 4 times with 6 contigs each thread. All on the local machine no spark no wdl no cromwell whatsoever. Could have parallelized 8 times with no issues but I did not care much!)