Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

How do I specify a list of samples for GenotypeGVCFs?

This is the recommended code for GenotypeGVCFs

java -jar GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R reference.fasta \
   --variant sample1.g.vcf \
   --variant sample2.g.vcf \
   -o output.vcf

Is there some way to specify input g.vcfs from a variable or a text file with sample names?

echo "$files"
s1.g.vcf
s2.g.vcf

or

cat files.txt
s1.g.vcf
s2.g.vcf

I tried --variant $files and --variant <(echo $files), but that doesn't work.

Tagged:

Best Answer

  • rmfrmf
    Accepted Answer

    This approach worked for me:

    # get all files ending with g.vcf and add --variant before it
    samples=$(find . | sed 's/.\///' | grep -E 'g.vcf$' | sed 's/^/--variant /')
    

    Then

    java -jar GenomeAnalysisTK.jar \
       -T GenotypeGVCFs \
       -R reference.fasta \
       -o output.vcf \
       $(echo $samples)
    

Answers

  • Greg_OwensGreg_Owens Member
    edited December 2017

    For bash scripts, I use a loop to build up a single variable that has all my samples. For example:

    while read prefix
    do
            tmp="$tmp --variant $gvcf/$prefix.g.vcf"
    done < $path/gvcf.list.txt
    

    Then just put $tmp in when I call GATK.

  • rmfrmf Member
    Accepted Answer

    This approach worked for me:

    # get all files ending with g.vcf and add --variant before it
    samples=$(find . | sed 's/.\///' | grep -E 'g.vcf$' | sed 's/^/--variant /')
    

    Then

    java -jar GenomeAnalysisTK.jar \
       -T GenotypeGVCFs \
       -R reference.fasta \
       -o output.vcf \
       $(echo $samples)
    
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Put all the file names in a single file named files.list or whatever. Give that file as --variant parameter and you are set. You don't need to fiddle with loops and other things.

  • sam0persam0per Member

    @SkyWarrior
    Is it possible that your suggestion is not available in GATK4?

    If I use -V sample.list which contains

    CZA121.g.vcf.gz
    CZA122.g.vcf.gz
    CZA123.g.vcf.gz
    CZA126.g.vcf.gz 
    

    the GenotypeGVCFs returns the following error
    Cannot read test/samples_gvcf.list because no suitable codecs found

    I also tried this format for the sample list (tab separated)

    CZA121  CZA121.g.vcf.gz
    CZA122  CZA122.g.vcf.gz
    CZA123  CZA123.g.vcf.gz
    CZA126  CZA126.g.vcf.gz
    

    and the same error occurs

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited April 18

    You cannot use that in GATK4.0. You need to combine your gvcfs into GenomicsDB or into a combined GVCF via CombineGVCFs. You cannot genotype them on the fly anymore in GATK4.0. And this is for the common good (It is faster!).

    A little benchmark result may push more people to GATK4.0

    1200 60X-100X WES GenotypeGVCFs GATK3.8-1 -nt 4 > 70 hours (The only parallelization option on the local machine 4 threads per data thread. Try 8 or more to see your system die in vain! if running on a local machine)
    1200 60X-100X WES GenomicsDBImport + GenotypeGVCFs GATK4.0.3.0 ~40 hours. (GenomicsDBImport parallelized 4 times with 6 contigs each thread and GenotypeGVCFs parallellized 4 times with 6 contigs each thread. All on the local machine no spark no wdl no cromwell whatsoever. Could have parallelized 8 times with no issues but I did not care much!)

Sign In or Register to comment.