How do I specify a list of samples for GenotypeGVCFs?

This is the recommended code for GenotypeGVCFs

java -jar GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R reference.fasta \
   --variant sample1.g.vcf \
   --variant sample2.g.vcf \
   -o output.vcf

Is there some way to specify input g.vcfs from a variable or a text file with sample names?

echo "$files"
s1.g.vcf
s2.g.vcf

or

cat files.txt
s1.g.vcf
s2.g.vcf

I tried --variant $files and --variant <(echo $files), but that doesn't work.

Tagged:

Best Answer

  • rmfrmf Member
    Accepted Answer

    This approach worked for me:

    # get all files ending with g.vcf and add --variant before it
    samples=$(find . | sed 's/.\///' | grep -E 'g.vcf$' | sed 's/^/--variant /')
    

    Then

    java -jar GenomeAnalysisTK.jar \
       -T GenotypeGVCFs \
       -R reference.fasta \
       -o output.vcf \
       $(echo $samples)
    

Answers

  • Greg_OwensGreg_Owens Member
    edited December 2017

    For bash scripts, I use a loop to build up a single variable that has all my samples. For example:

    while read prefix
    do
            tmp="$tmp --variant $gvcf/$prefix.g.vcf"
    done < $path/gvcf.list.txt
    

    Then just put $tmp in when I call GATK.

  • rmfrmf Member
    Accepted Answer

    This approach worked for me:

    # get all files ending with g.vcf and add --variant before it
    samples=$(find . | sed 's/.\///' | grep -E 'g.vcf$' | sed 's/^/--variant /')
    

    Then

    java -jar GenomeAnalysisTK.jar \
       -T GenotypeGVCFs \
       -R reference.fasta \
       -o output.vcf \
       $(echo $samples)
    
  • SkyWarriorSkyWarrior TurkeyMember

    Put all the file names in a single file named files.list or whatever. Give that file as --variant parameter and you are set. You don't need to fiddle with loops and other things.

  • sam0persam0per Member

    @SkyWarrior
    Is it possible that your suggestion is not available in GATK4?

    If I use -V sample.list which contains

    CZA121.g.vcf.gz
    CZA122.g.vcf.gz
    CZA123.g.vcf.gz
    CZA126.g.vcf.gz 
    

    the GenotypeGVCFs returns the following error
    Cannot read test/samples_gvcf.list because no suitable codecs found

    I also tried this format for the sample list (tab separated)

    CZA121  CZA121.g.vcf.gz
    CZA122  CZA122.g.vcf.gz
    CZA123  CZA123.g.vcf.gz
    CZA126  CZA126.g.vcf.gz
    

    and the same error occurs

  • SkyWarriorSkyWarrior TurkeyMember
    edited April 18

    You cannot use that in GATK4.0. You need to combine your gvcfs into GenomicsDB or into a combined GVCF via CombineGVCFs. You cannot genotype them on the fly anymore in GATK4.0. And this is for the common good (It is faster!).

    A little benchmark result may push more people to GATK4.0

    1200 60X-100X WES GenotypeGVCFs GATK3.8-1 -nt 4 > 70 hours (The only parallelization option on the local machine 4 threads per data thread. Try 8 or more to see your system die in vain! if running on a local machine)
    1200 60X-100X WES GenomicsDBImport + GenotypeGVCFs GATK4.0.3.0 ~40 hours. (GenomicsDBImport parallelized 4 times with 6 contigs each thread and GenotypeGVCFs parallellized 4 times with 6 contigs each thread. All on the local machine no spark no wdl no cromwell whatsoever. Could have parallelized 8 times with no issues but I did not care much!)

Sign In or Register to comment.