SelectVariants Large VCF slow runtime

I am attempting to subset and filter a large (10k exome sample, 250GB) VCF file using SelectVariants. My goal is to subset by individual samples (iterating over each sample using a custom script and passing an individual SelectVariants command for each), selecting only heterozygous alleles, with an alt allele depth > 5, GQ > 30, and for SNPs that pass the filter. My issue is very slow runtime, which seems like it shouldn't be a problem when I only want calls from a single sample. I feel it may be an issue with how I have set up my SelectVariants command (shown below), or it may be an issue with SelectVariants and large VCFs.

Here is the command I am using:

java -jar GATK.3.7.jar -T SelectVariants -R ref.fa -V very.large.vcf.gz -o single.sample.filtered.vcf.gz -sn -selectType SNP -select 'vc.getGenotype("").isHet()' -select 'vc.getGenotype("").getAD().1 > 5' -select 'vc.getGenotype("").getGQ() > 30' -select 'vc.isNotFiltered()'


  • matthewzatzmanmatthewzatzman SickKidsMember

    I have solved this by multithreading '-nt 16', which has significantly reduced expected runtimes.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    You may also be interested in WDL. Users have reported issues with multi-threading, and WDL is much more stable.


  • jrissejrisse WageningenMember


    I have a similar problem. I'm trying to filter a large vcf file (23Gb unzipped) to select high quality SNPs. The reference is a draft assembly with a lot of smallish contigs and the genome size is 2.1Gb. The vcf contains around 260 samples. The vcf file has been generated with samtools/bcftools and should contain all relevant fields. I tried multi-threading and increasing memory. I also disabled the downsampling of reads. The command has been running for > 6days now, and progress is 0% according to stdout. The output file only contains the vcf header, nothing else. I've used SelectVariants before, but don't recall anything taking this long. Any suggestions? The command is:

    java -Xmx124g -jar GenomeAnalysisTK.jar -nt 10 -T SelectVariants -R reference.fa --variant samtools_snps.vcf -dt NONE -env --selectTypeToInclude SNP -select "DP >= 150 && QUAL >= 50" -o samtools_snps_filtered_DP20_Q50.vcf


  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Judith,

    How many contigs are in your reference? This thread will help too.


  • tytolintytolin Member

    Hello, Sheila

    How can I reduce the time on running GATK SV more effectively?

    I also encounter the same problem that GATK spend too much time on SV for searching concordance.
    I have a vcf file with 429275 variants from samtools, and a vcf containing 5843388 variants from GATK UG. It spends my server twenty days to run the command

    -T SelectVariants -nt 8 -R reference.fa -V samtools.vcf --concordance gatk.vcf -o re_concordance.vcf

    And it just finish the first scaffold that I've seen from nohup.out
    scaffold122_len806_cov30:332 23.0 8.2 d 15250.3 w 0.0% 15250.3 w 15249.1 w
    scaffold122_len806_cov30:332 23.0 16.5 d 15250.3 w 0.0% 15250.3 w 15247.9 w
    It spend 8.3 days to run a scaffold with 806 bp long.

