This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
SelectVariants Large VCF slow runtime
I am attempting to subset and filter a large (10k exome sample, 250GB) VCF file using SelectVariants. My goal is to subset by individual samples (iterating over each sample using a custom script and passing an individual SelectVariants command for each), selecting only heterozygous alleles, with an alt allele depth > 5, GQ > 30, and for SNPs that pass the filter. My issue is very slow runtime, which seems like it shouldn't be a problem when I only want calls from a single sample. I feel it may be an issue with how I have set up my SelectVariants command (shown below), or it may be an issue with SelectVariants and large VCFs.
Here is the command I am using:
java -jar GATK.3.7.jar -T SelectVariants -R ref.fa -V very.large.vcf.gz -o single.sample.filtered.vcf.gz -sn sample.name -selectType SNP -select 'vc.getGenotype("sample.name").isHet()' -select 'vc.getGenotype("sample.name").getAD().1 > 5' -select 'vc.getGenotype("sample.name").getGQ() > 30' -select 'vc.isNotFiltered()'