Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

SelectVariants Large VCF slow runtime

I am attempting to subset and filter a large (10k exome sample, 250GB) VCF file using SelectVariants. My goal is to subset by individual samples (iterating over each sample using a custom script and passing an individual SelectVariants command for each), selecting only heterozygous alleles, with an alt allele depth > 5, GQ > 30, and for SNPs that pass the filter. My issue is very slow runtime, which seems like it shouldn't be a problem when I only want calls from a single sample. I feel it may be an issue with how I have set up my SelectVariants command (shown below), or it may be an issue with SelectVariants and large VCFs.

Here is the command I am using:

java -jar GATK.3.7.jar -T SelectVariants -R ref.fa -V very.large.vcf.gz -o single.sample.filtered.vcf.gz -sn sample.name -selectType SNP -select 'vc.getGenotype("sample.name").isHet()' -select 'vc.getGenotype("sample.name").getAD().1 > 5' -select 'vc.getGenotype("sample.name").getGQ() > 30' -select 'vc.isNotFiltered()'

Comments

  • matthewzatzmanmatthewzatzman SickKidsMember

    I have solved this by multithreading '-nt 16', which has significantly reduced expected runtimes.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @matthewzatzman
    Hi,

    You may also be interested in WDL. Users have reported issues with multi-threading, and WDL is much more stable.

    -Sheila

  • jrissejrisse WageningenMember

    Hi,

    I have a similar problem. I'm trying to filter a large vcf file (23Gb unzipped) to select high quality SNPs. The reference is a draft assembly with a lot of smallish contigs and the genome size is 2.1Gb. The vcf contains around 260 samples. The vcf file has been generated with samtools/bcftools and should contain all relevant fields. I tried multi-threading and increasing memory. I also disabled the downsampling of reads. The command has been running for > 6days now, and progress is 0% according to stdout. The output file only contains the vcf header, nothing else. I've used SelectVariants before, but don't recall anything taking this long. Any suggestions? The command is:

    java -Xmx124g -jar GenomeAnalysisTK.jar -nt 10 -T SelectVariants -R reference.fa --variant samtools_snps.vcf -dt NONE -env --selectTypeToInclude SNP -select "DP >= 150 && QUAL >= 50" -o samtools_snps_filtered_DP20_Q50.vcf

    Regards
    Judith

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jrisse
    Hi Judith,

    How many contigs are in your reference? This thread will help too.

    -Sheila

  • tytolintytolin Member

    @Sheila
    Hello, Sheila

    How can I reduce the time on running GATK SV more effectively?

    I also encounter the same problem that GATK spend too much time on SV for searching concordance.
    I have a vcf file with 429275 variants from samtools, and a vcf containing 5843388 variants from GATK UG. It spends my server twenty days to run the command

    -T SelectVariants -nt 8 -R reference.fa -V samtools.vcf --concordance gatk.vcf -o re_concordance.vcf

    And it just finish the first scaffold that I've seen from nohup.out
    scaffold122_len806_cov30:332 23.0 8.2 d 15250.3 w 0.0% 15250.3 w 15249.1 w
    scaffold122_len806_cov30:332 23.0 16.5 d 15250.3 w 0.0% 15250.3 w 15247.9 w
    It spend 8.3 days to run a scaffold with 806 bp long.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @tytolin
    Hi,

    That is too long. Can you try using -L? Try running on one contig first, and hopefully that helps.

    What kind of data are you working with?

    -Sheila

  • tytolintytolin Member

    @Sheila

    a kind of passerine whole genome data, about 1.2Gb
    I will try using -L with one scaffold

    Thanks a lot

    -Tyto

  • tytolintytolin Member
    edited May 2

    @Sheila

    I use SV -L to extract out one short scaffold from vcf files generated by UG and samtools mpileup to test whether my GATK is Ok.
    There are only 33 and 21 variants in the short scaffold of these two files.
    However, It have been spending me three days to get the variants and it still running.
    Is that usual or unusual?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @tytolin
    Hi Tyto,

    It is so nice to hear about users who work with non-human organisms. Passerine pictures from Google are quite nice :smile:

    It is pretty odd that it is taking a long time. Can you try using -L instead of subsetting out the contigs with SelectVariants? Also, how long are the contigs in your reference, and how many contigs are there?

    Thanks,
    Sheila

  • tytolintytolin Member

    @Sheila
    Hi Sheila,

    I have used -L to take the scaffold with 2716 bp long out from two vcf files generated with different program. Then, I run SV --concordance from those two new vcfs containing only one 2716 bp long scaffold. And, the --concordance have been running for 4 days.

    Also, I have SV -L scaffold2716 --concordance from two vcfs generated with two different program.
    It spend about three days finishing. The output seems quite normal.

    Thanks,
    Tyto

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @tytolin
    Hi Tyto,

    I see. After subsetting the VCF to just one contig, are you also using -L when running SelectVariants with --concordance?

    How long are the contigs in your VCF, and how many are there?

    Thanks,
    Sheila

  • tytolintytolin Member

    @Sheila
    hi, Sheila

    No, I didn't use -L with SelectVariants --concordance after subsetting the VCF

    I only take one contig which is 2716 bp long which is the same I subset.

    Thanks,
    Tyto

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @tytolin
    Hi Tyto,

    Thanks. What about in your reference in general? How many contigs are there and how long are they? I ask because I am wondering if the tool is simply taking a long time to load the entire reference.

    -Sheila

  • tytolintytolin Member

    @Sheila
    There are about 600 thousand contigs in my reference.
    Comparing with another mammal reference, It's much less than reference I using.
    It might be the reason why I can't complete the SV --concordance
    I will give a try.
    Thanks a lot.

    -Tyto

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited May 11

    @tytolin
    Hi Tyto,

    We have some threads on how to stitch together contigs with Ns. That may help you, as GATK tools are not made to take in so many contigs.

    -Sheila

  • tytolintytolin Member

    @Sheila
    Hello Sheila

    I have got a reference which has 8000 contigs, and I worked. Thanks.

Sign In or Register to comment.