We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Using VariantRecalibrator over a large dataset.

init_jsinit_js Member
edited May 2018 in Ask the GATK team

I'm using VariantRecalibrator on about 2TiB of raw SNPs, and I'm hoping there's a way I can build the model iteratively over smaller windows of input files. I'm fine with doing this sequentially/serially, it wouldn't need to be parallelized. My raw VCF files (thousands of them), which I supply as a series of --variant NNNNNNNN.vcf.gz arguments to VariantRecalibrator, are non-overlapping, and are all jointcalled over the same set of samples.

We're using GATK4.0.1.2 (will switch to, and we are following best practices. My SNP training set has been computed separately, as I don't have access to high-quality "truth" SNP sets for the plant genome I'm working with (sunflower).

Here is the gist of my VariantRecalibrator call (I have a similar one for INDELs):



HOME="${workdir}" /gatk/gatk VariantRecalibrator \
    --java-options "$JAVA_OPTIONS" \
    --TMP_DIR "${XTMP}/gatk" \
    --arguments_file "$argsfile" `:  that is simply a concatenation of all input files` \
    --resource ${resname},${resparams}:"${TMPSNPS}" \
    --mode SNP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
    --truth-sensitivity-tranche 100.0 \
    --truth-sensitivity-tranche  99.0 \
    --truth-sensitivity-tranche  90.0 \
    --truth-sensitivity-tranche  70.0 \
    --truth-sensitivity-tranche  50.0 \
    --rscript-file  "$workdir/${snp_rscript}" \
    --tranches-file "$workdir/${snp_tranches}" \
    --output        "$workdir/${snp_recal}"

The arguments_file simply contains a series of --variant XXX.vcf.gz.

My input VCF files are stored on the NAS inside tarballs (.tar), and I can't have them both in archive form and de-tarred, because it won't fit on the compute node. I'm hoping there is a way to iterate the model recalibration in batches, where on a subsequent batch I pick up the model where I left it off on the previous iteration?

I'm wondering what my options are to eventually iterate over my entire dataset.

  • Can I feed VariantRecalibrator pipe (fifo) files so I can stream vcf data into it? (the fifo would be backed by a program that would simply concatenate files one after the other on the stream, or just feed tar -xf batch001.tar 0000001.vcf.gz etc.).

  • Is there a plugin/codec that can read all the vcf.gz files contained in a tarball? Then I could probably leave all the tar files on the NAS/filer.

PS. In the docs, the VariantRecalibrator example mentions --recal-file FOO, but that option is not supported in (https://software.broadinstitute.org/gatk/documentation/tooldocs/ )

Post edited by init_js on


  • init_jsinit_js Member

    I've just noticed in the code that there's provision to load an existing model:


    Is this input the output of a previous run of the variant recalibration?

  • init_jsinit_js Member
    edited June 2018

    I don't think --input-model X and --output-model X are that useful after all. If the walk stops before enough data has been accumulated, then the variantrecalibratorengine can't output anything. So i'd have to know ahead of time how much data is enough to feed into the first batch.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I hope the WDLs we provide for running on many samples will help you. Have a look at Laura's response here that outlines some benefits of using the WDL and where to find them.


  • init_jsinit_js Member

    Great. That WDL information looks promising. The salient bit I need is probably the --sample-every 10 aspect of the workflow, which I don't think is described anywhere else in the docs. I'm stuck with Snakemake for the moment, so I can't readily incorporate WDL concepts, I have to replicate them. I think that should be fine.

    I've also found a workaround to my problem:

    VariantRecalibrator seems overly sensitive to the number of input files, and not so much the aggregate quantity of data in those files. When I posted initially, I was passing all of my ~4500 1Mbp vgf.gz, obtained from a GenotypeGVCFs scatter operation.

    When passing thousands of input files, the initial phase of VariantRecalibrator (before the traversal) would take hours, and consume over 40GB ram for my dataset. Once the traversal started, the memory footprint drastically reduced. As the traversal went on, the number of variants processed by minute would also go down to a grinding halt:

    • Chr01 got processed in 46min (153Mbp) (240000 variants/minute)
    • Chr02 got processed in 100min (180Mbp)
    • Chr03 got processed in 170min (168Mbp)
    • By the middle of chromosome 4, it would be down to (104000 variants/minute). There are 17 chromosomes plus bits.

    The workaround:

    If I gather all my GVCFs by chromosome first, then it's happier. The processing also doesn't seem to go down, so the total runtime appears linear in the quantity of data. It's still 250000 variants/min by the start chromosome 9. The job still takes a day or so, so it would be nice to be able to distribute the work. -- hopefully --sample-every 10 can help.

  • init_jsinit_js Member
    edited June 2018

    I meant to ask also...

    Is there any benefit to having an index (.vcf.gz.tbi) to each of the input files of VariantRecalibrator? It's not clear that it would be beneficial for traversal.

    (GatherVCF doesn't produce an index file if the inputs are vcf.gz, so I have to make an extra call to produce it, which takes hours per chromosome)

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I thought the tools in GATK4 now produce a VCF index on the fly, like in GATK3. Is that not true? What is the "extra call" that you use to produce the index?


  • init_jsinit_js Member
    edited June 2018

    Some tools in GATK4 do, but not all of them, depending on the input formats.

    This is gatk

    [Tue Jun 05 07:12:32 UTC 2018] Executing as [email protected] on Linux 3.10.0-693.11.6.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11; Deflater: Intel; Inflater: Intel; Picard version: Version:2.17.2
    INFO    2018-06-05 07:12:32     GatherVcfs      Checking inputs.
    INFO    2018-06-05 07:12:32     GatherVcfs      Checking file headers and first records to ensure compatibility.
    INFO    2018-06-05 07:12:42     GatherVcfs      Gathering by copying gzip blocks. Will not be able to validate position non-overlap of files.
    WARNING 2018-06-05 07:12:42     GatherVcfs      Index creation not currently supported when gathering block compressed VCFs.
    INFO    2018-06-05 07:12:42     GatherVcfs      Gathering /home/user/src/snp-calling/x/data/tmp/tmp.call_vcf_chrom.atUoPQ/inputs/vcf_9bfba45a254e34ffd716ee4b080d91106edecfa9_HanXRQChr06-000000001-001000000.vcf.gz
    INFO    2018-06-05 07:12:47     GatherVcfs      Gathering /home/user/src/snp-calling/x/data/tmp/tmp.call_vcf_chrom.atUoPQ/inputs/vcf_9bfba45a254e34ffd716ee4b080d91106edecfa9_HanXRQChr06-001000001-002000000.vcf.gz
    INFO    2018-06-05 07:12:54     GatherVcfs      Gathering /home/user/src/snp-calling/x/data/tmp/tmp.call_vcf_chrom.atUoPQ/inputs/vcf_9bfba45a254e34ffd716ee4b080d91106edecfa9_HanXRQChr06-002000001-003000000.vcf.gz

    I happen to use the program tabix -p vcf <inputfile>.vcf.gz to re-create the index on the gathered vcf. (yields a .vcf.gz.tbi)

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I see. Yes, the index file is necessary for the tools to run. There is a tool called IndexFeatureFile that you can use as well as tabix.


  • init_jsinit_js Member
    edited June 2018

    I've found out a few more things to watch out for when running VariantRecalibrator over large datasets, with help from @Sheila and @gauthier .

    My pipeline's not written in WDL, but I've been reproducing the steps laid out in the other thread, as well as the WDL pipeline it refers to (here).

    When I initially posted to this discussion, I was passing the scattered output of a GenotypeGVCFs call into VariantRecalibrator. There were thousands of small-ish 1Mbp .vcf.gz contigs. The first thing that helped greatly was to gather those. VariantRecalibrator has difficulty making any progress if there are too many inputs. I've settled for a per-chromosome gather (instead of a single file) to still be able to move files between filesystems (my total amount of vcf.gz data is in the low-TB range, but a single 2TB file is cumbersome to move around).

    The second improvement, was to convert the raw VCFs with Picard's MakeSitesOnlyVcf. VariantRecalibrator doesn't need all the information in the input VCF. I assumed it would not be affected by superfluous data, but it turns out it matters. It's not very good at not processing what it doesn't need. That has reduced my input dataset by about 50x. The WDL pipeline also makes an ExcessHet filter pass with a very conservative threshold before the MakeSitesOnlyVcf, but in my case the data reduction is marginal (1% or so). I'm not sure I really need the ExcessHet filter, but I've put it in place anyways. Note that you run VariantRecalibrator over the SitesOnly vcf, but you apply the model over the complete VCF.

    My SNP resource set (Known=false, Training=true, Truth=true, Prior=Q10.0), contains only polymorphic sites, so I've added --trust-all-polymorphic flag to my VariantRecalibrator call. My training resource set has 752947 records, collected from ~67 plants. The documentation mentions that the flag will drastically speed up computation -- which sounds great. But, the speed improvement was, in my setting, disappointingly marginal at best.

    A further tweak (from what's in the WDL), was to rename the extension on the recal file from .recal to .recal.vcf.gz. VariantCalibrator complains that it can't detect the output format of the table otherwise.

    SNP_01:26:30.949 WARN  GATKVariantContextUtils - Can't determine output variant file format from output file extension "recal". Defaulting to VCF.

    Another requirement for making further progress (or any at all) was to bump RAM allocation for the JVM heap. The symptom, when not enough ram is provided, is that the speed at which variants are processed (Variants/Minute) will taper off, and the program will come to a grinding halt partway through the input. You may or may not get a GC memory overhead limit error in the end, depending on how patient you are. The garbage collector eventually ends up biting its own tail. For instance, you might see something like this. Progress is slowed down dramatically by the third input (there are 20 total):

    SNP_01:26:30.966 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    SNP_01:26:40.974 INFO  ProgressMeter - HanXRQChr01:14849584              0.2               1168000        7003097.8
    SNP_01:26:50.977 INFO  ProgressMeter - HanXRQChr01:44921389              0.3               2393000        7175053.7
    SNP_01:27:30.988 INFO  ProgressMeter - HanXRQChr01:100839935              1.0               6899000        6896471.3
    SNP_01:27:40.991 INFO  ProgressMeter - HanXRQChr01:113245994              1.2               8068000        6912959.7
    SNP_01:27:50.998 INFO  ProgressMeter - HanXRQChr01:126277305              1.3               9242000        6928728.5
    SNP_01:28:01.072 INFO  ProgressMeter - HanXRQChr01:137251113              1.5              10384000        6914599.6
    SNP_01:31:36.268 INFO  ProgressMeter - HanXRQChr03:27538860              5.1              25690000        5048787.9
    SNP_01:32:54.747 INFO  ProgressMeter - HanXRQChr03:60014081              6.4              27390000        4282129.7
    SNP_01:33:10.593 INFO  ProgressMeter - HanXRQChr03:66700089              6.7              27695000        4158127.5
    SNP_01:40:07.186 INFO  ProgressMeter - HanXRQChr03:133671870             13.6              32467000        2386636.0
    SNP_01:40:23.955 INFO  ProgressMeter - HanXRQChr03:134624322             13.9              32578000        2346585.6
    SNP_01:41:14.643 INFO  ProgressMeter - HanXRQChr03:137302303             14.7              32902000        2233983.7
    SNP_01:41:31.455 INFO  ProgressMeter - HanXRQChr03:137901701             15.0              33000000        2198805.3
    SNP_01:46:37.498 INFO  ProgressMeter - HanXRQChr03:150503471             20.1              34363000        1708848.2
    SNP_01:51:27.068 INFO  ProgressMeter - HanXRQChr03:157679736             24.9              35071000        1406496.0
    SNP_01:51:44.156 INFO  ProgressMeter - HanXRQChr03:157806902             25.2              35102000        1391841.1
    SNP_01:52:01.219 INFO  ProgressMeter - HanXRQChr03:157957495             25.5              35131000        1377458.5

    I've bumped my ram usage up until I've managed to get through all the input, which was about 76GB ram for 2TB of zipped input, for my particular case.

    Another improvement, wrt threading: I assumed the variant "walkers"/visitor patterns would be single-threaded, but allocating 2 cores to the VariantRecalibrator helps greatly. I suspect the GC and other data workers are at play. 1 core is 100% in use during most of the processing, but there are short bursts where CPU usage climbs to all available cpus -- I suspect it's the GC.

    There is an argument to VariantRecalibrator, called --output-model, which is really an operation mode parameter. It allows a scattered approach to the model recalibrator -- but the initial pass for the model creation still needs to go through all your input data. You would use this mode if you have >10000 samples. The other thread I've linked, above, has more details on that front.

Sign In or Register to comment.