To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

GATK on RAD Data - extraordinarily long run times

Hi all,

I have Type 2B RAD data from many individuals from several populations of my non-model species, mapped using Bowtie 0.12.8 to a reference database made by extracting all potential RAD sites from the available genome. I would like to run a first-pass UnifiedGenotyper run on a single individual, but even on a supercomputer and using the -nct and -nt flags, GATK says it will need 4.9 weeks to finish!

A collaborator suggested that GATK may just not handle many reference contigs well, but I have already reduced my reference database from the 1.6 million possible tags to the 95,000 tags that were seen at least 100x across all my individuals.

Does GATK respond to the number of contigs like this? Are there any tips you can give me to reduce the amount of time necessary to something more reasonable?

Thank you!



  • CarneiroCarneiro Charlestown, MAMember

    Hi Roxana, by 1.6million do you mean the size of the 'fasta reference' you're using, or the number of contigs in this reference?

    In general the scaling factor for the GATK is going to be the reads (the BAM files) not the reference, unless your reference is orders of magnitude larger than the human reference. It's hard to say what is causing the slow behavior in your data without knowing it exactly.

    • How large is your reference? (The number of contigs shouldn't matter as much as the total size of it.)
    • How much coverage do you have per sample?
    • Are you trying to make calls on all at the same time?
    • Are you calling SNPs and Indels, or just snps?
    • Is your organism diploid?
  • RoxanaRoxana Member

    Hi Dr. Carneiro,

    • I have ~1.6 million unique RAD sites/contigs in my original reference file, each being 36b long. I was able to reduce that reference db to ~95,000 contigs. When I use the full reference db, GATK says it will take 23 weeks; with the smaller db, it will "only" take 4.9 weeks.
    • For the individual I am working with first, my average coverage across all tags is 93x for 70,303 sites (2,530,908 total bases).
    • Yes, I am trying to call all SNPs at once, instead of working with a subsetted region.
    • I'm only interested in SNPs and not indels at the moment, and am in fact using Bowtie1 to map the reads, which doesn't support gapped mapping. I'm not positive this is the best way to approach this, but it seemed the simplest way for this preliminary analysis.
    • Yes, my organism is diploid.

    Thank you!

  • RoxanaRoxana Member

    Update! Someone on SeqAnswers suggested that mapping to the whole genome instead of to the extracted potential tags may improve things. I mapped a single individual to the genome (12.5k contigs), then ran the UnifiedGenotyper on that sample. We're down to 51 minutes! Very interesting that the size of the reference makes such a huge difference (12.5k contigs/1 hour vs 95k contigs/4.9 weeks...).

    However, I'm not 100% convinced that mapping to the genome is the correct way to handle Type2B RAD tags. I don't have data to support my gut feeling yet, but it seems preferable to restrict possible mapping sites to regions that definitely contain the restriction site. I can imagine a situation where reads may be mismapped if given the full genome. Do you have any opinions about what reference I should be using here?

    My uncertainty aside, only one reference choice is computationally feasible.

  • CarneiroCarneiro Charlestown, MAMember

    This sounds likes something we should take a look and try to optimize. There is no reason for two references of the same size but different number of contigs to have (such) different runtimes. Is your data shareable? We would love to use it as a test case to find out what is going on here.

    On human data, the recommended approach is to always align to the entire reference even if you did targeted sequencing because there is always off-target sequencing and if you limit your alignment to the captured region, the aligner will try as hard as it can to force-align reads there, even when they don't belong, and this will enrich your false positive rate by creating badly mapped regions.

    For other organisms where the reference is not so well defined, I don't have an answer. My gut reaction would be to align to the whole genome, for the same reason, but if the alignments are bad, you may be biasing yourself the wrong way. It is hard to say without data to back either way here.

    Let me know if you can share your data and we will have someone take a look at the performance issue.

  • RoxanaRoxana Member

    Just to be clear, the full genome has many more bases than my in silico RAD-digested reference, just distributed across fewer contigs. I am more than happy to share the files I'm working with.

    Upon more thought, I think that mapping to the genome is actually a better idea than mapping to my digested potential sites after all. While mapping to the genome may result in some "sloppily" mapped reads, mapping to the digested fragments may force bad matches instead. After discussion with a few collaborators, and as this is the only computationally viable option anyway, I'm on board with this method.

    What is the best way to share the data? I can give you the full genome I'm working with, the file with all potential in silico-digested sites extracted, and a file with the RAD sites that are actually seen in my data, as well as one of the trimmed/filtered samples.fastq.

    Thank you!

  • CarneiroCarneiro Charlestown, MAMember

    I'd rather have the files you're using in your run of the GATK. That would be the BAM file and the RAD-digested reference fasta. This way we can replicate your runtime and figure out if there is anything we didn't provision for when traversing this highly chopped reference.

    Instructions to submit your data is here.

    Thank you,

  • jeremy_shejeremy_she ThailandMember
    edited July 2014

    I am finding the same thing, mapping to the rubber tree genome, which unfortunately is about 1.2 million contigs. Using 6 samples each with around 100M reads GATK says it will take 100 weeks! I checked with versions 2.8.1 and 3.1.1, from what I can see the problem seems to be with multi-threading, I am using -nt 8 and -nct 4 but the node that I am using rarely uses more than a single CPU. Perhaps the multi-threading code distributes the workload of the current reference sequence rather than distributing individual reference sequences?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @jeremy_she‌

    Unfortunately the GATK is not built to handle such large numbers of contigs. The main problems are at the level of memory and I/O (too many files need to be held open at the same time). Multi-threading just aggravates those issues. To handle RAD seq data, I would recommend combining many of the contigs into supercontigs. You can use stretches of Ns as spacers to avoid having reads that map across the overlap. A few hundred Ns should do the trick (for long reads, just increase the size of the spacers).

  • Hi @Geraldine_VdAuwera ,
    I have called SNPs and Indels according to the best practice for Exome/Panel + WGS for a non-model species. The reads are from RAD-seq. Large amount of contigs are combined to be supercontigs and the time that needs is very short. But the coverage of depth for my data is very low, expected is about 2X and the result seems not good. Do you have any suggestions about the procedures in calling SNPs and Indels?
    1. Do you recommend to recalibrate base quality scores?
    2. In calling SNPs and Indels, is the artificial LD could influence the result using HaplotypeCaller and GenotypeGVCFs.
    3. and any suggestion in hard filter, QualByDepth, FisherStrand, RMSMappingQuality, MappingQualityRankSumTest, ReadPosRankSumTest. I think the option FisherStrand and ReadPosRankSumTest are not suitable.
    Thanks a lot.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    What do you mean "the result seems not good."?

    You may find this thread and this thread useful.


    P.S. 2x coverage is really not useful because you won't be able to tell sequencing errors from actual variants.

Sign In or Register to comment.