To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

UnifiedGenotyper takes too long to init with big reference

Hello, I'm using GATK version 3.7-0-gcfedb67 in a Linux 3.13.0-93-generic x86_64 with 64 CPUs
UnifiedGenotyper is taking way to long to run with a large, fragmented fragmented reference (>300,000 contigs / 1.3 Gb genome size).
I'm targeting ~5,000 contigs with capture seq. When I map to the subset of contigs everything goes really fast in GATK. A test with 8 samples shows that takes UnifiedGenotyper 15 minutes to run.
The problem is that I want to map to the whole reference to avoid forcing the mapping of un-targeted reads onto unspecific regions. When I do map to the whole reference the UnifiedGenotyper starts to be very slow, no matter what I do.
If I use the raw bam and the whole reference it takes 48 hours to run!
If I use an interval bed file it still takes 21 hours to run!
The culprit seems to be the step labelled "GenomeAnalysisEngine - Strictness is SILENT" and after "initializing BAM"
These two steps takes seconds for the subset reference, but several hours when using the whole reference, even using the intervals file

Here's a portion of the log:

INFO 14:50:15,323 HelpFormatter - Date/Time: 2017/10/27 14:50:15
INFO 14:50:15,323 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:50:15,324 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:50:15,346 GenomeAnalysisEngine - Strictness is SILENT
INFO 23:16:29,036 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 23:16:29,062 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 23:17:34,078 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 65.02
INFO 07:40:07,489 IntervalUtils - Processing 502876553 bp from intervals
INFO 07:40:08,511 GenomeAnalysisEngine - Preparing for traversal over 8 BAM files
INFO 07:40:08,637 GenomeAnalysisEngine - Done preparing for traversal
INFO 07:40:08,638 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 07:40:08,639 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 07:40:08,640 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 07:40:08,778 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
WARN 07:40:08,779 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
INFO 07:40:08,780 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
INFO 07:40:38,652 ProgressMeter - Starting 0.0 30.0 s 49.6 w 100.0% 30.0 s 0.0 s

I'm using a standard command:
java -Xmx20g -jar ${GATKLoc}GenomeAnalysisTK.jar \ -R ${REF} \ -T UnifiedGenotyper \ -I ${BAMlist} \ -mbq 20 \ -L contigsDP5_halfIndivduals.bed \ -o SNPcall_${RUNname}_intervals.raw.vcf >> $LOG 2>> $ERR

Lastly, if I try to subset both the bam files and the reference to the regions of the interval bed file, it complains that cannot find the first contig. But it still takes several hours to init!

INFO 14:55:42,119 HelpFormatter - Date/Time: 2017/10/27 14:55:42
INFO 14:55:42,119 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:55:42,120 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:55:42,140 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:56:43,750 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 14:56:43,757 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 14:57:51,236 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 67.48
INFO 14:58:14,530 GenomeAnalysisEngine - Preparing for traversal over 8 BAM files
INFO 21:52:09,996 GenomeAnalysisEngine - Done preparing for traversal
INFO 21:52:09,997 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 21:52:09,997 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 21:52:09,998 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 21:52:10,064 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
WARN 21:52:10,064 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
INFO 21:52:10,065 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Badly formed genome location: Contig Contig0 given as location, but this contig isn't present in the Fasta sequence dictionary

I want to scale up to hundreds of samples so I'm afraid that it will take weeks!
Is there a way to speed up the process
Thanks in advance!

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Hernan
    Hi,

    It is true GATK tools are not designed to deal with too many contigs in the reference. Have a look at this thread for some helpful tips.

    In your case, it looks like the bed file contains Contig0, but that contig is not present in your reference. Can you confirm Contig0 is in the FASTA .dict file?

    Thanks,
    Sheila

  • HernanHernan MelbourneMember

    Thanks for the quick reply @Sheila
    Yes I read those threads before I was wondering if something has changed since? Also I think this is slightly a different issue, because I don't want to call SNPs for the whole genome, but for a rather small set of ~5K contigs. I guess the question is why if I use the interval file, GATK still needs to load the whole reference, why it takes so long, and if there's a way around it?
    When I use the interval bed file the program works as expected, it just take 20 hours (instead of 15 minutes when using a reduced ref)
    GATK complains when I use a subset bam and a subet ref, because contig 0 is still in the bam header but not in the reference (this is my best guess).

    best,
    H

  • HernanHernan MelbourneMember
    edited October 2017

    I have used the approach of putting all "non-target" regions together into a single super scaffold + using an intervals bed file, and it worked great. It took 30 min to complete and render as many SNPs as using only the intervals file (which takes 20 hours to complete). So intervals or subsetting bam files does not work, merging the reference does
    thanks, H

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Hernan
    Hi H,

    I guess the question is why if I use the interval file, GATK still needs to load the whole reference, why it takes so long, and if there's a way around it?

    GATK tools still load the whole reference you input regardless of the interval file. Unfortunately, that is just the way the tools work, and there is no way around it.

    When I use the interval bed file the program works as expected, it just take 20 hours (instead of 15 minutes when using a reduced ref)

    Yes, I was going to say you can try using just the pieces of the reference you are interested in. We have a way to fix the BAM file and only represent the contigs in the reference snippet here.

    Glad to see you found a workaround on your own :smile:

    -Sheila

Sign In or Register to comment.