reference genome with millions of contigs: How to get HC or UG started

Good day everyone,

I have been starting to use GATK (v. 3.5) with a reference genome that contains 2.6 x 10E6 contigs.
I managed the bam processing steps (realignment) to work with this genome and its bam files, however, for variant calling, GATK does not seem to get past the "ProgressMeter" - "Starting" step for hours, e.g. this is the output of a single sample Haplotype Caller command:

INFO  10:11:29,459 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:11:29,462 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 
INFO  10:11:29,462 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  10:11:29,463 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  10:11:29,466 HelpFormatter - Program Args: -T HaplotypeCaller -I target_indel_realignment/Cierzo.1.bam -R /project/production/Indexes/samtools/hvulgare.WGS_Morex_assembly_v3.fasta -et NO_ET -K /project/production/DAT/apps/GATK/2.4.9/sderdak_pcb.ub.es.key -dt NONE -rf BadCigar -L /scratch/production/sderdak/BARLEY_06/PstI/contigs_for_variantcalling_sorted.intervals --never_trim_vcf_format_field -A AlleleBalance -A BaseCounts -A BaseQualityRankSumTest -A ChromosomeCounts -A ClippingRankSumTest -A Coverage -A DepthPerAlleleBySample -A DepthPerSampleHC -A FisherStrand -A GCContent -A HaplotypeScore -A HardyWeinberg -A HomopolymerRun -A ClippingRankSumTest -A LikelihoodRankSumTest -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A NBaseCount -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest -A StrandBiasBySample -A StrandOddsRatio -A VariantType -ploidy 2 --min_base_quality_score 10 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --GVCFGQBands 10 --GVCFGQBands 13 --GVCFGQBands 20 --GVCFGQBands 30 --standard_min_confidence_threshold_for_calling 30 --standard_min_confidence_threshold_for_emitting 30 -o /scratch_tmp/20084084/BARLEY_06_Cierzo_PstI_allchr.vcf.gz 
INFO  10:11:29,470 HelpFormatter - Executing as [email protected] on Linux 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15. 
INFO  10:11:29,470 HelpFormatter - Date/Time: 2016/03/11 10:11:29 
INFO  10:11:29,470 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:11:29,470 HelpFormatter - -------------------------------------------------------------------------------- 
WARN  10:11:29,503 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values  for --variant_index_type and --variant_index_parameter 
WARN  10:11:29,505 GATKVCFUtils - Creating Tabix index for /scratch_tmp/20084084/BARLEY_06_Cierzo_PstI_allchr.vcf.gz, ignoring user-specified index type and parameter 
INFO  10:11:29,965 GenomeAnalysisEngine - Strictness is SILENT 
INFO  10:15:50,633 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  10:15:50,651 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  10:24:16,731 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 506.08 
INFO  10:27:37,841 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 
INFO  10:27:38,246 IntervalUtils - Processing 192926869 bp from intervals 
INFO  10:27:42,925 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  10:27:43,097 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:27:43,100 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:27:43,103 ProgressMeter -                 |      processed |    time |         per 1M |           |   total | remaining 
INFO  10:27:43,106 ProgressMeter -        Location | active regions | elapsed | active regions | completed | runtime |   runtime 
INFO  10:27:43,109 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output 
INFO  10:27:43,111 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output 
WARN  10:27:43,175 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. 
INFO  10:28:13,112 ProgressMeter -        Starting              0.0    30.0 s           49.6 w      100.0%    30.0 s       0.0 s 
INFO  10:28:43,122 ProgressMeter -        Starting              0.0    60.0 s           99.2 w      100.0%    60.0 s       0.0 s 
INFO  10:29:13,124 ProgressMeter -        Starting              0.0    90.0 s          148.8 w      100.0%    90.0 s       0.0 s 
INFO  10:29:43,126 ProgressMeter -        Starting              0.0   120.0 s          198.5 w      100.0%   120.0 s       0.0 s 
INFO  10:30:13,128 ProgressMeter -        Starting              0.0     2.5 m          248.1 w      100.0%     2.5 m       0.0 s 
INFO  10:30:43,130 ProgressMeter -        Starting              0.0     3.0 m          297.7 w      100.0%     3.0 m       0.0 s 
INFO  10:31:13,133 ProgressMeter -        Starting              0.0     3.5 m          347.3 w      100.0%     3.5 m       0.0 s 
[...]
INFO  12:14:13,842 ProgressMeter -        Starting              0.0   106.5 m        10566.7 w      100.0%   106.5 m       0.0 s 
INFO  12:14:43,844 ProgressMeter -        Starting              0.0   107.0 m        10616.3 w      100.0%   107.0 m       0.0 s 
INFO  12:15:13,853 ProgressMeter -        Starting              0.0   107.5 m        10665.9 w      100.0%   107.5 m       0.0 s 
INFO  12:15:43,856 ProgressMeter -        Starting              0.0   108.0 m        10715.5 w      100.0%   108.0 m       0.0 s 
INFO  12:16:13,858 ProgressMeter -        Starting              0.0   108.5 m        10765.1 w      100.0%   108.5 m       0.0 s 
INFO  12:16:43,866 ProgressMeter -        Starting              0.0   109.0 m        10814.8 w      100.0%   109.0 m       0.0 s

An attempt to run a group analysis of 6 samples with UnifiedGenotyper was even running for almost a day outputting only the "starting" log message:

INFO  17:40:44,328 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  17:40:44,330 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 
INFO  17:40:44,330 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  17:40:44,330 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  17:40:44,335 HelpFormatter - Program Args: -T UnifiedGenotyper -R /project/production/Indexes/samtools/hvulgare.WGS_Morex_assembly_v3.fasta -I Cierzo/target_indel_realignment/Cierzo.1.bam -I RIL1/target_ind
el_realignment/RIL1.1.bam -I RIL2/target_indel_realignment/RIL2.1.bam -I RIL3/target_indel_realignment/RIL3.1.bam -I SBCC042/target_indel_realignment/SBCC042.1.bam -I SBCC073/target_indel_realignment/SBCC073.1.ba
m -o BARLEY_06_PstI_allsamples_allchr.vcf.gz -glm BOTH --downsampling_type NONE -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A ChromosomeCounts -A Coverage -A DepthPerAlleleB
ySample -A FisherStrand -A HaplotypeScore -A HardyWeinberg -A HomopolymerRun -A InbreedingCoeff -A QualByDepth -A ReadPosRankSumTest -A SampleList -A SpanningDeletions -A TandemRepeatAnnotator -A VariantType --sa
mple_ploidy 2 --phone_home NO_ET -K /project/production/DAT/apps/GATK/2.4.9/sderdak_pcb.ub.es.key -nt 8 -L contigs_for_variantcalling.intervals 
INFO  17:40:44,337 HelpFormatter - Executing as [email protected] on Linux 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15. 
INFO  17:40:44,338 HelpFormatter - Date/Time: 2016/03/09 17:40:44 
INFO  17:40:44,338 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  17:40:44,338 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  17:40:44,885 GenomeAnalysisEngine - Strictness is SILENT 
INFO  17:46:11,736 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  17:46:11,765 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  18:48:15,693 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 3723.92 
INFO  18:52:25,254 IntervalUtils - Processing 192926869 bp from intervals 
INFO  18:52:33,829 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 16 processors available on this machine 
INFO  18:52:34,126 GenomeAnalysisEngine - Preparing for traversal over 6 BAM files 
INFO  18:52:34,346 GenomeAnalysisEngine - Done preparing for traversal 
INFO  18:52:34,352 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  18:52:34,357 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  18:52:34,361 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  18:52:34,558 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. 
WARN  18:52:34,565 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. 
INFO  18:52:34,568 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. 
INFO  18:53:04,397 ProgressMeter -        Starting         0.0    30.0 s      49.7 w      100.0%    30.0 s       0.0 s 
INFO  18:53:34,432 ProgressMeter -        Starting         0.0    60.0 s      99.3 w      100.0%    60.0 s       0.0 s 
INFO  18:54:04,462 ProgressMeter -        Starting         0.0    90.0 s     149.0 w      100.0%    90.0 s       0.0 s 
INFO  18:54:34,492 ProgressMeter -        Starting         0.0   120.0 s     198.6 w      100.0%   120.0 s       0.0 s 
INFO  18:55:04,522 ProgressMeter -        Starting         0.0     2.5 m     248.3 w      100.0%     2.5 m       0.0 s 
INFO  18:55:34,552 ProgressMeter -        Starting         0.0     3.0 m     298.0 w      100.0%     3.0 m       0.0 s 
INFO  18:56:04,582 ProgressMeter -        Starting         0.0     3.5 m     347.6 w      100.0%     3.5 m       0.0 s 
INFO  18:56:34,612 ProgressMeter -        Starting         0.0     4.0 m     397.3 w      100.0%     4.0 m       0.0 s 
INFO  18:57:04,642 ProgressMeter -        Starting         0.0     4.5 m     446.9 w      100.0%     4.5 m       0.0 s 
[...]
INFO  10:56:05,772 ProgressMeter -        Starting         0.0    16.1 h   15250.3 w      100.0%    16.1 h       0.0 s 
INFO  10:57:05,832 ProgressMeter -        Starting         0.0    16.1 h   15250.3 w      100.0%    16.1 h       0.0 s 
INFO  10:58:05,892 ProgressMeter -        Starting         0.0    16.1 h   15250.3 w      100.0%    16.1 h       0.0 s 
INFO  10:59:05,952 ProgressMeter -        Starting         0.0    16.1 h   15250.3 w      100.0%    16.1 h       0.0 s 
INFO  11:00:06,012 ProgressMeter -        Starting         0.0    16.1 h   15250.3 w      100.0%    16.1 h       0.0 s 

I have already tried several tricks to optimize the performance, including:

  • adding the java option -XX:MaxPermSize=2g which helped with fasta indexing and sequence dictionary generation in picard
  • restricting the GATK steps to contigs which actually have reads (specified with the GATK -L option)

Is there anything else I can do to get this to run at all, or faster, or just wait?

Thanks a lot for any suggestions you might have!

Best,
Sophia

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sophia,

    GATK has a design flaw that makes it use a lot of memory when it handles references with very many contigs such as yours. Increasing available memory may help. If that's still very slow, one thing you can do is create super contigs to reduce the number of contigs in your reference file.

  • SophiaSophia Member

    I was running these commands with -Xmx120g which is the maximum memory I can allocate currently on our cluster.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I see, then you'll need to take the super contigs approach. It's annoying but right now it's the only workaround that I'm aware of.

  • SophiaSophia Member

    Thanks Geraldine. I monitored cpu and memory usage of one of the UG jobs with the following settings:

    • on one entire computing node with 16 cpu and ~125 GB RAM
      using:
      -Xmx120g
      -XX:MaxPermSize=2g
    • GATK -nt 16

    The job ran for 24 hours (outputting "starting"), and consumed over the whole time approx. 1 cpu of computing resources and approx. 65G of RAM. So, allocating more memory will not help here.

    I will try again, just letting it run "forever" and see what happens.

    cheers,
    Sophia

  • SophiaSophia Member
    edited March 2016

    To follow up on this:
    I ran the same command again, allowing for 7 days, using:
    -Xmx120g
    -XX:MaxPermSize=2g
    - GATK -nt 8

    I saw the following in the log of the execution:

    • The GATK command was in "starting" for the first 71.3 hours.
    • After that, it started processing the contigs, as it should, although very slowly.
    • At the end of the 7 days, it had reached contig 50 out of 64311, after being stuck at this same position since 91.9 hours.
    • Over the whole time of the execution, the job consumed approx. 1 cpu of computing resources (which is surprising, considering the -nt 8 argument) and approx. 65G of RAM, so neither memory nor processing capacity were limiting factors here.

    The prediction for remaining time at the end of the 7 days was 4022.3 weeks.

    No variants have been output to the output file yet (contains only the vcf header).

    So I believe we can consider this not viable.

    I will have to go for the reduction of the number of contigs, either chosing a subset of contigs with relevant size and annotation or generating "super-contigs", as Geraldine recommended.

    thanks for the advice,
    best,
    Sophia

  • RyanYeoRyanYeo SingaporeMember

    Hi, I seem to face an error:

    INFO 23:27:32,235 GenomeAnalysisEngine - Strictness is SILENT
    INFO 23:27:32,334 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
    INFO 23:27:32,339 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 23:27:32,368 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
    INFO 23:27:32,577 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
    INFO 23:27:32,589 GenomeAnalysisEngine - Done preparing for traversal
    INFO 23:27:32,590 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 23:27:32,590 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 23:27:32,591 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 23:27:32,654 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    WARN 23:27:32,655 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
    INFO 23:27:32,656 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.

    May I know how can the problem be resolved?
    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @RyanYeo
    Hi,

    I am not sure I understand the issue. Are you asking why you get the WARN statement? I am helping here.

    -Sheila

Sign In or Register to comment.