Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK / HaplotypeCaller and a genome w/ 250K references?

Hello,

I am trying to understand what i'm seeing from HaplotypeCaller 3.6. I recently tried to use a genome build that includes the primary chromosomes plus ~250K unplaced contigs. When I run haplotype caller, the stderr is as follows (note timestamps)

INFO 17:25:43,150 HelpFormatter - Program Args: -T HaplotypeCaller -R 69_Mmul_8.0.1.fasta -I myBam.bam -o output.g.vcf.gz --emitRefConfidence GVCF -A DepthPerSampleHC -A HomopolymerRun --max_alternate_alleles 12 -nct 12
INFO 17:25:43,159 HelpFormatter - Executing as [email protected] on Linux 2.6.32-504.30.3.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27.
INFO 17:25:43,159 HelpFormatter - Date/Time: 2016/10/31 17:25:43
INFO 17:25:43,159 HelpFormatter - ----------------------------------------------------------------------------------
INFO 17:25:43,159 HelpFormatter - ----------------------------------------------------------------------------------
WARN 17:25:43,165 GATKVCFUtils - Creating Tabix index for SequenceO.work/output.g.vcf.gz, ignoring user-specified index type and parameter
INFO 17:25:43,175 GenomeAnalysisEngine - Strictness is SILENT
INFO 23:00:39,540 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO 23:00:39,547 SAMDataSource$SAMReaders - Initializing SAMRecords in serial

Note the 6 hour lapse between 17:25 and 23:00. The only thing I can see that's different here is my genome contains a enormous # of contigs. Is that a reasonable assumption to explain why the initialization is taking so long? Do people working in other organisms include unplaced contigs in their analyses or for any reason have a genome w/ a large # of contigs? Thanks in advance for any help or suggestions.

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bbimber,

    Unfortunately the current version GATK doesn't deal well with very large numbers of contigs. In our pipeline we restrict analysis to only the canonical chromosomes (using -L). This may improve in the future but for now it's a constraint you have to work with. In your case what's taking so long is creating the index for the output file.

  • bbimberbbimber HomeMember

    interesting. do you use the unplaced contigs for human alignments and other steps upstream of GATK, or do you omit them for everything?

    i'm considering concatenating all the unplaced contigs into a single pseudo 'ChrUn' chromosome, which might improve the situation. GATK isnt the only tool that doesnt seem to work particularly well w/ this many reference sequences.

Sign In or Register to comment.