The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# The GATK reference model pipeline for incremental joint discovery, in full detail

Cambridge, MA
edited March 2014

Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.

You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:

The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.

The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.

So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.

See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.

Let us know what you think!

Post edited by Geraldine_VdAuwera on
Tagged:

• The link to the "gory details" just goes to this page. Where can I find these gory details?

• Cambridge, MA

• Hi Geraldine,

I have started this program on my bam file containing all chromosomes (mouse). It managed to call genotypes for 3 chromosomes in 5 days. I cannot run programs for longer than that due to limits in wall time. Do you recommend to run each chromosome separately, maybe? Or is there any other way to speed up the calculations?
Thanks and best,

Bettina

• Hi Geraldine,

This new method seems terribly slow. I am running a batch of BAM files which are about 2.5GB each, over only a 14MB region of the genome, and each one is taking about 12 hours. I am using nct=1 because I can't risk using higher values as that was unreliable in the past.

Am I doing something wrong?

java -Xmx4g -Djava.io.tmpdir=javatempdir -jar GenomeAnalysisTK.jar -R canfam3.fasta -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -L chr13:38900000-53300000 -I AS_case_07_final.bam -o gVCF1_AS_case_07_final_variants.vcf -S STRICT -nct 1

• Cambridge, MA

@Bettina_Harr‌ and @mike_boursnell‌ That's odd as in our hands this goes much faster than the previous versions. Are you working with very deep coverage? You can certainly parallelize by running on each chromosome individually, to help cut down on runtimes.

• Hi Geraldine, no my coverage is not extremely high, average 20-fold per site.

• Quite high coverage 300-400. What is the 128000 parameter?

• Cambridge, MA

The 128000 parameter has to do with the index compression scheme -- I think it specifies block size.

Let me call in some of the devs who have been working on the HC's performance. Are you running 3.0 or 3.1?

• Hi Geraldine, ok, I have restarted the program processing one chromosome at a time. Looking at my intermediate output, I noticed that the software issued a warning:

WARN 02:33:07,773 DiploidExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:65884356 has 8 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site?

Thanks!

• oh, I am running 2.8.1. Did not know there was already a new version. Could this be part of the reason? I have done all my pre-processing now with 2.8.1. Can I switch to 3.1 with the HaplotypeCaller? Or should I stay with 2.8.1?
Thanks!

• Cambridge, MA

I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site?

Yes, this is typically due to indels in a context with homopolymers and/or repeats. You can limit the number of max alt alleles if you want, which will make the program run faster.

I am running 2.8.1.

You should definitely switch to 3.1; this new method is specifically designed for work in 3.1, and involves new code that is not in 2.8.

• ok, but I do not need to repeat the pre-processing, i.e. indel and base recalibration, when switching to 3.1 now, correct?

• Charlestown, MA
edited March 2014

Correct @Bettina_Harr‌ you don't need to reprocess (BQSR & Indel Realignment) to switch. Just run the haplotype caller. I actually didn't even know that we made the reference model available on 2.8, it was very experimental at that point. Try 3.1 and let us know how it is.

• Cambridge, MA

@mike_boursnell To clarify, are you finding that running HC in GVCF mode is much slower than previously?

• Yes. That's right.

• Cambridge, MA

@mike_boursnell‌ Do you have any comparison data, e.g. runtimes for running two different commands on the same data? We'd like to get a sense of what the difference is. The HC does do quite a bit more work in GVCF mode, so it's expected that the upfront runtime will be a little more, but not orders of magnitude.

• OK, I'll have a look. I'm not sure how to compensate for other things running on our server at the same time. Any suggestions?

• Cambridge, MA

Not really -- if you were able to reserve a set of nodes that would be ideal of course. Not knowing your setup/access I can't really advise. We have dedicated machines for testing this kind of thing so it's a lot easier.

• Hi Geraldine,
I run GATK 3.1 Unified Genotyper (UG) and Haplotype Caller (HC) for whole genome NA12878. With UG I have my vcf file with all variants called (near 4 milions), but when I run HC I obtained very few variants (near 1,3 milions) and in the log of HG I see this warning (a lot of): WARN 05:26:17,926 ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:7458214 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. My BWA alignment is OK I see only two alternate alleles (diploid). Do you know why of this warning ? Thanks.

• Cambridge, MA

Hi @sergiomv‌,

First I would say that the different numbers of variants is most probably unrelated to the warnings you're seeing. I can't really comment on those numbers; you should try to examine what is the overlap and what are the major differences between the two callsets. Have a look at the distribution of annotation values. See if you are missing major blocks of variants, or if there is a correlation with annotation values, e.g. if the variants HC is missing all have low MQ in the UG callset. HaplotypeCaller is more stringent about thing slike MQ than UG.

Regarding the alleles question, I would bet that HC is seeing multiple possible deletion alleles in your data, because of the low complexity of the context. Keep in mind that the HC does a de novo reassembly step with the data so it may come to a different alignment than BWA. And since considering multiple alleles decreases performance, the HC is set to only proceed with the most likely alleles and discard the rest. You don't need to worry about this.

• edited April 2014

Hi Geraldine, a couple of questions:
1) I ran the Haplotype Caller and in the end I am getting a short summary of my reads.

INFO  05:16:33,184 ProgressMeter - Total runtime 30549.58 secs, 509.16 min, 8.49 hours
INFO  05:16:33,185 MicroScheduler - 5481436 reads were filtered out during the traversal out of approximately 21788872 total reads (25.16%)
INFO  05:16:33,186 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  05:16:33,186 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  05:16:33,187 MicroScheduler -   -> 5434226 reads (24.94% of total) failing HCMappingQualityFilter
INFO  05:16:33,187 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  05:16:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  05:16:33,188 MicroScheduler -   -> 47210 reads (0.22% of total) failing NotPrimaryAlignmentFilter
INFO  05:16:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter


I am wondering about those 25% reads that were "filtered" during traversal. Could you tell me if I have to be concerned about this, and what that actually means?

• edited April 2014

2) I am still struggling with hitting the wall time when I am trying to run Haplotype caller on the "large" mouse chromosomes, i.e. chr1. It takes more than 4 days. We have a arge computing cluster, I tried to start Java with 16 processors and 20g RAM, but the program would not run any faster. I also set the max alternate allele to 2 hoping it would save some time, but it does not. I attach my jobscript below. Is there anything else I could try to speed things up? I am now using version 3.1.1.

#!/bin/sh
#BSUB -q fat-long
#BSUB -R scratch
#BSUB -W 120:00
#BSUB -o TP7-10F1A2.recal.bamX_out
#BSUB -e TP7-10F1A2.recal.bamX_error
#BSUB -R 'span[hosts=1]'
#BSUB -n 16
java -Xmx20g -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L X --variant_index_type LINEAR --variant_index_parameter 128000 -o TP7-10F1A2.recal.bamX.gVCF

• Cambridge, MA

1) HaplotypeCaller applies more stringent quality filters than our other tools. Any reads that have a mapping quality less than 20 (if I remember correctly) will be ignored. In your case it looks like many reads have a mapping quality that's considered too low to be useful, unfortunately. This depends mainly on the aligner you used and the quality of your data.

2) You can try using a more aggressive pruning setting (see here) and/or adding a layer of multithreading if your cluster allows it (see here). In addition, if you this is exome data, you can provide a list of capture targets using the -L argument.

• thank you Geraldine. One more short question: does it make sense to set anything for the -Xmx parameter?

• Hi Geraldine,
fwi, the -nt option does not seem to be supported in the Haplotype caller. I am testing the -nct option now.

INFO 01:45:01,617 HelpFormatter - --------------------------------------------------------------------------------
INFO 01:45:01,620 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21
INFO 01:45:01,621 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 01:45:01,626 HelpFormatter - Program Args: -T HaplotypeCaller -nt 2 -nct 4 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L X --variant_index_type LINEAR --variant_index_parameter 128000 -o TP7-10F1A2.recal.bamX.gVCF
INFO 01:45:01,630 HelpFormatter - Executing as bharr@gwda036 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18.
INFO 01:45:01,631 HelpFormatter - Date/Time: 2014/04/23 01:45:01
INFO 01:45:01,631 HelpFormatter - --------------------------------------------------------------------------------
INFO 01:45:01,632 HelpFormatter - --------------------------------------------------------------------------------
INFO 01:45:02,319 GenomeAnalysisEngine - Strictness is SILENT
INFO 01:45:02,461 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 01:45:02,471 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 01:45:02,522 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05
INFO 01:45:02,595 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 01:45:02,604 IntervalUtils - Processing 171031299 bp from intervals
INFO 01:45:02,617 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 4 CPU thread(s) for each of 2 data thread(s), of 64 processors available on this machine

##### ERROR ------------------------------------------------------------------------------------------
• Cambridge, MA

@Bettina_Harr‌ If you're using -nct (which is what I meant to recommend, sorry if that wasn't clear) you may want to increases memory allocation, yes. see the FAQs on multithreading for some general guidelines.

• unfortunately, I am not very experienced with parallelism. I ran the code overnight WITHOUT a special -Xmx parameter (I did not know what reasonable value I should use in the multi-processing mode). It runs fine for about 5h and then exists with "exit code 1). The error file contains this at the end. Do you know what could be the problem? Simply ran out of memory? I used -ntc 6 and reserved 10 processors in my jobscript, to be on the safe side.

INFO 01:42:29,369 ProgressMeter - 1:85086078 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h
INFO 01:43:29,371 ProgressMeter - 1:85087145 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h

##### ERROR stack trace

java.lang.NullPointerException
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------ • Cambridge, MA This looks like it might be a bug, actually. If you can isolate the region in which this issue occurred and produce a test case, we'll have a look and try to fix it. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894 • Hi Geraldine, ok, I perfomed the following test: the program generated the error after position 85299925 on chromosome 1 (see error below). Therefore I re-ran it with the L option (-L 1:85298925-85309925 ): java -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -L 1:85298925-85309925 -nct 6 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I 16B.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o 16B.recal.bam1.gVCF1 This generated no error and a normal looking gVCF file. I think it is not a bug in itself, as the error only arises with chromosome 1 files, and not with any other chromosome of the same individual mouse. I have now rerun the chr1 files with the -Xmx option set to 20g, and reserving 20 processors for it (each processor has 4Gb memory allocated to it. The run is still waiting in the queue, so I will only be able to say what happened tomorrow morning. When I start the program interactively, and then use "top" to monitor the memory I can see it uses 18.6 Gb. Maybe -Xmx20g is still not enough? PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23772 bharr 20 0 18.6g 842m 11m S 781.1 1.3 0:16.51 java error file: INFO 22:32:03,103 ProgressMeter - 1:85296437 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h INFO 22:32:33,105 ProgressMeter - 1:85299925 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h WARN 22:32:33,427 ExactAFCalc - this tool is currently set to genotype at most 2 alternate alleles in a given context, but the context at 1:85287439 has 4 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(Unknown Source)
at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) • Cambridge, MA Ack, this is worse than I thought. "ConcurrentModificationException" means this is most likely a problem with multithreading; it seems several threads may be trying to modify the same information at the same time, which is bad and causes the crash you see. This might be triggered by something in your data. At this point I would recommend not using multithreading at all with HaplotypeCaller. Instead of multithreading you may want to run HaplotypeCaller individually per chromosome, then concatenate the resulting variants (e.g. using the CatVariants utility). This will enable you to still use your cluster for parallelism without running into this thread safety issue. • I've been wondering about that. I've been using both -nct and -pairHMM VECTOR blah when running in gvcf mode and I have been experiencing random crashes that aren't reproducible (I run it again and it goes through fine) on human exomes. About 1 or 2 crashes per 100 runs so far. My stack trace is not the same as above...I haven't had time to look at all of them, but below is my latest crash that I've already rerun through...I'm actually going to do a sanity check through my last 150 to see if there were any that crashed that I missed. ##### ERROR stack trace java.lang.NullPointerException at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:51) at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:41) at java.util.TimSort.binarySort(TimSort.java:265) at java.util.TimSort.sort(TimSort.java:208) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.broadinstitute.sting.utils.sam.ReadUtils.sortReadsByCoordinate(ReadUtils.java:320) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.finalizeActiveRegion(HaplotypeCaller.java:1111) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.referenceModelForNoVariation(HaplotypeCaller.java:1001) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:808) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)

##### ERROR MESSAGE: Code exception (see stack trace for error itself)



• Hi Geraldine, I am already running everything on a per chromosome basis and this is were I hit my runtime limits with chromosome 1. It takes more than four days, even with allowing for 20g when I start Java. It seems that maybe I have to subdivide chromsome 1 in chuncks to be able to do it :-(!

• Cambridge, MA

Wait - you're saying it's taking four days per chromosome per sample? What kind of coverage do you have?

• medium coverage ~20 fold.

• edited April 2014

these are my CPU times (in seconds):
TP7-10F1A2.recal.bamMT_out: CPU time : 73

TP7-10F1A2.recal.bamX_out: CPU time : 26442

TP7-10F1A2.recal.bamY_out: CPU time : 42362

TP7-10F1A2.recal.bam19_out: CPU time : 45158

TP7-10F1A2.recal.bam15_out: CPU time : 70149

TP7-10F1A2.recal.bam18_out: CPU time : 91582

TP7-10F1A2.recal.bam16_out: CPU time : 113535

TP7-10F1A2.recal.bam14_out: CPU time : 116572

TP7-10F1A2.recal.bam12_out: CPU time : 123313

TP7-10F1A2.recal.bam17_out: CPU time : 132176

TP7-10F1A2.recal.bam3_out: CPU time : 133359

TP7-10F1A2.recal.bam7_out: CPU time : 143940

TP7-10F1A2.recal.bam8_out: CPU time : 145699

TP7-10F1A2.recal.bam6_out: CPU time : 148685

TP7-10F1A2.recal.bam11_out: CPU time : 150079

TP7-10F1A2.recal.bam9_out: CPU time : 153481

TP7-10F1A2.recal.bam10_out: CPU time : 160582

TP7-10F1A2.recal.bam13_out: CPU time : 163721

TP7-10F1A2.recal.bam5_out: CPU time : 196669

TP7-10F1A2.recal.bam2_out: CPU time : 253768

TP7-10F1A2.recal.bam4_out: CPU time : 258922

TP7-10F1A2.recal.bam1_out: CPU time : 355144

• Cambridge, MA

That seems awfully slow. Unfortunately I can't really help you deal with performance and platform issues as that is beyond the scope of support I can provide, sorry.

• Hi Geraldine,

ok, I got it to work. With this configuration, the whole chromosome 1 takes a little more than 1 day. So it is not a bug after all using the -nct option. One just seems to need tons of memory.

# BSUB -n 20

java -jar -Xmx40g /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 6 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I 14.recal.bam --max_alternate_alleles 2 --emitRefConfidence GVCF -L 1 --variant_index_type LINEAR --variant_index_parameter 128000 -o 14.recal.bam1.gVCF

• Hi. Can we use this gVCF method to analyse a bunch of BAM files all from the same sample (just so they can be analysed in parallel?) If so how would you have to set up the Read Group information in these BAM files?

• Cambridge, MA

@mike_boursnell I'm not sure what you mean, but if you're saying you want to compare the BAMs as if they belonged to different samples (e.g. to compare results on technical replicates), then all you need to do is give them different SM tags so that GATK treats them separate when you do the GenotypeGVCFs step.

• This is for whole genome sequencing of a single sample. We are looking at splitting the single FASTQ file into (say) 10 smaller bits and then processing them in parallel. When it gets to the HaplotypeCaller gVCFs stage there will be 10 separate BAM files, but which are all actually the same sample. If we give them the same SM tag will GATK assume they are the same sample and produce a single-column VCF file?

• Cambridge, MA

@mike_boursnell‌ Oh OK, got it. Yes, you should be able to just pass in the separate bam files. As long as they have the same SM tag, the data will get re-aggregated by the engine and served appropriately to the HC, leading to a single-sample VCF file.

• Hi Geraldine,
I have now completed the variant calling using Haplotype Caller and recalibrated my variants. It all worked but I am now in a position where my outfiles contain ONLY variable sites, and not the invariable ones. The vcf files that "GenotypeGVCFs" produces does no longer contain the invariable sites. I would like to have the info on all sites in the final vcf file, even for those sites that are not variable as Id like to see their coverage. is this possible at all at this point?

• Cambridge, MA

Yes, you can get that using the -inv` argument with GenotypeGVCFs. However, in the current version this will produce no-call site records to be emitted, so the hom-ref sites won't have much information attached to them. I believe the next version will produce detailed records at all sites.

• thank you Geraldine! I have another question. The R script does not automatically run on my machine, so I was trying to run the script manually by typing:

library(ggplot2)

source("7.recalibrate_SNP_plots.R")

Error: 'opts' is deprecated. Use 'theme' instead. (Defunct; last used in version 0.9.1)

I am getting the error above.

I am using ggplot2 1.0.0.

R version 3.1.0 (2014-04-10) -- "Spring Dance"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.1.0 (64-bit)

Simply replacing "opts" with "theme" does not do the trick, I will get another error then:

Error: theme_rect is deprecated. Use 'element_rect' instead. (Defunct; last used in version 0.9.1)

Thanks!

Bettina

• Cambridge, MA

There are a few other functions related to the opt -> theme change that also need to be adapted. The error message tells you which. Just step through the script and change every one that throws an error the same way.