The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, and its ability to call indels is far superior. We recommend using HaplotypeCaller in all cases, with only a few exceptions:

• If you want to analyze more than 100 samples at a time (for performance reasons)
• If you are working with non-diploid organisms (UG can handle different levels of ploidy while HC cannot)
• If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)

In those cases, we recommend using UnifiedGenotyper instead of HaplotypeCaller.

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 11Member

Dear Geraldine, I have to different setups:

1. I have a pool of individuals of a F2 generation. These individuals all decend from two diploid parents. So I still end up with a diploid pool and would use HC, right?
2. I have a pool of non related diploid individuals, so I have to use UG?

Best regards,

David

Hi David,

If you are doing true pooled calling, you need to use UG and set the ploidy to (real ploidy per individual) x (number of individuals).

Geraldine Van der Auwera, PhD

• University of KentuckyPosts: 3Member

Is HaplotypeCaller appropriate for making genotype calls from haploid progeny (individuals, not pools) from a diploid parent?

HaplotypeCaller can only genotype diploid individuals. For haploids, please use UnifiedGenotyper.

Geraldine Van der Auwera, PhD

• University of KentuckyPosts: 3Member

hmm, not the answer I expected. Is there some kind of intrinsic assumption about levels of herterozygosity? which program would one use to genotype isogenic dipliod lines?

HaplotypeCaller assumes that the organism is diploid and therefore looks for allele frequencies around 0, 0.5, 1. With UnifiedGenotyper you can set ploidy to whatever you want and the model will be adapted accordingly. Neither makes any assumptions about levels of heterozygosity if by that you mean the frequency at which you find het sites in your organism.

Keep in mind that we work exclusively on human genomes, so we may not be aware of the particularities of other types of organisms. If you need information that is tailored to those peculiarities, you need to clarify in more detail what you are working with and what are the concerns/constraints.

Geraldine Van der Auwera, PhD

• University of KentuckyPosts: 3Member

OK, that makes sense!

• CaliforniaPosts: 8Member
edited April 23

Hi Geraldine,

My understanding is that for larger cohorts (e.g. over 100 samples) HC is now also recommended. However, the FAQs page still recommends UG in this case so just wanted to mention that. Looks like the website is undergoing a lot of updates so this particular update to the recommendations might already be in the works but just wanted to mention it in case it is helpful.

Post edited by crojo on

@crojo, thanks for pointing this out! Indeed, we're in the middle of updating everything so a few pages are still lagging, but as a matter of fact I had forgotten about that FAQ article. I'll update this asap to avoid confusion.

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Hi Geraldine,

I recently analysed some targeted gene-sequencing data using HC with Best Practices 3.x using the gvcf option. The aim was to identify rare mutations. Although it worked well for all variants, there was one that it wasn't able to detect. A sequencing platform analysed the same fastq files and called variants using UG, pileup, mpileup, and HC with an earlier version of Best Practices and they found an indel with UG and pileup that HC missed. I verified the indel by Sanger sequencing, so it is real and indeed the one causing the disease. Thought it may be useful for the team to see why HC missed it. The region where the indel lies is highly repetitive and polymoprhic, so that may explain it, but I did find the same indel by visual inspection, and it is present in ~30% of the reads. I'd like to share the BAM and VCFs with you if you are interested in taking a look. Can I upload this onto the ftp? Are there any other files/info you will need? I can give you my command lines but not those of the sequencing platform.

Thanks, Megana

Hi Megana,

This is indeed interesting -- could you please post your command lines? Ideally we'd want the center's commands too to find out what they did differently. But we can start by looking at yours and see if we can identify any potential reasons why the indel would be missed. A screenshot would be good too, please. Depending on what we find on first examination we'll decide whether we need to see the actual data. Thanks for volunteering this information!

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Hi Geraldine,

Sure, here are my command lines to generate the raw calls.

java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta -I $ARGV[$n] -L $L --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o${dir}/${filename}.gvcf java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta @variant -o${dir}/\${project_name}.rawvariants.vcf

I have also attached a screenshot of the BAM file I generated with the deletion clearly visible and the VCF generated by the sequencing centre with the variant.

Let me know if you need anything else.

Thanks, Megana !

Thanks -- this does look like something HC should call. One more thing to try -- can you please run HC on this region with -ERC BP_RESOLUTION and the -bamout argument to output the reassembled reads? This will tell us what HC is seeing once it has performed reassembly. You can compare the output bam to the original and see if reads have moved. It can also show if HC didn't actually process parts of the region of interest, which happens in rare cases. Please show the IGV screenshot and the relevant lines from the GVCF (in full).

Incidentally (not causing the problem but FYI), you don't need to pass confidence threshold parameters to HC when running in ERC mode, because they are set to 0 internally. You can pass them to GenotypeGVCFs, however.

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Hi Geraldine,

Thanks for the pointer re: the confidence threshold paramaters. So I ran HC in BP_RESOLUTION mode and have attached a screenshot of the resulting bam (top) in comparison with the old bam (bottom). Indeed, there weren't any reads assembled in the region of interest, which explains the lack of variant calling! Why is that? I have also attached a screenshot of the resulting gvcf with the deleted base in yellow.

Thanks, Megana

Hi Megana,

Looking at the original bam files, it seems like there are a lot of indels and SNPs around your site of interest. This is usually a sign of bad mapping.

Because the reads are not present in the bamout file, the region is not being called as an active region. Have you checked the mapping qualities of the reads in the region, as well as the base qualities around the site? Active regions may not be triggered because of low quality scores.

I hope these help!

-Sheila

Hi again,

Can you upload a snippet of the region so we can replicate the issue? Instruction on how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

Thanks, Sheila

• FrancePosts: 13Member

Hi Sheila,

Sorry for the slow response. Thanks for the links. Indeed, the reads in the region have low MQ scores and BQ scores because it is highly repetitive and also polymorphic. I was just curious why UG called it but not HC, but that may have to also do with the specific command line used by the sequencing centre. I have uploaded the snippet file of the bam onto the ftp (prasad.snippet.bam) and the corresponding index file. Please do let me know if you need something else.

Thanks, Megana

Hi Megana,

I just submitted a bug report. I am not sure when this will be fixed, but thank you for sharing this! We are looking for test cases that are verified variants that are not picked up by Haplotype Caller.

I will keep you updated on the status.

-Sheila

• FrancePosts: 13Member

@Geraldine_VdAuwera

I took your suggestion and ran Haplotype caller without the confidence threshold parameters but then tried to use them with the GenotypeGVCF tool. But it is throwing up an error. Here is my command line and the error

java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta -stand_emit_conf 10 -stand_call_conf 30 --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-M.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-J.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-H.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-F.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-CH.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-CA.aligned.sorted.dedup.realign.recal.gvcf' -o '/media/newdrive/REANALYSIS/Plaquettes/plaquettes.rawvariants.vcf'

ERROR ------------------------------------------------------------------------------------------

##### ERROR MESSAGE: Argument with name 'stand_emit_conf' isn't defined.

I checked to make sure that the argument was available in GenotypeGVCFs. Am I missing something obvious here?

Thanks, Megana

• Posts: 360Member, GSA Collaborator ✭✭✭

@mprasad‌ - It's just a version issue. -stand_emit_conf was added to GGVCFs in version 3.2