Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode.

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
edited April 10 in Methods and Workflows

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above.

Overview and motivations

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

image

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

image

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

image

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.

image

This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

Workflow details

1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the following options:

--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

«1

Comments

  • blakeoftblakeoft ConnecticutPosts: 9Member
    edited March 12

    When I run HaplotypeCaller in gvcf mode, my alt field always contains <NON_REF>. In particular, when an alt allele is actually detected, <NON_REF> still appears. From what I have read in another post in Methods and Workflows, this is not what is expected.

    Post edited by blakeoft on
  • ebanksebanks Posts: 683GATK Developer mod

    It is expected (and critical to the pipeline's running correctly). That old documentation is outdated and will be updated when Geraldine gets back from vacation.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • blakeoftblakeoft ConnecticutPosts: 9Member

    Excellent. Thank you for your reply.

  • blueskypyblueskypy Posts: 213Member

    Regarding step 2. Optional data aggregation step

    I have 400 samples, can I put all together? what would be the memory requirement for running this step w/ 400 samples? how long it would take in Broad's clusters?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @blueskypy,

    I don't have any numbers on hand; I can ask the devs but my quick answer is no, don't try combining 400 at a time. We always try to minimize the number of intermediate steps, so if we found that 200 gives the best tradeoff on the Broad's arguably pretty hefty cluster, running on twice as many is probably not a good idea unless you have an even heftier one.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 213Member

    Regarding the N+1 problem, I guess it can be done in either step 2 (when N is big) or step 3, right?

    Does the calling on N change after the "+1"? If it does, does the calling change in way that the previous data analysis on N will mostly still hold?

    Thanks!

  • blueskypyblueskypy Posts: 213Member
    edited March 17

    @Geraldine_VdAuwera said: Hi blueskypy,

    I don't have any numbers on hand; I can ask the devs but my quick answer is no, don't try combining 400 at a time. We always try to minimize the number of intermediate steps, so if we found that 200 gives the best tradeoff on the Broad's arguably pretty hefty cluster, running on twice as many is probably not a good idea unless you have an even heftier one.

    hi, Geraldine Thanks so much for the help! I hope you are not working during your vacation! So I should split the 400 into two cohort to run with CombineGVCFs separately, and then use CombineGVCFs to combine them? If so, would it be quicker if I split them into smaller cohort, e.g 50, run each with CombineGVCFs and then use CombineGVCFs to combine them all?

    Post edited by blueskypy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Regarding the N+1 problem, I guess it can be done in either step 2 (when N is big) or step 3, right?

    Does the calling on N change after the "+1"? If it does, does the calling change in way that the previous data analysis on N will mostly still hold?

    I'm sorry but I'm not sure what you mean here, can you please clarify?

    I hope you are not working during your vacation!

    Hah, nope, my wife wouldn't allow it :)

    would it be quicker if I split them into smaller cohort, e.g 50, run each with CombineGVCFs and then use CombineGVCFs to combine them all?

    I think so, especially if you can run the 50-piece batches in parallel.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 213Member
    edited March 17

    OK, regarding the N+1 question, what I meant is the following:

    1. for a cohort of N samples, and for a sample X in the cohort, does the variant calls of sample X change before and after the "+1" step?

    2. After adding the additional sample by "+1", do I have to redo all the data analyses on the N samples? A specific example: for two subsets of N, N1 and N2, for example N1 is disease and N2 is control, and I do a comparison between N1 and N2 to identify the disease associated variants, would the list of candidate variants change before and after the "+1" step? (assume the added sample by "+1" is not in N1 or N2)

    Post edited by blueskypy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Ah, I see.

    1. It could happen that a variant would not be called in sample X with N samples, but would be called with N+1 samples -- that's the effect of multisample calling. The reverse shouldn't happen in principle, but in practice it has been observed as a result of limitations in the HC's ability to process graphs for very large cohorts. The new reference model-based pipeline completely bypasses the latter issue.

    2. In the "classic" pipeline (pre-3.0) yes, you would have to re-call the entire cohort in order to reap the benefits of multisample calling. In the new pipeline, no, you don't need to re-call any samples. You just call the new sample(s), add them to the combined gVCF if necessary, and re-run the joint genotyping step (which is very fast). Both ways yield the same results (with the list of variants before and after +1 being potentially different) but the second is immensely faster.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 213Member
    edited March 18

    hi, Geraldine,

    Regarding breaking 400 samples into sub-cohorts of 50 samples each and then combine the eight gVCF files later, I have two questions:

    1. should each sub-cohort be created randomly, or should each obey the general principle of a cohort, e.g. same sex/race, etc.?

    2. For the steps 2 of data aggregation, if it's faster to break 400 samples into eight sub-cohorts of 50 each than into two sub-cohorts of 200 each, would it be even faster to break 400 into 16 sub-cohorts of 25 each since all sub-cohorts can run in parallel? Is there any recommendation on the optimum size of sub-cohorts?

    Thanks!

    Post edited by blueskypy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @blueskypy‌,

    1. Random is fine, the order is not important here as it won't change anything. As long as you run the joint genotyping on all the samples of course.

    2. That's quite possible. We haven't examined the trade-off; our recommendations are geared towards very large cohorts. But if you try out a few settings and make any interesting observations, please do share!

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 213Member

    I really like the new approach in v3.0; but in terms of calling accuracy, is it similar to v2.8?

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    My experience is that no, it's not similar to 2.8. It's much better...

  • blueskypyblueskypy Posts: 213Member
    edited March 18

    For the step 2 of data aggregation on N sub-cohorts of M samples each, what's the time/space complexity in terms of M and N? I'm just trying to figure out the fastest way of finishing that step for 400 samples; possible approaches are, for example,

    1. break 400 into 8 sub-cohorts of 50 each or 16 sub-cohorts of 25 each and run the sub-cohorts in parallel. Seems there isn't a clear answer from the team which one is better.

    2. This is my new question. In aggregating the sub-cohorts, is it faster to CombineGVCFs all of them once, or combine two or more sub-cohorts into larger sub-cohorts (so a few such combinations can run in parallel) and then combine those larger sub-cohorts into the final cohort? In summary, it's a one-step combination vs multiple-step combination, which is faster?

    3. Also, can data aggregation and joint genotyping steps be run at chr level, so that each chr can run in parallel?

    Finally, just wonder what's the time and memory to run 200 samples for the data aggregation and joint genotyping steps at Broad? If it's within two days, I'll just go ahead and run it. But I don't want to later realize that I'll have to wait for 10 days for 400 samples.

    Post edited by blueskypy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Fastest way may be to just go ahead and do it already ;)

    1. If you do the comparison, please share with the class.
    2. If you have a reasonably small number of sub-cohorts (ie fewer than ~200) you don't actually need to aggregate them too. It is not necessary to end up with a single gVCF, you just need to make sure you're not running on hundreds of individual files.
    3. Yes, they can be done at chromosome level, if you really want to parallelize the heck out of your data. But I'm not convinced it's necessary.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 213Member

    hi, Geraldine, Sorry for this later added question, could you give me a hint?

    Finally, just wonder what's the time and memory to run 200 samples for the data aggregation and joint genotyping steps at Broad? If it's within two days, I'll just go ahead and run it. But I don't want to later realize that I'll have to wait for 10 days for 400 samples.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    I honestly have no scale to give you as I haven't run this myself, and I've been away for 3 weeks (working remotely + vacation) while the devs have been doing the large-scale testing. But my understanding is that it's really fast (rumor is someone here decided to run the pipeline on as many samples as possible and got to N=90K very rapidly -- and only stopped because there were no more samples available) so I really doubt it'll even take two days for 400 samples. I'll put in a to-do to document expected runtimes but right now I have a few fires to deal with, so it's low priority, sorry.

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member

    Can we pass the --variants information as a .list files as we did in the multi-sample calling step with reduced reads before.

    thanks saad

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @smk_84‌,

    Yes I believe you can pass in the -variants as a list file. Just to be clear, you should not run this method (HC in GVCF mode) on reduced bams, as it will not work. You need to go back to the unreduced bams.

    Geraldine Van der Auwera, PhD

  • patwardhpatwardh Posts: 3Member
    edited April 1

    Hi,

    Just for clarification, in the new GATK release 3.0/3.1, has multisample (batch) calling been replaced with joint-calling? or is traditional batch calling still an option that can be used in-place of joint calling (i.e. by setting a flag). We are starting to test 3.1 and would like to compare the results from both these workflows.

    thanks

    Post edited by patwardh on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @patwardh,

    Apologies for the confusion, we are working on updating the documentation to reflect the new recommendations. In a nutshell, we recommend replacing the batch/multisample calling with the new GVCF-based workflow described in this document. But you can absolutely still proceed as previously, no flags needed. It is the new workflow that requires additional flags/arguments.

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member

    Hi when using Unifiedgenotyper with the new version GATK 3.0 I am not getting any indels. This is how I am running my command

    java -Xmx10g -XX:+UseSerialGC -Djava.io.tmpdir=/home/skhan/tmp -jar /home/skhan/bio/GATK_3.0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R $reference -I inderealign.list -o $hash_files{'UnifiedGenotyper'}

    Ideally the multi-sample vcf file that is created should also contain indels but when I try to filter the file for indels I am not getting any. Can you kindly tell me what would be the ideal way of running for indels using UG.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @smk_84 By default the UG outputs only SNPs. You need to set the -glm argument to BOTH to also get indels.

    Geraldine Van der Auwera, PhD

  • JTeerJTeer Posts: 10Member

    First, thanks for including this type of approach in GATK. So far, it works in our hands, and appears to have addressed a long running issue we've seen with multisample calling on overlapping indels using UnifiedGenotyper. I had a few questions for you:

    1. I noticed that the "DP" FORMAT field gets a dot when a sample is nonreference, but does get a value when reference. Is there any way that field can include a DP value for nonref calls?

    2. The GenotypeGVCFs output contains many header lines describing information that isn't actually included in the file. For example, what appears to be ClinVar fields: CLNACC, CLNALLELE, CLNDBN, etc. I didn't see any documentation describing how to actually get that information added, so these empty header now essential pollute the namespace, which interferes with the annotations we add later. Is there any way to omit unused headers?

    3. Any plans to allow GenotypeGVCFs (or other GATK programs) to read compressed VCF files? I though the BGZIP reading issue in Java had been fixed, and using compressed files would help IO and space limitations!

    Thanks to all involved for the effort put into this feature.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @JTeer‌, I'm glad to hear this is working for you.

    1. I'll look into the DP question -- I'm not sure what is the desired behavior here.

    2. Those don't sound like any annotations emitted by GATK, is there any chance those were added to one of your VCFs by a different program?

    3. This is already functional. GATK can read gzipped VCFs if they are accompanied by a Tabix index. You can also have GATK output gzipped VCFs; to do so, just add a .gz extension for the output file name. The one problem right now is a known bug in the indexing, which is producing a regular .idx file instead of a Tabix .tbi index. We have a fix for this internally which will be in the next version (3.2).

    Geraldine Van der Auwera, PhD

  • JTeerJTeer Posts: 10Member

    Thanks for the response! I forgot to mention I'm using version 3.1-1.

    1. If you're asking me, I would want to see the DP value reflect the total depth for all genotypes, reference and nonref.

    2. I don't think the unused VCF header annotations are added by our pipeline: I see them in the file directly generated by GenotypeGVCFs (but not in the GVCFs generated by HaplotypeCaller.) I wonder if VariantAnnotator is somehow adding them? My command is below; I'll likely trim the command, as I'm not sure -A coverage and -A StrandBiasBySample are actually doing anything (I don't see SB FORMAT fields)...

      java \
          -Xmx${MEM}                                                         \
          -Djava.io.tmpdir=${JAVA_TMPDIR}                                    \
          -jar ${GATK}                                                       \
          -T GenotypeGVCFs                                                   \
          -R ${REF_SEQ}                                                      \
          -A Coverage                                                        \
          -A FisherStrand                                                    \
          -A StrandBiasBySample                                              \
          -D $SNP_DBSNP                                                      \
          -o ${SMPL_NAME}.vcf                                                \
          -nt $PROCS                                                         \
          -V samples.vcf.list
    3. Great, thank you for the heads up!

  • QiangHuQiangHu Posts: 3Member

    Hi Geraldine,

    We are working on 48 exome sequencing data in a Intel Xeon 16 cores node. The CombineGVCFs step took about 40 hours and the GenotypeGVCFs step took about 60 hours (with -nt 16). Is this normal?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @QiangHu, I can't really comment on individual performance/speed as it depends so much on your platform. For what it's worth this sounds reasonable to me.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Although I should add that you don't need to run CombineGVCFs if you only have 48 exomes, you can just skip that step.

    Geraldine Van der Auwera, PhD

  • QiangHuQiangHu Posts: 3Member

    Thanks @Geraldine_VdAuwera‌! We plan to do the GenotypeGVCFs at chromosome level in parallel. Hope it can save some time.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @JTeer, sorry for the delay -- I needed to check some things with the devs. What you observe for DP and the CLN... annotations is definitely not expected. Could you share a snippet of data that reproduces both issues that we can test locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • JTeerJTeer Posts: 10Member

    @Geraldine_VdAuwera, sorry for my delay as well - too much else going on! Requested test case has been uploaded as ref_based_testing_OK1.tgz (please delete the other empty failed attempts.)

  • smk_84smk_84 Posts: 59Member

    Hi,

    I am running the genotypegvcf on each chromosome in parallel and then I am trying to combine them using combine GVCF. But whenever I try to use these commands I get the following error at combinegvcf step.

    Here is the order of the commands I am using :-

    -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -L Chr01 -R Gmax_275_v2.0.fa -I HN001_indel_realigned.bam -o HN001_Chr01.vcf

    -T GenotypeGVCFs -R Gmax_275_v2.0.fa -o GVCF_Chr01.vcf -L Chr01 --variant HN001_Chr01.vcf --variant HN002_Chr01.vcf --variant HN003_Chr01.vcf --variant HN004_Chr01.vcf --variant HN005_Chr01.vcf

    -T CombineGVCFs -R Gmax_275_v2.0.fa -o GVCF.vcf --variant GVCF_Chr01.vcf --variant GVCF_Chr02.vcf --variant GVCF_Chr03.vcf --variant GVCF_Chr04.vcf --variant GVCF_Chr05.vcf --variant GVCF_Chr06.vcf --variant GVCF_Chr07.vcf --variant GVCF_Chr08.vcf --variant GVCF_Chr09.vcf --variant GVCF_Chr10.vcf --variant GVCF_Chr11.vcf --variant GVCF_Chr12.vcf --variant GVCF_Chr13.vcf --variant GVCF_Chr14.vcf --variant GVCF_Chr15.vcf --variant GVCF_Chr16.vcf --variant GVCF_Chr17.vcf --variant GVCF_Chr18.vcf --variant GVCF_Chr19.vcf --variant GVCF_Chr20.vcf

    Here is the error I am getting :-

    ERROR MESSAGE: The list of input alleles must containas

    an allele but that is not the case at position 285; please use the Haplotype Caller with gVCF output to generate appropriate records

    What do you think I am doing wrong?

  • FabriceBesnardFabriceBesnard ParisPosts: 23Member
    edited April 14

    Hi,

    In the workflow detail, you advise to use HC woth the following parameters:

    --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

    However, in the HaplotypeCaller documentation, I couldn't find any parameters like --variant_index_type or --variant_index_parameter

    Addendum: I found in release notes for version 2.8 that you need to add this parameters for the gVCF mode of HC to work well. I'm running version v3.1-1-g07a4bf8. Does this still hold true?

    Thanks ! Fabrice

    Post edited by FabriceBesnard on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @smk_84,

    GenotypeGVCFs produces a regular VCF, which cannot be used later by other tools that have "GVCF" in the name. So either you need to run CombineGVCFs before running GenotypeGVCFs, or use CombineVCFs on the output VCFs. Make sense?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @FabriceBesnard,

    Those arguments belong to the GATK engine, not to the HaplotypeCaller itself, so you need to llok for them here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html

    You should still use them with version 3.1, yes. They are meant to help with speed and file size, so they will not affect the results in terms of calculations. They just affect performance.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @JTeer‌, thanks for the files.

    I've been trying to reproduce your annotations issue without success. I used all the same command lines on your data but I'm not seeing the "orphaned annotations" you see in the header of your TEST2 file. The only explanation I can come up with is that somehow your files were re-headered or concatenated with a vcf file containing these annotations.

    I do however see the missing genotype DP values you mentioned and will look into that.

    Geraldine Van der Auwera, PhD

  • sbourgeoissbourgeois London, UKPosts: 5Member

    Hi Geraldine,

    sorry, I'm still a bit confused about the bits of information I find here and there in the docs and on the blog.

    First of all, as I'm running GATK on Intel Xeons, is it OK to use java -jar GenomeAnalysisTK.jar -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller ... to improve performance? I've only seen that command on the blog, nothing in the docs.

    I am converting my pipeline from GATK 2.8.1 to 3.1, which, if I'm not mistaken, means splitting the variant calling step into two steps: first step with haplotype caller in gvcf mode on each bam file, and the second one with GenotypeGVCFs on all gvcf files together.

    Could you please tell me if my code is fine for these steps, or if I have misunderstood something?

    Step 1:

    java -jar GenomeAnalysisTK.jar -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller -nct 4 -I sample1.bam -R Homo_sapiens.GRCh37.73.dna.fa --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample1.gvcf

    Step 2:

    java -jar GenomeAnalysisTK.jar -pairHMM VECTOR_LOGLESS_CACHING -T GenotypeGVCFs -nct 4 -I sample1.gvcf -I sample2.gvcf ... -gt_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o variants.raw.vcf

    Thanks in advance,

    Stephane

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Stephane,

    Almost, but a few quick corrections:

    • -pairHMM VECTOR_LOGLESS_CACHING only applies to HaplotypeCaller
    • -gt_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 all belong to HC too
    • You should use .vcf for the extension even for gVCFs otherwise the program will throw an error. Many people use .g.vcf for clarity, which is accepted.

    Geraldine Van der Auwera, PhD

  • JTeerJTeer Posts: 10Member

    @Geraldine_VdAuwera, thanks for looking into this. I've gone back and dug in a bit more, and have determined that these fields are coming from the "-D dbsnp_138.b37.vcf.gz" option: these headers are in the gatk_bundle_2.8 dbsnp file (from the public ftp site), and although HaplotypeCaller ignores them, GenotypeGVCFs pulls the entire dbsnp file header in. Since the -D option indicates that the file is only used for populating the ID column, I am thinking the header should not be pulled in (my humble opinion). I'd be interested to hear the GATK team's and others' thoughts on this. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @JTeer Darn, I did omit the dbsnp from my test. Completely agree that we should not be pulling that header stuff in! Will put in a bugfix request. Thanks for pointing this out!

    Geraldine Van der Auwera, PhD

  • knhoknho Indiana, USAPosts: 22Member

    Hi Geraldine, I wan to make sure if I can use reduced BAM files in variant calling (HC in gVCF mode) of each sample in this new version

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @knho‌ No, as noted in the release notes and version highlights documentation, you cannot use reduced BAM files with any tools in GATK 3.0. You should use the unreduced files for analysis.

    Geraldine Van der Auwera, PhD

  • crojocrojo CaliforniaPosts: 8Member

    Hi Geraldine,

    Thanks for the information. I noticed the Best Practices Documentation no longer includes an overview of using UnifiedGenotyper to make calls. Although there is still info in the Tool Documentation on how to use UG, I wanted to ask if anything has changed concerning the recommendations for using UG versus HC (again, given the Best Practices no longer outline UG usage)? I have over 100 samples to call at one time and based on the recommendations I would use UG but just wanted to make sure this was still recommended. Thank you for your time!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @crojo,

    We removed the UG from the Best Practices because HaplotypeCaller is now better in all respects, and we wanted make it clear that people should use only HC, unless their organism is non-diploid or their dataset is from pooled DNA. If the latter is your case, you can use UG as previously. If not, you should switch to HC, and apply the new workflow to call variants per sample in GVCF mode then genotype them together as described in this document.

    Geraldine Van der Auwera, PhD

  • ano1986ano1986 BelgiumPosts: 1Member

    Hi Geraldine,

    So you can create .g.vcf for all samples seperately with haplotypecaller and then you use -T genotypeVCFs on all the .g.vcfs. But what is the output then? One .vcf with genotypes for all the samples? Or again one .vcf for each sample?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @ano1986 The output of GenotypeGVCF is a multisample VCF file with genotypes for all your samples.

    Geraldine Van der Auwera, PhD

  • ZaagZaag Posts: 12Member

    @Geraldine_VdAuwera said: I do however see the missing genotype DP values you mentioned and will look into that.

    Is there any news on this? I see it only with 0/0 genotypes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @Zaag No news yet but the developers are aware of the problem. It's part of a wider scope that they're working on right now.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @JTeer We have a fix for the dbsnp headers bug you reported a few days ago. It should be available in the nightly builds once it's been through code review, within a day or two.

    Geraldine Van der Auwera, PhD

  • JTeerJTeer Posts: 10Member

    @Geraldine_VdAuwera: Great, thanks for looking into this and the other issues!

  • crojocrojo CaliforniaPosts: 8Member

    Hi Geraldine,

    I notice the Best Practices section of the website is under development. Perhaps a silly question, but does the document on this page ("Calling variants on cohorts of samples using HC in GVCF mode"), along with the documentation on the various downstream tools (combinegvcfs, genotypegvcfs, etc), suffice to call variants at this point? Or would you recommend waiting to proceed to perhaps wait for different, or more specific, recommendations you all might be working on for inclusion in the Best Practices Documentation?

    As always, many thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @crojo,

    That's a fair question, no worries. Yes, this is sufficient to call variants with the new workflow. The new developments in the Best Practices aim to explain in more detail the whys and hows of the process, but there will not be any differences in the recommendations (for this version of the GATK).

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member

    @Geraldine_VdAuwera Uhhh, when will 3.2 be released?

    @JTeer You can do tabix -p vcf foo.vcf.gz for the time being.

    @Geraldine_VdAuwera said: 3. This is already functional. GATK can read gzipped VCFs if they are accompanied by a Tabix index. You can also have GATK output gzipped VCFs; to do so, just add a .gz extension for the output file name. The one problem right now is a known bug in the indexing, which is producing a regular .idx file instead of a Tabix .tbi index. We have a fix for this internally which will be in the next version (3.2).

    @JTeer said: 3. Any plans to allow GenotypeGVCFs (or other GATK programs) to read compressed VCF files? I though the BGZIP reading issue in Java had been fixed, and using compressed files would help IO and space limitations!

    @Geraldine_VdAuwera I am interested in the sensitivity/specificity of UG and HC on high and low coverage samples. Has such a comparison of UG and HC been carried out? Can you point me to it, if it exists? Thanks!

    @Geraldine_VdAuwera said: We removed the UG from the Best Practices because HaplotypeCaller is now better in all respects

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Tommy,

    The 3.2 release is not yet scheduled, so it will be at least another few weeks, possibly more.

    We are planning to write a paper on HC that will include UG vs HC comparisons. In the meantime I'll try to put together a blog post with the main takeaways.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Eh, why wait -- I've posted the draft of the HaplotypeCaller method article; there are still a few missing pieces but it should already help answer quite of lot of questions. See http://www.broadinstitute.org/gatk/guide/article?id=4148

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member

    @Geraldine_VdAuwera‌ Thanks! If I could only vote for employee of the month at the Broad ;) I have a few additional questions. I am most interested in receiving an answer to the first one, in case you need to prioritize your time.

    I) Have recommendations for the HC arguments --stand_call_conf and --stand_emit_conf changed after introduction of the new workflow? Previously I have read: "A common question is the confidence score threshold to use for variant detection. We recommend: Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30. Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise."

    II) This document reads: "If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF."

    @Geraldine_VdAuwera said:

    1. If you have a reasonably small number of sub-cohorts (ie fewer than ~200) you don't actually need to aggregate them too. It is not necessary to end up with a single gVCF, you just need to make sure you're not running on hundreds of individual files.

    Is it OK for me to merge 2000 gVCF files into 10 with CombineGVCFs and then proceed with GenotypeGVCFs on those 10 gVCFs? Is it recommended to merge batches of ~200 gVCFs merely for computational reasons? If it is computationally feasible to merge all 2000, which only cover 10Mbp each, do you then advise against it?

    III) Is it necessary to do compute annotations with HC, if those annotations are recalculated by GenotypeGVCFs anyway?

    IV) Does GenotypeGVCFs handle triallelic SNPs? I guess I'll know the answer to this question, once I start testing it.

    V) The inherited --variant_index_parameter argument value of 128000 mentioned in this document seems very much like a magic number. Is there an explanation of 128000 somewhere? It's not even a multiple of 42... Oh well, I'll happily use it with blissful ignorance :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Aw, you flatter me, @tommycarstensen‌! For the record the new HC doc was primarily written by my new team member, @Sheila‌, with invaluable expert contributions from @vruano‌. Credit where credit's due :)

    So, here goes.

    1. No formal changes to the confidence threshold recommendations (at least nothing the methods devs have told me). Actually I'll have to check what is the interaction with the reference model and explain exactly how that works in the ref model document (which is next on our plate).

    2. Purely computational reasons -- it's a lot of open files to deal with. I'll have to check what is the best combining strategy; probably would be helpful to provide some kind of guideline/table on this point, huh. But in principle, merging to 10 should be fine (no need to generate a single final one). That said we had a couple of annoying bugs that came up when people did rounds of merging. They're fixed in the nightlies so use that if you want to save yourself some hassle. I'll see if we can release a bugfix version sometime soon for those who don't enjoy the cutting edge of development versions.

    3. Depends which annotations you mean -- the core annotations are necessary because they're what GenotypeGVCFs uses as basis for recalculations. Others aren't needed, that's correct. As for additional annotations for which we need to see the data, you have to either get them through HC or afterward through VariantAnnotator.

    4. Yes, GGVCFs can handle any number of alleles.

    5. It is either a magic number, or 42 multiplied by 3000, adjusted for inflation. Honestly 'm not sure -- that was part of @thibault‌'s mad engineering scheme to make GVCFs reasonably efficient. He may be able/willing to comment... but AFAICT it's an arbitrary number that worked well in testing.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member
    edited May 17

    Thanks for the answers @Geraldine_VdAuwera‌! I am curious to know, why HC3.1 --emitRefConfidence GVCF outputs zero quality SNPs like the one below. Is that the expected behaviour?

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT xxx

    1 937796 rs9777993 T A,<NON_REF> 0 . DB;DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQ=70.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:1,0,0:3:0,3,59,3,59,59:0,0,0,0

    Post edited by tommycarstensen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hmm, is this with the default emit/call confidence thresholds, or are you overriding them in your command line? The HC is meant to be greedy but this seems a little much.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member

    I am using -stand_call_conf 10 and -stand_emit_conf 10 instead of the default of 30.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Oh, actually the HaplotypeCaller ignores the conf thresholds (sets them to 0) in GVCF mode -- I'll make sure to document this. So what happens here is that the HC greedily assigns QUAL != . for all lines where is not the only ALT allele (>= 0), and forces genotype calls everywhere.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Hi, there. I used HaplotypeCaller to generate .gvcf files and then applied joint genotyping on these files, however, in the genotyping step GATK complained that "Line 55597163: there aren't enough columns for line Total compute time in PairHMM computeLikelihoods() : 0.0 (we expected 9 tokens, and saw 1 )", I could not go on. Can someone kindly point out what's wrong with my code? Below are my command arguments: For generating gvcfs:

    java –jar GenomeAnalysisTK.jar
    -R  ref.fa
    -T  HaplotypeCaller
    -I   sample.recal.bam
    -o  sample.gvcf
    --emitRefConfidence         GVCF
    --variant_index_type         LINEAR 
    --variant_index_parameter    128000
    -nct  3
    For genotyping:
    java –jar GenomeAnalysisTK.jar
    -R  ref.fa
    -T  GenotypeGVCFs
    --variant   sample1.gvcf
    --variant   sample2.gvcf
    -o  all.sample.vcf
    -nt 3
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @Jack, the problem is that you're using .gvcf as extension name for your output, which is confusing the parser in the next step. Only .vcf is a valid extension name. If you want to signify that the files are gVCF explicitly in the filename, I recommend using .g.vcf, which will be correctly identified by the parser.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Hi, @Geraldine, I changed the extension of my gvcf files to .g.vcf and rerun the GenotypeGVCFs, but still got the same error. Should I rerun the HaplotypeCaller walker to generate the gvcf files?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hmm, now that I read your error message more carefully I think this is another problem. This looks like the log output is getting mixed in with the VCF output. Are you using pipes at all?

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    No, @Geraldine,I did not use pipes. I run the two steps separately

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hmm, this is not a good sign. Can you try running again without the -nt/-nct arguments?

    Geraldine Van der Auwera, PhD

  • MikeKMikeK Hamilton NZPosts: 3Member

    When I generate the gVCF from the single sample BAM file would it be appropriate to use a much lower value for the parameters -max_alternate_alleles and -maxNumHaplotypesinPopulation? for a diploid sample I came up with 3 - two for the diploid and one for the reference (ie three alleles to cover A in the ref, and B and C in sample). I'd like to use lower values for these parameters so as to minimise the CPU time required. I'd be interested in aany experiences or recommendations.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
  • erikterikt Posts: 7Member

    Hello,

    I have some questions about the joint genotyping step. The data I am working with is as follows: I have WES data for 24 samples from a single Hiseq run. 3 of the samples are single, and the remaining 21 are trios. Further, 2 of the trios were run again on two lanes of another Hiseq run. I have followed the best practices using the latest version of GATK (first per lane, then per sample), through generating gvcf files for each sample. Where I get a little lost is at the joint stage. Should I run GenotypeGVCFs separately on each trio, on all the gVCFs from a given Hiseq run, or on all the gVCFs from both runs? Regarding the 6 samples that were rerun, should I merge these with the data from the first run, and if so at what stage? It would be very convenient if I could keep them separate, as this would give me the 30 exomes that I need for VQSR, but I don't know if this is advisable.

    A lot of questions, I know, but I would very much appreciate any help/guidance.

    Thanks, Erik

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @erikt,

    You can run GenotypeGVCFs on all of your exomes together, including the technical replicates. You'll just need to make sure that the replicates have different SM tags that identify them as separate samples, otherwise they'll get lumped together.

    Geraldine Van der Auwera, PhD

  • byb121byb121 UKPosts: 23Member

    @Geraldine_VdAuwera said: Hmm, this is not a good sign. Can you try running again without the -nt/-nct arguments?

    As I recall, HC does not support -nt/-nct but only parallel with scatter-gather, has that changed?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @byb121‌ The HC supports -nct but quite a few people have run into issues with it so we're trying to move away from that. We find that data-parallelizing HC jobs using something like Queue is more robust and saves more time overall.

    Geraldine Van der Auwera, PhD

  • HasaniHasani GermanyPosts: 12Member

    @Geraldine_VdAuwera said: Hi FabriceBesnard,

    Those arguments belong to the GATK engine, not to the HaplotypeCaller itself, so you need to llok for them here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html

    You should still use them with version 3.1, yes. They are meant to help with speed and file size, so they will not affect the results in terms of calculations. They just affect performance.

    Sorry but the link is broken and reading the answers did not help "me" understand what is the story of --variant_index_type and --variant_index_parameter

    Would you please clarify why both are so important and how to chose their values?

    Thank you in advance!

    H.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @Hasani There is really no story -- those parameters are basically a technical hack to set the right indexing scheme for GVCFs. We provide the exact values you should use in the document above (and in an error message triggered when they're not supplied on the command line). You should not choose values yourself.

    Geraldine Van der Auwera, PhD

  • catherinelpycatherinelpy MalaysiaPosts: 1Member

    Hi,

    I am using GATK 3.0 to run variant calling to detect mutation for my RNA-seq data. But, the mutation that I found using GATK 2.0 that call the variant one by one were not found in variant calling results using GATK 3.0.

    What is the difference between running the variant calling process one by one and calling the variant in a group? In addition, why there are some mutation that I miss using GATK 3.0?

    Any help is greatly appreciated. Thank you.

  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @catherinelpy

    Hi,

    Can you give us a step-by-step explanation of what you did in each case? Please provide the exact command lines you used.

    Thank you.

    Sheila

  • mbookmanmbookman Posts: 2Member

    We are interested in producing a gVCF file for the recalibrated variants. The Workflow details above indicate how to produce a gVCF with HaplotypeCaller, but then the input to and output from VQSR is no longer a gVCF.

    What is the prescribed way of using GATK 3.x to produce a recalibrated variant file with the reference calls?

  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin
    edited August 22

    @mbookman

    Hello,

    Assuming you have one or more gvcf files after running Haplotype Caller, you can use GenotypeGVCFs with --includeNonVariantSites. This will include all the sites including non-variant sites. Please read about this here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.html#--includeNonVariantSites

    Then, you can run the file through VQSR and it will retain all the sites when applying the recalibration.

    I hope this helps.

    -Sheila

    Post edited by Sheila on
  • mbookmanmbookman Posts: 2Member
    edited August 26

    Hi Scheila,

    Sorry, I should have mentioned that we tested with --includeNonVariantSites. The downside to --includeNonVariantSites is that the reference sites are not grouped as they are with the gVCF output from HaplotypeCaller. A distinct record is emitted for every individual site. This explodes out the size of the file which and in the small bit of testing so far, it tripled the downstream processing time.

    On a relatively small example (a subset of chromosomes 1-3), we started with a 1.6G gVCF produced by HaplotypeCaller. If I run that through GenotypeGVCFs without the flag, the output file is 170M. If I run it with the flag, the output file is 25G. Of course the file compresses down significantly, but the concern is the cost of downstream processing of so many individual records that otherwise could be processed in groups.

    To be clear, what I am referencing is that the output from HaplotypeCaller has POS --> END groupings of positions like:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  142833
    1       1       .       N       <NON_REF>       .       .       END=10046       
    GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    1       10047   .       C       <NON_REF>       .       .       END=10056       
    GT:DP:GQ:MIN_DP:PL      0/0:51:12:49:0,9,135
    1       10057   .       A       <NON_REF>       .       .       END=10057       
    GT:DP:GQ:MIN_DP:PL      0/0:54:3:54:0,4,1435
    ...
    

    while the output from GenometypeGVCFs --includeNonVariantSites looks like:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  142833
    1       1       .       N       .       .       .       NCC=1   GT:DP   0/0:0
    1       2       .       N       .       .       .       NCC=1   GT:DP   0/0:0
    1       3       .       N       .       .       .       NCC=1   GT:DP   0/0:0
    1       4       .       N       .       .       .       NCC=1   GT:DP   0/0:0
    1       5       .       N       .       .       .       NCC=1   GT:DP   0/0:0
    1       6       .       N       .       .       .       NCC=1   GT:DP   0/0:0
    1       7       .       N       .       .       .       NCC=1   GT:DP   0/0:0
    ...
    

    Is there a way to get the grouped positions from GenotypeGVCFs?

    Thanks.

    Post edited by mbookman on
  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @mbookman

    Hi,

    Unfortunately, there is no way to do what you are asking in the the current version of GATK.

    -Sheila

  • knhoknho Indiana, USAPosts: 22Member

    I have several java-related issues when I ran HaplotypeCaller with nct option. Thus, I heard that a group converted java-based GATK into C++-based GATK and comparing the running times. Is there any way to get the C++-based GATK version?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @knho If you mean the Gamgee project which I mentioned recently in another thread, it's fully open-source and you're welcome to use it, but be aware that it's not a full re-implementation of GATK. Rather, it's a re-implementation of the utility functions in the core framework of GATK, plus various software libraries like Picard and htslib. Some GATK tools will be reimplemented to work on top of that, and we have some new tools in development based on it as well, but it doesn't replace the existing GATK.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    @knho might be thinking of the Intel PairHMM work, which is included in the current version of GATK. But again, this is a single utility function within HaplotypeCaller (albeit which contributes heavily to runtime)

  • knhoknho Indiana, USAPosts: 22Member

    When we built the PairHMM source, we got an error message: Could not find "build.xml"

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @knho, we are trying to address your issue in the other thread where you posted about this, but you haven't answered my questions there.

    Geraldine Van der Auwera, PhD

  • GiusyGiusy Posts: 9Member
    edited September 8

    Hi Geraldine! In a previous comment you suggest to another user to include the -glm argument to call indels with UG

    @Geraldine_VdAuwera said: smk_84 By default the UG outputs only SNPs. You need to set the -glm argument to BOTH to also get indels.

    Is it necessary to add a specific argument for indels with HC as well? This thought occurred to me since I tried to look up in my vcf file for indels but seem to be completely absent. I'm currently using version 3.2-2. Thank you!

    Post edited by Giusy on
  • tommycarstensentommycarstensen Posts: 112Member
    edited September 8

    @Giusy I am not using the -glm argument and my INDELs are being called just fine with version 3.2-2 of HC.

    Post edited by tommycarstensen on
  • GiusyGiusy Posts: 9Member

    Thank you, @tommycarstensen‌! you really cheer me up, it means the problem is just that maybe I'm not looking at the right thing in igv, but indels are there. Can you suggest me a way to check properly if I've indels in my vcf?

  • tommycarstensentommycarstensen Posts: 112Member

    @Giusy try this: zcat my.vcf.gz | grep -v ^# | awk '{REF=$4; ALT=$5; split(ALT,aALT,","); if(length(REF)>1) {print $1,$2,$3,$4,$5; exit}; for(i in aALT) {if(length(aALT[i])>1) {print $1,$2,$3,$4,$5; exit}}}'

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @Giusy No special arguments are necessary for HC to call indels; in fact it is impossible to turn off indel calling because HC calls everything together.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member

    In the current VariantRecalibrator example the annotations MQRankSum, ReadPosRankSum, HaplotypeScore and others are used: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.html http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_ReadPosRankSumTest.html http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_HaplotypeScore.html

    The current default GenotypeGVCF annotations are [InbreedingCoeff, FisherStrand, QualByDepth, ChromosomeCounts, GenotypeSummaries].

    Is there a current recommendation with regard to which annotations to use for filtering variants from low and high coverage human sequence data? Will the VR step be inferior if leaving out the annotations HaplotypeScore, MQRankSum, ReadPosRankSum?

    P.S. Is the annotation NCC documented somewhere?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @tommycarstensen‌,

    The usual VQSR recommendations apply, see http://www.broadinstitute.org/gatk/guide/article?id=1259

    I'm not familiar with NCC, are you sure that's one of ours? It could something that was recently added that I missed somehow, but it doesn't ring a bell. What does the VCF header entry say?

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member
    edited September 10

    @Geraldine_VdAuwera Impressive, that I managed to ask two trivial/stupid questions in one post, which I should have looked up the answer to myself. I checked the header:

    ##INFO=<ID=NCC,Number=1,Type=Integer,Description="Number of no-called samples">

    I ran GenotypeGVCFs with the default settings. I am running it again to include the annotations MQRankSum and ReadPosRankSum; easier for me than running VariantAnnotator.

    Thanks for sending the VQSR recommendation link though. I noticed a few minor changes to the recommended use of VR and AR, which I wasn't aware of.

    Post edited by tommycarstensen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    LOL no worries, sometimes it's nice to get really a easy question for a change -- as long as you're not a repeat offender :)

    Plus, this has made me aware of NCC as part of a new-ish annotation, Genotype Summaries, so that's a good thing. See http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_GenotypeSummaries.html for the official (albeit tragically terse and unsurprisingly unhelpful) tool doc for Genotype Summaries.

    FYI we have an on-and-off project to beef up the annotation docs, which keeps getting deprioritized but will eventually get done.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member

    I went celebrating too fast. I added --annotation MQRankSum and --annotation ReadPosRankSum to my GenotypeGVCFs command and then got this:

    ##### ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Annotation MQRankSum was not found; please check that you have specified the annotation name correctly

    The annotations are definitely present in all my CombineGVCFs output files, so I don't know, what is going on.

    Problem solved. If I do --annotation MappingQualityRankSumTest and --annotation ReadPosRankSumTest, then it works and the annotations MQRankSum and ReadPosRankSum are added to the GenotypeGVCFs output. Thought I would share this with others being confused about abbreviated annotations. Thanks.

    @Geraldine_VdAuwera additional documentation is a great and much appreciated Christmas present. Looking forward to even more great documentation snowing on us all later this year.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @tommycarstensen‌ Yes the confusion on when to use abbreviated vs full names for the annotations is an ongoing cause of grief for lots of people. This will also be tackled in the annotation docs, when they come out... I'm hoping we can get that done before the first snowfall, but failing that, Christmas sounds about right :)

    So much in the documentation backlog, and the devs keep coming out with new tools & features... Oh well, job security is nice :)

    Geraldine Van der Auwera, PhD

  • knhoknho Indiana, USAPosts: 22Member

    I ran Haplotypcaller using a option "-pairHMM VECTOR_LOGLESS_CACHING" however, the running time is almost same as with the option. Please let me know whether I can expect some speed-up when I use the option "-pairHMM VECTOR_LOGLESS_CACHING" in HaplotypeCaller

Sign In or Register to comment.