The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Calling only some positions

d_nid_ni Posts: 17


I'm trying to use GATK to validate reproducibility of the NGS part. We run the same sample 3 times on a MiSeq and I have a list of positions and I want to see if GATK has the same calls in all samples or not.

I'm trying to do it with this command:

RUNX.bam -> input bam files
POSSITIONS.vcf -> vcf with 5035 different positions

(GATK v2.5-2-gf57256b)
java - jar GenomeAnalysisTK -T HaplotypeCaller -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles POSSITIONS.vcf -out_mode EMIT_ALL_SITES

I expected that my result have same amount of positions (calls), but my result only have ~2200


Best Answer


  • d_nid_ni Posts: 17

    Ok, I think I have to explain it better...

    I have a vcf file with a ~5000 snp (no indels) from microarray data and 3 bam files (MiSeq) of the same sample as the microarray vcf.
    What I want to do is compare this data from this vcf (microarray) with the data from the MiSeq with GATK, I hope to have a vcf file result from GATK with same positions as the microarray vcf and the call for each one of the bam files to compare with the vcf from microarray.

    Is that possible with GATK? I tough that "--genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles" are to do that, but as I wrote above, result vcf is a subset of the original one.

    All comments are welcome.

    Thanks a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAPosts: 11,743 admin

    Hi there,

    Thanks for clarifying, that helps.

    Genotype given alleles mode is not meant for what you are trying to do. In your case you should perform variant calling normally (in the default discovery mode) then compare the resulting variant callsets using VariantEval and Genotype Concordance.

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17

    Thanks for your reply!

    I'm not sure if it'll be the result that I expected, I think that I can explain better with an example

    If I have on my microarray vcf this line:

    chr1 5926507 rs555164 T C 100 PASS DB GT 0|0

    But this base doesn't have data on my GATK VCF result because there are no variants on it or I have poor coverage, how I could realize if it's because there is no variant (as on the microarray vcf) or it's because of poor coverage/gap?

    In other words, is there an option to call only a list of positions and get the full result?

    Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAPosts: 11,743 admin

    Ah, got it. To get that you can pass your VCF file with the -L argument (so it will be used as a list of positions to call) and the --output_mode set to EMIT_ALL_SITES.

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17


    I tried it, but my result vcf only has ~500 calls instead of ~5000 that the microarray vcf has, any idea of why?

    GenomeAnalysisTK -Xmx10g -T HaplotypeCaller -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf -L microarray.vcf --output_mode EMIT_ALL_SITES
    GATK version:

  • d_nid_ni Posts: 17

    Hi Geraldine,

    I was taking a look on the result of the command that you told me and I'm not sure how to interpret it... as I told you I only have ~500 calls but my vcf "reference" from iscan has ~5000 can I understand that the calls that aren't on the result vcf are the same as reference? what happen if I have no coverage in all my samples on one of the positions of the iscan vcf?

    In other hand, I was trying to use ENOTYPE_GIVEN_ALLELES because I was in your last workshop and one of your colleagues told that this option can be used to "validate custom captures" and that it's what I'm trying to do, do you have more info about it?

    Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAPosts: 11,743 admin

    Hmm, that's odd because with the -output_mode EMIT_ALL_SITES you should get an output line for every single position that is in your "reference" vcf file that you pass in as intervals file. If this is not working for you, you can upload test files to our FTP server and we will try to reproduce the problem locally. Instructions are here:

    We don't have dedicated documentation for GENOTYPE_GIVEN_ALLELES, but in short it tells the caller "don't look at all possible alleles, just look at the alleles in this VCF and tell me which of those my samples have". So it is primarily meant for diagnostic applications (eg checking if a patient sample has a given allele) and it can be useful for certain validation experiments, that's true. I guess it depends exactly what you're trying to achieve; if you're trying to see whether your sequencing runs can consistently identify specific alleles, then GGA is appropriate for you. If you want to compare the congruence of your results in general it is too limiting. I was thinking more of the latter but if the former is what you want then it does make sense to use GGA.

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17


    Thanks again, this forum is really useful!

    So I solved it, but I think that you have a bug on GATK, because when I use:
    GenomeAnalysisTK -Xmx10g -T HaplotypeCaller -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf -L microarray.vcf --output_mode EMIT_ALL_SITES
    I get ~500 snps

    But if I run it with UnifiedGenotyper:
    GenomeAnalysisTK -Xmx10g -T UnifiedGenotyper -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf -L microarray.vcf --output_mode EMIT_ALL_SITES
    I get ~5000 snps!

    So it seems that this option (--output_mode EMIT_ALL_SITES) doesn't work in haplotypecaller


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAPosts: 11,743 admin

    Hmm, I'm glad your problem is solved (and you can probably try out the GGA mode too now) but what you're seeing with HC is a little worrying. Could you maybe send us some test files that reproduce the issue so we can try to debug this locally? Instructions are here:

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17

    Hi Geraldine,

    I tested it with the mini-bundle that I downloaded for the workshop, and I have the same issue... I can paste the commands that I used, I think that is not a problem with my data, I think that it's a bug for this set of options:

    java -jar ../GenomeAnalysisTK-2.6-4-g3e5ff60/GenomeAnalysisTK.jar -T UnifiedGenotyper -R human_b37_20.fasta -I realigned_reads.bam -o TEST_UniG.vcf -L 20:10004000-10005000 --output_mode EMIT_ALL_SITES
    RESULT: 1000 "NON-HEAD" rows on the vcf

    java -jar ../GenomeAnalysisTK-2.6-4-g3e5ff60/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I realigned_reads.bam -o TEST_HC.vcf -L 20:10004000-10005000 --output_mode EMIT_ALL_SITES
    RESULT: 13 "NON-HEAD" rows on the vcf

    So, as you see (and you can test if you want because you have already this set of data) HaplotypeCaller is not working fine with --output_mode EMIT_ALL_SITES at least in this two versions of GATK:

    GATK 2.5-2
    GATK 2.6-5 (last release)

    Please, let me know if you need something else, but I don't think that you need my bam files to test it.



Sign In or Register to comment.