Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Calling only some positions

d_nid_ni Posts: 17Member

Hi,

I'm trying to use GATK to validate reproducibility of the NGS part. We run the same sample 3 times on a MiSeq and I have a list of positions and I want to see if GATK has the same calls in all samples or not.

I'm trying to do it with this command:

RUNX.bam -> input bam files POSSITIONS.vcf -> vcf with 5035 different positions

(GATK v2.5-2-gf57256b) java - jar GenomeAnalysisTK -T HaplotypeCaller -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles POSSITIONS.vcf -out_mode EMIT_ALL_SITES

I expected that my result have same amount of positions (calls), but my result only have ~2200

Tagged:

Best Answer

Answers

  • d_nid_ni Posts: 17Member

    Ok, I think I have to explain it better...

    I have a vcf file with a ~5000 snp (no indels) from microarray data and 3 bam files (MiSeq) of the same sample as the microarray vcf. What I want to do is compare this data from this vcf (microarray) with the data from the MiSeq with GATK, I hope to have a vcf file result from GATK with same positions as the microarray vcf and the call for each one of the bam files to compare with the vcf from microarray.

    Is that possible with GATK? I tough that "--genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles" are to do that, but as I wrote above, result vcf is a subset of the original one.

    All comments are welcome.

    Thanks a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,285Administrator, GSA Member admin

    Hi there,

    Thanks for clarifying, that helps.

    Genotype given alleles mode is not meant for what you are trying to do. In your case you should perform variant calling normally (in the default discovery mode) then compare the resulting variant callsets using VariantEval and Genotype Concordance.

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17Member

    Thanks for your reply!

    I'm not sure if it'll be the result that I expected, I think that I can explain better with an example

    If I have on my microarray vcf this line:

    chr1 5926507 rs555164 T C 100 PASS DB GT 0|0

    But this base doesn't have data on my GATK VCF result because there are no variants on it or I have poor coverage, how I could realize if it's because there is no variant (as on the microarray vcf) or it's because of poor coverage/gap?

    In other words, is there an option to call only a list of positions and get the full result?

    Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,285Administrator, GSA Member admin

    Ah, got it. To get that you can pass your VCF file with the -L argument (so it will be used as a list of positions to call) and the --output_mode set to EMIT_ALL_SITES.

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17Member

    Hi,

    I tried it, but my result vcf only has ~500 calls instead of ~5000 that the microarray vcf has, any idea of why?

    Command: GenomeAnalysisTK -Xmx10g -T HaplotypeCaller -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf -L microarray.vcf --output_mode EMIT_ALL_SITES GATK version: v2.5-2-gf57256b

  • d_nid_ni Posts: 17Member

    Hi Geraldine,

    I was taking a look on the result of the command that you told me and I'm not sure how to interpret it... as I told you I only have ~500 calls but my vcf "reference" from iscan has ~5000 can I understand that the calls that aren't on the result vcf are the same as reference? what happen if I have no coverage in all my samples on one of the positions of the iscan vcf?

    In other hand, I was trying to use ENOTYPE_GIVEN_ALLELES because I was in your last workshop and one of your colleagues told that this option can be used to "validate custom captures" and that it's what I'm trying to do, do you have more info about it?

    Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,285Administrator, GSA Member admin

    Hmm, that's odd because with the -output_mode EMIT_ALL_SITES you should get an output line for every single position that is in your "reference" vcf file that you pass in as intervals file. If this is not working for you, you can upload test files to our FTP server and we will try to reproduce the problem locally. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    We don't have dedicated documentation for GENOTYPE_GIVEN_ALLELES, but in short it tells the caller "don't look at all possible alleles, just look at the alleles in this VCF and tell me which of those my samples have". So it is primarily meant for diagnostic applications (eg checking if a patient sample has a given allele) and it can be useful for certain validation experiments, that's true. I guess it depends exactly what you're trying to achieve; if you're trying to see whether your sequencing runs can consistently identify specific alleles, then GGA is appropriate for you. If you want to compare the congruence of your results in general it is too limiting. I was thinking more of the latter but if the former is what you want then it does make sense to use GGA.

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17Member

    Hi,

    Thanks again, this forum is really useful!

    So I solved it, but I think that you have a bug on GATK, because when I use: GenomeAnalysisTK -Xmx10g -T HaplotypeCaller -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf -L microarray.vcf --output_mode EMIT_ALL_SITES I get ~500 snps

    But if I run it with UnifiedGenotyper: GenomeAnalysisTK -Xmx10g -T UnifiedGenotyper -R ucsc.hg19.fasta -I RUN1.bam -I RUN2.bam -I RUN3.bam -o RESULT.vcf -L microarray.vcf --output_mode EMIT_ALL_SITES I get ~5000 snps!

    So it seems that this option (--output_mode EMIT_ALL_SITES) doesn't work in haplotypecaller

    Daniel.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,285Administrator, GSA Member admin

    Hmm, I'm glad your problem is solved (and you can probably try out the GGA mode too now) but what you're seeing with HC is a little worrying. Could you maybe send us some test files that reproduce the issue so we can try to debug this locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • d_nid_ni Posts: 17Member

    Hi Geraldine,

    I tested it with the mini-bundle that I downloaded for the workshop, and I have the same issue... I can paste the commands that I used, I think that is not a problem with my data, I think that it's a bug for this set of options:

    java -jar ../GenomeAnalysisTK-2.6-4-g3e5ff60/GenomeAnalysisTK.jar -T UnifiedGenotyper -R human_b37_20.fasta -I realigned_reads.bam -o TEST_UniG.vcf -L 20:10004000-10005000 --output_mode EMIT_ALL_SITES RESULT: 1000 "NON-HEAD" rows on the vcf

    java -jar ../GenomeAnalysisTK-2.6-4-g3e5ff60/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I realigned_reads.bam -o TEST_HC.vcf -L 20:10004000-10005000 --output_mode EMIT_ALL_SITES RESULT: 13 "NON-HEAD" rows on the vcf

    So, as you see (and you can test if you want because you have already this set of data) HaplotypeCaller is not working fine with --output_mode EMIT_ALL_SITES at least in this two versions of GATK:

    GATK 2.5-2 GATK 2.6-5 (last release)

    Please, let me know if you need something else, but I don't think that you need my bam files to test it.

    Best,

    Daniel.

Sign In or Register to comment.