The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

IndelRealigner/RealignerTargetCreator known site bundle files

mikemike Member Posts: 103
edited November 2012 in Ask the GATK team

Hi, For both IndelRealigner/RealignerTargetCreator, there is an option for known indel sites as below:

-known /path/to/indels.vcf

However, from the bundle files collection such as from hg19, there are several vcf files:

1000G_indels_for_realignment.hg19.vcf
1000G_omni2.5.hg19.sites.vcf
1000G_omni2.5.hg19.vcf
dbsnp_132.hg19.excluding_sites_after_129.vcf
dbsnp_132.hg19.vcf
hapmap_3.3.hg19.sites.vcf
hapmap_3.3.hg19.vcf
indels_mills_devine.hg19.sites.vcf
indels_mills_devine.hg19.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf

amongst them, just based on the names, 1000G_indels_for_realignment.hg19.vcf and indels_mills_devine.hg19.sites.vcf look like the files supposed to use for IndelRealigner/RealignerTargetCreator, Could you clarify the exact files for this purpose?

Since for old version, I have used 1000G_phase1.indels.hg19.vcf and Mills_and_1000G_gold_standard.indels.hg19.sites.vcf. and I compared the new and old files, quite different now.

Thanks

Mike

Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,713 admin

    Geraldine Van der Auwera, PhD

  • mikemike Member Posts: 103
    edited November 2012

    Hi, Geraldine:

    Thanks for the input! However, the article seems not updated for the new version GATK v2.0 or newer. For example, the article mentioned for realignment, we shall use:

    Mills_and_1000G_gold_standard.indels.b37.sites.vcf
    1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
    

    which is exactly what I used for the old version I described in my original comments above. But if we look at the bundle of the new version, those files are gone or at least the names somewhat changed more or less, I copied and pasted again the files in the bundle for the new version as below:

    1000G_indels_for_realignment.hg19.vcf
    1000G_omni2.5.hg19.sites.vcf
    1000G_omni2.5.hg19.vcf
    dbsnp_132.hg19.excluding_sites_after_129.vcf
    dbsnp_132.hg19.vcf
    hapmap_3.3.hg19.sites.vcf
    hapmap_3.3.hg19.vcf
    indels_mills_devine.hg19.sites.vcf
    indels_mills_devine.hg19.vcf
    NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf
    NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf
    

    I think for realignment, I shall use 1000G_indels_for_realignment.hg19.vcf, but what about indels_mills_devine.hg19.sites.vcf or indels_mills_devine.hg19.vcf, which one to use for realignment?

    Thanks again

    Mike

    Post edited by Geraldine_VdAuwera on
  • mikemike Member Posts: 103

    Thanks a lot for the great detailed info, Geraldine! Appreciated! Mike

  • mikemike Member Posts: 103

    Hi, Geraldine:

    Sorry,. I just realized that your web page is actually the new version. Our own installation has some confusion about the new and old versions, which was caused by our installation staffs. Sorry about confusion. your web page is fine on that.

    Thanks any way for the info!

    Best

    Mike

  • Seq2FindSeq2Find USAMember Posts: 2

    Hello Geraldine,

    I happened to use "1000G_omni2.5.hg19.vcf" instead of "Mills_and_1000G_gold_standard.indels.b37.vcf" for the paramter '-known /path/to/indels.vcf' in IndelRealigner/RealignerTargetCreator step. (I wasn't sure of the purpose of each of the vcf files in the bundle then.)

    Would this be something that I should re-run with the recommended file "Mills_and_1000G_gold_standard.indels.b37.vcf".

    Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,295 admin

    @Seq2Find
    Hi,

    Yes, you should rerun with the recommended file. The Omni file contains SNPs only, not indels.

    -Sheila

  • Seq2FindSeq2Find USAMember Posts: 2

    Thank you very much Sheila.

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember Posts: 108

    @Sheila @Geraldine_VdAuwera ,

    Hi,
    I read in suplementary information of 1000G main paper (2015) that they utilized in indel realignment:

    ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz
    ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz

    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/

    In GATK pipeline, we have:
    Mills_and_1000G_gold_standard.indels.b37.vcf
    Mills_and_1000G_gold_standard.indels.b37.vcf

    Could you please mention here which files are more updated and useful in your opinion??

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,295 admin

    @MUHAMMADSOHAILRAZA
    Hi,

    Have a look at our recommendations here.

    -Sheila

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember Posts: 108

    @Sheila

    Hi,
    Thanks for the reply!
    I knew the recommendations mentioned there for GATK.

    do you have any idea about comparisons between these two sets of indels:

    ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz (1000G pipeline)
    ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz (1000G pipeline)
    at
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/

    and
    Mills_and_1000G_gold_standard.indels.b37.vcf (GATK Pipeline)
    1000G_phase1.indels.b37.vcf (GATK pipeline)

    Is there any difference between these two sets of files?

    Regards
    sohail

    Issue · Github
    by Sheila

    Issue Number
    988
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,713 admin

    Hi @MUHAMMADSOHAILRAZA,

    We have someone on the team currently writing up some documentation on the provenance of our resource files, which may shed some light on this. In the meantime we can't really comment on whether those two sets might be equivalent, which I think is what you're asking. They do sound fairly similar though, for what it's worth.

    Geraldine Van der Auwera, PhD

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember Posts: 108

    @Geraldine_VdAuwera
    Hi,

    Any update regarding the two file sets comparison??

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,713 admin

    No, we've been very busy.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.