BAM index files from 1000G not recognized & other questions

firasr83firasr83 New YorkMember
edited November 2013 in Ask the GATK team

I'm working on a small exome capture experiment using matched paired-end data on Illumina platform from tumor samples (a 5 patient/tumor sample cohort and 5 parental sample cohort). And I'm using GATK to find indels/SNPs in these samples.

I have analysis ready BAMs from these 10 samples and am now ready to do variant calling.

  1. When doing the raw variant calling is it preferred to create a multisample VCF?
  2. If so, should I create a multisample VCF for each cohort (one VCF for patients, another VCF for parents)?
  3. Since the samples are <30 in each cohort, after reading the GATK guide, I realize that I need to download 3rd party BAMs from 1000G for GATK to filter the raw VCFs. This would mean I need to add roughly 50 additional BAMs similar to the cohorts (in this case, Caucasians) as additional input files when creating a raw VCF. I've been able to get a list of these BAMs and associated BAIs from and have downloaded the same. After creating a .list file containing a list of input files with absolute pathnames (cohort BAMs + 1000G BAMs), and passing this to the -I parameter, GATK complains that the 1000G BAMs don't have associated BAIs, even though this is not the case. I renamed the BAI files to have the same basename as the BAMs and still no luck. When I omit the 1000G BAMs GATK proceeds without problems. Is there any way for GATK to recognize the 1000G BAIs as this would mean I don't have to dedup and index them on my own.
  4. When the filtered VCFs are created, how do I know or access only the SNPs/Indels from the experiment samples rather than from the whole bunch (cohort + 1000G)? And is there a tool to create a VCF for only these samples?


Best Answers


  • firasr83firasr83 New YorkMember

    @firasr83 said:
    I've been able to get a list of these BAMs and associated BAIs from

    I wanted to correct the URL address to: . That's what I have used.

  • firasr83firasr83 New YorkMember

    I'd posted a comment a few days ago that was pending moderation and hasn't appeared here yet. I'm not sure what happened.

    Thanks for the help Geraldine. GATK behaves as expected when multiple BAMs are specified explicitly to the -I parameter. I have been having problems with using a .list file with input file names, especially with the 1000G BAMs as GATK starts complaining that there are no associated BAI files. Re-indexing the 1000G BAMs is no good as GATK then complains that the BAIs are corrupt.

    I tried experimenting and used the script command to log the output. Here's the pastebin:

    Not sure what's wrong. Any help would be appreciated. And I hope to hear from you soon. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @firasr83, your comment might have been mistaken as spam; our spam blockers get a little overzealous sometimes when links are included in a post. I've whitelisted your account so it shouldn't happen again.

    I'm not sure what's going wrong; the one difference I see is that in one case (passing bams directly) you are using relative paths, while in the other (bam list) you are using absolute paths. This shouldn't make any difference, but I would recommend testing the two complementary cases just to rule out any funkiness at that level.

  • firasr83firasr83 New YorkMember

    Thank you @pdexheimer! That was it! GATK should have been a little bit more specific in terms of throwing up the error. :-)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Nice job with the eagle eye, @pdexheimer :)

    @firasr83, I'll see if we can coax GATK into explicitly specifying which file is lacking an index in the future.

Sign In or Register to comment.