BAM index files from 1000G not recognized & other questions
I'm working on a small exome capture experiment using matched paired-end data on Illumina platform from tumor samples (a 5 patient/tumor sample cohort and 5 parental sample cohort). And I'm using GATK to find indels/SNPs in these samples.
I have analysis ready BAMs from these 10 samples and am now ready to do variant calling.
- When doing the raw variant calling is it preferred to create a multisample VCF?
- If so, should I create a multisample VCF for each cohort (one VCF for patients, another VCF for parents)?
- Since the samples are <30 in each cohort, after reading the GATK guide, I realize that I need to download 3rd party BAMs from 1000G for GATK to filter the raw VCFs. This would mean I need to add roughly 50 additional BAMs similar to the cohorts (in this case, Caucasians) as additional input files when creating a raw VCF. I've been able to get a list of these BAMs and associated BAIs from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/exome.alignment.index and have downloaded the same. After creating a .list file containing a list of input files with absolute pathnames (cohort BAMs + 1000G BAMs), and passing this to the -I parameter, GATK complains that the 1000G BAMs don't have associated BAIs, even though this is not the case. I renamed the BAI files to have the same basename as the BAMs and still no luck. When I omit the 1000G BAMs GATK proceeds without problems. Is there any way for GATK to recognize the 1000G BAIs as this would mean I don't have to dedup and index them on my own.
- When the filtered VCFs are created, how do I know or access only the SNPs/Indels from the experiment samples rather than from the whole bunch (cohort + 1000G)? And is there a tool to create a VCF for only these samples?