What is the best way to make an in-house database with allele frequency (like 100 genomes or ExAC)?
I am currently working on making a pilot in-house database of around 35 exomes data we have in our laboratory. I am following the GATK best practices guidelines for data analysis. The following pipeline I am using,
1. Data prepsocessing to make a clean BAM (Picardtools, BWA) from fastq files 2. Mark duplicates in Picard 3. Realignment around indels (RealignerTargetCreator + IndelRealigner) 4. Base quality score recalibration (BaseRecalibrator + PrintReads) 5. Generate GVCF or VCF (HaplotypeCaller).
After this there are two ways to proceed I guess.
Either I can generate seperate VCF files for 35 samples directly using HaplotypeCaller, hardfilter them and merge them using VCFtools (using mergeVCF command) to make a multisample VCF file. Then I can calculate the Minor Allele Frequency using VCFtools for each variant in the multisample VCF file.
The other way to proceed is to generate seperate GVCF files for 35 samples using HaplotypeCaller, do joint analysis using GenotypeGVCFs to make a multisample VCF file and then perform VQSR. Then again I can calculate the Minor Allele Frequency using VCFtools for each variant in the multisample VCF file.
My question is, what is your recommendation about the best way to proceed. Do you recommend any other pipeline? Please also correct me if I am wrong in choosing the pipelines mentioned above.