A complete script that processes a trio with WGS data from FASTQ to BAM to VCF
I found there are quite many GATK documentation material online. However, I could not find one that shows me exactly how to process a WGS dataset from bump to bump. For example, right now, I got 3 samples with WGS data. Each sample has 4 FASTQ files, for example for the first sample, there are: s1_L1_1.fa.gz, s1_L1_2.fa.gz, s1_L2_1.fa.gz, s1_L2_2.fa.gz. Now, my question is, what are the exact serial of commands that I should use to create a VCF file with these 3 samples.
I found a nice example at http://www.htslib.org/workflow, but I am afraid that it is not the latest version. I spent quite some time try to to figure this out. Below is what I got for running on each of the 3 samples:
bwa mem -t 1000 -k 32 -M hg19.fa s1_L1_1.fa.gz s1_L1_2.fa.gz s1_L2_1.fa.gz s1_L2_2.fa.gz | samtools view -b -S -t hg19.fa.fai - > s1.bam
samtools sort [email protected] 4 s1.tmp.bam s1.sorted.bam, then java -jar picard/MarkDuplicates.jar I= s1.sorted.bam O=s1.markdup.bam M=s1.dupStat
gatk RealignerTargetCreator –R hg19.fq.gz –I s1.sorted.bam –known indels.vcf –O realigner.intervals
gatk BaseRecalibrator –R hg19.fq.gz–I realigned.bam –knownSites dbsnp137.vcf –knownSites gold.standard.indels.vcf –O recal.table
gatk HaplotypeCaller –R hg19.fa.gz –I s1.bam –o s1.gvcf –ERC GVCF
Once I done the above for each of the 3 samples, then I merge 3 gVCF files together, by:
gatk GenotypeGVCFs –R hg19.fa.gz –V s1.gvcf –V s2.gvcf –V s3.gvcf
Can someone please let me know if I got the above correct? If not, can you please kindly correct me?
Thank you & best regards,