If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
VariantRecalibration needs -mG 4 with per lane fastq, but not concatenated fastqs?
The headline is a bit imprecise because of the length limit, so what I mean to say is that when I run a NA12878 sample through my pipeline with the fastq files split per lane, I need to use -mG 4 in VariantRecalibration for it to not crash. But when I concatenate the files to one R1 and one R2 it runs with the standard settings through the entire pipeline with no issues. Here's the command line I used for the failed execution:
VariantRecalibrator -nt 4 -R human_g1k_v37_decoy.fasta -input NA12878-map_sorted_markdup_recalibhaplotype_haplotype.vcf -mode SNP -recalFile NA12878-map_sorted_markdup_recalibhaplotype_haplotypeSNPs.recal -tranchesFile NA12878-map_sorted_markdup_recalibhaplotype_haplotypeSNPs.tranches -resource:omni,known=false,training=true,truth=true,prior=12.0 references/omni.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 references/1000g.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/dbsnp.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 references/hapmap.vcf -an QD -an MQ -an DP -an FS -an SOR -an MQRankSum -an ReadPosRankSum -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0
The entire pipeline goes NA12878 lane 1-8 read 1 and NA12878 lane 1-8 read 2 fastq > bwa > MergeSamFiles > BaseRecalibrator > PrintReads > HaplotypeCaller > GenotypeGVCFs > VariantRecalibrator SNP/INDEL > ApplyRecalibration SNP/INDEL
We took a look at the bam file from MergeSamFiles and the equivalent bam file from SortSam from the pipeline that uses the concatenated fastq files as input, and there are differences in how the reads have been either mapped or marked as unmapped. So the difference seems to arise in bwa when it maps the fastq files, but why is that though?