Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VariantRecalibration needs -mG 4 with per lane fastq, but not concatenated fastqs?

oskarvoskarv BergenMember

The headline is a bit imprecise because of the length limit, so what I mean to say is that when I run a NA12878 sample through my pipeline with the fastq files split per lane, I need to use -mG 4 in VariantRecalibration for it to not crash. But when I concatenate the files to one R1 and one R2 it runs with the standard settings through the entire pipeline with no issues. Here's the command line I used for the failed execution:

VariantRecalibrator -nt 4 -R human_g1k_v37_decoy.fasta -input NA12878-map_sorted_markdup_recalibhaplotype_haplotype.vcf -mode SNP -recalFile NA12878-map_sorted_markdup_recalibhaplotype_haplotypeSNPs.recal -tranchesFile NA12878-map_sorted_markdup_recalibhaplotype_haplotypeSNPs.tranches -resource:omni,known=false,training=true,truth=true,prior=12.0 references/omni.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 references/1000g.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/dbsnp.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 references/hapmap.vcf -an QD -an MQ -an DP -an FS -an SOR -an MQRankSum -an ReadPosRankSum -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0

The entire pipeline goes NA12878 lane 1-8 read 1 and NA12878 lane 1-8 read 2 fastq > bwa > MergeSamFiles > BaseRecalibrator > PrintReads > HaplotypeCaller > GenotypeGVCFs > VariantRecalibrator SNP/INDEL > ApplyRecalibration SNP/INDEL

We took a look at the bam file from MergeSamFiles and the equivalent bam file from SortSam from the pipeline that uses the concatenated fastq files as input, and there are differences in how the reads have been either mapped or marked as unmapped. So the difference seems to arise in bwa when it maps the fastq files, but why is that though?

Issue · Github
by Sheila

Issue Number
2066
State
open
Last Updated
Assignee
Array
Milestone
Array

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @oskarv
    Hi,

    I have asked Soo Hee to look into this. We will get back to you soon.

    -Sheila

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi again @oskarv,

    First, when you say:

    I need to use -mG 4 in VariantRecalibration for it to not crash.

    Do you mean -nt 4?

    Second, when you concatenate the different lane-level FASTQs, how are you then assigning appropriate read groups (@RG) for effective BQSR?

    Third, when you say differences in how BWA maps or marks, can you be more specific? We don't officially answer BWA questions, as it is an outside tool, but if you can give an example of the oddity you are seeing, perhaps it will be obvious to one of us what is going on and whether this is happening at the alignment or other step.

    Can you also do us a favor and read through this blogpost? It has some guidelines to help you help yourself with questions you have.

Sign In or Register to comment.