To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

What is the best practice for calling/combining variants across multiple RNA-Seq datasets

Hi,
I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Best Answers

Answers

Sign In or Register to comment.