We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Single sample vs Multiple samples Haplotype Caller

WaterdrakeWaterdrake Manchester Member

Salutations ,
Let me start by saying I'm pretty new in this whole bioinformatics thing. So probably I won't be adequate enough , please do bear in mind.
I have the following questions. I have one callset of 850 whole exomes that is done by a Lab . They used Haplotype Caller in gVCF mode (the whole workflows of course [BSQR {without -L flag (bed file)}], etc. etc. So now my job is to run Haplotype Caller on a single sample basis and compare them. Because the illness I'm looking for is probably caused by a single rare snp or indel mutation and I have heard that multiple sample calling tends to generalize the the results based on samples genotypes (missing rare variants). So I have the following questions:
1.Can I run Haplotype Caller in gVCF mode (probably not because its made for cohort workflow) ? So I have to run Haplotype caller in "--genotyping_mode DISCOVERY" right ? (I have read that there isn't any huge differences in calls by the different modes ,only in the borderline calls )
Something like this probably :
java -jar GenomeAnalysisTK.jar
-T HaplotypeCaller
-R reference.fa
-I preprocessed_reads.bam
-L ( DO I need to use the -L flag they haven't ? )
--genotyping_mode DISCOVERY
-variant_index_type LINEAR
-variant_index_parameter 128000
-stand_emit_conf 10
-stand_call_conf 30
-o raw_variants.vcf
I have two other questions.
As I said they haven't used -L flag in their BSQR or when running the Haplotype Caller. Will it be wrong for me to use it in mine BSQR and HaplotypeCaller. My concern is that I will restrict the sites.That way reducing the rare variants , but then again if its outside the sites designated by the kit its highly likely to be false positives right ?
My other question is about after I run the Haplotype Caller. About the VQSR and Hard Filtering. Can I use VQSR even though its a single sample ? Yes I have read:
" Important notes for exome capture experiments
In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs."
But won't that be the same as the lab did before me. Add more samples from the 1000 Genomes Project that way generalizing it the same way but with a different set ? I just want to have an unified way for filtering them. (by VQSLOD).So I don't want to use Hard Filtering.
Sorry for the long post and sorry for the language , probably you can guess that English is not my first language.
Bye have a nice day/night.

Best Answer


Sign In or Register to comment.