The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Effect of cohort in HaplotypeCaller

Posts: 228Member
edited October 2013

I'm trying to measure the effect of the composition of a cohort on the calling of individual sample. So my cohort includes 46 Caucasian and 1 Japanese (J1) exome seq samples from 1KG. I want to check if the Japanese specific variants will be buried in such a cohort.

I'm only using chr22 and the average depth of coverage on all samples are around 10X. The input files to HC are the reduced bam files. Here are the results and my questions:

1). the command:

java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T HaplotypeCaller \ -R$refGenome \
--dbsnp $dbSNP \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -o$cohort.raw.var.vcf \
-I $cohort.list  2). While the calls to the Caucasians seem normal, all calls to J1 are "./." 3). Then I run HC with J1 alone, the resulting vcf file only contains headers, no content; i.e., the last line in that file is the following: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT jpt.ERR034603 The HC finished w/o reporting any error. Here is a portion from the output of HC: INFO 15:13:25,888 MicroScheduler - 136025 reads were filtered out during the traversal out of approximately 2686707 total reads (5.06%) INFO 15:13:25,888 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 15:13:25,889 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 15:13:25,889 MicroScheduler - -> 136025 reads (5.06% of total) failing HCMappingQualityFilter INFO 15:13:25,889 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter  4). I run HC with one Caucasian alone, the vcf file looks normal. 5). Then I run HC using a cohort including J1 and four other Japanese samples, the calling to J1 seems normal. Could anyone explain 2. 3, and 4? Should I increase the percentage of Japanese in the Caucasian cohort in order to get calling on J1? Thanks a lot! Post edited by blueskypy on Tagged: ## Answers • Posts: 6,962Administrator, GATK Developer admin Hmm, the only reason HC would emit a "./." call would be if the site was completely uncovered after the reassembly step. Is this happening throughout chr22? Have you tried generating the outputBAM (see argument with that name in the HC tech doc) for the different case figures? Be sure to choose a small region to test this on, not the entire chromosome. Geraldine Van der Auwera, PhD • Posts: 228Member hi, Geraldine, Thanks again for the quick response! I added more info on the point 3 of my post. Does the number with HCMappingQualityFilter and UnmappedReadFilter suggest that the site is covered. I'll also check teh outputBAM file. • Posts: 6,962Administrator, GATK Developer admin The filtering summary suggests that most of the reads are ok (not filtered out), so that's a good sign. It doesn't tell us how well covered the sites are though; the outputBAM will be more helpful for that. Geraldine Van der Auwera, PhD • Posts: 228Member Here is the bamout file from running HC with J1 alone. -bash-4.1$ ls -al jpt.ERR034603.hc*
-rw-r--r-- 1 b01 root  24 Oct 28 16:57 jpt.ERR034603.hc.out.bai
-rw-r--r-- 1 b01 root 192 Oct 28 16:57 jpt.ERR034603.hc.out.bam


Here is the output from DepthOfCoverage of the original reduced bam file as -I for HC:

-bash-4.1more coverage/jpt.ERR034603.sample_summary sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15 jpt.ERR034603 387951939 11.12 9 3 2 17.2 Total 387951939 11.12 N/A N/A N/A  • Posts: 6,962Administrator, GATK Developer admin A screenshot showing the original bam file and the bamOut file open in IGV, centered on a site of interest and if possible also showing the vcf records for that site, would be more helpful to see what's going on. Geraldine Van der Auwera, PhD • Posts: 228Member Well, I couldn't figure out what's wrong with J1. But running using another Japanese sample does not have any problem , either within a Caucasian cohort or alone. So maybe the problem with J1 is just an individual case. • Posts: 6,962Administrator, GATK Developer admin Yeah, it sounds like your J1 sample is problematic. I would encourage you to find out what's wrong with it. Perhaps the read groups are not distinct from another sample in the Caucasian cohort? That is a classic issue. Geraldine Van der Auwera, PhD • Posts: 409Member, GSA Collaborator ✭✭✭✭ Does that say that 17% of the bases have a base quality above Q15? If so, that might explain it • Posts: 228Member All of the %_bases_above_15 are around 13 for the 51 exome seq samples from 1KG. Is it because they are reduced bam files?> @pdexheimer said: Does that say that 17% of the bases have a base quality above Q15? If so, that might explain it • Posts: 228Member I'll check it in IGV. But notice that the jpt.ERR034603.hc.out.bam is only 192 Bytes! @Geraldine_VdAuwera said: A screenshot showing the original bam file and the bamOut file open in IGV, centered on a site of interest and if possible also showing the vcf records for that site, would be more helpful to see what's going on. • Posts: 6,962Administrator, GATK Developer admin How did you generate the jpt.ERR034603.hc.out.bam? Make sure you pad the region around the site you are looking at, with at least 200 bp on either side (maybe even more). Otherwise HC will not see enough data to do anything meaningful. All of the %_bases_above_15 are around 13 for the 51 exome seq samples from 1KG. Is it because they are reduced bam files? No, it shouldn't be caused by the reducing process. Are you looking at a specific region only or the full genomes? Geraldine Van der Auwera, PhD • Posts: 228Member hi, Geraldine, I run fastqc on the bam files at each step of bwa, sortSam, MarkDuplicates, realign, BQSR, ReduceReads. After realign, the quality score looks normal - the medians at positions 1-100 are all around 37; But after BQSR, which finished w/o any complaints, all medians lowered to around 18. See the attachment. Here is my BQSR command: java -Xmx4g -jargatkDir/GenomeAnalysisTK.jar -T BaseRecalibrator \
-R $refGenome \ -I$sampleID.RealnDeDup.bam \
-knownSites $dbSNP -knownSites$snp_g1k -knownSites $indel_g1kP1 -knownSites$indel_MillsG1k \
-o \$sampleID.recal.grp


Here the refGenome is chr22 which is also used with bwa at step 1, so the bam file is also the alignment to chr22. Could you help me to understand why BQSR lowered all quality scores?

Did you generate the BQSR plots? Is there any indication which covariates may be associated with the quality issue?

Geraldine Van der Auwera, PhD

• Posts: 228Member
edited November 2013

hi, Geraldine,
This BQSR lowering the whole quality score does not happen when I use the whole genome as ref. So it must be due to me only using chr22 as ref. I was going to draw the BQSR plot, but then realized only using chr22 as ref is wrong, as you pointed out, because reads of other chrs might be mapped to chr22.

However, even if I use the whole genome as ref in mapping and use -L chr22 in the following steps may not be good either for the BQSR and VQSR. So I think the safest way is to run every step using the whole genome and only extract chr22 from the final vcf file.

Post edited by blueskypy on