Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Effect of cohort in HaplotypeCaller

blueskypyblueskypy Posts: 194Member
edited October 2013 in Ask the GATK team

I'm trying to measure the effect of the composition of a cohort on the calling of individual sample. So my cohort includes 46 Caucasian and 1 Japanese (J1) exome seq samples from 1KG. I want to check if the Japanese specific variants will be buried in such a cohort.

I'm only using chr22 and the average depth of coverage on all samples are around 10X. The input files to HC are the reduced bam files. Here are the results and my questions:

1). the command:

java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T HaplotypeCaller \
 -R $refGenome \
 --dbsnp $dbSNP \
 -stand_call_conf 50.0 \
 -stand_emit_conf 10.0 \
 -o $cohort.raw.var.vcf \
-I $cohort.list

2). While the calls to the Caucasians seem normal, all calls to J1 are "./."

3). Then I run HC with J1 alone, the resulting vcf file only contains headers, no content; i.e., the last line in that file is the following:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT jpt.ERR034603

The HC finished w/o reporting any error. Here is a portion from the output of HC:

INFO  15:13:25,888 MicroScheduler - 136025 reads were filtered out during the traversal out of approximately 2686707 total reads (5.06%)
INFO  15:13:25,888 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  15:13:25,889 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  15:13:25,889 MicroScheduler -   -> 136025 reads (5.06% of total) failing HCMappingQualityFilter
INFO  15:13:25,889 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

4). I run HC with one Caucasian alone, the vcf file looks normal.

5). Then I run HC using a cohort including J1 and four other Japanese samples, the calling to J1 seems normal.

Could anyone explain 2. 3, and 4? Should I increase the percentage of Japanese in the Caucasian cohort in order to get calling on J1? Thanks a lot!

Post edited by blueskypy on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hmm, the only reason HC would emit a "./." call would be if the site was completely uncovered after the reassembly step. Is this happening throughout chr22? Have you tried generating the outputBAM (see argument with that name in the HC tech doc) for the different case figures? Be sure to choose a small region to test this on, not the entire chromosome.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    hi, Geraldine, Thanks again for the quick response! I added more info on the point 3 of my post. Does the number with HCMappingQualityFilter and UnmappedReadFilter suggest that the site is covered. I'll also check teh outputBAM file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    The filtering summary suggests that most of the reads are ok (not filtered out), so that's a good sign. It doesn't tell us how well covered the sites are though; the outputBAM will be more helpful for that.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    Here is the bamout file from running HC with J1 alone.

    -bash-4.1$ ls -al jpt.ERR034603.hc*
    -rw-r--r-- 1 b01 root  24 Oct 28 16:57 jpt.ERR034603.hc.out.bai
    -rw-r--r-- 1 b01 root 192 Oct 28 16:57 jpt.ERR034603.hc.out.bam
    

    Here is the output from DepthOfCoverage of the original reduced bam file as -I for HC:

    -bash-4.1$ more coverage/jpt.ERR034603.sample_summary
    sample_id       total   mean    granular_third_quartile granular_median granular_first_quartile %_bases_above_15
    jpt.ERR034603   387951939       11.12   9       3       2       17.2
    Total   387951939       11.12   N/A     N/A     N/A
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    A screenshot showing the original bam file and the bamOut file open in IGV, centered on a site of interest and if possible also showing the vcf records for that site, would be more helpful to see what's going on.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    Well, I couldn't figure out what's wrong with J1. But running using another Japanese sample does not have any problem , either within a Caucasian cohort or alone. So maybe the problem with J1 is just an individual case.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Yeah, it sounds like your J1 sample is problematic. I would encourage you to find out what's wrong with it. Perhaps the read groups are not distinct from another sample in the Caucasian cohort? That is a classic issue.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 335Member, GSA Collaborator ✭✭✭

    Does that say that 17% of the bases have a base quality above Q15? If so, that might explain it

  • blueskypyblueskypy Posts: 194Member

    All of the %_bases_above_15 are around 13 for the 51 exome seq samples from 1KG. Is it because they are reduced bam files?> @pdexheimer said:

    Does that say that 17% of the bases have a base quality above Q15? If so, that might explain it

  • blueskypyblueskypy Posts: 194Member

    I'll check it in IGV. But notice that the jpt.ERR034603.hc.out.bam is only 192 Bytes!

    @Geraldine_VdAuwera said: A screenshot showing the original bam file and the bamOut file open in IGV, centered on a site of interest and if possible also showing the vcf records for that site, would be more helpful to see what's going on.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    How did you generate the jpt.ERR034603.hc.out.bam? Make sure you pad the region around the site you are looking at, with at least 200 bp on either side (maybe even more). Otherwise HC will not see enough data to do anything meaningful.

    All of the %_bases_above_15 are around 13 for the 51 exome seq samples from 1KG. Is it because they are reduced bam files?

    No, it shouldn't be caused by the reducing process. Are you looking at a specific region only or the full genomes?

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    hi, Geraldine,

    I run fastqc on the bam files at each step of bwa, sortSam, MarkDuplicates, realign, BQSR, ReduceReads. After realign, the quality score looks normal - the medians at positions 1-100 are all around 37; But after BQSR, which finished w/o any complaints, all medians lowered to around 18. See the attachment.

    Here is my BQSR command:

    java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T BaseRecalibrator \
    -R $refGenome \
    -I $sampleID.RealnDeDup.bam \
    -knownSites $dbSNP -knownSites $snp_g1k -knownSites $indel_g1kP1 -knownSites $indel_MillsG1k \
    -o $sampleID.recal.grp
    

    Here the refGenome is chr22 which is also used with bwa at step 1, so the bam file is also the alignment to chr22. Could you help me to understand why BQSR lowered all quality scores?

    per_base_quality.png
    800 x 600 - 9K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Did you generate the BQSR plots? Is there any indication which covariates may be associated with the quality issue?

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member
    edited November 2013

    hi, Geraldine, This BQSR lowering the whole quality score does not happen when I use the whole genome as ref. So it must be due to me only using chr22 as ref. I was going to draw the BQSR plot, but then realized only using chr22 as ref is wrong, as you pointed out, because reads of other chrs might be mapped to chr22.

    However, even if I use the whole genome as ref in mapping and use -L chr22 in the following steps may not be good either for the BQSR and VQSR. So I think the safest way is to run every step using the whole genome and only extract chr22 from the final vcf file.

    Post edited by blueskypy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Ah, interesting. Yes, you will get better results doing the pre-processing on the whole genome, and both BQSR and VQSR definitely give better results if they see more data. So running on whole genome rather than a segment is better overall, as a rule. The main exception to that rule is if you're working with exome data; then for most steps it's better to use the exome targets list.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.