Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Inconsistency among the depth in the vcf file and in the original bam file and HC -bamout bam file.

Inconsistency among the depth in the vcf file and in the original bam file and HC -bamout bam file.

I got a vcf an SNP with DP value 31 in the vcf file

In the orginal bam file the total counts at that loci is 434

In the HC -bamout bam file the total counts is 123.

HC reassemble the reads making some haplotypes locally , and realign the reads to the new haplotypes in order to get the 2 haplotypes with the highest possiblty, as far as I understand, which could cause some inconsistency between the depth of the locus in the vcf file and the original bam file.

I don't understand the difference between the vcf file and -bamout bam file

Another qusetion, some reads in the -bamout bam file are in vague colors in IGV, what do they mean?

Thank you so much.

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Hello,

    Although you started with 434 reads at the locus, please note that Haplotype Caller downsamples to 250 reads. It is not 250 reads per locus, but 250 reads at the start of the active region. So, there can be some variation per locus. Please read about active regions here: https://www.broadinstitute.org/gatk/guide/article?id=4147

    The number of reads that span from the start of the active region to your position of interest must be less than 123 (see below for explanation of vague colored reads). You see those reads in the bamout file. The DP of 31 is the number of reads that passed Haplotype Caller's filtering requirements. Please read about the filters Haplotype Caller applies here: https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

    The vague colored reads in IGV are the artificial haplotypes, which are the candidate haplotypes that are found after local reassembly. Please read more about local reassembly here: https://www.broadinstitute.org/gatk/guide/article?id=4146
    So, you can subtract the number of artificial haplotypes from 123 to get the actual number of reads at the position before filtering.

    I hope this explains everything.

    -Sheila

  • sibsccsibscc sibsMember
    edited November 2014

    image
    Hi, @Sheila

    Here is the locus I mentioned above. The upper track is from -bamout the lower track is from the orginal bam file.

    The depth of upper track is 123, the depth of the lower track is 434. the SNP called from the upper track is 31.

    So the vague strips are the haplotypes reassembled from the reads in the region.

    It seems like the reads in the original bam file are moved left and right trimed in order to make the haplotypes.

    I switched off downsampling by -dt NONE.

    So, you mean that HC will automaticly filter out some reads based on :

    • NotPrimaryAlignmentFilter
    • FailsVendorQualityCheckFilter
    • DuplicateReadFilter
    • UnmappedReadFilter
    • MappingQualityUnavailableFilter
    • HCMappingQualityFilter
    • MalformedReadFilter

    These are the reasons why the depth in the bamout file is smaller than the orginal bam file.

    123 is composed by the number of reads and the number of haplotypes. DP value of 31 is substracting the numbern of haplotypes from 123? Is that correct?

    image
    Move down a little, I can find another set of vague fragments, some of them are identical. Are they haplotypes? As far as I read from the documents the haplotypes should not be identical. Am I correct ?

    Thank you to be so patient.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Hi,

    What happens when you switch off downsampling using -dt NONE? Do the reads stay the same from original bam file to bamout bam file?

    Yes, HaplotypeCaller automatically filters reads based on those filters you specified.

    The DP value of 31 is the number of reads that are not haplotypes or filtered out.

    Although I said the vague colored strands are artificial haplotypes, it looks like you have the reads colored by strand (forward or reverse), not by sample. So, you cannot tell which reads are artificial haplotypes in your case, but you can tell that the vague colored reads are low quality reads. There are a lot of white reads, which means they have been filtered out due to low mapping quality. Can you check the reads in white, and let me know what the mapping qualities are?

    -Sheila

  • sibsccsibscc sibsMember
    edited November 2014

    @Sheila‌

    image

    Hi, The white colored reads are reads with mapping quality of 0.

    After coloring the reads by read group, the light blue reads are the artificial haplotypes the pink reads are original reads with mapping quality > 0.

    According to the coverage track, some bases at the right of the original bam file seems to be clipped. and there is no bases supporting the left 10 bases of aritificial haplotype in the original bam file as seen in IGV.

    Since, HC automaticaly filter the duplicated reads. Is Picard marking duplicates still necessary?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Hi,

    The reason the white reads are shorter in the bamout file is that there is a larger deletion there than seen in the original bam file. The deletion shortens the reads because it is designated as a black line.

    Are there absolutely no reads that support the artificial haplotype that seems to be 10 bases longer than the reads on the left? Are there no reads farther down the screen shot?

    Picard's MarkDuplicates is absolutely required because it gives a flag to the duplicate reads that HaplotypeCaller's filters use.

    -Sheila

  • sibsccsibscc sibsMember
    edited November 2014

    Hi, @Sheila‌

    After some checking.

    There is no deletion called in the vcf file at or near the region

    I scrolled down but did see any base extending to the 10 bases of the haplotype.
    Is that possible the reads are not shown in IGV? It seems like most of depth in the coverage track are more than which could be seen below in the bam file track.

    In fact, I didn't use Picard MarkDuplicates as recommended, since it is an amplicon seq. But when I did BWA mem mapping, I used option "-M" in order to compare the results with and without duplicates marking in the future. Is that possible the -M option makes HC recognize the duplicates?

    Thank you

  • sibsccsibscc sibsMember

    Hi, @Sheila‌
    After reassembly, many reads with mapping quality generated.

    Do the mapping qualities are recalculated from mapping the reads to the winner two haplotypes ?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Hi,

    Can you please upload your files so I can have a look at what is going on? Directions for how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Can you also confirm that you are using GATK version 3.3?

    Thanks.

  • sibsccsibscc sibsMember

    Hi, @Sheila‌,

    Sorry for a miss statement. I missed "mapping quality of 0".

    After reassembly, many reads with mapping quality of 0 generated.

    Do the mapping qualities are recalculated from mapping the reads to the winner two haplotypes ?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Hi,

    No, the mapping qualities are not recalculated from mapping to the artificial haplotypes.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    I did not mention they are recalculated when the reads are realigned to the reference after reassembly to produce the bamout.

    The most important thing to note here is that the alignment to the haplotype (via pairhmm) is what counts in the likelihood calculations, not any of the original MAPQs.

    -Sheila

  • sibsccsibscc sibsMember
    edited December 2014

    @Sheila said:
    sibscc‌

    Hi,

    Can you please upload your files so I can have a look at what is going on? Directions for how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

    I created a folder "sibscc", uploaded a .zip file including:

    original bam file, -bamout bam file, the corresponding .bai file, together with the command used .

    The locus mentioned is chr17:59940819

    Thank you

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Hi,

    This looks like an issue with downsampling. HaplotypeCaller works fine when running without -dt NONE. We are aware of this issue, and I will let you know when it is fixed. However, it is not high on the priority list because the genotype is being output correctly.

    -Sheila

  • sibsccsibscc sibsMember

    @Sheila said:
    sibscc‌

    Hi,

    This looks like an issue with downsampling. HaplotypeCaller works fine when running without -dt NONE. We are aware of this issue, and I will let you know when it is fixed. However, it is not high on the priority list because the genotype is being output correctly.

    -Sheila

    Are the likely slight switch of the positions in IGV from the unvisualized reads, which are visualized in the -bamout bam file?

  • sibsccsibscc sibsMember
    edited December 2014

    Hi, @shaila‌ @Geraldine_VdAuwera‌

    Yes, that is what I mean. I am a little confused about the slight shift of the reads in the -bamout bam file comparing to the original bam file

    According to the answers, HC does some adjustments to the reads locally based on the active region identified.

    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sibscc‌

    Hi,

    This will be fixed in a future version of GATK. Thanks for bringing it to our attention.

    -Sheila

Sign In or Register to comment.