Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

something fishy? VCF depth and BAM depth don't match

prasundutta87prasundutta87 EdinburghMember
edited October 2018 in Ask the GATK team

Hi,

So after performing a multisample variant calling using GATK 4.0.4.0, I separated one of the samples and wanted to visualize one heterozygous site on IGV. If you see from the attached image, the VCF shows that the Depth is 1 and the GQ is 99. But, the BAM file has lots of reads assigned to that position. I have used the original BAM file that was used during the variant calling. Is there a possibility of rearrangement at that site due to which the Depth became 1? Should have I used the rearranged BAM file instead for visualizing? If the Depth is 1 in the VCF file, how come the GQ is 99?

Best Answer

Answers

  • prasundutta87prasundutta87 EdinburghMember

    This is the zoomed out image of the site:

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭
    edited October 2018

    Hi @prasundutta87,

    Please post the command used, BAMOUT and the VCF record for further interpretation. Thanks.

  • prasundutta87prasundutta87 EdinburghMember
    edited October 2018

    Hi @shlee ,

    I don't have any bamout files generated..I had many sample BAMs of 30x depth with huge file sizes, hence did not generate bamout files..it is also advised in the forum not to generate them..the VCF recored is here:

    11 56715403 . A G 11126 PASS AC=1;AF=0.5;AN=2;ANN=G|intron_variant|MODIFIER|LOC112587845|gene18548|transcript|rna39467|pseudogene|1/2|n.95-8287A>G||||||;BaseQRankSum=-0.319;ClippingRankSum=0;DP=1;ExcesssHet=0.2813;FS=3.467;InbreedingCoeff=0.1486;MQ=51.44;MQRankSum=0;QD=30.96;ReadPosRankSum=0.299;SOR=0.595 GT:AD:DP:GQ:PL 0/1:1,0:1:99:365,0,134

  • EADGEADG ✭✭✭ KielMember ✭✭✭

    Hi @prasundutta87,

    use the -L command to limit the region for the bamout. It is advised to use Bamout only for troubleshooting and for a small region because of the HaplotypeCaller became very slow when it is activated.

    Create a bamout from your problem by using -L 11:56715203-11:56715603- it is always recommended to pad some bases to your region of interest.

    And take a look at the bamout and your variant. Maybe you can post a screenshot from variant in the bamout here.

    Greetings,
    EADG

  • prasundutta87prasundutta87 EdinburghMember
    edited January 18

    Hi,

    Sorry for getting back after so long. Had got busy with another project. So, as instructed, I only called variants for the small region using this command:

    java -jar gatk-package-4.0.4.0-local.jar HaplotypeCaller -R water_buffalo_re_arranged_chrom_ref_genome.fa -I Female_Grantown_buffalo_30x_sorted_markduped_readgroup.bam -L 11:56715203-11:56715603 -O output.vcf.gz -bamout bamout.bam

    I got 8 variants in the interval region and the variant of interest was there. Here is the locus from the VCF file:

    11 56715403 . A G 370.73 . AC=1;AF=0.5;AN=2;BaseQRankSum=0;ClippingRankSum=0;DP=21;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=58.73;MQRankSum=0;QD=26.48;ReadPosRankSum=0.674;SOR=0.693 GT:AD:DP:GQ:PL 0/1:4,10:14:99:408,0,131

    I am posting IGV snapshot of the bamout file generated..

    Here is the image with metadata:

    So, there seems to be a rearrangement that took place due to a 40 bp indel and the actual DP is 14 instead of 1. I am aware that in this case, I performed a single sample non-gvcf generated variant calling. I believe that is why there is still some descrepensy from my original multisample variant calling that I performed in the first place. How should I infer this then?

    I would be grateful if some more light can be thrown on this if possible. I can provide more details if needed.

    Post edited by prasundutta87 on
  • prasundutta87prasundutta87 EdinburghMember

    Sorry..the actual Haplotypecaller command used was:

    java -jar /exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/gatk-4.0.4.0/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar HaplotypeCaller -R /exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/water_buffalo_PacBio_genome/water_buffalo_re_arranged_chrom_ref_genome.fa -I Female_Grantown_buffalo_30x_sorted_markduped_readgroup.bam -L 11:56715203-56715603 -O output.vcf.gz -bamout bamout.bam

  • prasundutta87prasundutta87 EdinburghMember
    edited January 21

    I just redid the same multisample variant calling again with the same commands I used before. All using GATK 4.0.4.0, Haplotypecaller in GVCF mode>GenomicDBimport>GenotypeGVCF

    Commands used:

    **java -jar gatk-package-4.0.4.0-local.jar HaplotypeCaller -R water_buffalo_re_arranged_chrom_ref_genome.fa --native-pair-hmm-threads 4 -I "$animal"_sorted_markduped_readgroup.bam -L 11:56715203-56715603 -ERC GVCF -O "$animal".g.vcf.gz -bamout "$animal"_bamout.bam

    java -Xmx90G -jar gatk-package-4.0.4.0-local.jar GenomicsDBImport -R water_buffalo_re_arranged_chrom_ref_genome.fa --TMP_DIR ./tmp --sample-name-map sample_names_map.txt --reader-threads 2 --genomicsdb-workspace-path 11 -L 11:56715203-56715603

    java -Xmx8G -XX:ConcGCThreads=1 -jar gatk-package-4.0.4.0-local.jar GenotypeGVCFs -R water_buffalo_PacBio_genome/water_buffalo_re_arranged_chrom_ref_genome.fa -new-qual -V gendb://11 -O 11_variants.vcf.gz**

    When I checked the GVCF of the sample, this is what I found:

    11 56715403 . A G,ATCCACTCTCCCCTAAACTCCTCTCCACACATATACACGTG, 327.73 . DP=20;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.5,0,0;RAW_MQ=68841 GT:AD:DP:GQ:PL:SB 0/1:1,0,0,0:1:23:365,0,134,23,38,370,229,128,251,456:0,1,0,0

    And the final VCF is this for that locus looks like this:

    11 56715403 . A G 11615.3 . AC=63;AF=0.444;AN=142;BaseQRankSum=-4.310e-01;ClippingRankSum=0.00;DP=1720;ExcessHet=0.2283;FS=6.140;InbreedingCoeff=0.1469;MLEAC=77;MLEAF=0.542;MQ=52.67;MQRankSum=0.00;QD=25.36;ReadPosRankSum=0.299;SOR=0.612 GT:AD:DP:GQ:PGT:PID:PL 0/1:1,0:1:99:.:.:365,0,134

    The IGV view of the bamout file is pasted below

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Hi @prasundutta87,

    @AdelaideR gave me a heads-up on your followup. To summarize your observations with v4.0.4.0:

    • Joint calling with HaplotypeCaller GVCF mode + GenomicsDBImport + GenotypeGVCFs gives the odd DP=1 result that the reads do not reflect.
    • HaplotypeCaller single-sample BAMOUT mode corrects this oddity and gives you DP=14 for the locus.
    • Both approaches give an A->G heterozygous SNP for the sample of interest with high QUAL and also high GQ (99).
    • The intervals and samples seen by either approach differ.

    I think the discussion in https://github.com/broadinstitute/gatk/issues/3697 is partially of interest. I suspect the difference in depth you observe relates to the two different approaches in calling and the use of different intervals in calling, e.g. no intervals vs. a single locus interval -L 11:56715203-56715603. HaplotypeCaller has undergone a number of improvements since v4.0.4.0, and since any fixes would be against the latest release version, would it be possible for you to confirm you can recapitulate results with the latest release v4.0.12.0? If it does recapitulate, could you see if anything surrounding the variant or any other sample included could easily explain this difference? If nothing comes to the surface easily, then would it be possible for you to share with us a snippet of your data to recapitulate the results seen in both approaches? We really appreciate it. Instructions for submitting data are at https://software.broadinstitute.org/gatk/guide/article?id=1894.

  • prasundutta87prasundutta87 EdinburghMember

    Hi @shlee,

    Thanks for the reply. Does the gvcf line that I pasted above tell something?

    11 56715403 . A G,ATCCACTCTCCCCTAAACTCCTCTCCACACATATACACGTG, 327.73 . DP=20;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.5,0,0;RAW_MQ=68841 GT:AD:DP:GQ:PL:SB 0/1:1,0,0,0:1:23:365,0,134,23,38,370,229,128,251,456:0,1,0,0

    My whole analysis of one project is based on variants based on gatk v4.0.4.0. It is not possible for me to change to the latest version and re-do the whole analysis once again. I can test this particular case for now.

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭
    edited January 29

    Hi @prasundutta87,

    Apologies for the delay in replying. I've been dealing with a minor car accident and am finally back online.

    Here's the record you want me to examine:


    It appears the record has somehow dropped the third ALT allele. We know there should be three ALTs because of the hanging comma for the ALT column, the three MLEAC/MLEAF values, the four AD values, AND the ten PL values. For example, if the site truly had only two ALTs, the record would have six PLs.

    Would you mind checking this record in your VCF again via a different view method to confirm this is what you see? I ask because I've had a similar instances of dropped information when copy-pasting formatted columnal data from the command line, in particular when a column's data is rather long.

    The closest possible bug from the GATK Bug Tracker appears to be https://software.broadinstitute.org/gatk/documentation/article?issue=5160&repo=gatk. Does this sound like a plausible scenario for your data?

    As I've stated above, in the case this is some bug in the tool, the GATK policy is for researchers to recapitulate the results with the latest version of GATK, currently 4.0.12.0. I know how important it is to stick to a version of the tool for research purposes. The GATK just needs a confirmation that the bug is still in the latest code, as there is some possibility some change in the code gives a different outcome. Thank you.

  • prasundutta87prasundutta87 EdinburghMember
    edited January 30

    Hi @shlee,

    You were right. This is the complete line from the gvcf file:

    11  56715403    .   A   G,ATCCACTCTCCCCTAAACTCCTCTCCACACATATACACGTG,<NON_REF>   327.73  .   DP=20;ExcessHet=3.0103;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;RAW_MQ=68841.00    GT:AD:DP:GQ:PL:SB   0/1:1,0,0,0:1:23:365,0,134,23,38,370,229,128,251,456:0,1,0,0
    

    Hope this helps.

  • prasundutta87prasundutta87 EdinburghMember

    Hi @shlee ,

    This was indeed very helpful. Thanks a lot. I will colour the reads in the bamout file as suggested and see how it goes.

    Actually, I have two projects. One project needs joint calling and another project, I needed variants from 4 samples separately. Since, I had already performed the joint calling and got my variants required for the analysis, I just separated the 4 samples from the multisample VCF file and am going ahead with the analysis. And that's when I came across this issue. I am hoping its the best way to go as calling variants again on the 4 samples separately will be a time-consuming process.

  • prasundutta87prasundutta87 EdinburghMember

    Hi,

    So this is how the IGV visualisation looks indicating informative/uninformative reads and coloured by HC tag:

    After adding the sample-specific GVCF, this is how it looks:

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Hi @prasundutta87,

    The HC tag coloring results are consistent with the VCF metrics. One thing I am noticing is that you have ~ten artificial haplotypes, and this implies there are additional read-supported variants outside of the 41-bp zoomed-in view you show. HaplotypeCaller offers options to reduce the number of artificial haplotypes under consideration. These include --max-mnp-distance, --adaptive-pruning, --kmer-size and --min-pruning. The first two are newly available in GATK4.1. Please refer to tool documentation and search the forum for details on these parameters.

  • prasundutta87prasundutta87 EdinburghMember
    edited February 6

    Hi @shlee,

    Thanks for your reply. I hope you are coping well from the accident. Just wanted to know, how will reducing the number of artificial haplotypes under consideration be helpful? How can I infer it affecting my analysis? Can you point me to a suitable document? I found this though: https://gatkforums.broadinstitute.org/gatk/discussion/4146/hc-step-2-local-re-assembly-and-haplotype-determination

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭
    edited February 6

    Hi @prasundutta87,

    Informative reads are those that unambiguously align better to a particular haplotype. If a read can align well to a number of artificial haplotypes, and it is ambiguous which it supports, then it becomes uninformative. My understanding is if you reduce the number of artificial haplotypes, then there is a better chance read support for one of the haplotypes over others will be unambiguous. Of course, this may not be the case for your locus in the end but it is something you could try if you need to better resolve the support for a variant call.

    Thanks for asking about my wellbeing. I really appreciate that. I am doing well.

  • prasundutta87prasundutta87 EdinburghMember
    edited February 10

    Hi @shlee ,

    Thanks for the reply. So, based on the above discussion, for my issue, when I separate my samples from a multisample VCF file, should I again perform a variant filtration based on INFO/DP or INFO/QD? A part of my research requires cohort based calling (population genetics work), whereas another one requires single samples VCF files (allele-specific expression). I am aware that after the multisample variant call, the variant caller is sure of the genotype, but after separation, it starts looking ambiguous, which by theory is not.

    Post edited by prasundutta87 on
  • SkyWarriorSkyWarrior ✭✭✭ TurkeyMember ✭✭✭

    One thing to make sure that you get your genotypes after splitting is that you need to perform allelic splitting before splitting your individual samples from a multisample vcf. Sometimes sites that contain multiple alleles become very convoluted and when you extract an individual sample the exact information may get lost in translation unfortunately. You can perform allelic splitting by VT or bcftools norm. In fact I would also recommend this before doing any sorts of variant recalibration after joint calling.

  • prasundutta87prasundutta87 EdinburghMember

    Hi @Skywalker,

    I have already got rid of multiallelic sites as I am only focussing on high quality bialleic sites for both my projects.

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Hi @prasundutta87,

    Filtering using cohort level annotations can empower variant discovery. It's my understanding the INFO field is meant to reflect cohort-level metrics and we use the FORMAT annotations to scrutinize single-samples, e.g. at the genotype level. Currently, GATK tools are mostly focused on enabling cohort-level, i.e. INFO level filtering, although this is changing, e.g. with the ability to filter on genotype. So I assume this is the reason folks end up subsetting their callset to a single sample and SelectVariants adjusts the INFO level metrics to reflect the subset sample(s). It is possible to retain certain original cohort-level annotations even after subsetting, e.g. check out SelectVariants' --keep-original-ac and --keep-original-dp. This produces two sets of annotations--original and adjusted.

    when I separate my samples from a multisample VCF file, should I again perform a variant filtration based on INFO/DP or INFO/QD? A part of my research requires cohort based calling (population genetics work), whereas another one requires single samples VCF files (allele-specific expression).

    I think the key question to ask is whether your filtering approaches are retaining the variants of interest. Do you have rare variants or hard-to-call variants that can serve as positive controls? Based on:

    ... I am only focussing on high quality bialleic sites for both my projects.

    It seems this might be about tuning and settling on appropriate filtering thresholds for your data. One question I have for you is whether there may be GATK functionality that would be helpful towards your two-pronged analysis. No promises here but we are keen to make sure we can meet new research needs.

Sign In or Register to comment.