The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

HaplotypeCaller DP reports low values

sam24sam24 UKPosts: 2Member

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi Sam,

    I believe this is due to an issue with downsampling in HaplotypeCaller, which we are aware of but haven't been able to fix yet. In a nutshell, both callers normally downsample the data to a max coverage of 250 by default, but it seems that downsampling is currently not working for HC. So when you have ~200X coverage, there's no downsampling anywhere so you don't see any differences, but when you have much higher coverage, downsampling kicks in for UG but not for HC, hence the differences you observe.

    We're working to fix the bug; in the meantime you can either hard-downsample your data using PrintReads with -dcov 250 before running HC, or just ignore the differences. The results are correct in both cases. Though the HC runs are bound to take much longer without downsampling...

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Ack, and now my colleague points out that you are using -dt NONE in both cases so downsampling can't be the issue. It was such a nice, tidy explanation... but ignore what I said earlier, sorry.

    Do you get the same discrepancies if you run single-threaded? If not it might be the multi-threading that is malfunctioning.

    Geraldine Van der Auwera, PhD

  • sam24sam24 UKPosts: 2Member

    Hi Geraldine,

    Thank you for your comments about downsampling and HC. This would have been a lovely explanation! But yes, I am switching off downsampling. I did try (amongst many iterations) setting -dcov 5000 and this did not change the reported DP values.

    It is good to know that the low numbers should not worry me. I can definitely report to you that all the variants in my data set are being reported, just with a lower numbers.

    I have just re-run the same command using a single thread and unfortunately got the same result.

    Best,
    Sam

  • KathKath Posts: 43Member

    Hi,

    Has anyone had any further thoughts about this? I seem to be having the same issue. At first I assumed it was that IGV included duplicate reads (as I have marked but not removed duplicates from my bam files) but I have set preferences such that IGV should not display duplicates...

    Thanks

    Kath

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi @Kath,

    Unfortunately we haven't had time to look into this in detail, so we're not sure yet what's going on. Since it doesn't seem to have adverse consequences on variant discovery, we're not currently able to treat this as a priority issue. However if you have data that suggests it is a big problem, we would look into it more closely.

    Geraldine Van der Auwera, PhD

  • MaikeMaike FrankfurtPosts: 1Member

    Dear Geraldine,

    are there any news regarding this?
    I think I 'm having this issue too. I have amplicon data and used the HaplotypeCaller for a sample conaining one individual.
    I should get a depth of about DP=370 but I get values between 19 and 199. The data is not downsampled and only one of the reads was filtered out (by the NotPrimaryAlignmentFilter).
    Unfortunately I can't really say how much of a problem this ist because I don't really know why it happens and what it does to the
    results, but I'm worried that most of my data might not be used for the computation, either in a random way or because
    of some parameter that might introduce a bias.
    Can you say anything about that? I would really appreciate your help.

    Thank you,
    Maike

  • ebanksebanks Broad InstitutePosts: 698Member, Administrator, Broadie, Moderator, Dev admin

    You can test whether reads are being used by the tool with the -bamOut option. You will see which reads align to which haplotypes - and which are uninformative to the calculation.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • RyanRyan Posts: 7Member

    Hi Geraldine,

    We have a similar situation as Maike. We are using a targeted capture method and are expecting a coverage of about 1000X in our areas of interest. As above, the issue happens with HaplotypeCaller even though no Downsampling has been selected. We have tried it with the latest release of GATK 3.2-2 and the issue is still present.

    I can't make the case that we are missing calls because of this, but the fact that the read depth reported by HC is erroneous could have some impact on the results we report.

    If people are using GATK to analyze their amplicon or targeted capture panels, more and more and datasets will have this issue.

    Thanks for looking into this!

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi @Ryan,

    It occurs to me that we haven't yet discussed the mapping quality in this thread, as a potential culprit. The HC applies a fairly stringent MQ-based filtering, automatically discarding any read that has MQ lower than 20. Have you checked what is the MQ of the data in the affected regions? What does the filtering summary (last bit of console output) say about HCMappingQualityFilter counts?

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 7Member

    Hi @Geraldine,

    The mapping quality seems to be pretty good in those areas. Here is an output of HC. In this case we had removed our duplicate reads from the input:

    INFO 13:22:29,057 MicroScheduler - 220625 reads were filtered out during the traversal out of approximately 13691080 total reads (1.61%)
    INFO 13:22:29,057 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter
    INFO 13:22:29,057 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 13:22:29,058 MicroScheduler - -> 220625 reads (1.61% of total) failing HCMappingQualityFilter
    INFO 13:22:29,058 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 13:22:29,058 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 13:22:29,059 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
    INFO 13:22:29,059 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
    INFO 13:22:30,138 GATKRunReport - Uploaded run statistics report to AWS S3

    Thanks,

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hmm, so much for that explanation. Have you checked with -bamOut how many reads are actually left in that region after the HC's reassembly process?

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 7Member

    We just checked with -bamOut and only a small subset of reads are used (~200-350) after HC's assembly out of the 1000 or plus reads available in those areas. I am willing to share some data if that is helpful. Thanks for the help!

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    I see -- can you just please check whether this still happens with the latest version (3.2-2)? If so, please upload a snippet of data to our FTP. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • RyanRyan Posts: 7Member

    Hi Geraldine,

    I have uploaded a snippet of our data to your FTP server. the compressed file is named HC_low_DP_issue_with_high_coverage.tar.gz
    This snippet was generated with the latest version of GATK 3.2-2
    Let me know if you need a larger segment of data as I limited it to a small region on MTOR with a variant call.

    Thanks!

    Ryan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Thanks Ryan. I ran some tests on your data and I think I can explain what's happening here.

    The first couple of steps in the HC process involve determining what regions may contain variants ("determine ActiveRegions") and reassemble the reads to determine what are the possible haplotypes.

    In your case, the HC initially identifies an active region with coordinates chr1:11205021-11205095 containing 3000 reads, then applies some trimming and processing that leaves 1070 reads in the active region, which is what it will then proceed with to the next step (assembly). The reason why that reduction happens is that there's a hardcoded downsampling operation that reduces the overall content of the active region to approximately 1000 reads in total (but respecting a minimum of 5 reads starting per position).

    I believe that is why your particular site of interest ends up with only a fraction of the expected depth. Currently this 1000-read total per active region cannot be overridden from command line, but I'm going to put in a feature request to have this be an adjustable parameter in future versions.

    Geraldine Van der Auwera, PhD

  • mmajmmaj Posts: 9Member

    Hi Geraldine (and others).
    We're experiencing similar problems with the Haplotypecaller (both v3.1-1 and v3.2-2). In fact our problem is almost identical to that reported by Ryan, a few posts back in this thread.
    We use amplicon sequencing and consequently have very high coverage (150 bp PE data). In specific cases, HC reports DP=30 reads in regions with app. 1000x good quality coverage. In contrast, UnifiedGenotyper calls the same variants and reports the "correct" DP= ~1000x. This shouldn't be a result of HCmappingQualityFilter, as only ~2% reads are discarded due to this filter.
    However, we do observe a rather low fraction of reads in these regions having the "proper-pairs" tag. Could this discrepancy between HC and UGT in DP reporting be due to some inherent different treatment of reads that do not have the "proper-pair" tag set between the two callers? The problem / discrepancy between UGT and HC persists for all the aligners we frequently use (BWA, MOSAIK, STAMPY).
    Thanks in advance!

    Mads

  • SheilaSheila Broad InstitutePosts: 3,953Member, Broadie, Moderator, Dev admin

    @mmaj‌

    Hi Mads,

    We are aware of this issue, but it is not a super high priority fix as the results output are correct. Just the DP is being reported incorrectly. If the results were incorrect, it would be a higher priority.

    This seems like some quirk and not a bug, but we will certainly post when the issue is resolved.

    -Sheila

  • SheilaSheila Broad InstitutePosts: 3,953Member, Broadie, Moderator, Dev admin

    @mmaj‌

    Hi again,

    A fix to my previous comment, we are aware the DP is not reported incorrectly, but rather Haplotype Caller is using fewer reads than it could/should.

    -Sheila

  • mmajmmaj Posts: 9Member

    Hi Sheila and Geraldine,
    While it may not seem like a bug, I must admit I have a hard time understanding how HC selects a tiny subset of (what appears to be) perfectly good reads in our amplicon sequencing setup. In a very recent sample, we have +600x cov at a clearly het.zyg site, and UnifiedGenotyper correctly reports the DP. However, HC reports only 16 reads for this region, ultimately resulting in a very low quality score for that particular variant.
    You may argue that this is not important, but it makes it very difficult to apply hard filters to our raw callsets, since we now have to loosen our hard filters to account for the misleading DP / quality scores reported by HC.
    Do you know whether this issue is being or will be addressed anytime soon?

    Anyways, thanks for a really great toolkit!!

    -Mads

  • SheilaSheila Broad InstitutePosts: 3,953Member, Broadie, Moderator, Dev admin

    @mmaj‌

    Hi Mads,

    Which version of GATK are you using? Can you post the commands you used? Have you checked the mapping qualities of the reads in the area? If they are low, that could be a potential reason they are getting filtered out.

    -Sheila

  • mmajmmaj Posts: 9Member

    Hi sheila,
    I'm using GATK v3-2.2.
    command:
    -T HaplotypeCaller --dbsnp $dbSNP -I sample.bam -rf MappingQualityZero -pairHMM VECTOR_LOGLESS_CACHING -o sample.raw.HTC.vcf -R $fastaref -L $ROI -minPruning 2 -nct $threads -dt NONE

    From this run, we lose ~4% of reads due to the readfilters, primarily HCMappingQualityFilter.
    The mapping quality is generally (always) rather low for this particular setup, as we're sequencing degraded DNA from FFPE tissue. However, for the variant of interest, we get MQ=23.7 from the above HC run. Similarly, we get MQ=23.2 for the same variant using UnifiedGenotyper, but UGT correctly emits the DP (+600 reads) and AD for this variant.
    consequently, the QUAL emitted by HC and UGT for this variant is ~150 and 4700, respectively...

    While I'm aware that the mapping qualities are definitely lower for this type of setup, I assumed that reads passing the automatically applied readfilters (e.g. HCMappingQualityFilter) would be included in the variant-calling/-reporting process. Although HC's de novo assembly step may to some degree affect the DP and AD emitted, I still don't understand how it could reduce DP from 600+ to 16 :)

    • Mads
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi @mmaj,

    I notice you are using -dt none. We have had reports that modifying the default downsampling with HC somehow adversely affects the downsampling process. In particular it seems that disabling downsampling leads to decreased reported depth. Unintuitive and vexing, it is. This is a problem specific to HC; we haven't had the bandwidth yet to assign someone to figure it out, so unfortunately at this time our only recommendation is to leave the downsampling alone.

    Geraldine Van der Auwera, PhD

  • mmajmmaj Posts: 9Member

    Hi Geraldine,
    Thanks for the suggestion!
    I've rerun the HC command without downsampling (simply removing "-dt NONE" from above command), and it emits a DP around 200 reads, which (albeit being only 1/3 of the total reads for this position), is a huge difference from the DP=16 emitted with -dt NONE....
    I'll definitely keep the effect of (no) downsampling in mind! Thanks!

    -Mads

  • metheusemetheuse ann arborPosts: 8Member
    edited October 2014

    I ran HaplotypeCaller on some exome-seq samples with both dedupped reads (>200x coverage) and non-dedupped reads (>2000x coverage, yes, there is a lot of duplication). I was expecting to see the depth from non-dedupped reads is much higher than with dedupped reads. However, it's remarkably opposite..
    Here is the command I was using:
    java -jar ${GATK} -T HaplotypeCaller -R ${genome_fasta} -I Sample_${id}_recal.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 --dbsnp ${dbsnp} -L ${target_bed} --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o Sample_${id}.vcf
    Btw, after running haplotypecaller and genotypeGVCFs, there are 2300 variants from dedupped reads, and 2270 variants from non-dedupped reads, which is, again, against my expectation.
    This is a target exome-seq and only exons from hundred genes were targeted, so that's why there is very high duplication. I was concerned about whether I should remove duplicates, so I tried both.

    I kind of understand that it was because HC randomly downsampled the reads. When there are so many duplicates, the unique reads would have much lower chance to be included. I was thinking I should run HC without downsampling by adding -dt NONE, however, after reading this thread, especially the message above, it seems that adding -dt NONE is not a good choice.
    Any advice? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi @metheuse,

    There is actually no expectation of getting either more or fewer variants in dedupped reads vs. non-dedupped reads, since you don't know ahead of time if the duplicates support reads that are all ref or that have evidence of variation.

    Regarding your observations about depth, can you give a little more detail and post a few records showing the difference?

    Geraldine Van der Auwera, PhD

  • metheusemetheuse ann arborPosts: 8Member
    edited October 2014

    @Geraldine_VdAuwera said:
    Hi metheuse,

    There is actually no expectation of getting either more or fewer variants in dedupped reads vs. non-dedupped reads, since you don't know ahead of time if the duplicates support reads that are all ref or that have evidence of variation.

    Regarding your observations about depth, can you give a little more detail and post a few records showing the difference?

    Thanks for replying. I think I'll anyway remove duplicates. What I really wanted to know is that what is the benefit to let HC do downsampling? It doesn't make sense to me to sequence the DNA ultra-deeply but not use all of them in variant calling. But from the above discussion on this page, it seems that you don't recommend using -dt NONE. Is that still true? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    The downsampling system is tuned to provide the analysis tools with as much data as they need to be empowered to make the right decisions, and no more. Providing the tools with more data than they need just slows them down and increases your compute costs without benefit. Generally speaking, sequencing DNA ultra-deeply rarely makes sense. The way the Broad's production facility (and many others) works is that the sequencing process is optimized to yield the right depth of data per experimental design. This can vary between studies, and when a new design is introduced, pilot studies are carried out to determine what the depth of sequencing should be. This allows you to get the discovery power you need without wasting money and computing resources.

    Geraldine Van der Auwera, PhD

  • KStammKStamm Posts: 31Member

    @Geraldine_VdAuwera said:
    Thanks Ryan. I ran some tests on your data and I think I can explain what's happening here.

    I believe that is why your particular site of interest ends up with only a fraction of the expected depth. Currently this 1000-read total per active region cannot be overridden from command line, but I'm going to put in a feature request to have this be an adjustable parameter in future versions.

    Has there been any progress on that feature? I've also got ultra-deep amplicon seq I need accurate depth of variant calls for. The first pass, with default downsampling yielded this genotype call:

    0/0:543:0:543:0,0,1927
    

    The sample by another technology is clearly HET, and when viewing the alignments manually I see thousands of reads for both REF and ALT. Somehow, perhaps by luck of the starting positions or map qualities, the downsampler got only REF reads. Still the genotype likelihoods are 0,0,1927, indicating equal chance of het and hom-ref from 543 reads ref?

    Anyway, I re-ran it with -dcov 99999 and now it comes up like

    0/0:998:0:998:0,0,7039
    

    So again, a misleading "0/0" call, and stopping at about 1000 reads when I explicitly asked for 99 thousand.

    I think for now I have to go with a less complicated variant caller.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    I do believe the devs added a feature to control the per-AR max reads. See this argument.

    We're putting in a fix to replace the 0/0 calls by ./. when the QG/PLs are uninformative.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    But that said -- to be frank, HaplotypeCaller isn't designed to handle amplicon data and we've heard from many users like yourself that it doesn't do very well on that datatype. I don't often say this, but you may want to give UnifiedGenotyper a shot.

    Geraldine Van der Auwera, PhD

  • KStammKStamm Posts: 31Member
    edited September 2015

    Thanks Geraldine, I'll give the new maxReadsInRegionPerSample command a try.

    In the meanwhile, I got the correct genotype from

    samtools mpileup -C50 -S -D -d 99999 -l region.BED -uf $REF *BAM
    
    0/1:255,0,255:5043:1:99
    

    Now it says high confidence in the HET, with 5043 reads at the site, but still isn't reporting per-allele read counts, (which If I recall correctly might be why I switched from samtools to GATK in the first place!! )

    EDIT: I think it's working, the progressMeter predicts about 10x longer runtime with the new depth settings.

  • priesgopriesgo PanamáPosts: 22Member

    Dear GATK's,

    Finally got into playing with your HaplotypeCaller. Thanks for your work as always!

    I'm retaking this thread as I'm finding a similar issue of having too low DP values from the HC as compared to the UG. But, what got me to write a comment is that the difference in my case is quite extreme and it might be affecting hard filtering by QD. Let me show you a couple of variant calls.

    From UG:

    NC_008508 61511 . T A 10.42 FP;LowQual AC=1;AF=1.00;AN=1;BaseQRankSum=-1.205;DP=259;Dels=0.00;FS=0.000;GC=42.89;HRun=0;MLEAC=1;MLEAF=1.00;MQ=3.48;MQ0=136;MQRankSum=-0.123;NDA=1;PercentNBase=0.0000;QD=0.04;ReadPosRankSum=-0.205;SOR=0.311;set=FilteredInAll GT:AD:DP:GQ:PL 1:248,9:259:40:40,0

    From HC:

    NC_008508 61511 . T A 307 PASS AC=1;AF=1.00;AN=1;DP=8;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=58.11;NDA=1;QD=32.41;SOR=4.407;set=variant GT:AD:DP:GQ:PL:SAC:SB 1:0,8:8:99:337,0:0,0,0,8:0,0,0,8

    So DP reported is getting from 259 down to just 8. OK, the variant is still called though. But, as I'm applying hard filters by "QD < 2.0" I would think something might be erroneous. Now, check the variant call quality values: 10.42 for UG and 307 for HC. Looks as if the difference in DP was compensated by increasing the quality. On HC results I am getting very high values for QD at all variants. Overall, I'm getting a bunch of variants detected both by UG and HC; UG filters some of them out using the hard filters, while HC does not filter out any variant. Looks as if the HC lost the ability to filter false positives. Or more probably I'm missing part of the picture behind the scenes. As far as I know recommended hard filters for UG and HC remain the same.

    Could you, please, give me a hand with this?

    I'm using GATK version 3.5, only 5.17% of reads were filtered out due to HCMappingQualityFilter, I used downsampling default configuration, I'm using a ploidy of 1 and I'm using the flag -allowNonUniqueKmersInRef.

    Thanks in advance!
    Pablo.

  • SheilaSheila Broad InstitutePosts: 3,953Member, Broadie, Moderator, Dev admin

    @priesgo
    Hi Pablo,

    Can you please post the original BAM file and bamout file for the site?

    Thanks,
    Sheila

  • priesgopriesgo PanamáPosts: 22Member

    Thanks for your help Sheila the ability to analyse reassemblies is very powerful for validation purposes.

    I rerunned my haplotype caller command adding the parameters: -bamout $INPUT_BAM.hc_reassembly.bam -forceActive -disableOptimizations.

    At the inspection of this site the results from the HaplotypeCaller reassembly seem to improve the UnifiedGenotyper results as the coverage values for each base indicates (see the attached screenshots for UG and HC). Annotations on read depth seem to be erroneous though. After rerun read depth decreased from 8 to 4, while reassembly info shows 89 total reads, 68 of them supporting this variant.

    IGV_screenshot_1.png
    1835 x 1080 - 82K
    IGV_screenshot_UG_coverage.png
    216 x 248 - 8K
    IGV_screenshot_HC_coverage.png
    186 x 257 - 6K
  • SheilaSheila Broad InstitutePosts: 3,953Member, Broadie, Moderator, Dev admin

    @priesgo
    Hi Pablo,

    I suspect some of the reads are the artificial haplotypes created by HaplotypeCaller. Can you please try coloring the reads by sample? You can right click on the reads and choose color by sample.

    -Sheila

  • priesgopriesgo PanamáPosts: 22Member

    I did not have that option in IGV, only "color by strand" so I did a manual analysis of samples across reads. All samples were the same for every read, the read group changed though. And guess what we had: 8 reads in the read group "1" and the rest of the reads in the read group "ArtificialHaplotype". Could you give me some insight or point me to the correct documentation for this process of the artifical haplotype?
    Under the light of this findings I have some questions:
    (1) In this specific case, should I trust HaplotypeCaller or rely on the filtering of Unified Genotyper results? Looking at the coverage values of the original BAM this variant looks for the untrained eye like a false positive.
    (2) Are the generic recommended hard filtering thresholds (https://www.broadinstitute.org/gatk/guide/article?id=3225) the same for UnifiedGenotyper and HaplotypeCaller?

    And a last question that came to me looking at the strand coloring:
    (3) Are we losing the strand information on the re-assembly? All reads are from the same strand, but FS=0.0.

    Thanks again and sorry to raise so many questions. Maybe some questions desired an independent thread, let me know if you prefer me to write a separate thread for any.

    Pablo.

  • SheilaSheila Broad InstitutePosts: 3,953Member, Broadie, Moderator, Dev admin

    @priesgo
    Hi Pablo,

    For understanding artificial haplotypes, you can look at this article.

    1) We recommend HaplotypeCaller over UnifiedGenotyper as it makes better indel calls.

    2) Yes, they are. However, they are meant to be a starting point. It is up to you to determine what the best cutoffs are for your own dataset. Have a look here for more information.

    3) I think the FS only takes into account the reads in the bamout, but I will double check. The artificial haplotypes are all on 1 strand for simplicity.

    -Sheila

Sign In or Register to comment.