Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Any extra filters when using Haloplex or amplicon sequencing data?

GDPGDP SzegedMember

Hello everyone!
Our group has a Haloplex dataset from the old days when Agilent wasn't marking the PCR duplicates with barcodes. So unfortunately, deduplication is out of question. 61 genes were sequenced in 150 individuals. I ran the Best Practices for GATK4. And used hard filtering settings. I visualized the difference between dbSNP and novel variants, to set the filters (thanks for the tutorials, they helped a lot). The data look good in the sense that we have 99% of the variants in the dbSNP, and not too many variants were filtered. So I am confident that the false positive rate is low. I am more worried about genotyping, more precisely, worried that there would be more homozygotes because of PCR biases.

Are there any other filtering options that could be useful for working with such a dataset? If there are some from amplicon sequencing, I think that could also be helpful, because the problems that could possibly arise are the same. Are there some extra measures that one can use to ensure that the homozygotes called for an amplicon sequencing are really homozygotes? Is it worth it to run the analysis with the Unified Genotyper as well? (I read that a couple of years ago HC used to mess up genotypes while UG didn't, but I guess things changed for the better)

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @GDP,

    What is your reasoning behind discounting deduplication? From what I can see online about HaloPlex tech, its advantage is the multiple sets of nonoverlapping PCR targets for a given target region. I should think that you should still be able to use MarkDuplicates for informational purposes, even if you will include duplicate fragments in your final analysis. To discount PCR bias, you can see what are the variant calls when you exclude duplicate fragments. Towards this, after flagging duplicate fragments with MarkDuplicates you can use +/- --disable-read-filter NotDuplicateReadFilter. This filter is automatically applied to most GATK tools.

  • GDPGDP SzegedMember

    Hello shlee,

    Thanks for the quick response. The reasons for discounting the deduplication step are the following:
    1. haloplex libraries are created using restriction enzymes, although there are some fragments that do not start at the exact same position, most reads will start at the restriction site, meaning that one position of the reads are fixed, and since most reads are the same size, that means, in most cases the other one as well, therefore they are all duplicates of each other. The new Haloplex method uses barcoding to identify true duplicates (PCR products) and apparent duplicates (start and end at the same time but arisen from different PCR reactions). Unfortunately, our dataset is from a time (I think 2014) when Agilent didn't multiplex them.
    2. If we deduplicate we lose 80-90% of the reads. And even though our coverage is good, it is not good enough to allow that. And it is clearly visible that without deduplication we get reasonable variants, with a low number of false positives, the only problem is the apparent overrepresentation of homozygotes.

    I have marked the duplicates with Picard, just to not lose that information, but I didn't use it anywhere along the pipeline, disabling the filter for all GATK tools where it was possible, just in case (I wasn't sure if it is off by default and or not, if I remember well, not all tools are clear about that in the description).

  • GDPGDP SzegedMember

    But my question did not refer to whether deduplication is necessary, I am quite sure now that, in our case, it is not. I am more interested in any kind of tool (in my question I asked for filtering options, but maybe it is rather a tool), that could help recalibrate the many homozygotes that we get into heterozygotes.

    Until recently, it was only a gut feeling that many of our homozygotes were called wrong because the alternative (by alternative I mean different from the variant that was called as homozygote, not the one that is different from the reference) alleles got diluted by PCR. But looking at the data in details makes me even more sure. I am attaching an IGV screenshot of a variant in 3 samples.

    All these three samples were called homozygote reference (A), we have 26 other samples that are heterozygote A/G for this. (almost all of the Gs here have quality values above 30)
    in the first sample it is 115 As vs 3 Gs after deduplication: 5 As vs 0 Gs
    in the second sample it is 102 As vs 6 Gs after deduplication: 5As vs 1 G
    in the third sample it is 179 As vs 25 Gs after deduplication: 6As vs 0 Gs
    I don't doubt the HaplotypeCallers decision to not call these variants heterozygotes (although I could argue with the last one a bit, but ok). If PCR wasn't a factor to consider, I'd agree with these decisions. As far as I'm concerned, I am sure that these are heterozygotes in our samples basing on that sequencing errors are relatively rare and that G is otherwise quite common in the rest of our samples. I should also note that in our results the het/hom var ratio is pretty low. It is partially explained by the fact that many samples were from native tribes, where inbreeding is common, but we see very low values even where we don't expect inbreeding.

    I am considering rewriting the genotypes in a manner that when there are more than 3 reads supporting an alternative variant (again, I don't mean alternative to the reference, I mean alternative to the one called homozygote) which was called in another sample, I'd call the genotype heterozygote. But I expect there are more sophisticated tools (maybe taking into account the quality of these reads) for that. Unfortunately, I didn't find anything in the literature that people who do amplicon sequencing would use anything like, that even though they could benefit from it, if they are interested in genotypes not only in the presence of rare variants. I don't know, if you have any recommendation.

    Issue · Github
    by shlee

    Issue Number
    3010
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    I see @GDP.

    I ran the Best Practices for GATK4. And used hard filtering settings....So I am confident that the false positive rate is low.

    From this we assume you are generating HaplotypeCaller GVCFs on each sample then using CombineGVCFs and GenotypeGVCFs to generate calls.

    Your concern is not the callsites but with the genotypes for the sites for a number of your samples. You think these sites should be called het but they are hom-ref. You say that PCR-bias is likely a factor:

    I don't doubt the HaplotypeCallers decision to not call these variants heterozygotes (although I could argue with the last one a bit, but ok). If PCR wasn't a factor to consider, I'd agree with these decisions. As far as I'm concerned, I am sure that these are heterozygotes in our samples basing on that sequencing errors are relatively rare and that G is otherwise quite common in the rest of our samples. I should also note that in our results the het/hom var ratio is pretty low. It is partially explained by the fact that many samples were from native tribes, where inbreeding is common, but we see very low values even where we don't expect inbreeding.

    If the support for an ALT is low in a particular sample, then cohort-level genotyping can boost the confidence for the ALT given other samples contain the same ALT. You may want to try the new qual model available to both HaplotypeCaller and GenotypeGVCFs ( -new-qual or --use-new-qual-calculator). Its benefits are explained by Geraldine in this blogpost.

    I am more interested in any kind of tool (in my question I asked for filtering options, but maybe it is rather a tool), that could help recalibrate the many homozygotes that we get into heterozygotes.

    I will see if our developers know anything towards this.

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    2 thoughts:

    1. CalculateGenotypePosteriors might help here.
    2. Take a look at CollectIndependentReplicateMetrics (assuming that you have true genotypes for one of your samples) to see if your decision to not mark duplicates is right. It will tell you how much of the duplication in your data is independent. if this is high, you do not want to mark duplicates.
  • GDPGDP SzegedMember

    @shlee
    Thanks, I was using the -new-qual already for GenotypeGVCFs as that was indicated in the best practices. For the HaplotypeCaller there was no such recommendation on github, so I didn't use that, but I have started a re-run now with the flag on, I am a little sceptical whether it answers the question though.

    "From this we assume you are generating HaplotypeCaller GVCFs on each sample then using CombineGVCFs and GenotypeGVCFs to generate calls.

    Your concern is not the callsites but with the genotypes for the sites for a number of your samples. You think these sites should be called het but they are hom-ref. You say that PCR-bias is likely a factor"

    Yes, all of these assumptions are correct.

    "If the support for an ALT is low in a particular sample, then cohort-level genotyping can boost the confidence for the ALT given other samples contain the same ALT."

    Yes, I agree. That is why we decided to use the GATK-pipeline, and not other genotyping methods, even though the HaplotypeCaller in itself is not ideal when so many duplicates are present in the data.

    @yfarjoun
    Thanks for the tip, I tried CalculateGenotypePosteriors, unfortunately it had the opposite effect than I wished. It turned even more genotypes into homozygotes. Its effect on the variant that I've shown above was that the first two samples kept their genotype qualities (99), and the third sample went from a genotype quality of 1 to 8 after calculating posterior probabilities. So the program just made it sure that these look like homozygotes, while I am convinced that they should be heterozygotes. I don't know if I should play more with the CalculateGenotypePosteriors settings, I don't really understand how that part with the Dirichlet's distribution functions, so I didn't change the default numbers. Since they are only giving the initial values, I don't know if it would influence much.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @GDP, I cannot tell from the resolution of your screenshot if your coverage depths are in the hundreds or thousands. You should tune the parameters for our assembly callers for high read depths, e.g. 1000x. One parameter to tune is --max-reads-per-alignment-start. If you need more advice with CalculateGenotypePosteriors, please post the command you ran and a snippet of the result. @Sheila will follow up with you.

  • GDPGDP SzegedMember

    @shlee Unfortunately, they are only in the hundreds. I always set --max-reads-per-alignment-start to 0.

    In the meantime I have the results from the run -new-qual when I used it not only for the GenotypeGVCFs, but also for the HaplotypeCaller run. There was a slight change in the het/hom var ratio, but only like changing from 1.07 to 1.09, but it affected almost all samples like this.

    @Sheila It would be nice if you could give me some tips for the calibration of CalculateGenotypePosteriors, because this tools sounds like something that could really help this situation, but with the default settings it kind of does the opposite.

    I ran it with default settings:
    $gatk --java-options "-Xmx16g" CalculateGenotypePosteriors -V new_filter_snps.vcf -R $reference_files/human_g1k_v37.fasta -supporting $reference_files/dbsnp_138.b37.vcf
    @samplesbam -O posterior_new_filter_snps
    I am working with snps and indels in separate vcfs. These vcfs contain a couple of thousands variants for 150 samples. I also tried running this with and without the
    --disable-tool-default-read-filters setting. With it and without it, it creates variants with avg het/hom var ratio of 0.6, so that just makes things much worse, it introduces a lot of new variants as well, I have no idea why and how that works, shouldn't it just recalibrate the genotypes of the existing variants?

    I think maybe by playing with the --globalPrior and the --deNovoPrior values it would be possible to make it do the right job, which should be to recognize heterozygotes more often. I mean if there are 10 good quality reads of an allele (not in the same position of the reads, not always from the same strand) and 100 good quality ones of another allele, it should be called as heterozygote and not as homozygote...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @GDP
    Hi,

    It may be easiest if you submit a bug report with a few sites that are called homozygous when there is "clear" evidence in the BAM file that the site is heterozygous. To confirm, you have looked at the bamout files?

    Instructions for submitting a bug report are here.

    Thanks,
    Sheila

Sign In or Register to comment.