Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Y chromosome SNPs not found in YCC and cannot be assigned to a particular haplogroup

mqbpkkd2mqbpkkd2 ManchesterMember

Dear all,

I've been doing ancient DNA analysis in Egyptian mummies and I have some issues with my Y chromosome SNPs.
First of all when I map my dataset to the hg19 I have less SNPs than when I map it only to the Y-chromosome and also different ones. I do the SNP calling with GATK in which case I have very few haplotypes, and also I call SNPs from Geneious in which case I have way more. The common ones between GATK and Geneious are less than 10.
My biggest problem is that when I search the UCSC genome browser they come up as validated with an rs number and all, but I cannot find them in the YCC publication, therefore I cannot assign them in a particular haplogroup which is what I have to get out of this analysis. Below are the commands that I'm running to call SNPs after the local realignment and recalibration. Any help would be very much appreciated.

java -jar ../../../NGS_programs/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../../GRCh38_hg38_WholeGenomewithrepeats/hg38.fa -I ../../../NGS_Data/NA_chromosomeY_NGS/NA_Y_mapped_to_complete_genome_GRC38_hg38/NA_Y_TTCTTGAC-GCGCGTAG.hg38aligned.MERGED.Groups.Added.indelrealigner.recal.bam -o NA_Y_TTCTTGAC-GCGCGTAG.hg38aligned.MERGED.HaplotypeCaller.vcf

java -jar ../../../NGS_programs/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T SelectVariants -R ../../
GRCh38_hg38_WholeGGRCh38_hg38_WholeGenomewithrepeats/hg38.fa -V NA_Y_TTCTTGAC-GCGCGTAG.hg38aligned.MERGED.HaplotypeCallertest.vcf -selectType SNP -o NA_Y_TTCTTGAC-GCGCGTAG.hg38aligned.MERGED.SelectVariants.vcf

java -jar ../../../NGS_programs/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T VariantFiltration -R ../../GRCh38_hg38_WholeGenomewithrepeats/hg38.fa -V NA_Y_TTCTTGAC-GCGCGTAG.hg38aligned.MERGED.SelectVariants.vcf --filterExpression "DP < 5" --filterName "NA_Y_filter_GATK" -o NA_Y_GATK_filtered_snps_VariantFiltration.vcf

The last command I have also tried with default values i.e. --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "NA_Y_filter_GATK" but nothing changed.

Thank you all.

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    Hi @mqbpkkd2,

    Personally I wasn't aware that Geneious included a variant caller. In a cursory Google search I couldn't find any documentation detailing the algorithm it uses, so I can't comment on what level of accuracy you might expect there.

    Speaking for GATK, and specifically HaplotypeCaller, it is designed to be highly sensitive while excluding obvious artifacts, but it does expect certain levels of quality of the data, both of the sequencing and the underlying starting material. I assume somewhat naively that mummy DNA is likely to be degraded in all sorts of ways, and I would expect to see a lot of noise in the resulting sequence. So my off-the-bat interpretation is that HC is throwing out a lot of sites as obvious artifacts, while Geneious might be less discriminating when run with default settings. You can relax the HC's settings if you want to turn up sensitivity, but that means letting in more false positives. So it really depends on what you need in your work.

    Keep in mind also that chromosome Y is a difficult one to work with, so you should probably restrict your analysis to regions that people have found tractable in past studies. In fact I would recommend using -L to pass in the list of coordinates of regions that are associated with haplogroups of interest (or a list of the SNPS that have been used to define those haplogroups), plus -ip 100 to add some padding (to avoid edge effects). By doing this your analysis will run faster and you will only get results in usable regions.

    Now, I notice you mention DP > 5. Is that an estimate of the average coverage you have overall? Because that is going to be your biggest problem -- it's really not going to be enough to make confident calls. At that level of coverage you really don't have any power to distinguish real variants from artifacts, especially if your starting material is likely to be heavily degraded. The first thing you should do, really, is check that you have enough coverage over the intervals that matter to you. You can do this with DiagnoseTargets.

Answers

  • mqbpkkd2mqbpkkd2 ManchesterMember

    Thank you so much for your quick response, I really appreciate it!! Yes it is real fun once you get things to work (which I'm still struggling for!) :smiley: For the first issue I figured that was the reason why, and thanks for your suggestion I'll work with the whole genome from now on.
    For the second, my goal is to assign my Y SNPs to a haplogroup. But I can't do that because of two reasons:

    1st: I have discrepancies when I call SNPs form GATK and from Geneious (maybe GATK is too stringent for my ancient data??) For example I have 7 SNPs from HaplotypeCaller and maybe more than 50 from Geneious even though my DP is > 5 in both cases.

    2nd: Even though the SNPs exist in the UCSC/ISOGG database and they have an rs number, they do not assign to any haplogroup.
    That makes me wonder about two things:

    -Are my SNPs unique or a result of modern contamination?
    -Why GATK filters out most of my SNPs as they appear on Geneious?
    -Which SNP method do I trust?

    Thank you once again, for your valuable help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    Hi @mqbpkkd2,

    Personally I wasn't aware that Geneious included a variant caller. In a cursory Google search I couldn't find any documentation detailing the algorithm it uses, so I can't comment on what level of accuracy you might expect there.

    Speaking for GATK, and specifically HaplotypeCaller, it is designed to be highly sensitive while excluding obvious artifacts, but it does expect certain levels of quality of the data, both of the sequencing and the underlying starting material. I assume somewhat naively that mummy DNA is likely to be degraded in all sorts of ways, and I would expect to see a lot of noise in the resulting sequence. So my off-the-bat interpretation is that HC is throwing out a lot of sites as obvious artifacts, while Geneious might be less discriminating when run with default settings. You can relax the HC's settings if you want to turn up sensitivity, but that means letting in more false positives. So it really depends on what you need in your work.

    Keep in mind also that chromosome Y is a difficult one to work with, so you should probably restrict your analysis to regions that people have found tractable in past studies. In fact I would recommend using -L to pass in the list of coordinates of regions that are associated with haplogroups of interest (or a list of the SNPS that have been used to define those haplogroups), plus -ip 100 to add some padding (to avoid edge effects). By doing this your analysis will run faster and you will only get results in usable regions.

    Now, I notice you mention DP > 5. Is that an estimate of the average coverage you have overall? Because that is going to be your biggest problem -- it's really not going to be enough to make confident calls. At that level of coverage you really don't have any power to distinguish real variants from artifacts, especially if your starting material is likely to be heavily degraded. The first thing you should do, really, is check that you have enough coverage over the intervals that matter to you. You can do this with DiagnoseTargets.

  • mqbpkkd2mqbpkkd2 ManchesterMember

    Hi Sheila, Geraldine thank you both!

    To answer a few of your questions, no I do not have sequence coverage for the entire chromosome, I only enriched specific areas of the genome but even so my dataset is highly fragmented. The Depth that I'm getting is 4-7 reads tops. I am well aware that it's not enough to make confident calls but with ancient datasets we just have to get the best out of it I guess!

    Thank you very much both for your suggestions, I'll try them out and hope for the best!

Sign In or Register to comment.