invariant calls not consistently included in Unififed Genotyper output

Hello!

I am writing to seek your help on a curious result I received from running the Unified Genotyper. I ran Unified Genotyper to obtain all confident sites (both invariant and variant) on a suite of different bam files. For certain regions of the genome that overlap certain bam file reads, it appears that it did not output invariant sites, only variant sites. I do not know if this is due to differences among bam files, my call to GATK, or is a glitch.

Here is a little more background.

  • First, I am using Unified Genotyper to maintain continuity with some analyses I did over a year go.
  • My data:
  • I have 92 low coverage bam files from samples produced from illumina sequencing a RADtag type enzyme digestion library. The bams contain single-end reads, each 44-bp long. I will refer to these reads as ‘RADtags’.
  • I also have 18 high coverage bams produced from illumina sequencing whole genome paired-end libraries, with read lengths varying from 36 to 100. I will refer to these as WGS samples.
  • Previously I called SNPs using Unified Genotyper with mode set to EMIT_VARIANTS_ONLY, on all this data at once, to increase the chances of identifying quality SNPs in the low coverage RADtag data and to get genotypes from the high coverage WGS data at the same sites. The data seemed fine.

  • Current issue:

  • I ran the Unified Genotyper with the mode set to EMIT_ALL_CONFIDENT_SITES for all the same data. The vcf includes both invariant and variant calls. However, it seems I am not getting any invariant sites from regions that overlap the RADtag reads. It seems the program is not emitting invariant sites within RADtags, but is reporting invariant sites outside of RADtags and only for WGS samples.

I have attached a subset of my vcf file, and a snapshot image from viewing these samples in igv. In the igv view, the middle samples are the RADtag samples, the higher coverage samples at the top and bottom are the WGS samples. As you can see, there should be plenty of invariant sites in the regions overlapping the RADTAGs and there should be calls that include these data regions and samples.

Can you think of a reason that could be causing this? I have pasted my call to GATK below.

Thank you!
Amanda

java -jar /usr/local/gatk/3.0-0/GenomeAnalysisTK.jar \
-R mg.new.ref.fa \
-T UnifiedGenotyper \
-I AHQT1G_q29.paired.nodup.realign.bam -I BOG10G_q29.paired.nodup.realign.bam -I CAC6G_q29.paired.nodup.realign.bam \
-I CAC9N_q29.paired.nodup.realign.bam -I DUNG_q29.paired.nodup.realign.bam -I IM62.JGI_q29.paired.nodup.realign.bam \
-I KOOTN_q29.paired.nodup.realign.bam -I LMC24G_q29.paired.nodup.realign.bam -I MAR3G_q29.paired.nodup.realign.bam \
-I MED84G_q29.paired.nodup.realign.bam -I MEN104_q29.paired.nodup.realign.bam -I NHN26_q29.paired.nodup.realign.bam \
-I PED5G_q29.paired.nodup.realign.bam -I REM8G_q29.paired.nodup.realign.bam -I SF5N_q29.paired.nodup.realign.bam \
-I SLP9G_q29.paired.nodup.realign.bam -I SWBG_q29.paired.nodup.realign.bam -I YJS6G_q29.paired.nodup.realign.bam \
-I CACid.277_index2.GCTCGG.sortedrealign.bam -I CACid.279b_index2.CTGATT.sortedrealign.bam -I CACid.247_index2.AATACT.sortedrealign.bam -I CACid.237_index2.CGCTGT.sortedrealign.bam \
-I CACid.240b_index2.GTCTCT.sortedrealign.bam -I CACid.272_index2.ATCTCC.sortedrealign.bam -I CACid.282b_index2.TCTGCT.sortedrealign.bam -I CACid.179_index2.TATAGT.sortedrealign.bam \
-I CACid.187_index2.CAGCAT.sortedrealign.bam -I CACid.189_index2.TAATCC.sortedrealign.bam -I CACid.192_index2.GCAGTT.sortedrealign.bam -I CACid.339_index2.CCTGAA.sortedrealign.bam \
-I CACid.235_index2.AAGCGA.sortedrealign.bam -I CACid.236_index2.AACTTA.sortedrealign.bam -I CACid.249_index2.GTTCCT.sortedrealign.bam -I CACid.370_index2.CTAATA.sortedrealign.bam \
-I CACid.372_index2.CCGAAT.sortedrealign.bam -I CACid.215_index2.CTTGTA.sortedrealign.bam -I CACid.218_index2.CCCTCG.sortedrealign.bam -I CACid.219_index2.ACTGAC.sortedrealign.bam \
-I CACid.221_index2.GTGACT.sortedrealign.bam -I CACid.225_index2.ACTAGC.sortedrealign.bam -I CACid.226_index2.CGACTA.sortedrealign.bam -I CACid.231_index2.GTGAGA.sortedrealign.bam \
-I CACid.232_index2.AATCTA.sortedrealign.bam -I CACid.233_index2.CAAGCT.sortedrealign.bam -I CACid.212_index2.CTCTCA.sortedrealign.bam -I CACid.211_index2.ACTCCT.sortedrealign.bam \
-I CACid.403_index2.GATCAA.sortedrealign.bam -I CACid.280_index2.GGCTTA.sortedrealign.bam -I CACid.283_index2.GTGGAA.sortedrealign.bam -I CACid.285_index2.AAATGA.sortedrealign.bam \
-I CACid.003_index2.GGGTAA.sortedrealign.bam -I CACid.007_index2.CGTCAA.sortedrealign.bam -I CACid.008_index2.GTTGCA.sortedrealign.bam -I CACid.055_index2.ATATAC.sortedrealign.bam \
-I CACid.056_index2.TTAGTA.sortedrealign.bam -I CACid.071_index2.TCGCTT.sortedrealign.bam -I CACid.313_index2.TCGTCT.sortedrealign.bam -I CACid.040_index2.CAATAT.sortedrealign.bam \
-I CACid.292_index2.TCATGG.sortedrealign.bam -I CACid.080_index2.CAGGAA.sortedrealign.bam -I CACid.088_index2.ATCGAC.sortedrealign.bam -I CACid.322_index2.TTGCGA.sortedrealign.bam \
-I CACid.114_index1.GCTCGG.sortedrealign.bam -I CACid.279a_index1.CTGATT.sortedrealign.bam -I CACid.181_index1.AATACT.sortedrealign.bam -I CACid.184_index1.CGCTGT.sortedrealign.bam \
-I CACid.185_index1.GTCTCT.sortedrealign.bam -I CACid.191_index1.ATCTCC.sortedrealign.bam -I CACid.194_index1.TCTGCT.sortedrealign.bam -I CACid.240a_index1.TATAGT.sortedrealign.bam \
-I CACid.238_index1.CAGCAT.sortedrealign.bam -I CACid.371_index1.TAATCC.sortedrealign.bam -I CACid.253_index1.GCAGTT.sortedrealign.bam -I CACid.260_index1.CCTGAA.sortedrealign.bam \
-I CACid.262_index1.AAGCGA.sortedrealign.bam -I CACid.257_index1.AACTTA.sortedrealign.bam -I CACid.258_index1.GTTCCT.sortedrealign.bam -I CACid.216_index1.CTAATA.sortedrealign.bam \
-I CACid.220_index1.CCGAAT.sortedrealign.bam -I CACid.223_index1.CTTGTA.sortedrealign.bam -I CACid.224_index1.CCCTCG.sortedrealign.bam -I CACid.228_index1.ACTGAC.sortedrealign.bam \
-I CACid.229_index1.GTGACT.sortedrealign.bam -I CACid.281_index1.ACTAGC.sortedrealign.bam -I CACid.282a_index1.CGACTA.sortedrealign.bam -I CACid.284_index1.GTGAGA.sortedrealign.bam \
-I CACid.286_index1.AATCTA.sortedrealign.bam -I CACid.287_index1.CAAGCT.sortedrealign.bam -I CACid.288_index1.CTCTCA.sortedrealign.bam -I CACid.072_index1.ACTCCT.sortedrealign.bam \
-I CACid.077_index1.GATCAA.sortedrealign.bam -I CACid.078_index1.GGCTTA.sortedrealign.bam -I CACid.084_index1.GTGGAA.sortedrealign.bam -I CACid.087a_index1.AAATGA.sortedrealign.bam \
-I CACid.087b_index1.GGGTAA.sortedrealign.bam -I CACid.091_index1.CGTCAA.sortedrealign.bam -I CACid.092_index1.GTTGCA.sortedrealign.bam -I CACid.103_index1.CTCACT.sortedrealign.bam \
-I CACid.095_index1.TCCATA.sortedrealign.bam -I CACid.100_index1.ACTCGA.sortedrealign.bam -I CACid.101_index1.GTTCGA.sortedrealign.bam -I CACid.106_index1.ATATAC.sortedrealign.bam \
-I CACid.108_index1.TTAGTA.sortedrealign.bam -I CACid.110_index1.TCGCTT.sortedrealign.bam -I CACid.112_index1.TCGTCT.sortedrealign.bam -I CACid.144_index1.CAATAT.sortedrealign.bam \
-I CACid.149_index1.TCATGG.sortedrealign.bam -I CACid.321_index1.CAGGAA.sortedrealign.bam -I CACid.323_index1.ATCGAC.sortedrealign.bam -I CACid.324_index1.TTGCGA.sortedrealign.bam \
-o CACid.MIM.sites.q40.6.14.15.vcf \
--metrics_file CACid.MIM.sites.q40.6.14.15.metrics.txt \
--genotype_likelihoods_model SNP \
--output_mode EMIT_ALL_CONFIDENT_SITES \
-stand_call_conf 40 \
-stand_emit_conf 40 \
-l INFO

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amsci8
    Hi Amanda,

    Have you tried -out_mode EMIT_ALL_SITES? I think the sites that are not output are the ones that do not have enough confidence to be emitted with EMIT_ALL_CONFIDENT_SITES .

    -Sheila

  • amsci8amsci8 USAMember
    edited July 2015

    Hi Sheila,
    Thanks for your message. I have not tried using EMIT_ALL_SITES. However, I hesitate to think that this is the problem. It seems the program is not calling almost any invariant sites within regions that overlap the RADTAGs for any of the samples, even when there is very high coverage for the WGS samples. It will output invariant sites right outside the region of the RADTAG, for example the very next base, but not within the bps that overlap the RADTAG.

    See the attached tiff image. It called invariant sites starting at bp 474, right outside the RADTAG. It did not call any invariant sites for any sample for the region overlapping the RADTAGS visible in the bottom half of the view (bps 430-473). Despite high mapping qualities and high coverage.

    I can try using EMIT_ALL_SITES, but can you think of another possible reason for this behavior? It seems odd that all the sites that overlap RADTAGs throughout the entire genome would not ever contain callable invariant sites for any sample, despite callable invariant sites right outside them.

    Would it matter that one set of bams are paired-end reads, and another set are single-end reads?

    Thank you so much for your thoughts!

    Amanda

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @amsci8 would it be an option for you to do EMIT_ALL_SITES instead along with --intervals ?

  • amsci8amsci8 USAMember

    Hi Tommy,
    Thanks. Perhaps. I think I'd have to first create a vcf of all the sites in my desired RADTAG samples using EMIT_ALL_SITES. Then use that vcf as the limiter for intervals in my GATK call that includes the high coverage lines. It could be worth a try.

    Still seems odd that this is happening though, so I may try a different caller altogether for comparison.

    Thanks,
    Amanda

  • amsci8amsci8 USAMember
    edited July 2015

    @Sheila
    Hi Sheila,

    I tried your suggestion of running the Unified Genotyper with EMIT_ALL_SITES. I have attached a subset of the data from this run, in case you want to take a look. Note that the lines with "CACid" in the name are the RADTAG samples.

    Indeed, it seems that for the invariant sites that overlap RADTAG samples, the Unified Genotyper is outputting a lower QUAL value. However, the amount of coverage or mapping qualities from the high coverage lines does not vary for these sites. There is just extra data, from the RADTAG samples. It seems odd to me that it would produce a lower QUAL value for a site with MORE data. However, I admit I don't know the details of exactly how the QUAL calculations are done. Also, from reading a few posts here and there, it seems folks at the GATK do not have a lot of confidence (relatively speaking) in the QUAL values for invariant sites from Unified Genotyper...is that a correct interpretation? (My reason for continuing to use it was to be consistent with SNP only analyses I did a while back.)

    Note that I ran an analysis with Samtools for comparison, and it seems it does not have this issue. So, it seems my options are to lower my QUAL threshold for invariant sites (in custom post-processing), switch to Samtools, or try the HaplotypeCaller. Do you have any ideas about why the QUAL values would be downgraded at the sites with these samples? I guess I'm wondering if this may be an issue that occurs with HaplotypeCaller as well...

    Thank you so much for any thoughts you can share!

    Amanda

    ADDITION/EDIT:
    I found the following on article posted by another GATK staff scientist (Geraldine I believe), about the Unified Genotyper (for archival purposes only). It seems that having the additional, low coverage data from the RADTAG samples does lower the QUAL value for invariant sites...
    "Confidently called bases

    Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and/or QUAL > T for the site being non-reference in at least one sample.

    Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample."

    Post edited by amsci8 on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amsci8
    Hi Amanda,

    Did you Mark Duplicates? Perhaps this thread will help you: http://gatkforums.broadinstitute.org/discussion/4475/unifiedgenotyper-and-single-end-radtags

    -Sheila

  • amsci8amsci8 USAMember

    @Sheila
    Hi Sheila,

    I did not Mark Duplicates for the RADSEQ data. I did Mark Duplicates for the high coverage WGS data.

    -Amanda

  • amsci8amsci8 USAMember
    edited July 2015

    @Sheila
    Hi Sheila,

    • What are your thoughts on whether the low and variable coverage of the RADSEQ samples could be driving down the QUAL calculations?
    • Also, do you think it seems reasonable to manually filter based on QUAL, such that I only keep high QUAL variants (>=Q40) but let in lower QUAL invariant sites (>=Q20). If one looks at the attached tiff image, one can see where the QUAL drastically drops in the region of the RADseq samples (the ones on the right), while the sites without them have QUALs in the 40s. Coverage and mapping quality seems comparable otherwise. When I peruse my data, it seems this is the general pattern, that invariant sites overlapping the RADseq data have mysteriously lower QUAL (often values in the range of 25-28), while the neighboring invariant sites without the RADseq data mostly have QUAL values much higher than that.

    Note that in total my dataset includes 18 whole genome resequenced lines, 11 of which are at reasonably high coverage, and 92 low coverage RADseq samples. I should also note that I am not using this data to identify causal sites. I am using it for population genetic analysis. I am not looking at indels.

    Sorry to keep bugging you about this.
    Thank you!
    Amanda

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amsci8
    Hi Amanda,

    I know you checked the mapping qualities, but can you check the base qualites in the RadSeq data? I have a feeling those are what are driving down the QUAL scores.

    Thanks,
    Sheila

    P.S. I appreciate you persistence! :smile:

  • amsci8amsci8 USAMember

    @Sheila
    Hi Sheila,
    Thanks for the tip! I will check that. Do you think perusing various regions in igv is good enough? Or do you recommend a more quantitative way?

    Thanks,
    Amanda :)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amsci8
    Hi Amanda,

    I think the best way is in IGV. You can right click on a read and choose "Show all bases". Then "Shade Bases By" -> Quality. The lower quality bases will be lighter colored than high quality bases.

    -Sheila

  • amsci8amsci8 USAMember

    @Sheila
    Hi Sheila,

    Thanks! I finally got around to checking this. I think the base qualities are actually pretty good. From perusing a handful of individuals in a few different regions, the large majority of bases have phred scores in the mid to high 30s with many at 40. Very few reads have bases with qualities in 20s. Your suggestion also jogged my memory that I did early assessments usting fastqc and the base qualities are pretty good across the length of the reads on average too.

    Hmmm...puzzling.

    Thanks again for any thoughts you have,
    Amanda :smile:

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @amsci8
    Hi Amanda,

    Now I am wondering if this is a Unified Genotyper issue. Can you try with Haplotype Caller and see if the regions are called correctly?

    Thanks,
    Sheila

  • amsci8amsci8 USAMember

    @Sheila
    Hi Sheila,
    I'll try to run with haplotype caller soon.
    Thanks!
    Amanda

Sign In or Register to comment.