Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

missing FOXOG annotation for specific SNP variants in VCFs

fortunofortuno ChicagoMember

Hi,

We've run Mutect2 with OxoGReadCounts walker to obtain FOXOG annotation for somatic variants in VFCs. I know FOXOG annotation is only calculated for SNPs and when denominator (ALT_F1R2 + ALT_F2R1) is not null. However, I would like to know if there are some other conditions under the FOXOG annotation is not calculated since some SNPs were not annotated with FOXOG values. For instance:

GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:710,18:1.351e-03:9:9:19396,531:347:358 0/1:602,5:4.393e-03:4:1:16497,153:288:305

Supposedly, this SNP variant has non-zero ALT_F1R2 and ALT_F2R1 annotated, so it should calculate FOXOG value but this value is not provided. Do you know why FOXOG is not calculated in these cases? I'm providing one additional example, just in case:

GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:818,19:0.012:12:7:24855,568:406:401 0/1:676,22:0.015:15:7:20245,679:359:307

Thanks,
Francisco Ortuno

Tagged:

Issue · Github
by Sheila

Issue Number
1200
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
chandrans

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @fortuno
    Hi Francisco,

    I am checking with the team and will get back to you.

    -Sheila

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @fortuno,

    Are you using a contamination estimate with MuTect2? Either supplying an output of ContEst or some default allele fraction cutoff? Just to confirm, you are indeed seeing the FOXOG annotation for other variant sites in your callset?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Just to confirm one more thing @fortuno , your variant sites without the FOXOG annotation conform to the sequence context described in this OxoG dictionary entry, yes?

  • fortunofortuno ChicagoMember

    Thank you very much for your responses @shlee! Yes, I am seeing the FOXOG annotation in other variant sites and we are using contamination fraction cutoff of 0.02. Could this be related to the FOXOG annotation?

    I have been also checking sequence context. My variant sites without FOXOG annotation may include C→A or G→T cases as described in the OxoG dictionary. Although the context is not CCG→CAG or GGC→GTC, it isn't either in variant sites with FOXOG annotation. I guess the context will be relevant when filtering OxoG artifacts with D-ToxoG tool but not for FOXOG annotation, am I right?

    Any other idea?

    Thank you very much for your assistance.

  • tsato3tsato3 Cambridge, MAMember, Broadie, Dev

    Hello @fortuno,

    I am a developer working on MuTect2. I could not reproduce the error you mentioned; that is, the vcf always prints a FOXOG genotype annotation for the tumor sample for all SNPs. Could you send us the bam (subset to the region of interest), the exact command line, the output vcf file, and the GATK version?

    Thanks,
    Takuto

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Instructions for submitting test data are here.

  • fortunofortuno ChicagoMember

    Hi @tsato3 and @shlee,

    Thank you very much for your responses. For instance, we have found one case in the region chr3:75669000-75670000 when running Mutect2 using the following BAM files in Genomics Data Common (GDC):

    BAM file UUID (Normal Sample): 046b7599-214c-4ad3-b9af-f13eaba72881
    BAM file UUID (Tumor Sample): 5cca592a-a346-4b63-bf58-5efc38498d2d

    The output VCF is also in GDC with the UUID 384a699d-349e-4a43-a5de-509cb14c75b2 (except for some posterior non-relevant header normalizations). I think you could have access to these data, right? The command especification for this specific case is in VCF headers but we generally run Mutect2 in the following way:

    java -Djava.io.tmpdir=/tmp/job_tmp_${BLOCK_NUM} \
    -d64 -jar -Xmx${JAVA_HEAP} \
    -XX:ParallelGCThreads=1 \
    -XX:+UseSerialGC \
    ${GATK_PATH} \
    -T MuTect2 \
    -nct 1 -nt 1 \
    -R ${REF} \
    -L ${REGION} \
    -I:tumor ${TUMOR_BAM} \
    -I:normal ${NORMAL_BAM} \
    --normal_panel ${PON} \
    --cosmic ${COSMIC} \
    --dbsnp ${DBSNP} \
    --contamination_fraction_to_filter ${CONTAMINATION} \ ##Use 0.02 as cut off.
    -o ${OUTPUT_BASE}.${BLOCK_NUM}.mt2.vcf \
    --output_mode EMIT_VARIANTS_ONLY \
    --disable_auto_index_creation_and_locking_when_reading_rods

    Please, let me know if this information is enough or you would need anything else.

    Thanks again for all the support!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @fortuno,

    At this time our team does not have permission to access protected data, not even for a BAM slice, on the NCI/GDC. Given this, it's unclear whether you are allowed to send us the data, even for debugging. I'm now in the process of requesting an eraCommons account to be able to gain access to controlled TCGA data but the admin who is in charge of this is on vacation until after Labor Day. I hope solving your question is not urgent. Is it okay if we get back to you after some weeks?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited August 2016

    @fortuno,

    Just to confirm, you are observing missing FOXOG annotations for both your run of MuTect2 as well as in the GDC data portal's VCF? Can you give us a few of the site coordinates for records missing the annotation, please? Thanks.

  • fortunofortuno ChicagoMember

    Hi @shlee,

    Thank you very much for your interest and support. I work with GDC data portal's VCFs and I'm observing missing FOXOG annotation there. I work with people who run these MuTect2 pipelines but I haven't run them by myself.

    Anyway, my question is not urgent, it's okay if we wait for you being able to gain access to controlled TCGA data. I prefer this way as I'm not sure which data can be or not shared.

    Thanks again,
    Francisco.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited September 2016

    Hi Francisco (@fortuno),

    I have some answers but am missing a key explanation that I need to ask what is the significance of for your work, e.g. are null FOXOG values causing errors in downstream tools. Let's start with the answers and perhaps this will help you report back some trends and then in turn help us figure out what is going on.

    I was able to download a slice of your data as well as GDC's MuTect2 callset. I see that the GDC's callset uses some nightly build of MuTect2 from February and what you observe manifests in this callset as an absent FOXOG label in the FORMAT field. The latest stable release of MuTect2 (GATK v3.6; tool still in beta) changes this to a present FOXOG label in the FORMAT field but still gives some sites where there is no annotation in the samples, i.e. . values.

    I will assume that you are calling with the latest MuTect2 release, given it is still in beta and still undergoing improvements so will comment on my observations from this callset. From going through the code as well as the VCF, there appear to be broadly three circumstances, with the last being tentative, in which you will get these . null values for the FOXOG annotation.

    1. When a site is obviously not a SNP, e.g. is an insertion or deletion.
    2. When ALT_F1R2 and ALT_F2R1 read counts are zero which gives a zero denominator for the calculation, e.g. sites with the homologous_mapping_event annotation.
    3. Tentatively: when the most likely alllele is non-informative. This needs confirmation.

    I identified the record that you originally posted and this site appears to satisfy (1) being a SNP and (2) nonzero ALT read counts but still gives a null FOXOG annotation. Interestingly, for my simplified callset, this site gives a different ALT allele than the GDC dataset and is additionally annotated as a triallelic_site. This means that the site has two ALTs. Other trialleic sites are getting FOXOG annotations. So what gives? To help answer this, would you be able to check, if you exclude sites that match (1) or (2), for the remaining sites with null FOXOG annotations whether the ALT allele is ambiguous, e.g. could just as likely be another ALT allele? Or do you observe any other peculiarities for these sites?

    Let me point out that this particular site would not pass filters. It is annotated with alt_allele_in_normal;clustered_events;panel_of_normals or alt_allele_in_normal;clustered_events;triallelic_site. The ALT counts are much higher in the normal sample than the tumor, and also very low in the overall pileups at ~2% that is around the estimated contamination fraction. Can you explain to us how the FOXOG annotation for such a site is important to your research? We definitely want to fix bugs but need some user-backed insight to prioritize them. Thanks.

    Post edited by shlee on

    Issue · Github
    by Sheila

    Issue Number
    1473
    State
    closed
    Last Updated
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @fortuno
    Hi Francisco,

    I just put in a bug report. You can keep track of it here.

    -Sheila

  • fortunofortuno ChicagoMember
    edited September 2016

    Hi @shlee,

    I'm sorry for my late response. I was out for vacation. I really appreciate your detailed response which it's very useful for us. It seems to be very similar with the experience I had. In fact, I also saw sites satisfying (1) and (2) but I missed to find condition (3).

    I will check this week if sites with null FOXOG annotation not satisfying (1) and (2) are always multiallelic or the ALT allele is ambiguous. I'm using FOXOG annotation to run D-ToxoG tool for labelling oxoG artifacts. First, we convert VCFs in MAF format and then we prepare MAF files to be compatible with D-ToxoG input. I wanted to be sure if I should remove sites with null FOXOG before running this tool. This is clear in (1) and (2) conditions but I didn't know about remaining sites.

    Thanks again for your helpful assistance.

  • fortunofortuno ChicagoMember

    Hi,

    I was checking last week those sites with null FOXOG annotation in GDC data (excluding sites that match your (1) or (2) conditions), but I'm still unable to find a clear pattern for all these sites. That is, ALT allele is not always ambiguous or there is no a list of not passing filters that unequivocally determines sites without FOXOG annotation. I'll be having a look deeply but I wanted to ask if you have figured out anything else or you can at least confirm the (3) tentative condition.

    As I mentioned, I wanted to be sure if I should include this sites in my subsequent D-ToxoG analysis (calculating the FOXOG value by myself) or I should remove them. I know these sites won't pass some filters but our idea is to incorporate an additional filter label to those sites with potential oxoG artifact risk in addition to the filters that are already added.

    Thanks again for all the assistance.

  • fortunofortuno ChicagoMember

    Wow @shlee!

    Thank you very much for your prompt and extensive response. It was very clarifying for me. I totally agree that leaving out FOXOG-missing sites would not be dramatic taking into account their low percentage. I will check other samples in the cohort but I guess percentages would be similar.

    Regarding D-ToxoG, it throws some error if FOXOG is not specified in all the sites and it only admits SNP entries. This is not a problem because I am already filtering out all those non-SNP sites, so I can filter missing-FOXOG too.

    My main concern was to understand better why FOXOG was not provided in these cases and if I should consider important to study these cases deeply. Taking into account the two main reasons we already discussed (SNP sites and read counts are zero) and the low incidence of remaining cases, I think this is enough to run DToxoG without any subjacent concern.

    Thank you very much again for all the assistance I have received. It has been very helpful to discuss with you about it.

    Best,
    Francisco.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I'd glad I could be of some help. And I'll be sure to followup if our team figures out what is going on with these non-annotated sites. At the least we know this phenomena is programmatic, consistent and so far rare.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @fortuno,

    I just wanted to inform you that our developers will implement the FOXOG annotator from scratch in GATK4, which is slated for beta release early next year. Thanks again for your feedback.

  • xiuczxiucz Member
    edited July 2018

    I just wanted to inform you that our developers will implement the FOXOG annotator from scratch in GATK4, which is slated for beta release early next year. Thanks again for your feedback.

    hi, @shlee

    is the FOXOG annotator here now in GATK4? I only have tumor samples with no matched normal samples, and in this situation I cannot find any "orientation_bias" tags in the FILTER field after using collectSequencingArtifactMetrics _ and _FilterByOrientationBias . I wanted to use the D-ToxoG tool, but I failed.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I don't believe that is currently set up, though @davidben may have more info on this.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    FilterByOrientationBias has been around for a while. If you don't see anything with the orientation_bias filter it could be that you didn't specify the appropriate artifact-modes argument(s), which by default is only G/T. It's also very possible that there was nothing to filter, which has been the case with almost every bam sequenced at the Broad for several years.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah there you go, I'm a bit out of touch :)

    (Thanks David!)

  • xiuczxiucz Member

    @davidben said:
    FilterByOrientationBias has been around for a while. If you don't see anything with the orientation_bias filter it could be that you didn't specify the appropriate artifact-modes argument(s), which by default is only G/T. It's also very possible that there was nothing to filter, which has been the case with almost every bam sequenced at the Broad for several years.

    Thank you, davidben! Another question, when I worked with normal matched samples( paired, not tumor-only samples), and also I found too many G→T variants(almost 80% in all SNVs, and they are all F1R2 in IGV) in the twice-filtered result, but they are not flagged with "orientation_bias". So, I want to know how to specify an argument or an threshold value to filter out more G→T variants?

    My best wishes. :)

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @xiucz Unfortunately the threshold is hard-coded. As a hack you could manually edit all the G-->T qualities in the preadapter metrics table output by CollectSequencingArtifactMetrics.

    Fortunately, we have been meaning to update our orientation bias filter for a long time and a new Bayesian version is in the final stages of code review. It should be merged some time next week. We would be very curious to see how it does on your data.

Sign In or Register to comment.