We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Orientation Model (F1R2/F2R1) runs on data from single-end tech

mack812mack812 SpainMember
edited August 2019 in Ask the GATK team


I am currently working on data from a tiny gene panel (~29 kb; target enrichment by amplicons) with deep coverage in most of the samples (mean coverage well above 1000x in most samples). Data is single-end reads.

I followed the recently updated tutorial for somatic variant calling to run Mutect2. Then I realized that it probably does not make any sense to run the --f1r2-tar-gz option and the subsequent LearnReadOrientationModel tool on my data, being from single-end-reads technology. However, I was surprised to see that not only the tools did not complain of receiving single-end reads but also produced a .orientation_priors filled with data, which can be passed to FilterMutectCalls also without raising any errors. How is this possible?

Please notice that I also posted another question about applying GATK tools to uncommon datasets like the one here described (small panels, single-end, amplicons) about a week ago. It would be great if you could give some orientation about to what extent GATK is applicable to this sort of data, which is being massively produced around the world for targeted sequencing of small regions of interest (typically hotspot panels in cancer).


  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @mack812 We will get back to you soon!

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @mack812
    The developer said "It doesn't error because in the GATK an unpaired read is considered not F1R2 -- that is read.isF1R2() returns false as opposed to throwing an error. That being said, the read orientation filter should only be used with paired reads."
    I will put this in as a documentation request for our future GATK writer (we've been looking for one since the spring). If you'd like to contribute towards this in anyway (like providing a list of things you'd like answers to like an FAQ) that can be easily picked up by someone, discussed with devs, and filled out, that would be amazing.

  • mack812mack812 SpainMember
    edited September 2019

    Hi @Tiffany_at_Broad,

    Thanks for your reply. I stopped using the read orientation model on this dataset but still, I think it would be a good idea to check what the tools are doing when run on single-end reads. I get F1R2/F2R1 data until the very end (until the unfiltered/filtered vcfs, in which there is an estimation of the number of reads in each orientation for every variant detected, which makes no sense at all on this dataset).

    Regarding my questions about how to handle this sort of datasets with GATK, off the top of my head:

    1. Whether or not to apply BaseRecalibrator considering the small size of these gene panels (around 20-30 kb).

    2. Whether or not to run the CalculateContamination tool, for the same reason: in these small panels, less than 10 variants from the population-AF file are used to calculate contamination. I must say however that the tool seems to work fine because it is able to detect contamination in reference samples from mixed cell lines while yielding low contamination figures when applied to data from "real" samples.

    3. What is the best configuration of the Mutect2 tool for this sort of data? Please also notice that I reported the impossibility of disabling or adjusting the MappingQualityFilter on Mutect2 on another post, which is currently unanswered. This is important because the MQ from the BAMs produced with this technology seems to be lower than that of more common techs (i.e. Illumina), and therefore several, very relevant, regions are ignored, contributing to a high false negative rate.

    4. Would it be a good idea to generate a PoN for this sort of data? This is messy data coming from FFPE samples (with the expected FFPE hotspots C>T, G>A) and has lots of artifactual indels in homopolymers, in both cases reaching 5% AF or even higher in worst scenarios (i.e. worst quality FFPEs). Since the orientation model cannot be applied here (single-end reads), I am considering other approaches such as identifying those variants appearing in more than 75% of the samples analyzed and using them to build a "black list" with which to clean up this mess a little bit.

    5. What approach would be the best for CNV detection (mainly detection of amplifications) in this sort of data?

    Just as a reminder, we are talking about data from small panels (20-30kb overall size)
    used for detecting somatic variants in cancer hotspots, target-enrichment by amplicons, depth around 1000-2000x mean cov, sequencing with SBS-single nucleotide addition technology (detection through CMOS semiconductor chips), single-end reads (sized ~120 nt), and samples being tumor DNA from FFPE blocks without a normal counterpart.

    Hope this will be useful. I will probably post some more questions here regarding this issue.

    Thanks again for your support.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Thanks @mack812 ! I will pass this along to the team.

Sign In or Register to comment.