Reference Genome Components

shleeshlee CambridgeMember, Broadie, Moderator
edited March 28 in Dictionary

image This document defines several components of a reference genome. We use the human GRCh38/hg38 assembly to illustrate.

GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or ALT contigs to represent common complex variation, including HLA loci. Alternate contigs are also present in past assemblies but not to the extent we see with GRCh38. Much of the improvements in GRCh38 are the result of other genome sequencing and analysis projects, including the 1000 Genomes Project.

The ideogram is from the Genome Reference Consortium website and showcases GRCh38.p7. The zoomed region illustrates how regions in blue are full of Ns.

Analysis set reference genomes have special features to accommodate sequence read alignment. This type of genome reference can differ from the reference you use to browse the genome.

  • For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, a proportion of homologous centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR (pseudoautosomal) regions on chromosome Y. Confirm the set you are using by viewing a PAR region of the Y chromosome on IGV as shown in the figure below. The chrY location of PAR1 and PAR2 on GRCh38 are chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415.
    image
    The sequence in the reference set is a mix of uppercase and lowercase letters. The lowercase letters represent soft-masked sequence corresponding to repeats from RepeatMasker and Tandem Repeats Finder.

  • The GRCh38 analysis sets also include a contig to siphon off reads corresponding to the Epstein-Barr virus sequence as well as decoy contigs. The EBV contig can help correct for artifacts stemming from immortalization of human blood lymphocytes with EBV transformation, as well as capture endogenous EBV sequence as EBV naturally infects B cells in ~90% of the world population. Heng Li provides the decoy contigs.


Nomenclature: words to describe components of reference genomes

  • A contig is a contiguous sequence without gaps.

  • Alternate contigs, alternate scaffolds or alternate loci allow for representation of diverging haplotypes. These regions are too complex for a single representation. Identify ALT contigs by their _alt suffix.

    The GRCh38 ALT contigs total 109Mb in length and span 60Mb of the primary assembly. Alternate contig sequences can be novel to highly diverged or nearly identical to corresponding primary assembly sequence. Sequences that are highly diverged from the primary assembly only contribute a few million bases. Most subsequences of ALT contigs are fairly similar to the primary assembly. This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.

  • Primary assembly refers to the collection of (i) assembled chromosomes, (ii) unlocalized and (iii) unplaced sequences. It represents a non-redundant haploid genome.

    (i) Assembled chromosomes for hg38 are chromosomes 1–22 (chr1chr22), X (chrX), Y (chrY) and Mitochondrial (chrM).
    (ii) Unlocalized sequence are on a specific chromosome but with unknown order or orientation. Identify by _random suffix.
    (iii) Unplaced sequence are on an unknown chromosome. Identify by chrU_ prefix.

  • PAR stands for pseudoautosomal region. PAR regions in mammalian X and Y chromosomes allow for recombination between the sex chromosomes. Because the PAR sequences together create a diploid or pseudo-autosomal sequence region, the X and Y chromosome sequences are intentionally identical in the genome assembly. Analysis set genomes further hard-mask two of the Y chromosome PAR regions so as to allow mapping of reads solely to the X chromosome PAR regions.

  • Different assemblies shift coordinates for loci and are released infrequently. Hg19 and hg38 represent two different major assemblies. Comparing data from different assemblies requires lift-over tools that adjust genomic coordinates to match loci, at times imperfectly. In the special case of hg19 and GRCh37, the primary assembly coordinates are identical for loci but patch updates differ. Also, the naming conventions of the references differ, e.g. the use of chr1 versus 1 to indicate chromosome 1, such that these also require lift-over to compare data. GRCh38/hg38 unifies the assemblies and the naming conventions.

  • Patches are regional fixes that are released periodically for a given assembly. GRCh38.p7 indicates the seventh patched minor release of GRCh38. This NCBI page explains in more detail. Patches add information to the assembly without disrupting the chromosome coordinates. Again, they improve representation without affecting chromosome coordinate stability. The two types of patches, fixed and novel, represent different types of sequence.

    (i) Fix patches represent sequences that will replace primary assembly sequence in the next major assembly release. When interpreting data, fix patches should take precedence over the chromosomes.
    (ii) Novel patches represent alternate loci. When interpreting data, treat novel patches as population sequence variants.


The GATK perspective on reference genomes

Within GATK documentation, Tutorial#8017 outlines how to map reads in an alternate contig aware manner and discusses some of the implications of mapping reads to reference genomes with alternate contigs.

GATK tools allow for use of a genomic intervals list that tells tools which regions of the genome the tools should act on. Judicious use of an intervals list, e.g. one that excludes regions of Ns and low complexity repeat regions in the genome, makes processes more efficient. This brings us to the next point.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

  • For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
  • When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
  • However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

     -L HLA-A*01:01:01:01:1+
    

Viewing CRAM alignments on genome browsers

Because CRAM compression depends on the alignment reference genome, tools that use CRAM files ensure correct decompression by comparing reference contig MD5 hashtag values. These are sensitive to any changes in the sequence, e.g. masking with Ns. This can have implications for viewing alignments in genome browsers when there is a disjoint between the reference that is loaded in the browser and the reference that was used in alignment. If you are using a version of tools for which this is an issue, be sure to load the original analysis set reference genome to view the CRAM alignments.

Should I switch to a newer reference?

Yes you should. In addition to adding many alternate contigs, GRCh38 corrects thousands of SNPs and indels in the GRCh37 assembly that are absent in the population and are likely sequencing artifacts. It also includes synthetic centromeric sequence and updates non-nuclear genomic sequence.

The ability to recognize alternate haplotypes for loci is a drastic improvement that GRCh38 makes possible. Going forward, expanding genomics data will help identify variants for alternate haplotypes, improve existing and add additional alternate haplotypes and give us a better accounting of alternate haplotypes within populations. We are already seeing improvements and additions in the patch releases to reference genomes, e.g. the seven minor releases of GRCh38 available at the time of this writing.

Note that variants produced by alternate haplotypes when they are represented on the primary assembly may or may not be present in data resources, e.g. dbSNP. This could have varying degrees of impact, including negligible, for any process that relies on known variant sites. Consider the impact this discrepant coverage in data resources may have for your research aims and weigh this against the impact of missing variants because their sequence context is unaccounted for in previous assemblies.


External resources

  1. New 11/16/2016 For a brief history and discussion on challenges in using GRCh38, see the 2015 Genome Biology article Extending reference assembly models by Church et al. (DOI: 10.1186/s13059-015-0587-3).
  2. For press releases highlighting improvements in GRCh38 from December 2013, see http://www.ncbi.nlm.nih.gov/news/12-23-2013-grch38-released/ and http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html. The latter post summarizes major improvements, including the correction of thousands of SNPs and indels in GRCh37 not seen in the population and the inclusion of synthetic centromeric sequence.
  3. Recent releases of BWA, e.g. v0.7.15+, handle alt contig mapping and HLA typing. See the BWA repository for information. See these pages for download and installation instructions.
  4. The Genome Reference Consortium (GRC) provides human, mouse, zebrafish and chicken sequences, and this particular webpage gives an overview of GRCh38. Namely, an interactive chromosome ideogram marks regions with corresponding alternate loci, regions with fix patches and regions containing novel patches. For additional assembly terminology, see http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml.
  5. The UCSC Genome Browser allows browsing and download of genomes, including analysis sets, from many different species. For more details on the difference between GRCh38 reference and analysis sets, see ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/README.txt and ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/README.txt, respectively. In addition, the site provides annotation files, e.g. here is the annotation database for GRCh38. Within this particular page, the file named gap.txt.gz catalogues the gapped regions of the assembly full of Ns. For our illustration above, the corresponding region in this file shows:

        585    chr14    0    10000    1    N    10000    telomere    no
        1    chr14    10000    16000000    2    N    15990000    short_arm    no
        707    chr14    16022537    16022637    4    N    100    contig    no
    
  6. The Integrative Genomics Viewer is a desktop application for viewing genomics data including alignments. The tool accesses reference genomes you provide via file or URL or that it hosts over a server. The numerous hosted reference genomes include GRCh38. See this page for information on hosted reference genomes. For the most up-to-date list of hosted genomes, open IGV and go to Genomes>Load Genome From Server. A menu lists genomes you can make available in the main genome dropdown menu.


Post edited by shlee on

Comments

  • jingmengjingmeng AustraliaMember

    Thanks very much for the post. It is said that in the analysis set of GRCh38, duplicate copies of centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR regions on chromosome Y are hard-masked. I am not sure of the meaning of duplicate copies of repeat arrays. I know a repeat array is that some nucleotides are repeated several times, for example, ATTCGGATTCGGATTCGG (ATTCGG is repeated three times). Then what are duplicate copies for this case? Could you please provide some information (coordinate and nucleotide) of the hard-masked regions in the analysis set? And could you please explain to me why hard-masking these regions can result optimal mapping? For variant calling on the repeat and low complexity regions, as the repeat regions are hard-masked and ignored, no variant is predicted in these regions and it will result in sensitivity decreasing. Is my guess right? Look forward to hearing from you.

    Issue · Github
    by shlee

    Issue Number
    1609
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @jingmeng,

    We'll get to your question soon.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited January 2017

    Hi @jingmeng,

    Yes, indeed I do phrase it as you say and I agree this could be worded better as the current wording is confusing, especially given folks study actual copy number variants and this is not what I mean.

    For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, duplicate copies of centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR regions on chromosome Y.

    To be more precise, here's what I've come up with and I hope this clarifies your questions.

    For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, a proportion of homologous centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR (pseudoautosomal) regions on chromosome Y.

    As for what types of repeat arrays the genome masks, this is more detail than our current user base needs. However, if you find this is important to current genomics research, please let us know. For now, you can read up on general repeat array types in this wikipedia entry and satellite arrays in this wiki entry. It's my understanding, and perhaps someone in the community can clarify this, that masking these regions allows for better short read mapping to the reference. It's my impression that those in the field have come to this conclusion empirically. If understanding this further is important to your research, then I can ask those in the know here and get back to you in a week or two.

    As for variant calling on repeat and low complexity regions, well, with current technology this seems (in my opinion) an exercise in futility. I fathom such regions are subject to high rates of homologous recombination and polymerase slippage so I suspect variability between people will be high. Also, how can we effectively, by orthogonal methods, confirm the validity of the sequence in such regions? In other words, how can we determine the truth to verify our short-read sequence alignments? The assumptions we use for 4 nucleotides and paired short read sequences cannot apply equally well to regions that only use 1-3 nucleotides. Jingmeng, how much would you trust any variation reported for such regions with the current sequencing technology?

    As for the actual coordinates of the masked repeat arrays, I refer you to reference 5 of the external resources. The gap.txt.gz file tells you exactly which regions are gaps, i.e. hard-masked by Ns. Also, this GRC page has a subtab called Gaps. If you click on it, they show two different types of gaps, spanned gaps and unspanned gaps, which they then define further and show metrics for within GRCh38. You can click on column headers to sort, as I show in the attached screenshots.

    I hope this is helpful and thanks for the feedback. Cheers.

  • mmokrejsmmokrejs Czech RepublicMember

    This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping
    quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.

    I am just guessing what you describe but I also realized I have lost valid read pairs because one read was mapped to say chr11 and the other mate to chr11__XXXX_alt. My 'bwa mem' run was likely not aware of me serving it alternate contigs and discarded the read pair during mate-pair distance check. The reads were assigned mapQ=0. I believe this is what you are talking about. However, I think it is a consequence of me running 'bwa mem' in a wrong way. Read further below.

    Recent releases of BWA, e.g. v0.7.15+, handle alt contig mapping and HLA typing. See the BWA repository for information.
    See these pages for download and installation instructions.

    Unfortunately the bwa docs do not explain How to make bwa ALT-aware. Below is the way how I call 'bwa mem'.

    bwa mem -t 16 -p -M -R '@RG\tID:XXXX-PB\tPL:ILLUMINA\tLB:S04380110 Agilent SureSelect Human All Exon V5\tSM:XXXX-PB' ftp.broadinstitute.org/bundle/hg38/
    hg38bundle/Homo_sapiens_assembly38.fasta.gz dedup.realigned.pairs.fastq | \
    samtools view [email protected] 15 -T ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz -Sb -o XXXX-PB.bwa.pairs.bam
    [M::bwa_idx_load_from_disk] read 0 ALT contigs

    I believe the above "read 0 ALT contigs" means my bwa needed some additional *.alt file along the indexed reference which is not present in hg38bundle and I did not "just cherry-pick?" the file on my own inside the hg38bundle/ directory .

    bwa.kit/run-gen-ref hs38DH
    

    See:
    https://sourceforge.net/p/bio-bwa/mailman/bio-bwa-help/thread/[email protected]/#msg33179172
    https://github.com/lh3/bwa/tree/master/bwakit
    https://github.com/lh3/bwa/blob/master/bwakit/run-gen-ref

  • mmokrejsmmokrejs Czech RepublicMember
    edited February 2017

    $ bwa.kit/run-gen-ref hs38DH
    [bwa_index] Pack FASTA... 36.89 sec
    [bwa_index] Construct BWT for the packed sequence...
    [BWTIncCreate] textLength=6434693834, availableWord=464768632
    ...
    [bwt_gen] Finished constructing BWT in 711 iterations.
    [bwa_index] 4178.04 seconds elapse.
    [bwa_index] Update BWT... 26.76 sec
    [bwa_index] Pack forward-only FASTA... 26.02 sec
    [bwa_index] Construct SA from BWT and Occ... 1381.03 sec
    [main] Version: 0.7.15-r1140
    [main] CMD: bwa index hs38DH.fa
    [main] Real time: 5648.259 sec; CPU: 5647.889 sec
    ...
    -rw------- 1 mmokrejs mmokrejs 487553 Jan 31 22:33 hs38DH.fa.alt
    -rw------- 1 mmokrejs mmokrejs 3263690197 Jan 31 22:33 hs38DH.fa
    -rw------- 1 mmokrejs mmokrejs 3217347004 Jan 31 23:44 hs38DH.fa.bwt
    -rw------- 1 mmokrejs mmokrejs 804336731 Jan 31 23:44 hs38DH.fa.pac
    -rw------- 1 mmokrejs mmokrejs 455474 Jan 31 23:44 hs38DH.fa.ann
    -rw------- 1 mmokrejs mmokrejs 20199 Jan 31 23:44 hs38DH.fa.amb
    -rw------- 1 mmokrejs mmokrejs 1608673512 Feb 1 00:07 hs38DH.fa.sa
    $

    Basically, it seems I could have just copied&renamed hs38DH.fa.alt from bwa.kit/resource-GRCh38/ into the hg38bundle/ directory and re-created 'bwa index'.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @mmokrejs,

    Please check out Tutorial#8017. It's a long tutorial with details that address many of your questions in the posts above. There are links within the document to the GRCh38 resources that alt-aware mapping requires. After reading the tutorial, if you have specific questions on the contents of the tutorial, then please let us know.

  • shleeshlee CambridgeMember, Broadie, Moderator
    • Picard has a tool, ScatterIntervalsByNs, that can help you create intervals between gapped regions.
    • The tool to use to create an intersection of intervals is Picard's IntervalListTools.
  • oskarvoskarv BergenMember

    @shlee said:

    • Picard has a tool, ScatterIntervalsByNs, that can help you create intervals between gapped regions.
    • The tool to use to create an intersection of intervals is Picard's IntervalListTools.

    I downloaded the GRCh38-p10 version from ftp://ftp.ensembl.org/pub/release-88/fasta/homo_sapiens/dna/ and tried to create an interval list for scatter gather use, but it doesn't work.
    Here's the command and the error message:

    java -jar picard.jar ScatterIntervalsByNs R=Homo_sapiens.GRCh38.dna.primary_assembly.fa O=output.interval_list
    [Fri Apr 21 15:33:17 CEST 2017] picard.util.ScatterIntervalsByNs REFERENCE=Homo_sapiens.GRCh38.dna.primary_assembly.fa OUTPUT=output.interval_list    OUTPUT_TYPE=BOTH MAX_TO_MERGE=1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Fri Apr 21 15:33:17 CEST 2017] Executing as [email protected] on Linux 4.4.0-70-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
    [Fri Apr 21 15:33:17 CEST 2017] picard.util.ScatterIntervalsByNs done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=251658240
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" java.lang.NullPointerException
            at picard.util.ScatterIntervalsByNs.segregateReference(ScatterIntervalsByNs.java:141)
            at picard.util.ScatterIntervalsByNs.doWork(ScatterIntervalsByNs.java:107)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
    

    I tried with picard 2.9 too but it gave the same error message. The reference file on your ftp works as well as this one: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh38.p10.genome.fa.gz, but the one I need to use is the absolute latest one. What could be wrong with it?

    Issue · Github
    by Sheila

    Issue Number
    2000
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @oskarv,

    Both GRCh38.p10.genome.fa (Sanger) and Homo_sapiens.GRCh38.dna.primary_assembly.fa (ENSEMBL) files you give work fine in my hands with ScatterIntervalsByNs. Be sure to create fresh index (samtools faidx) and dictionary (Picard CreateSequenceDictionary) files before running ScatterIntervalsByNs.

    One thing I notice is that the two FASTAs annotate and sort contigs differently:

    WMCF9-CB5:~ shlee$ gzcat ~/Downloads/GRCh38.p10.genome.fa.gz | grep '>'
    >chr1 1
    >chr2 2
    
    WMCF9-CB5:~ shlee$ gzcat ~/Downloads/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz | grep '>'
    >1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
    >10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
    

    After running the FASTA through CreateSequenceDictionary using the tool's default settings, this translates to each header as follows:

    @HD     VN:1.5
    @SQ     SN:chr1 LN:248956422    M5:2648ae1bacce4ec4b6cf337dcae37816     UR:file:/Users/shlee/Downloads/GRCh38.p10.genome.fa.gz
    @SQ     SN:chr2 LN:242193529    M5:4bb4f82880a14111eb7327169ffb729b     UR:file:/Users/shlee/Downloads/GRCh38.p10.genome.fa.gz
    
    @HD     VN:1.5
    @SQ     SN:1    LN:248956422    M5:2648ae1bacce4ec4b6cf337dcae37816     UR:file:/Users/shlee/Downloads/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    @SQ     SN:10   LN:133797422    M5:907112d17fcb73bcab1ed1c72b97ce68     UR:file:/Users/shlee/Downloads/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    

    Be aware of this difference in ordering and nomenclature.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hmm, there appear to be some odd Markdown format interpretations going on with our forum site and elements in codeblocks appear unprotected.

    image

    Should be:
    image

  • Hi @shlee,
    I have some question about human reference genome. What is patch version of GRCh38 in GATK Resource Bundle? The annotation database(such as NCBI GRCh38.p7.RefSeq and dbsnp) are mainly for GRCh38.p7, so is OK if I use Resource Bundle reference for alignment and GRCh38.p7 for annotation? Can this cause some error in snp calling?

    Issue · Github
    by shlee

    Issue Number
    2300
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @zhaoqiang90,

    As far as I know, the GATK Resource Bundle provides only major releases, without any patch updates. Let me get back to you about your second question.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @zhaoqiang90,
    The subsetting of contigs works only one way. It should work for your case where every one of your variant contigs is represented in the resource. The converse may not work however. If the tool encounters a variant on a contig that the resource does not represent, I believe it will error.

    Currently, VariantAnnotator is available only in GATK3. We have plans to port it to GATK4 and it is slated to be available late this year.

  • Hi @shlee
    I still have some doubts.
    1. If I use GRCh38.p7 as my reference genome, is it OK to use the file in GATK resource bundle as my training set at GATK VQSR? If it’s not OK, is that means that I must use the reference genome GATK resource bundle?
    2. If I use GATK resource bundle's human reference genome. In the annotation step(usually SnpEff or Annovar), if I use a annotation database which(for example) support for GRCh38.9p7, it this will cause some error in my final result because of minor version’s difference? Or I should the database that support GRCh38 major release?
    3. As mentioned in GATK page, Exome files and itemized resource list coming soon(ish) for GRCh38. If I don't have these file to assist my WES analysis, would this caused some error in the final result?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @zhaoqiang90,

    The developers tell me that unfortunately, we cannot predict how each tool will handle mismatching contig header lines for input files. So you will have to see how it goes with your analysis. As to the scientific validity of mixing references, this is a question you must answer for yourself.

    I think the bigger issue to consider is whether your alignment settings handle patched information correctly.

    (i) Fix patches represent sequences that will replace primary assembly sequence in the next major assembly release. When interpreting data, fix patches should take precedence over the chromosomes.
    (ii) Novel patches represent alternate loci. When interpreting data, treat novel patches as population sequence variants.

    If a read aligns to a fixed patch better than the primary assembly, which alignment representation does the aligner prioritize? Could the two possible alignments end up with MAPQ 0 and secondary alignment flags? If this is the case, then you get less information for the locus than if you had used just the major release of the reference because our callers ignore secondary alignments. If one of the alignments ends up a supplementary alignment, are the mapping qualities distributed in such a way that allows our callers to use the alignment (e.g. HCMappingQualityFilter)? Did the mate align to the other contig, e.g. one mate on the patch contig and the other mate on the primary assembly? Our workflows, without additional tweaking, currently do not call on reads with such criss-crossed mates.

    Unless there is a strong reason for your research to require the patches, e.g. a specific patch contig representing a drastically altered haplotype important to your sample individuals, and you are equip to deal with the vagaries of using patches, e.g. adding these to the alt contig index file for alt-aware BWA handling or some additional post-alignment-processing, then I think you are better off using the major release version for our workflows.

    As for your questions:
    1. I think it should be okay to train VQSR with the resource bundle file.
    2. I'm not familiar with SnpEff nor Annovar as they are not GATK tools and so I cannot comment on your question. At what step do you annotate, what do you annotate and what are the annotations used for?
    3. You should be using exome target intervals provided by your exome kit provider.

  • markwmarkw 7003AMember, Broadie, Moderator, Dev

    These definitions may be helpful for users looking for genomes sequences from RefSeq:

    RefSeq category - shown if the assembly is a reference or representative genome in the NCBI Reference Sequence (RefSeq) project classification:

    • Reference genome - a manually selected high quality genome assembly that NCBI and the community have identified as being important as a standard against which other data are compared
    • Representative genome - a genome computationally or manually selected as a representative from among the best genomes available for a species or clade that does not have a designated reference genome

    Source: https://www.ncbi.nlm.nih.gov/assembly/help/#glossary

Sign In or Register to comment.