If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
(How to) Simulate reads using a reference genome ALT contig
This tutorial shows how to generate simulated reads against a specific target sequence. This can be useful, e.g. if you want to simulate reads for an alternate contig in GRCh38/hg38 to see how they end up mapping between the primary assembly versus the alternate contig.
We use external tools to accomplish this. In Section 1, we use Samtools to subset the target contig sequence from a reference FASTA file. In Section 2, we use wgsim to generate FASTQ format paired reads against the target contig. The resulting read data is ready for alignment.
This tutorial provides example data for you to follow along and includes a mini-reference FASTA. If you are unfamiliar with terms that describe reference genome components, take a few minutes to study the Dictionary entry Reference Genome Components.
Prerequisites and tools involved
This tutorial uses external tools that may require additional dependencies, e.g. the gcc compiler, that may not be available by default on your system.
After downloading wgsim, follow instructions to compile. For v0.3.0, the command is as follows.
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
Samtools. See Tutorial#2899 for installation instructions.
Download example data
tutorial_7859.tar.gz (GoogleDrive; ftp site) contains six files. A mini-reference
.faiindex, the subset
The mini-reference contains two contigs subset from human GRCh38/hg38:
chr19_KI270866v1_alt. In the tutorial, we simulate reads for the 43,156 bp ALT contig. The ALT contig corresponds to a diverged haplotype of chromosome 19. Specifically, it corresponds to chr19:34350807-34392977, which contains the glucose-6-phosphate isomerase or GPI gene. Part of the ALT contig introduces novel sequence that lacks a corresponding region in the primary assembly.
1. Use Samtools to subset target contig sequence from FASTA reference
Each contig in the reference FASTA has a header line beginning with
> that identifies the contig sequence that follows. We need the exact representation of this header line to subset the target contig sequence. The UNIX command below lists all such headers for the FASTA file.
grep '>' chr19_chr19_KI270866v1_alt.fasta
This prints the following for our mini-reference
faidx option of Samtools to subset the ALT contig sequence to a new FASTA file,
samtools faidx chr19_chr19_KI270866v1_alt.fasta chr19_KI270866v1_alt > chr19_KI270866v1_alt.fasta
Optionally introduce variants into reads
To introduce variants into reads, alter the FASTA sequence at this point before simulating reads. For example, to introduce a simple heterozygous SNP, duplicate the contig information within the file, name the duplicate contig differently, and change the base within the duplicated sequence. Search for the target base's sequence context by using TextEdit's Find function. Keep in mind FASTA file sequences contain line breaks.
To generate an alternate FASTA reference based on a VCF of variants, see GATK’s FastaAlternateReferenceMaker.
2. Use wgsim to simulate FASTQ paired reads against the target contig FASTA
Generate simulated reads from
chr19_KI270866v1_alt.fasta with the following command.
wgsim -1151 -2151 -d500 -r0 -e0 -N10000 -R0 -X0 chr19_KI270866v1_alt.fasta 7859_GPI.read1.fq 7859_GPI.read2.fq
This gives two FASTQ files,
7859_GPI.read2.fq, one for each mate of the paired reads.
- Each read is 151 bases. Set with
-2151for read1 and read2, respectively.
- The outer distance or insert size is 500 bases with a standard deviation of 50. This is set with the
- The files contain 10K read pairs, and this is set by the
- None of the reads contain indels (
-X0) nor mutations/variants (
- Base quality scales with the value given to
-eso we set it to zero (
-e0) for base quality scores of
I, which is, according to this page and this site, an excellent base quality score equivalent to a Sanger Phred+33 score of 40.
For a 43 kb contig, 10K x 2 x 151 reads should give us ~70x hypothetical coverage. Here are two pairs of reads from
@chr19_KI270866v1_alt_40173_40622_0:0:0_0:0:0_0/1 AGGTATGAGGATCTGGGTCTTCCCGTGTCTGAGTAGGTAGCACCTGGCACAGGTATGAGGATATGGGTCTTCCATGTCTGAGGAGGTAGCACCTGGCACAGATATGAGGATCTGCGTCTTCCAGTGTTTGAGGAGGTGAGTTTGGACTCAG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/1 CACCACTGCTGAGCTCAGGCAAGTGCACAAGGAAAGCTGTGGCTCACTGCTCGGCTCCAGCAGAGGTGGTCCCATGGACCACCTGTTGCTACAGAGGGGTCGGCAGCCCTGTCACTCAAGGCAGGGTTTGCTCTGCAAGCTGCCCCAGCCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@chr19_KI270866v1_alt_40173_40622_0:0:0_0:0:0_0/2 AGGGCCAGATCACACCTCCTCAGATATTGACCGACCCAGATCCTTATACCTGCACCAGATCCTACCTCCTCAGGCATTGACAGATCCAGATCCTTATACTTGTGCCAGATCCTACCTCCTTAGACATGGACAGACCCAGATCCTCATACCA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/2 AGGCCCATGAGGTCAGGTCAGTGTTTATTGAGTACCTGCTGCATACCTAGCTTGGGGAAAGGTAGAGAGGCCCTCAGAGAGGCTTGGAGGGCAAGAGCAACCCAGGCAGGATGAGGGCTCCACTTCCACCTGAGGGCGGGCTGAGCTTGCA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
All the bases of all the reads from a simulation have the same base quality, and in this instance each base quality is
I. Notice the read names of the simulated reads contain useful information, e.g the last read name
@chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/2 consists of the following.
- input FASTA file name
- the region that the sequence comes from
- sequencing error, substitutions and gaps
0:0:0and the same for the mate
- member pair (0-based indexing in hexadecimal) and mate pair information
- To convert FASTQ to BAM, see Section A of Tutorial#6484.
To align the reads using BWA-MEM (v0.7.15), you can use the following command. Alternatively, see Tutorial#8017.
bwa mem chr19_chr19_KI270866v1_alt.fasta 7859_GPI_alt.read1.fq 7859_GPI_alt.read2.fq > GPI_bwamem.sam
DWGSIM Tutorial, a variant of WGSIM
- This Dictionary entry reviews terminology that describes reference genome components.