Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
SNP in promoter region

Hai, I have Exome-seq and RNA-seq data and I try to find SNP in those sample. I know that the SNP is not in the gene itself but in the promoter region. From what I know, Exome-seq and RNA-seq do not cover promoter region. What is your suggestion about this? Probably you can share some expreience how to find SNP in promoter regiuon with RNA-seq and Exome-seq. data Thank you.
Answers
Hi @barbarian
For exome, it depends on whether you have sequence coverage at the sites that you are interested in. Often the capture intervals extend beyond the limits of the exons. You should do a coverage analysis to find out if that is the case. If so, then it is no problem, just adjust the analysis intervals to include your sites of interest. If you do not have sequence coverage there, then there is no way to do what you want -- you will need to obtain more sequencing data.
For RNAseq, you will probably not have any sequence coverage in the promoter region.
Hello Mrs. Geraldine,
So, I have checked the alignment for both RNA-seq and Exome-seq (same sample) and I noticed that the coverage outside exon in Exome-seq is almost 0 but I could find a little coverage (beween 0-20 reads) in RNA-seq data. I decided to try using RNA-seq data for checking the SNP in the promoter region. My workflow is like this:
java -jar /opt/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -knownSites /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -R /home/local/EnsembleHG38/primary.fa -I 34.dedup.grouped.ordered.bam --filter_reads_with_N_cigar -o 34.table
java -jar /opt/GATK/GenomeAnalysisTK.jar -T PrintReads -R /home/local/EnsembleHG38/primary.fa -I 34.dedup.grouped.ordered.bam -BQSR 34.table -o 34_recal_reads.bam --filter_reads_with_N_cigar
java -jar /opt/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/local/EnsembleHG38/primary.fa -I 34_recal_reads.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.snps.indels.g.vcf -bamout 34.bamout.bam
And then, I tried to display it in IGV and compare between 34_recal_reads.bam, 34.bamout.bam, and 34.snps.indels.g.vcf. This is the screenshot
It's strange that the bamout result becomes really low in coverage and the g.vcf file seems really loaded. What do you think about it? I guessed I messed up in defining the interval because I didn't run the IndelRealigner step. But still it strange that most of the exon coverage is missing in the bamout. In original bam, the exon area has quite good amount of reads. Thank you for your help.
Hi @barbarian,
You're most probably losing reads because of the
--filter_reads_with_N_cigar
option. If you're working with RNAseq data you need to follow our Best Practices for RNAseq, which is a slightly different workflow and includes steps to handle reads with N cigars appropriately.Hello,
Thank you for your suggestion. I thought the workflow is the same and I skipped that part. I have done the N cigars step and redone the data processing,but it still the same. The screenshot is here:
https://dropbox.com/s/hhigaezmo6kdgzd/screen_shot_intervals.png?dl=0
For that result, I use RealignerTarget creator to generate an interval file and I use HaplotypeCaller with that intervals. I use variation VCF file from here : ftp://ftp.ensembl.org/pub/release-80/variation/vcf/homo_sapiens/. I use 1000genomes phase 3 because the file that I download from GATK is not compatible. I used Hg38 build for my alignment. Where do you think I made a mistake? Thank you for your suggestion.
Can you post the command lines you used for the data processing? And can you show before/after screenshots showing the difference in coverage? Finally, did you check what proportion of the coverage is constituted of low-quality mapped reads and duplicates?
This is my command after I aligned, deduped, and sorted
1. NCigar
java -jar /opt/GATK/GenomeAnalysisTK.jar -T SplitNCigarReads -R ../../EnsembleHG38/primary.fa -I SRR1656634.dedup.grouped.ordered.bam -o 34.splitted.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
BaseRecalibrator
java -jar /opt/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -knownSites /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -R /home/local/EnsembleHG38/primary.fa -I 34.splitted.bam -o 34.table
PrintReads
java -jar /opt/GATK/GenomeAnalysisTK.jar -T PrintReads -R /home/local/EnsembleHG38/primary.fa -I SRR1656634.dedup.grouped.ordered.bam -BQSR 34.table -o 34_recal_reads.bam
HaplotypeCaller (without interval)
java -jar /opt/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/local/EnsembleHG38/primary.fa -I 34_recal_reads.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.snps.indels.g.vcf -bamout 34.bamout.bam
RealignerTargetCreator
java -jar /opt/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/local/EnsembleHG38/primary.fa -I 34.splitted.bam --known /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.intervals
HaplotypeCaller (with interval)
java -jar /opt/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/local/EnsembleHG38/primary.fa -I 34_recal_reads.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.snps.indels2.g.vcf -bamout 34.bamout2.bam -L 34.intervals
Result comparison

Without interval
With interval

For this :
"Finally, did you check what proportion of the coverage is constituted of low-quality mapped reads and duplicates?"
I don't think I do that. The coverage in promoter region is really low, probably between 0-20. Mostly around 1-10 and some part has 0 coverage. I know this is not good, but I don't think I have choice to filter low coverage.
@barbarian
Hi,
Can you check whether the reads in the region have low mapping quality or low base quality scores? Also, please check if any of the reads are duplicates. Unfortunately, if the reads have low mapping quality or low mapping quality, there are automatically filtered out by Haplotype Caller. This is good however, because you don't want to get false positive calls from those reads.
-Sheila
@barbarian, I don't understand your workflow at all. Why are you running HaplotypeCaller on the intervals from RealignerTargetCreator? That makes no sense.
My normal work flow is the one without intervals. The step I tried using interval because I don't understand the parameter -L from the sample. It said in the CommandLineGATK guide L is like this: One or more genomic intervals over which to operate
So, I think it uses the intervals from RealignerTargetCreator.
Hello,
Thank you for your reply. I don't think this is caused by low base quality because in the data, reads in exon regions have quite good quailty but it also removed in the bamout file. Compare to the input, which is bam file after BaseRecalibrator and PrintReads, many of reads in the exon region is removed in the bamout file. I think in the bam file after PrintReads is quite good and there are no problem. I don't know why the HaplotypeCaller step remove many exon reads and resulting many indels in the vcf file result.
Oh I see. No, the intervals from RTC are for passing to IndelRealigner only. The intervals you provide will -L are the exome capture intervals if you're doing exome sequencing, or other intervals of interest that you may have.
Unfortunately if you don't have enough god coverage you simply won't be able to do this analysis. You need to obtain more/better sequencing data.
Yah you are right. Actually I feel that it is just impossible to detect SNP in the promoter regions with such low coverage. So, I just tried to finish my analysis process so that at least I can get SNP in the exon region which is has good coverage. From some paper, I know there are several SNP which is occur in the first exon so probably I can at least report those findings. Tht is why I confuse why the reads in exon region is removed in the bamout after HaplotypeCaller even though I'm sure the coverage is quite good for the first exon. I don't know the exact number but it is between 20-40 reads in the middle part of exon and 10-20 reads in the edge part.
OK, well the first picture you posted is not showing, so I can't comment on that, but fyi the way the bamout system works is that it will only show reads in regions where it called a variant. There are some arguments you can use to force it to output all the reads even outside of the reported active region. Have a look at the Guide section called Common Problems; there is one article (I forget the title) that may be useful; and there is a tutorial that explains how to use the special arguments in practice, see https://www.broadinstitute.org/gatk/guide/article?id=5484