SNP in promoter region

barbarianbarbarian JapanMember

Hai, I have Exome-seq and RNA-seq data and I try to find SNP in those sample. I know that the SNP is not in the gene itself but in the promoter region. From what I know, Exome-seq and RNA-seq do not cover promoter region. What is your suggestion about this? Probably you can share some expreience how to find SNP in promoter regiuon with RNA-seq and Exome-seq. data Thank you.

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @barbarian

    For exome, it depends on whether you have sequence coverage at the sites that you are interested in. Often the capture intervals extend beyond the limits of the exons. You should do a coverage analysis to find out if that is the case. If so, then it is no problem, just adjust the analysis intervals to include your sites of interest. If you do not have sequence coverage there, then there is no way to do what you want -- you will need to obtain more sequencing data.

    For RNAseq, you will probably not have any sequence coverage in the promoter region.

  • barbarianbarbarian JapanMember

    Hello Mrs. Geraldine,

    So, I have checked the alignment for both RNA-seq and Exome-seq (same sample) and I noticed that the coverage outside exon in Exome-seq is almost 0 but I could find a little coverage (beween 0-20 reads) in RNA-seq data. I decided to try using RNA-seq data for checking the SNP in the promoter region. My workflow is like this:

    1. aligned to Ensemble Hg38 with tophat2/bowtie2
    2. dedup with samtools
    3. grouping and reorder with picard
    4. BaseRecalibrator, my command
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -knownSites /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -R /home/local/EnsembleHG38/primary.fa -I 34.dedup.grouped.ordered.bam --filter_reads_with_N_cigar -o 34.table
    5. PrintReads, my command
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T PrintReads -R /home/local/EnsembleHG38/primary.fa -I 34.dedup.grouped.ordered.bam -BQSR 34.table -o 34_recal_reads.bam --filter_reads_with_N_cigar
    6. HaplotypeCaller, my command
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/local/EnsembleHG38/primary.fa -I 34_recal_reads.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.snps.indels.g.vcf -bamout 34.bamout.bam

    And then, I tried to display it in IGV and compare between 34_recal_reads.bam, 34.bamout.bam, and 34.snps.indels.g.vcf. This is the screenshot

    It's strange that the bamout result becomes really low in coverage and the g.vcf file seems really loaded. What do you think about it? I guessed I messed up in defining the interval because I didn't run the IndelRealigner step. But still it strange that most of the exon coverage is missing in the bamout. In original bam, the exon area has quite good amount of reads. Thank you for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @barbarian,

    You're most probably losing reads because of the --filter_reads_with_N_cigar option. If you're working with RNAseq data you need to follow our Best Practices for RNAseq, which is a slightly different workflow and includes steps to handle reads with N cigars appropriately.

  • barbarianbarbarian JapanMember

    Hello,

    Thank you for your suggestion. I thought the workflow is the same and I skipped that part. I have done the N cigars step and redone the data processing,but it still the same. The screenshot is here:

    https://dropbox.com/s/hhigaezmo6kdgzd/screen_shot_intervals.png?dl=0

    For that result, I use RealignerTarget creator to generate an interval file and I use HaplotypeCaller with that intervals. I use variation VCF file from here : ftp://ftp.ensembl.org/pub/release-80/variation/vcf/homo_sapiens/. I use 1000genomes phase 3 because the file that I download from GATK is not compatible. I used Hg38 build for my alignment. Where do you think I made a mistake? Thank you for your suggestion.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you post the command lines you used for the data processing? And can you show before/after screenshots showing the difference in coverage? Finally, did you check what proportion of the coverage is constituted of low-quality mapped reads and duplicates?

  • barbarianbarbarian JapanMember

    This is my command after I aligned, deduped, and sorted
    1. NCigar
    java -jar /opt/GATK/GenomeAnalysisTK.jar -T SplitNCigarReads -R ../../EnsembleHG38/primary.fa -I SRR1656634.dedup.grouped.ordered.bam -o 34.splitted.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

    1. BaseRecalibrator
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -knownSites /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -R /home/local/EnsembleHG38/primary.fa -I 34.splitted.bam -o 34.table

    2. PrintReads
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T PrintReads -R /home/local/EnsembleHG38/primary.fa -I SRR1656634.dedup.grouped.ordered.bam -BQSR 34.table -o 34_recal_reads.bam

    3. HaplotypeCaller (without interval)
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/local/EnsembleHG38/primary.fa -I 34_recal_reads.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.snps.indels.g.vcf -bamout 34.bamout.bam

    4. RealignerTargetCreator
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/local/EnsembleHG38/primary.fa -I 34.splitted.bam --known /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.intervals

    5. HaplotypeCaller (with interval)
      java -jar /opt/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/local/EnsembleHG38/primary.fa -I 34_recal_reads.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp /home/local/gatk_resource/hg38/1000GENOMES-phase_3.vcf -o 34.snps.indels2.g.vcf -bamout 34.bamout2.bam -L 34.intervals

    Result comparison
    Without interval
    image

    With interval
    image

    For this :
    "Finally, did you check what proportion of the coverage is constituted of low-quality mapped reads and duplicates?"

    I don't think I do that. The coverage in promoter region is really low, probably between 0-20. Mostly around 1-10 and some part has 0 coverage. I know this is not good, but I don't think I have choice to filter low coverage.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @barbarian
    Hi,

    Can you check whether the reads in the region have low mapping quality or low base quality scores? Also, please check if any of the reads are duplicates. Unfortunately, if the reads have low mapping quality or low mapping quality, there are automatically filtered out by Haplotype Caller. This is good however, because you don't want to get false positive calls from those reads.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @barbarian, I don't understand your workflow at all. Why are you running HaplotypeCaller on the intervals from RealignerTargetCreator? That makes no sense.

  • barbarianbarbarian JapanMember

    My normal work flow is the one without intervals. The step I tried using interval because I don't understand the parameter -L from the sample. It said in the CommandLineGATK guide L is like this: One or more genomic intervals over which to operate
    So, I think it uses the intervals from RealignerTargetCreator.

  • barbarianbarbarian JapanMember

    @Sheila said:
    barbarian
    Hi,

    Can you check whether the reads in the region have low mapping quality or low base quality scores? Also, please check if any of the reads are duplicates. Unfortunately, if the reads have low mapping quality or low mapping quality, there are automatically filtered out by Haplotype Caller. This is good however, because you don't want to get false positive calls from those reads.

    -Sheila

    Hello,

    Thank you for your reply. I don't think this is caused by low base quality because in the data, reads in exon regions have quite good quailty but it also removed in the bamout file. Compare to the input, which is bam file after BaseRecalibrator and PrintReads, many of reads in the exon region is removed in the bamout file. I think in the bam file after PrintReads is quite good and there are no problem. I don't know why the HaplotypeCaller step remove many exon reads and resulting many indels in the vcf file result.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see. No, the intervals from RTC are for passing to IndelRealigner only. The intervals you provide will -L are the exome capture intervals if you're doing exome sequencing, or other intervals of interest that you may have.

    Unfortunately if you don't have enough god coverage you simply won't be able to do this analysis. You need to obtain more/better sequencing data.

  • barbarianbarbarian JapanMember

    @Geraldine_VdAuwera said:
    Oh I see. No, the intervals from RTC are for passing to IndelRealigner only. The intervals you provide will -L are the exome capture intervals if you're doing exome sequencing, or other intervals of interest that you may have.

    Unfortunately if you don't have enough god coverage you simply won't be able to do this analysis. You need to obtain more/better sequencing data.

    Yah you are right. Actually I feel that it is just impossible to detect SNP in the promoter regions with such low coverage. So, I just tried to finish my analysis process so that at least I can get SNP in the exon region which is has good coverage. From some paper, I know there are several SNP which is occur in the first exon so probably I can at least report those findings. Tht is why I confuse why the reads in exon region is removed in the bamout after HaplotypeCaller even though I'm sure the coverage is quite good for the first exon. I don't know the exact number but it is between 20-40 reads in the middle part of exon and 10-20 reads in the edge part.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, well the first picture you posted is not showing, so I can't comment on that, but fyi the way the bamout system works is that it will only show reads in regions where it called a variant. There are some arguments you can use to force it to output all the reads even outside of the reported active region. Have a look at the Guide section called Common Problems; there is one article (I forget the title) that may be useful; and there is a tutorial that explains how to use the special arguments in practice, see https://www.broadinstitute.org/gatk/guide/article?id=5484

Sign In or Register to comment.