We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Are filtering and trimming necessary before mapping and SNP calling using GATK

Hi GATK team,

I am using GATK to call SNPs from whole genome re-sequencing data. According to FastQC report, base quality was lower than 20 after 100bp (120bp reads) and illumine Universal Adapter contents reach to 5% after 60 bp. I set base quality 30 and map quality 30 (--min_base_quality_score 30 --min_mapping_quality_score 30) to call SNP in GATK. Are these two settings enough to remove low quality data? Shall I need to remove reads with adapter contamination and trim low quality reads before mapping and SNP calling? Thanks very much for your help.

Here attached FastQC report

Best regards,
Baosheng

Best Answer

Answers

  • Hi Sheila,
    Many thanks for your reply. I only want to get high confidence variants. Another question is I have both unmasked genome assembly and hard masked (sequences of repeats was coded as "N") genome assembly. Which one do you think is better for BWA. If I use the unmarked genome, I have to remove SNPs mapped on the repeat regions, after calling SNP. If I use the hard masked genome, I do not have to do this step, is that correct?
    Thanks for your time.

    Best regards,
    Baosheng

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @baosheng.wang1gmail.com
    Hi Baosheng,

    In your case, you will get high confidence calls in the regions where you have good coverage and good quality reads/bases. If you have evenly distributed coverage, you may be able to get high confidence calls even though you have low quality bases at the ends of your reads. For example, some positions may be covered by a few low quality end-of-read bases, but they may also have many good quality middle-of-read bases covering them.

    If I use the unmarked genome, I have to remove SNPs mapped on the repeat regions, after calling SNP.

    I am not sure I understand this. Why would you remove SNPs in repeat regions? What do you mean by repeat regions? Regions that are homologous to other regions? Or, regions that have many repeated bases?

    Thanks,
    Sheila

  • Hi Sheila,
    The "repeat region" I refereed is sequence have more than one copy (some kind of variation could exist among copies) in the genome, either tandem or scatter distributed. SNPs called from these regions may not accurate, since reads from paralogos may be treated as sequences from a single locus.

    Thanks!
    Baosheng

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    That is exactly why people do paired end sequencing...

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @baosheng.wang1gmail.com
    Hi Baosheng,

    I see. @SkyWarrior is correct that paired end sequencing can help in repeat regions, however, it may not help in 100% of cases. Have a look at this dictionary entry for more information.

    I think it is best to use the unmasked genome in your case. You will know if there are mapping issues if the coverage is lower in those repeat regions compared to the other non-repeat regions. Are you not interested in the other regions that are masked? How many regions are "masked" in your masked genome?

    Thanks,
    Sheila

  • Hi Sheila,
    Thanks for you suggestion. About 40% of the genome was masked, and I do not focus on them on current project. However, you are right it is interesting to see the difference between masked and unmasked region on sequencing coverage.

    SkyWarrior, thanks for your comment!

    Best regards,
    Baosheng

Sign In or Register to comment.