Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK with Solid data

zzqzzq ChinaMember

Dear all,
I want to call SNPs from Solid data (SE=35) using GATK recent version (3.2.2), but I got the following errors which is regarding to RealignerTargetCreator function:
when using argument -fixMisencodedQual
ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
without -fixMisencodedQual
ERROR MESSAGE: SAM/BAM file SAMFileReader{XXXXX} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 62; please see the GATK --help documentation for options related to this error
when using argument -allowPotentiallyMisencodedQuals, it can run well.

All my command like following,
bfast match -f ref.fa -r 1.fastq -A 1 -n 16 >1.aligned.bmf

bfast localalign -f ref.fa -m 1.aligned.bmf -A 1 -n 16 >2.aligned.baf

bfast postprocess -f ref.fa -i 2.aligned.baf -A 1 -Y 2 -n 16 -b 0 >2.sam

java -Xmx60g -jar /bin/picard-tools-1.118/AddOrReplaceReadGroups.jar INPUT=2.sam OUTPUT=2.bam SORT_ORDER=coordinate RGID=OS RGLB=OS RGPL=solid RGPU=SRR035385 RGSM=OS

java -Xmx60g -jar /bin/picard-tools-1.118/MarkDuplicates.jar INPUT=2.bam OUTPUT=2rdup.bam METRICS_FILE=2rdup REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES=2000

java -jar /bin/GATK-3.2-2/GenomeAnalysisTK.jar -R ref.fa -T RealignerTargetCreator -I 2rdup.bam -o 2.realn.intervals -nt 8 -allowPotentiallyMisencodedQuals ###I got errors here

Can I get a correct VCF file when I using argument -allowPotentiallyMisencodedQuals in the following command! While, some wrong commands may lead to this problem, please point them out.

I hope someone can help me with my questions, thank you!

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you please clarify: what did you try first, what happened, and what did you try next? Have you tried running without any argument related to the quality scores?

  • zzqzzq ChinaMember

    Thanks for your reply and sorry for my messy expression.
    I just mapped my single end reads by bfast (0.7.0-a), then sorted and removed duplication by picard. (commands are as above)
    For subsequent programs, I will use GATK to do indel realignment, base recalibration, and SNP calling. For me, I first ran the RealignerTargetCreator step to generate intervallist file. But here I got this errors with -fixMisencodedQual argument and without any argument related to the quality scores. At last I have escaped from these errors with allowPotentiallyMisencodedQuals argument. I think I will run the realignment and base recalibration with allowPotentiallyMisencodedQuals argument. What worries me is how can I get a relatively accurate SNP data when I ignore warnings about base quality score encoding.
    What is more, I have seen many posts about quality score encoding, but I can not find a best way to solve these problems.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @zzq, have you checked that the qualities were converted correctly from color space to base space? We cannot provide guidance for processing Solid data because we do not have experience working with it, but there are many good sources of information, e.g. on Biostars.

  • zzqzzq ChinaMember

    @Geraldine_VdAuwera‌ yes, I converted by solid2fastq program in the bfast aligner. Here, I also met the same questions for processing illumina Hiseq data with -fixMisencodedQual argument and without any argument related to the quality scores. For this, I mapped my reads by bwa (bwa aln and bwa sampe), then sorted and removed duplication by picard.

  • zzqzzq ChinaMember

    @Geraldine_VdAuwera I am sorry for my bad internet. I hope someone can provide some good ideas for the errors caused by mapping quality scores or some tools can fix these problems. Thanks !

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You can try plotting the distribution of quality scores to see what is the actual range of quals in your data. This will help you verify what is the actual encoding, and based on that you can decide if there is a problem or not.

  • zzqzzq ChinaMember
    edited September 2014

    @Geraldine I just have got the distribution of quality score by picard QualityScoreDistribution. The results like following,

    QUALITY COUNT_OF_Q
    1       1
    4       3
    8       1
    33      109374040
    36      1148984
    37      1952653
    38      15093610
    39      14164430
    40      8497200
    41      9159472
    42      4170620
    43      2467055
    44      7370061
    45      2852887
    46      6919352
    47      7415325
    48      8944089
    49      22878009
    50      13742276
    51      17380743
    52      9314278
    53      17196247
    54      23006647
    55      32835188
    56      45214920
    57      67191858
    58      55869963
    59      46119796
    60      93740488
    61      123904293
    62      213033963
    63      155824541
    64      214768916
    65      533030541
    66      696554686
    67      386672655
    68      646960058
    69      481386567
    70      725898626
    71      862267583
    72      1394198068
    

    With these, I can not make a correct decision how to escape from this error. Your reply in this problem would be greatly appreciated. Thanks!

    Post edited by Geraldine_VdAuwera on
  • zzqzzq ChinaMember

    @Geraldine_VdAuwera I have checked the right encoding file and found that the quality which are 1, 4 and 8 should not be here. I hope that you can provide me some tools to filter these bad reads in bam. Thanks!

  • SheilaSheila Broad InstituteMember, Broadie admin
    edited September 2014

    @zzq‌

    Hi,

    It looks like only a few of your bases have very low quality scores (less than 8), so this should not be a problem. If there were many of them I would recommend checking the processing pipelines.

    You can ignore the warnings and use --allowPotentiallyMisencodedQuals.

    -Sheila

  • yuegeorgeyuegeorge hkMember

    @Geraldine_VdAuwera‌
    Hi, Have you figure out the cause of this topic??

  • yuegeorgeyuegeorge hkMember

    @Geraldine_VdAuwera‌
    Hi Geraldine,
    When i run BaseRecalibrator:
    /home/elvis/software/JAVA/jre1.7.0_67/bin/java -Xmx15g -jar /home/elvis/software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R /home/george/hg19/ucsc.hg19.fasta \
    -I ./aCGH5875.rmdup.bam \
    -knownSites /media/Analysis/gatk_resource/dbsnp_138.hg19.vcf \
    -knownSites /media/Analysis/gatk_resource/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
    -knownSites /media/Analysis/gatk_resource/1000G_phase1.indels.hg19.sites.vcf \
    -o ./BaseRecal.grp \

    The error is:
    MESSAGE: SAM/BAM file SAMFileReader{/home/george/alignment/autsim/new_default/aCGH5875.rmdup.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66; please see the GATK --help documentation for options related to this error

    Of course if i run with parameter --fix_misencoded_quality_scores
    the error is:
    ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool

    These are same with above topic descript, so what's wrong with my BaseRecalibrator? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @yuegeorge You should run the picard tool QualityScoreDistribution (like zzq did above) and post what is the distribution of scores in your data. With that information we can decide what is the best solution.

  • yuegeorgeyuegeorge hkMember

    @Geraldine_VdAuwera‌
    thanks.
    The results of picard QualityScoreDistribution are list below:

    HISTOGRAM java.lang.Byte

    QUALITY COUNT_OF_Q
    6 26771262
    34 9843539
    37 261596
    39 1492736
    40 1124871
    41 586607
    42 771730
    43 375112
    44 348854
    45 625320
    46 290613
    47 997871
    48 727189
    49 1054528
    50 2296820
    51 1442771
    52 2516369
    53 977538
    54 1797541
    55 3253572
    56 4181920
    57 5157597
    58 6833425
    59 7472912
    60 5228875
    61 11409084
    62 16274518
    64 20101300
    65 34473431
    66 73842373
    67 140703031
    68 53275198
    69 92328809
    70 59171108
    71 101201701
    72 118073194
    73 202564942

  • yuegeorgeyuegeorge hkMember

    @Geraldine_VdAuwera‌
    Furthermore, if i use the same file to calling variants by HaplotypeCaller. Everything looks like fine.
    but once i run BaseRecalibrator, the erorr will happened.
    That's very confused.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @yuegeorge‌

    The problem is those bases with Q6. That should not happen in a normally encoded file. What platform were these called on?

    The different reaction of the two tools is because they apply different filters. The HaplotypeCaller probably never sees those very low-scoring bases, whereas BaseRecalibrator sees every base.

    I think it is safe to use the argument to ignore potentially misencoded scores in this case (see above thread for exact name) but you should try to find out why this encoding happened.

Sign In or Register to comment.