Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

HaplotypeCaller single sample mode error

wbsimeywbsimey California Academy of SciencesMember ✭✭

Hello,
I am trying to run GATK4 HaplotypeCaller on a bam file from a single individual on Ubuntu 18 with
gatk HaplotypeCaller -R Tse_SBAPGDGG_D.fa -I AHP2746_sorted_dedup.bam -ERC GVCF -O AHP2746.gvcf
But I get the error:
A USER ERROR has occurred: Argument --emit-ref-confidence has a bad value: Can only be used in single sample mode currently. Use the --sample-name argument to run on a single sample out of a multi-sample BAM file.

Anyone know why this error is being thrown? The input bam file is not multi-sample, it is a bwa generated bam file then sorted and deduped by elprep. It is from a single individual.

Best Answer

  • wbsimeywbsimey California Academy of Sciences ✭✭
    Accepted Answer

    I think I have solved the issue, which is a read group problem. It turns out that elprep has a flag to add read group info to your bam files (--replace-read-group).
    So I piped the output of bwa mem to elprep and used:
    elprep filter /dev/stdin bwa_mem_bam_elprep.bam --mark-duplicates --mark-optical-duplicates output.metrics --remove-duplicates --sorting-order coordinate --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --nr-of-threads 32
    Then indexed that bam file with:
    samtools index bwa_mem_bam_elprep.bam
    then ran
    gatk HaplotypeCaller -R Tse_SBAPGDGG_D.fa -I bwa_mem_bam_elprep.bam -ERC GVCF -O AHP2746_elprep-test.gvcf

    Thanks again for your help SkyW.

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Is it possible that there are multiple read groups in there with the multiple sample names such as lane numbers?

    Can you check/post the header of the bam file?

  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    Hi SW,
    The first few rows of the bam file:
    A00351:187:HMTMMDSXX:3:1441:17879:34741 113 scaffold_6 1 60 49S102M = 60952 61001 TGAGGTCTGCACGGGGCCTGAGAGAGAGAGGAGAAATGGACTTAGTCTCTATGTGCCTGGTCATGTATTTGTTCCTATATCCTCTTTCTACACCTCTTGTCCACTCGCACTAAGCTTCCCTCCCAACCGGCATCTTTGATCTCCTCTCAAA FFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF: NM:i:0 MD:Z:102 MC:Z:151M AS:i:102 XS:i:46 SA:Z:scaffold_39937,2,+,73S78M,23,9;
    A00351:187:HMTMMDSXX:3:2343:19045:25457 81 scaffold_6 1 49 91S60M = 29571 29512 TACCTGTGGAAGAAAAGCTTGAAAGCTCGGTTAGTGTTGCAGTGAGGTCTGCACGGGGCCTGAGAGAGAGAGGAGAACTGGACTTAGTCTGTATGTGCCTGGTCATGTATTTGTTCCTATATCCTCTTTCTACACATCTTGTCCACTGGCA ,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,:F,,,::,,,,,F:F::FF:F:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:2 MD:Z:44C11C3 MC:Z:143M8S AS:i:51 XS:i:34
    A00351:187:HMTMMDSXX:3:2163:5737:14278 99 scaffold_6 17 60 151M = 207 338 TATTTGTTCCTATATCCTCTTTCTACACCTCTTGTCCACTCGCACTAAGCTTCCCTCCCAACCGGCATCTTTGATCTCCTCTCAAAGACTGACTAGAATAAGCGTAACTATTTCTGTGCTGGACACCTGGGATTCTGGGTCCATTTTAGAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF NM:i:1 MD:Z:102T48 MC:Z:11M3I137M AS:i:146 XS:i:73
    A00351:187:HMTMMDSXX:3:2334:2410:12461 145 scaffold_6 22 60 151M scaffold_39528 34 0 GTTCCTATATCCTCTTTCTACACATCTTGTCCACTGGCACTAAGCTTCCCTCCCAACCGGCATCTTTGATCTCCTCTCAGAGACTGACTAGAATAAGTGTAACTATTTCTGTGCTGGACACCTGGGATTCTGGGTCCATTTTAGACATTTC F,FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFF:FF:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:3 MD:Z:23C11C43A71 MC:Z:131M20S AS:i:136 XS:i:58

  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    This sample was sequenced on a Novaseq instrument.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Can you run

    samtools view -H bamfile.bam
    

    ?
    This will print the header only.

  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    This printed all of the headers, but here are a few:
    @SQ LN:519 SN:scaffold_41234
    @SQ SN:scaffold_41233 LN:519
    @SQ SN:scaffold_41232 LN:519
    @SQ SN:scaffold_41231 LN:519
    @SQ LN:519 SN:scaffold_41229
    @SQ SN:scaffold_41228 LN:519
    @SQ LN:519 SN:scaffold_41227
    @SQ SN:scaffold_41226 LN:519
    @SQ SN:scaffold_41327 LN:518
    @SQ SN:scaffold_41305 LN:518
    @SQ SN:scaffold_41304 LN:518
    @SQ SN:scaffold_41303 LN:518
    @SQ SN:scaffold_41302 LN:518
    @SQ SN:scaffold_41301 LN:518
    @SQ SN:scaffold_41300 LN:518
    @SQ LN:518 SN:scaffold_41299
    @SQ SN:scaffold_41298 LN:518
    @SQ SN:scaffold_41297 LN:518
    @SQ SN:scaffold_41296 LN:518
    @SQ LN:518 SN:scaffold_41295
    @SQ SN:scaffold_41294 LN:518
    @SQ SN:scaffold_41293 LN:518
    @SQ SN:scaffold_41292 LN:518
    @SQ SN:scaffold_41291 LN:518
    @SQ SN:scaffold_41290 LN:518
    @SQ SN:scaffold_41289 LN:518
    @SQ SN:scaffold_41288 LN:518
    @SQ SN:scaffold_41287 LN:518
    @SQ SN:scaffold_41286 LN:518
    @SQ SN:scaffold_41285 LN:518
    @SQ SN:scaffold_41284 LN:518
    @SQ SN:scaffold_41283 LN:518
    @SQ SN:scaffold_41282 LN:518

    We are mapping these to a reference genome generated with linked reads (10x) and Hi-C data.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Sample name info is usually at the bottom of the header. Can you provide the whole header?

  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    I have tried
    samtools view -H bamfile.bam
    samtools view -H bamfile.bam | less
    samtools view bamfile.bam | less
    samtools view bamfile.bam | head
    none of these produce anything different than what I have posted here.
    Is there some other way to view the headers?

  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    I also tried
    samtools view -H AHP2746_sorted_dedup.bam | grep @RG
    this returned nothing. Does this mean that none of my files have a read group?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Somethings just don't add up here.

  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    OK, thanks for your help.
    The header of the original fastQ files look like this:
    @A00351:187:HMTMMDSXX:3:1101:17400:1000 1:N:0:GCTCAGTT+AGAAGCCT
    After running
    bwa mem -t 32 Tse_SBAPGDGG_D AHP2746_L003_R1_001.fastq.gz AHP2746_L003_R2_001.fastq.gz > AHP2746.bam
    The header of the bam file looks like this"
    A00351:187:HMTMMDSXX:3:1441:17879:34741 113 scaffold_6 1 60 49S102M = 60952 61001
    I tried bwa mem runs with various -R options, such as
    bwa mem -M -t 32 -R '@RG\tID:sample_1\tSM:AHP2746\tPL:illumina\tPU:lane1\tPI:420' Tse_SBAPGDGG_D AHP2746_L003_R1_001.fastq.gz AHP2746_L003_R2_001.fastq.gz > bwa-mem-Rtest_AHP2746.bam
    which completed, but elprep throws an error when trying to dedupe and sort the bwa bam file
    2019/07/23 10:07:46 gzip: invalid header in NewBGZFReader

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    BWA produces sam output. You need to convert the output to bam using picard or samtools.

  • wbsimeywbsimey California Academy of SciencesMember ✭✭
    Accepted Answer

    I think I have solved the issue, which is a read group problem. It turns out that elprep has a flag to add read group info to your bam files (--replace-read-group).
    So I piped the output of bwa mem to elprep and used:
    elprep filter /dev/stdin bwa_mem_bam_elprep.bam --mark-duplicates --mark-optical-duplicates output.metrics --remove-duplicates --sorting-order coordinate --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --nr-of-threads 32
    Then indexed that bam file with:
    samtools index bwa_mem_bam_elprep.bam
    then ran
    gatk HaplotypeCaller -R Tse_SBAPGDGG_D.fa -I bwa_mem_bam_elprep.bam -ERC GVCF -O AHP2746_elprep-test.gvcf

    Thanks again for your help SkyW.

Sign In or Register to comment.