Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Invalid SAM?

I used BWA MEM to map reads from an interleaved FASTQ.

fastq="all.fastq"
fasta="/share/PI/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
bwa="/share/PI/apps/bcbio/anaconda/bin/bwa"
nThreads="12"

#Run BWA MEM
#IMPORTANT: NEED -p since "$fastq" is an interleaved fastq
readGroup="@RG\tID:CHM1\tSM:CHM1\tPL:Illumina"
sam="CHM1.sam"
"$bwa" mem -R "$readGroup" -t "$nThreads" -p "$fasta" "$fastq" -o "$sam"

(The FASTQs are actually CHM1; I used prefetch to fetch .sra files from three different runs from NCBI, then used fastq-dump to convert the SRAs to FASTQs, then cated them all together into one FASTQ.)

The SAM is 515Gb but has no obvious problems. samtools quickcheck says it's valid. But when I run GATK4 (4.0.4.0)'s FixMateInformation or ValidateSamFile, I get output like this

ERROR: Record 1, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
WARNING: Record 1, Read name ######################################################################################################################################################################################################, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities 
ERROR: Record 421522661, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag

There may be even more errors, but this is what I got after two hours.

I can, in fact, see that the first line in the SAM file is

######################################################################################################################################################################################################  4       *       0       0       *       *       0       0       *       *       AS:i:0  XS:i:0  RG:Z:CHM1

Is this SAM really invalid? Or is there something I need to do so GATK4 will accept it

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ccnn
    Hi,

    I need to ask the team and get back to you.

    -Sheila

  • EqualizerEqualizer Member
    Hi
    I used ValidateSamFile Picard command on my sorted Bam file. in output validate txt file I have this error:
    ## HISTOGRAM java.lang.String
    Error Type Count
    WARNING:QUALITY_NOT_STORED 87

    WARNING: Record 1, Read name SRX5013139.5351, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities

    How can I filter this warning.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @Equalizer

    You can use this GATk tool and suppress warnings using:

    --IGNORE_WARNINGS
    

    More information can be found at this link

Sign In or Register to comment.